[LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
@ 2026-03-16  5:12 Rakie Kim
  2026-03-16  5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Rakie Kim @ 2026-03-16  5:12 UTC (permalink / raw)
  To: akpm
  Cc: gourry, linux-mm, linux-kernel, linux-cxl, ziy, matthew.brost,
	joshua.hahnjy, byungchul, ying.huang, apopple, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun,
	rakie.kim

This patch series is an RFC to propose and discuss the overall design
and concept of a socket-aware weighted interleave mechanism. As there
are areas requiring further refinement, the primary goal at this stage
is to gather feedback on the architectural approach rather than focusing
on fine-grained implementation details.

Weighted interleave distributes page allocations across multiple nodes
based on configured weights. However, the current implementation applies
a single global weight vector. In multi-socket systems, this creates a
mismatch between configured weights and actual hardware performance, as
it cannot account for inter-socket interconnect costs. To address this,
we propose a socket-aware approach that restricts candidate nodes to
the local socket before applying weights.

Flat weighted interleave applies one global weight vector regardless of
where a task runs. On multi-socket systems, this ignores inter-socket
interconnect costs, meaning the configured weights do not accurately
reflect the actual hardware performance.

Consider a dual-socket system:

          node0             node1
        +-------+         +-------+
        | CPU 0 |---------| CPU 1 |
        +-------+         +-------+
        | DRAM0 |         | DRAM1 |
        +---+---+         +---+---+
            |                 |
        +---+---+         +---+---+
        | CXL 0 |         | CXL 1 |
        +-------+         +-------+
          node2             node3

Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
the effective bandwidth varies significantly from the perspective of
each CPU due to inter-socket interconnect penalties.

Local device capabilities (GB/s) vs. cross-socket effective bandwidth:

         0     1     2     3
CPU 0  300   150   100    50
CPU 1  150   300    50   100

A reasonable global weight vector reflecting the base capabilities is:

     node0=3 node1=3 node2=1 node3=1

However, because these configured node weights do not account for
interconnect degradation between sockets, applying them flatly to all
sources yields the following effective map from each CPU's perspective:

         0     1     2     3
CPU 0    3     3     1     1
CPU 1    3     3     1     1

This does not account for the interconnect penalty (e.g., node0->node1
drops 300->150, node0->node3 drops 100->50) and thus forces allocations
that cause a mismatch with actual performance.

This patch makes weighted interleave socket-aware. Before weighting is
applied, the candidate nodes are restricted to the current socket; only
if no eligible local nodes remain does the policy fall back to the
wider set.

Even if the configured global weights remain identically set:

     node0=3 node1=3 node2=1 node3=1

The resulting effective map from the perspective of each CPU becomes:

         0     1     2     3
CPU 0    3     0     1     0
CPU 1    0     3     0     1

Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
effective bandwidth, preserves NUMA locality, and reduces cross-socket
traffic.

To make this possible, the system requires a mechanism to understand
the physical topology. The existing NUMA distance model provides only
relative latency values between nodes and lacks any notion of
structural grouping such as socket boundaries. This is especially
problematic for CXL memory nodes, which appear without an explicit
socket association.

This patch series introduces a socket-aware topology management layer
that groups NUMA nodes according to their physical package. It
explicitly links CPU and memory-only nodes (such as CXL) under the
same socket using an initiator CPU node. This captures the true
hardware hierarchy rather than relying solely on flat distance values.

[Experimental Results]

System Configuration:
- Processor: Dual-Socket Intel Xeon 6980P (Granite Rapids)

               node0                       node1
             +-------+                   +-------+
             | CPU 0 |-------------------| CPU 1 |
             +-------+                   +-------+
12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
DDR5-6400    +---+---+                   +---+---+  DDR5-6400
                 |                           |
             +---+---+                   +---+---+
8 Channels   | CXL 0 |                   | CXL 1 |  8 Channels
DDR5-6400    +-------+                   +-------+  DDR5-6400
               node2                       node3

1) Throughput (System Bandwidth)
   - DRAM Only: 966 GB/s
   - Weighted Interleave: 903 GB/s (7% decrease compared to DRAM Only)
   - Socket-Aware Weighted Interleave: 1329 GB/s (1.33TB/s)
     (38% increase compared to DRAM Only,
      47% increase compared to Weighted Interleave)

2) Loaded Latency (Under High Bandwidth)
   - DRAM Only: 544 ns
   - Weighted Interleave: 545 ns
   - Socket-Aware Weighted Interleave: 436 ns
     (20% reduction compared to both)

[Additional Considerations]

Please note that this series includes modifications to the CXL driver
to register these nodes. However, the necessity and the approach of
these driver-side changes require further discussion and consideration.
Additionally, this topology layer was originally designed to support
both memory tiering and weighted interleave. Currently, it is only
utilized by the weighted interleave policy. As a result, several
functions exposed by this layer are not actively used in this RFC.
Unused portions will be cleaned up and removed in the final patch
submission.

Summary of patches:

  [PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask()
  This patch adds a new NUMA helper function to find all nodes in a
  given nodemask that share the minimum distance from a specified
  source node.

  [PATCH 2/4] mm/memory-tiers: introduce socket-aware topology mgmt
  This patch introduces a management layer that groups NUMA nodes by
  their physical package (socket). It forms a "memory package" to
  abstract real hardware locality for predictable NUMA memory
  management.

  [PATCH 3/4] mm/memory-tiers: register CXL nodes to socket packages
  This patch implements a registration path to bind CXL memory nodes
  to a socket-aware memory package using an initiator CPU node. This
  ensures CXL nodes are deterministically grouped with the CPUs they
  service.

  [PATCH 4/4] mm/mempolicy: enhance weighted interleave with locality
  This patch modifies the weighted interleave policy to restrict
  candidate nodes to the current socket before applying weights. It
  reduces cross-socket traffic and aligns memory allocation with
  actual bandwidth.

Any feedback and discussions are highly appreciated.

Thanks

Rakie Kim (4):
  mm/numa: introduce nearest_nodes_nodemask()
  mm/memory-tiers: introduce socket-aware topology management for NUMA
    nodes
  mm/memory-tiers: register CXL nodes to socket-aware packages via
    initiator
  mm/mempolicy: enhance weighted interleave with socket-aware locality

 drivers/cxl/core/region.c    |  46 +++
 drivers/cxl/cxl.h            |   1 +
 drivers/dax/kmem.c           |   2 +
 include/linux/memory-tiers.h |  93 +++++
 include/linux/numa.h         |   8 +
 mm/memory-tiers.c            | 766 +++++++++++++++++++++++++++++++++++
 mm/mempolicy.c               | 135 +++++-
 7 files changed, 1047 insertions(+), 4 deletions(-)

base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
-- 
2.34.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask()
  2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
@ 2026-03-16  5:12 ` Rakie Kim
  2026-03-16  5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Rakie Kim @ 2026-03-16  5:12 UTC (permalink / raw)
  To: akpm
  Cc: gourry, linux-mm, linux-kernel, linux-cxl, ziy, matthew.brost,
	joshua.hahnjy, byungchul, ying.huang, apopple, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun,
	rakie.kim

Add a new NUMA helper, nearest_nodes_nodemask(), to find all nodes in a
given nodemask that are located at the minimum distance from a specified
source node.

Unlike nearest_node_nodemask(), which returns only a single node, this
function identifies all nodes that share the closest distance value. This
is useful when multiple nodes are equally near in the NUMA topology and
a complete set of nearest candidates is required.

The helper clears the output nodemask and sets all nodes that meet the
minimum distance condition. It returns 0 on success or -EINVAL if the
output argument is invalid.

Signed-off-by: Rakie Kim <rakie.kim@sk.com>
---
 include/linux/numa.h |  8 ++++++++
 mm/mempolicy.c       | 41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

diff --git a/include/linux/numa.h b/include/linux/numa.h
index e6baaf6051bc..aa9526e9078b 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -33,6 +33,8 @@ int numa_nearest_node(int node, unsigned int state);
 
 int nearest_node_nodemask(int node, nodemask_t *mask);
 
+int nearest_nodes_nodemask(int node, const nodemask_t *mask, nodemask_t *out);
+
 #ifndef memory_add_physaddr_to_nid
 int memory_add_physaddr_to_nid(u64 start);
 #endif
@@ -54,6 +56,12 @@ static inline int nearest_node_nodemask(int node, nodemask_t *mask)
 	return NUMA_NO_NODE;
 }
 
+static inline int nearest_nodes_nodemask(int node, const nodemask_t *mask,
+					 nodemask_t *out)
+{
+	return 0;
+}
+
 static inline int memory_add_physaddr_to_nid(u64 start)
 {
 	return 0;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 68a98ba57882..a3f0fde6c626 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -338,6 +338,47 @@ int nearest_node_nodemask(int node, nodemask_t *mask)
 }
 EXPORT_SYMBOL_GPL(nearest_node_nodemask);
 
+/**
+ * nearest_nodes_nodemask - Find all nodes in @mask that are nearest to @node
+ * @node: The reference node ID to measure distance from
+ * @mask: The set of candidate nodes to compare against
+ * @out:  Pointer to a nodemask that will store the nearest node(s)
+ *
+ * This function iterates over all nodes in @mask and measures the distance
+ * between each candidate node and the given @node using node_distance().
+ * It finds the minimum distance and then records all nodes in @mask that
+ * share that same minimum distance into the output mask @out.
+ *
+ * For example, if multiple nodes have equal minimal distance to @node, all
+ * of them are included in @out.
+ *
+ * Return: 0 on success, or -EINVAL if @out is NULL.
+ */
+int nearest_nodes_nodemask(int node, const nodemask_t *mask, nodemask_t *out)
+{
+	int dist, n, min_dist = INT_MAX;
+
+	if (!out)
+		return -EINVAL;
+
+	nodes_clear(*out);
+
+	for_each_node_mask(n, *mask) {
+		dist = node_distance(node, n);
+
+		if (dist < min_dist) {
+			min_dist = dist;
+			nodes_clear(*out);
+			node_set(n, *out);
+		} else if (dist == min_dist) {
+			node_set(n, *out);
+		}
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nearest_nodes_nodemask);
+
 struct mempolicy *get_task_policy(struct task_struct *p)
 {
 	struct mempolicy *pol = p->mempolicy;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes
  2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
  2026-03-16  5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
@ 2026-03-16  5:12 ` Rakie Kim
  2026-03-18 12:22   ` Jonathan Cameron
  2026-03-16  5:12 ` [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator Rakie Kim
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Rakie Kim @ 2026-03-16  5:12 UTC (permalink / raw)
  To: akpm
  Cc: gourry, linux-mm, linux-kernel, linux-cxl, ziy, matthew.brost,
	joshua.hahnjy, byungchul, ying.huang, apopple, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun,
	rakie.kim

The existing NUMA distance model provides only relative latency values
between nodes and lacks any notion of structural grouping such as socket
or package boundaries. As a result, memory policies based solely on
distance cannot differentiate between nodes that are physically local
to the same socket and those that belong to different sockets. This
often leads to inefficient cross-socket demotion and suboptimal memory
placement.

This patch introduces a socket-aware topology management layer that
groups NUMA nodes according to their physical package (socket)
association. Each group forms a "memory package" that explicitly links
CPU and memory-only nodes (such as CXL or HBM) under the same socket.
This structure allows the kernel to interpret NUMA topology in a way
that reflects real hardware locality rather than relying solely on
flat distance values.

By maintaining socket-level grouping, the kernel can:
 - Enforce demotion and promotion policies that stay within the same
   socket.
 - Avoid unintended cross-socket migrations that degrade performance.
 - Provide a structural abstraction for future policy and tiering logic.

Unlike ACPI-provided distance tables, which offer static and symmetric
relationships, this socket-aware model captures the true hardware
hierarchy and provides a flexible foundation for systems where the
distance matrix alone cannot accurately express socket boundaries or
asymmetric topologies.

This establishes a topology-aware basis for more predictable and
performance-consistent NUMA memory management.

Signed-off-by: Rakie Kim <rakie.kim@sk.com>
---
 include/linux/memory-tiers.h |  93 +++++
 mm/memory-tiers.c            | 766 +++++++++++++++++++++++++++++++++++
 2 files changed, 859 insertions(+)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 7a805796fcfd..406b50ac7d88 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -52,10 +52,24 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
 struct memory_dev_type *mt_find_alloc_memory_type(int adist,
 						  struct list_head *memory_types);
 void mt_put_memory_types(struct list_head *memory_types);
+
+int register_mp_package_notifier(struct notifier_block *notifier);
+void unregister_mp_package_notifier(struct notifier_block *notifier);
+int mp_probe_package_id(int nid);
+int mp_add_package_node_by_initiator(int nid, int initiator_nid);
+int mp_add_package_node(int nid);
+int mp_get_package_nodes(int nid, nodemask_t *out);
+int mp_get_package_cpu_nodes(int nid, nodemask_t *out);
+int mp_get_package_memory_only_nodes(int nid, nodemask_t *out);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
 bool node_is_toptier(int node);
+
+int mp_next_demotion_nodemask(int nid, nodemask_t *out);
+int mp_next_demotion_node(int nid);
+int mp_next_promotion_nodemask(int nid, nodemask_t *out);
+int mp_next_promotion_node(int nid);
 #else
 static inline int next_demotion_node(int node)
 {
@@ -71,6 +85,26 @@ static inline bool node_is_toptier(int node)
 {
 	return true;
 }
+
+static inline int mp_next_demotion_nodemask(int nid, nodemask_t *out)
+{
+	return 0;
+}
+
+static inline int mp_next_demotion_node(int nid)
+{
+	return 0;
+}
+
+static inline int mp_next_promotion_nodemask(int nid, nodemask_t *out)
+{
+	return 0;
+}
+
+static inline int mp_next_promotion_node(int nid)
+{
+	return 0;
+}
 #endif
 
 #else
@@ -151,5 +185,64 @@ static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
 static inline void mt_put_memory_types(struct list_head *memory_types)
 {
 }
+
+static inline int register_mp_package_notifier(struct notifier_block *notifier)
+{
+	return 0;
+}
+
+static inline void unregister_mp_package_notifier(struct notifier_block *notifier)
+{
+}
+
+static inline int mp_probe_package_id(int nid)
+{
+	return NOTIFY_DONE;
+}
+
+static inline int mp_add_package_node_by_initiator(int nid, int initiator_nid)
+{
+	return 0;
+}
+
+static inline int mp_add_package_node(int nid)
+{
+	return 0;
+}
+
+static inline int mp_get_package_nodes(int nid, nodemask_t *out)
+{
+	return 0;
+}
+
+static inline int mp_get_package_cpu_nodes(int nid, nodemask_t *out)
+{
+	return 0;
+}
+
+static inline int mp_get_package_memory_only_nodes(int nid, nodemask_t *out)
+{
+	return 0;
+}
+
+static inline int mp_next_demotion_nodemask(int nid, nodemask_t *out)
+{
+	return 0;
+}
+
+static inline int mp_next_demotion_node(int nid)
+{
+	return 0;
+}
+
+static inline int mp_next_promotion_nodemask(int nid, nodemask_t *out)
+{
+	return 0;
+}
+
+static inline int mp_next_promotion_node(int nid)
+{
+	return 0;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 864811fff409..47d323e5466e 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -998,3 +998,769 @@ static int __init numa_init_sysfs(void)
 subsys_initcall(numa_init_sysfs);
 #endif /* CONFIG_SYSFS */
 #endif
+
+/**
+ * enum mp_nodes_type - Selector for which subset of a package to return
+ * @MP_NODES_ALL:       All NUMA nodes that belong to the package.
+ * @MP_NODES_CPU:       Only CPU nodes in the package.
+ * @MP_NODES_MEM_ONLY:  Only memory-only nodes (e.g. CXL/HBM) in the package.
+ *
+ * Used internally to choose which nodemask to expose for a given package.
+ */
+enum mp_nodes_type {
+	MP_NODES_ALL,
+	MP_NODES_CPU,
+	MP_NODES_MEM_ONLY
+};
+
+/**
+ * struct memory_package - Per-socket (physical package) container
+ * @package_id:          Physical socket/package id (from topology).
+ * @nodes:               Nodemask of all member nodes in this package.
+ * @cpu_nodes:           Nodemask of CPU nodes in this package.
+ * @memory_only_nodes:   Nodemask of memory-only nodes in this package.
+ * @cpu_list:            List head of CPU-type members.
+ * @memory_only_list:    List head of memory-only members.
+ * @list:                Linkage on the global @memory_packages list.
+ *
+ * A memory_package groups NUMA nodes that share the same physical CPU package.
+ * The masks are used to implement socket-local placement/demotion/promotion.
+ */
+struct memory_package {
+	int package_id;
+	nodemask_t nodes;
+	nodemask_t cpu_nodes;
+	nodemask_t memory_only_nodes;
+	struct list_head cpu_list;
+	struct list_head memory_only_list;
+	struct list_head list;
+};
+
+/**
+ * enum mpn_source_flags - Source used to resolve a node's package membership
+ * @MPN_SRC_UNKNOWN:     Unknown/unspecified.
+ * @MPN_SRC_CPU:         Directly resolved from a CPU node (1:1).
+ * @MPN_SRC_INITIATOR:   Resolved via an initiator CPU node provided by a driver.
+ * @MPN_SRC_SLIT:        Resolved via SLIT/nearest-node.
+ *
+ * These flags are informational; they describe how a given node was bound to
+ * its package and help with policy decisions later.
+ */
+enum mpn_source_flags {
+	MPN_SRC_UNKNOWN		= 0,
+	MPN_SRC_CPU		= BIT(1),
+	MPN_SRC_INITIATOR	= BIT(2),
+	MPN_SRC_SLIT		= BIT(3)
+};
+
+/**
+ * struct memory_package_node - Per-node membership and preferences
+ * @nid:              NUMA node id for this entry.
+ * @initiator_nid:    CPU nid that served as the initiator when resolving @nid.
+ * @package_id:       Resolved package id that @nid belongs to.
+ * @source_flags:     One of &enum mpn_source_flags describing the resolution.
+ * @preferred:        Opposite-type nearest candidates inside the same package.
+ * @package:          Pointer to the owning &struct memory_package (NULL until bound).
+ * @package_entry:    Linkage on the owning package's type list.
+ *
+ * Each NUMA node that participates in socket-aware policy gets a wrapper entry
+ * that caches package membership and the precomputed set of preferred targets.
+ */
+struct memory_package_node {
+	int nid;
+	int initiator_nid;
+	int package_id;
+	int source_flags;
+	nodemask_t preferred;
+	struct memory_package *package;
+	struct list_head package_entry;
+};
+
+#define node_is_memory_only(_nid) \
+	(node_state((_nid), N_MEMORY) && !node_state((_nid), N_CPU))
+
+static BLOCKING_NOTIFIER_HEAD(mp_package_algorithms);
+
+static LIST_HEAD(memory_packages);
+static struct memory_package_node *mpns[MAX_NUMNODES];
+static DEFINE_MUTEX(memory_package_lock);
+
+/**
+ * register_mp_package_notifier - Register a package resolution algorithm
+ * @notifier: Notifier called with the nid to resolve (see mp_probe_package_id()).
+ *
+ * Drivers (e.g., CXL region/decoder code) register here to supply a package
+ * hint for newly appearing nodes. The notifier is invoked during nid->package
+ * resolution.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int register_mp_package_notifier(struct notifier_block *notifier)
+{
+	return blocking_notifier_chain_register(&mp_package_algorithms, notifier);
+}
+EXPORT_SYMBOL_GPL(register_mp_package_notifier);
+
+/**
+ * unregister_mp_package_notifier - Unregister a package resolution algorithm
+ * @notifier: Notifier previously registered with register_mp_package_notifier().
+ */
+void unregister_mp_package_notifier(struct notifier_block *notifier)
+{
+	blocking_notifier_chain_unregister(&mp_package_algorithms, notifier);
+}
+EXPORT_SYMBOL_GPL(unregister_mp_package_notifier);
+
+/**
+ * mp_probe_package_id - Invoke registered notifiers to resolve a node's package
+ * @nid: NUMA node id to resolve.
+ *
+ * Calls the blocking notifier chain to let subsystems provide an initiator or
+ * package id for @nid.
+ *
+ * Return: Notifier return code (>=0 typically); negative errno on failure.
+ */
+int mp_probe_package_id(int nid)
+{
+	return blocking_notifier_call_chain(&mp_package_algorithms, nid, NULL);
+}
+EXPORT_SYMBOL_GPL(mp_probe_package_id);
+
+static int mp_node_to_package_id(int nid)
+{
+	int package_id = -EINVAL;
+	unsigned int first_cpu;
+	const struct cpumask *cpu_mask;
+
+	if (!node_state(nid, N_CPU))
+		goto out;
+
+	cpu_mask = cpumask_of_node(nid);
+	if (cpumask_empty(cpu_mask)) {
+		pr_err("node%d: CPU mask is empty\n", nid);
+		goto out;
+	}
+
+	first_cpu = cpumask_first(cpu_mask);
+	if (first_cpu >= nr_cpu_ids) {
+		pr_err("node%d: CPU (%d) out of range\n", nid, first_cpu);
+		goto out;
+	}
+
+	/*
+	 * Map the first CPU in this node’s cpumask to its physical package id.
+	 * This ties the NUMA node to a socket (package) using topology info.
+	 */
+	package_id = topology_physical_package_id(first_cpu);
+	if (package_id < 0) {
+		pr_err("node%d: failed to resolve package id (%d)\n", nid, package_id);
+		package_id = -EINVAL;
+		goto out;
+	}
+
+out:
+	return package_id;
+}
+
+static void update_package_preferred(struct memory_package *mp)
+{
+	struct memory_package_node *mpn;
+
+	lockdep_assert_held(&memory_package_lock);
+
+	/*
+	 * For each CPU node, compute its preferred set as the nearest
+	 * memory-only node(s) within the same package. If the package has
+	 * no memory-only nodes, fall back to a self-reference so callers
+	 * never see an empty preferred set.
+	 */
+	list_for_each_entry(mpn, &mp->cpu_list, package_entry) {
+		nodes_clear(mpn->preferred);
+		if (!nodes_empty(mp->memory_only_nodes))
+			nearest_nodes_nodemask(mpn->nid, &mp->memory_only_nodes,
+					       &mpn->preferred);
+		else
+			node_set(mpn->nid, mpn->preferred);
+	}
+
+	/*
+	 * Symmetrically, for each memory-only node, compute its preferred set
+	 * as the nearest CPU node(s) within the same package. If the package
+	 * has no CPU nodes, fall back to a self-reference.
+	 */
+	list_for_each_entry(mpn, &mp->memory_only_list, package_entry) {
+		nodes_clear(mpn->preferred);
+		if (!nodes_empty(mp->cpu_nodes))
+			nearest_nodes_nodemask(mpn->nid, &mp->cpu_nodes,
+					       &mpn->preferred);
+		else
+			node_set(mpn->nid, mpn->preferred);
+	}
+}
+
+static inline bool memory_package_is_empty(struct memory_package *mp)
+{
+	lockdep_assert_held(&memory_package_lock);
+
+	return (nodes_empty(mp->cpu_nodes) && nodes_empty(mp->memory_only_nodes));
+}
+
+static inline bool package_node_is_valid(int nid)
+{
+	if (!mpns[nid]) {
+		pr_err("mpns[%d] is NULL\n", nid);
+		return false;
+	}
+
+	if (nodes_empty(mpns[nid]->preferred) || (mpns[nid]->package == NULL)) {
+		pr_err("nid %d: package or preferred mask not initialized\n", nid);
+		return false;
+	}
+
+	return true;
+}
+
+static struct memory_package *create_memory_package(int package_id)
+{
+	struct memory_package *mempackage;
+
+	mempackage = kzalloc(sizeof(*mempackage), GFP_KERNEL);
+	if (!mempackage)
+		return ERR_PTR(-ENOMEM);
+
+	mempackage->package_id = package_id;
+	mempackage->nodes = NODE_MASK_NONE;
+	mempackage->cpu_nodes = NODE_MASK_NONE;
+	mempackage->memory_only_nodes = NODE_MASK_NONE;
+	INIT_LIST_HEAD(&mempackage->cpu_list);
+	INIT_LIST_HEAD(&mempackage->memory_only_list);
+	INIT_LIST_HEAD(&mempackage->list);
+
+	return mempackage;
+}
+
+static void destroy_memory_package(struct memory_package *mp)
+{
+	lockdep_assert_held(&memory_package_lock);
+
+	if (memory_package_is_empty(mp)) {
+		list_del(&mp->list);
+		kfree(mp);
+	}
+}
+
+static struct memory_package *find_create_memory_package(int package_id)
+{
+	struct memory_package *mempackage;
+
+	mutex_lock(&memory_package_lock);
+	list_for_each_entry(mempackage, &memory_packages, list) {
+		/*
+		 * If a package for this package_id already exists, reuse it
+		 * instead of allocating a new one.
+		 */
+		if (mempackage->package_id == package_id) {
+			mutex_unlock(&memory_package_lock);
+			return mempackage;
+		}
+	}
+	mutex_unlock(&memory_package_lock);
+
+	mempackage = create_memory_package(package_id);
+	if (IS_ERR(mempackage))
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&memory_package_lock);
+	list_add(&mempackage->list, &memory_packages);
+	mutex_unlock(&memory_package_lock);
+
+	return mempackage;
+}
+
+static int bind_node_to_package(int nid)
+{
+	int ret = 0, package_id;
+	struct memory_package *mp;
+
+	mutex_lock(&memory_package_lock);
+	if (!mpns[nid]) {
+		ret = -EINVAL;
+		goto unlock_out;
+	}
+	package_id = mpns[nid]->package_id;
+	mutex_unlock(&memory_package_lock);
+
+	mp = find_create_memory_package(package_id);
+	if (IS_ERR(mp)) {
+		ret = PTR_ERR(mp);
+		goto out;
+	}
+
+	mutex_lock(&memory_package_lock);
+	mpns[nid]->package = mp;
+	node_set(mpns[nid]->nid, mp->nodes);
+	if (node_is_memory_only(mpns[nid]->nid)) {
+		node_set(mpns[nid]->nid, mp->memory_only_nodes);
+		list_add(&mpns[nid]->package_entry, &mp->memory_only_list);
+	} else {
+		node_set(mpns[nid]->nid, mp->cpu_nodes);
+		list_add(&mpns[nid]->package_entry, &mp->cpu_list);
+	}
+	update_package_preferred(mp);
+
+unlock_out:
+	mutex_unlock(&memory_package_lock);
+out:
+	pr_info("memory_package %d: nodes=%*pbl cpu=%*pbl memery_only=%*pbl\n",
+		mp->package_id,
+		nodemask_pr_args(&mp->nodes),
+		nodemask_pr_args(&mp->cpu_nodes),
+		nodemask_pr_args(&mp->memory_only_nodes));
+
+	return ret;
+}
+
+static void unbind_node_to_package(struct memory_package *mp, int nid)
+{
+	lockdep_assert_held(&memory_package_lock);
+
+	node_clear(nid, mp->nodes);
+	if (node_state(nid, N_CPU))
+		node_clear(nid, mp->cpu_nodes);
+	else
+		node_clear(nid, mp->memory_only_nodes);
+
+	if (mpns[nid])
+		list_del(&mpns[nid]->package_entry);
+
+	update_package_preferred(mp);
+}
+
+static struct memory_package_node *create_package_node(int nid, int initiator_nid)
+{
+	int cpu_nid, package_id;
+	int source_flags;
+	struct memory_package_node *mpn;
+
+	if (node_state(nid, N_CPU)) {
+		cpu_nid = nid;
+		source_flags = MPN_SRC_CPU;
+	} else {
+		if (initiator_nid >= 0) {
+			cpu_nid = initiator_nid;
+			source_flags = MPN_SRC_INITIATOR;
+		} else {
+			/*
+			 * No driver-supplied initiator: fall back to the
+			 * nearest CPU node (via SLIT/numa_distance).
+			 */
+			cpu_nid = numa_nearest_node(nid, N_CPU);
+			source_flags = MPN_SRC_SLIT;
+		}
+	}
+
+	package_id = mp_node_to_package_id(cpu_nid);
+	if (package_id < 0)
+		return ERR_PTR(-EINVAL);
+
+	mpn = kzalloc(sizeof(*mpn), GFP_KERNEL);
+	if (!mpn)
+		return ERR_PTR(-ENOMEM);
+
+	mpn->nid = nid;
+	mpn->initiator_nid = cpu_nid;
+	mpn->package_id = package_id;
+	mpn->source_flags = source_flags;
+	mpn->preferred = NODE_MASK_NONE;
+	mpn->package = NULL;
+	INIT_LIST_HEAD(&mpn->package_entry);
+
+	return mpn;
+}
+
+static void __destroy_package_node(int nid)
+{
+	struct memory_package_node *mpn;
+	struct memory_package *mp;
+
+	lockdep_assert_held(&memory_package_lock);
+
+	mpn = mpns[nid];
+	if (!mpn)
+		return;
+
+	mp = mpn->package;
+	if (mp) {
+		unbind_node_to_package(mp, nid);
+		mpn->package = NULL;
+
+		if (memory_package_is_empty(mp))
+			destroy_memory_package(mp);
+	}
+
+	mpns[nid] = NULL;
+	kfree(mpn);
+}
+
+static void destroy_package_node(int nid)
+{
+	mutex_lock(&memory_package_lock);
+	__destroy_package_node(nid);
+	mutex_unlock(&memory_package_lock);
+}
+
+static int find_package_node(int nid, int initiator_nid)
+{
+	int mpn_nid = NUMA_NO_NODE;
+
+	mutex_lock(&memory_package_lock);
+	if (mpns[nid]) {
+		/*
+		 * SLIT-derived entries are provisional; if a driver later
+		 * provides an explicit initiator, drop the provisional
+		 * entry and rebuild with the stronger hint.
+		 */
+		if (mpns[nid]->source_flags == MPN_SRC_SLIT && initiator_nid >= 0)
+			__destroy_package_node(nid);
+		else
+			mpn_nid = nid;
+	}
+	mutex_unlock(&memory_package_lock);
+
+	return mpn_nid;
+}
+
+static int find_create_package_node(int nid, int initiator_nid)
+{
+	int mpn_nid;
+	struct memory_package_node *mpn;
+
+	mpn_nid = find_package_node(nid, initiator_nid);
+	if (mpn_nid != NUMA_NO_NODE)
+		return mpn_nid;
+
+	mpn = create_package_node(nid, initiator_nid);
+	if (IS_ERR(mpn))
+		return PTR_ERR(mpn);
+
+	mutex_lock(&memory_package_lock);
+	mpns[nid] = mpn;
+	mutex_unlock(&memory_package_lock);
+
+	return nid;
+}
+
+static int create_node_with_package(int nid)
+{
+	int ret;
+
+	ret = find_create_package_node(nid, NUMA_NO_NODE);
+	if (ret < 0) {
+		pr_err("package_node(%d) failed: %d\n", nid, ret);
+		return ret;
+	}
+
+	ret = bind_node_to_package(nid);
+	if (ret) {
+		pr_err("bind_node_to_package(%d) failed: %d\n", nid, ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * mp_add_package_node_by_initiator - Add a node with an initiator
+ * @nid:            Target NUMA node to add.
+ * @initiator_nid:  CPU nid used to resolve @nid's package (>=0).
+ *
+ * Ensures that a &struct memory_package_node exists for @nid and that its
+ * package_id is determined using @initiator_nid when provided. Binding to the
+ * package is not performed here.
+ *
+ * Return: 0 on success; negative errno on failure.
+ */
+int mp_add_package_node_by_initiator(int nid, int initiator_nid)
+{
+	int ret;
+
+	ret = find_create_package_node(nid, initiator_nid);
+	if (ret < 0) {
+		pr_err("find_create_package_node(nid=%d, initiator=%d) failed: %d\n",
+		       nid, initiator_nid, ret);
+		return ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mp_add_package_node_by_initiator);
+
+/**
+ * mp_add_package_node - Add a node, resolving package automatically
+ * @nid: Target NUMA node to add.
+ *
+ * Wrapper over mp_add_package_node_by_initiator() that requests automatic
+ * initiator resolution (e.g., nearest CPU).
+ *
+ * Return: 0 on success; negative errno on failure.
+ */
+int mp_add_package_node(int nid)
+{
+	return mp_add_package_node_by_initiator(nid, NUMA_NO_NODE);
+}
+EXPORT_SYMBOL_GPL(mp_add_package_node);
+
+static int __mp_get_preferred_nodemask(int nid, enum mp_nodes_type node_type,
+				    nodemask_t *out)
+{
+	int ret = 0;
+
+	if (!out) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	nodes_clear(*out);
+
+	if (nid < 0 || nid >= MAX_NUMNODES) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (node_type == MP_NODES_CPU) {
+		if (node_is_memory_only(nid)) {
+			pr_err("nid %d is a memory-only node\n", nid);
+			ret = -EINVAL;
+			goto out;
+		}
+	} else if (node_type == MP_NODES_MEM_ONLY) {
+		if (!node_is_memory_only(nid)) {
+			pr_err("nid %d is a CPU node\n", nid);
+			ret = -EINVAL;
+			goto out;
+		}
+	} else {
+		pr_err("invalid node type: %d\n", (int)node_type);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!package_node_is_valid(nid)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	nodes_copy(*out, mpns[nid]->preferred);
+
+out:
+	return ret;
+}
+
+static int __mp_get_package_nodemask(int nid, enum mp_nodes_type node_type,
+				     nodemask_t *out)
+{
+	int ret = 0;
+
+	if (!out) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	nodes_clear(*out);
+
+	if (nid < 0 || nid >= MAX_NUMNODES) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!package_node_is_valid(nid)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	switch (node_type) {
+	case MP_NODES_ALL:
+		nodes_copy(*out, mpns[nid]->package->nodes);
+		break;
+	case MP_NODES_CPU:
+		nodes_copy(*out, mpns[nid]->package->cpu_nodes);
+		break;
+	case MP_NODES_MEM_ONLY:
+		nodes_copy(*out, mpns[nid]->package->memory_only_nodes);
+		break;
+	default:
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	return ret;
+}
+
+#if CONFIG_MIGRATION
+/**
+ * mp_next_demotion_nodemask - Demotion candidates within a package
+ * @nid: CPU node from which memory would be demoted.
+ * @out: Output nodemask of nearest memory-only targets in the same package.
+ *
+ * Return: 0 on success; negative errno if @nid is invalid or not initialized.
+ */
+int mp_next_demotion_nodemask(int nid, nodemask_t *out)
+{
+	return __mp_get_preferred_nodemask(nid, MP_NODES_CPU, out);
+}
+EXPORT_SYMBOL_GPL(mp_next_demotion_nodemask);
+
+/**
+ * mp_next_demotion_node - Pick one demotion target
+ * @nid: CPU node from which memory would be demoted.
+ *
+ * Picks one target (random among the nearest) from mp_next_demotion_nodemask().
+ *
+ * Return: target nid on success, or NUMA_NO_NODE if no candidate is available.
+ */
+int mp_next_demotion_node(int nid)
+{
+	int target_nid;
+	nodemask_t target_nodemask;
+
+	if (mp_next_demotion_nodemask(nid, &target_nodemask))
+		return NUMA_NO_NODE;
+	if (nodes_empty(target_nodemask))
+		return NUMA_NO_NODE;
+
+	target_nid = node_random(&target_nodemask);
+
+	return target_nid;
+}
+EXPORT_SYMBOL_GPL(mp_next_demotion_node);
+
+/**
+ * mp_next_promotion_nodemask - Promotion candidates within a package
+ * @nid: Memory-only node towards which promotion seeks CPU locality.
+ * @out: Output nodemask of nearest CPU targets in the same package.
+ *
+ * Return: 0 on success; negative errno if @nid is invalid or not initialized.
+ */
+int mp_next_promotion_nodemask(int nid, nodemask_t *out)
+{
+	return __mp_get_preferred_nodemask(nid, MP_NODES_MEM_ONLY, out);
+}
+EXPORT_SYMBOL_GPL(mp_next_promotion_nodemask);
+
+/**
+ * mp_next_promotion_node - Pick one promotion target
+ * @nid: Memory-only node to be promoted towards CPUs.
+ *
+ * Picks one target (random among the nearest) from mp_next_promotion_nodemask().
+ *
+ * Return: target nid on success, or NUMA_NO_NODE if no candidate is available.
+ */
+int mp_next_promotion_node(int nid)
+{
+	int target_nid;
+	nodemask_t target_nodemask;
+
+	if (mp_next_promotion_nodemask(nid, &target_nodemask))
+		return NUMA_NO_NODE;
+	if (nodes_empty(target_nodemask))
+		return NUMA_NO_NODE;
+
+	target_nid = node_random(&target_nodemask);
+
+	return target_nid;
+}
+EXPORT_SYMBOL_GPL(mp_next_promotion_node);
+#endif /* CONFIG_MIGRATION */
+
+/**
+ * mp_get_package_nodes - Return all members of @nid's package
+ * @nid: Any NUMA node in the package.
+ * @out: Output nodemask to receive all members.
+ *
+ * Return: 0 on success; negative errno if @nid is invalid or not initialized.
+ */
+int mp_get_package_nodes(int nid, nodemask_t *out)
+{
+	return __mp_get_package_nodemask(nid, MP_NODES_ALL, out);
+}
+EXPORT_SYMBOL_GPL(mp_get_package_nodes);
+
+/**
+ * mp_get_package_cpu_nodes - Return CPU members of @nid's package
+ * @nid: Any NUMA node in the package.
+ * @out: Output nodemask to receive CPU members.
+ *
+ * Return: 0 on success; negative errno if @nid is invalid or not initialized.
+ */
+int mp_get_package_cpu_nodes(int nid, nodemask_t *out)
+{
+	return __mp_get_package_nodemask(nid, MP_NODES_CPU, out);
+}
+EXPORT_SYMBOL_GPL(mp_get_package_cpu_nodes);
+
+int mp_get_package_memory_only_nodes(int nid, nodemask_t *out)
+{
+	return __mp_get_package_nodemask(nid, MP_NODES_MEM_ONLY, out);
+}
+EXPORT_SYMBOL_GPL(mp_get_package_memory_only_nodes);
+
+/**
+ * mp_get_package_memory_only_nodes - Return memory-only members of @nid's package
+ * @nid: Any NUMA node in the package.
+ * @out: Output nodemask to receive memory-only members.
+ *
+ * Return: 0 on success; negative errno if @nid is invalid or not initialized.
+ */
+static int __meminit mp_hotplug_callback(struct notifier_block *nb,
+		unsigned long action, void *_arg)
+{
+	int nid;
+	struct node_notify *nn = _arg;
+
+	nid = nn->nid;
+	if (nid < 0)
+		return notifier_from_errno(0);
+
+	switch (action) {
+	case NODE_REMOVED_LAST_MEMORY:
+		destroy_package_node(nid);
+		break;
+
+	case NODE_ADDED_FIRST_MEMORY:
+		create_node_with_package(nid);
+		break;
+
+	default:
+		break;
+	}
+
+	return notifier_from_errno(0);
+}
+
+static int __init memory_package_init(void)
+{
+	int ret = 0, nid;
+
+	for_each_online_node(nid) {
+		if (!node_state(nid, N_MEMORY))
+			continue;
+
+		/*
+		 * On boot, enumerate already-present NUMA nodes and build the
+		 * initial package topology. CPU nodes are the common case,
+		 * but memory-only nodes are handled as well.
+		 */
+		ret = create_node_with_package(nid);
+		if (ret) {
+			pr_err("create nid(%d) failed: %d\n", nid, ret);
+			goto out;
+		}
+	}
+
+	hotplug_node_notifier(mp_hotplug_callback, MEMTIER_HOTPLUG_PRI);
+
+out:
+	return ret;
+}
+late_initcall(memory_package_init);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator
  2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
  2026-03-16  5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
  2026-03-16  5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
@ 2026-03-16  5:12 ` Rakie Kim
  2026-03-16  5:12 ` [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality Rakie Kim
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Rakie Kim @ 2026-03-16  5:12 UTC (permalink / raw)
  To: akpm
  Cc: gourry, linux-mm, linux-kernel, linux-cxl, ziy, matthew.brost,
	joshua.hahnjy, byungchul, ying.huang, apopple, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun,
	rakie.kim

CXL memory nodes appear without an explicit socket association.
Relying on plain NUMA distance does not convey which physical package
(CPU socket) they should belong to, which in turn makes locality-aware
placement ambiguous.

This change introduces a registration path that binds a CXL memory node
to a socket-aware "memory package" using an initiator CPU node. The
initiator is the CPU nid that best represents the host-side attachment
of the region (e.g., the CPU closest to the region’s target). By using
this nid to resolve the package, the CXL node is grouped with the CPUs
it actually services.

The flow is:
  - Determine an initiator CPU nid for the CXL region.
  - Register the CXL node with the package layer using that initiator.

This provides a deterministic and topology-consistent way to place CXL
nodes into the correct socket grouping, reducing the risk of inadvertent
cross-socket choices that distance alone cannot prevent.

Signed-off-by: Rakie Kim <rakie.kim@sk.com>
---
 drivers/cxl/core/region.c | 46 +++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h         |  1 +
 drivers/dax/kmem.c        |  2 ++
 3 files changed, 49 insertions(+)

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 5bd1213737fa..2733e0d465cc 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2570,6 +2570,47 @@ static int cxl_region_calculate_adistance(struct notifier_block *nb,
 	return NOTIFY_STOP;
 }
 
+static int cxl_region_find_nearest_node(struct cxl_region *cxlr)
+{
+	struct cxl_region_params *p = &cxlr->params;
+	struct cxl_endpoint_decoder *cxled = NULL;
+	struct cxl_memdev *cxlmd = NULL;
+	int i, numa_node;
+
+	for (i = 0; i < p->nr_targets; i++) {
+		cxled = p->targets[i];
+		cxlmd = cxled_to_memdev(cxled);
+		numa_node = dev_to_node(&cxlmd->dev);
+		if (numa_node != NUMA_NO_NODE)
+			return numa_node;
+	}
+	return NUMA_NO_NODE;
+}
+
+static int cxl_region_add_package_node(struct notifier_block *nb,
+				       unsigned long dax_nid, void *data)
+{
+	int region_nid, nearest_nid, ret;
+	struct cxl_region *cxlr = container_of(nb, struct cxl_region, package_notifier);
+
+	region_nid = phys_to_target_node(cxlr->params.res->start);
+	if (region_nid != dax_nid)
+		return NOTIFY_DONE;
+
+	nearest_nid = cxl_region_find_nearest_node(cxlr);
+	if (nearest_nid == NUMA_NO_NODE)
+		return NOTIFY_DONE;
+
+	ret = mp_add_package_node_by_initiator(dax_nid, nearest_nid);
+	if (ret) {
+		dev_info(&cxlr->dev, "failed add package node (%lu), nearest_nid (%d)\n",
+			 dax_nid, nearest_nid);
+		return NOTIFY_DONE;
+	}
+
+	return NOTIFY_OK;
+}
+
 /**
  * devm_cxl_add_region - Adds a region to a decoder
  * @cxlrd: root decoder
@@ -3788,6 +3829,7 @@ static void shutdown_notifiers(void *_cxlr)
 
 	unregister_node_notifier(&cxlr->node_notifier);
 	unregister_mt_adistance_algorithm(&cxlr->adist_notifier);
+	unregister_mp_package_notifier(&cxlr->package_notifier);
 }
 
 static void remove_debugfs(void *dentry)
@@ -3940,6 +3982,10 @@ static int cxl_region_probe(struct device *dev)
 	cxlr->adist_notifier.priority = 100;
 	register_mt_adistance_algorithm(&cxlr->adist_notifier);
 
+	cxlr->package_notifier.notifier_call = cxl_region_add_package_node;
+	cxlr->package_notifier.priority = 100;
+	register_mp_package_notifier(&cxlr->package_notifier);
+
 	rc = devm_add_action_or_reset(&cxlr->dev, shutdown_notifiers, cxlr);
 	if (rc)
 		return rc;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index ba17fa86d249..6b6653e31135 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -551,6 +551,7 @@ struct cxl_region {
 	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
 	struct notifier_block node_notifier;
 	struct notifier_block adist_notifier;
+	struct notifier_block package_notifier;
 };
 
 struct cxl_nvdimm_bridge {
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index c036e4d0b610..32ee66b82cd3 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -94,6 +94,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	if (IS_ERR(mtype))
 		return PTR_ERR(mtype);
 
+	mp_probe_package_id(numa_node);
+
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct range range;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality
  2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
                   ` (2 preceding siblings ...)
  2026-03-16  5:12 ` [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator Rakie Kim
@ 2026-03-16  5:12 ` Rakie Kim
  2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Rakie Kim @ 2026-03-16  5:12 UTC (permalink / raw)
  To: akpm
  Cc: gourry, linux-mm, linux-kernel, linux-cxl, ziy, matthew.brost,
	joshua.hahnjy, byungchul, ying.huang, apopple, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun,
	rakie.kim

Flat weighted interleave applies one global weight vector regardless of
where a task runs. On multi-socket systems this ignores inter-socket
interconnect costs and can steer allocations to remote sockets even when
local capacity exists, degrading effective bandwidth and increasing
latency.

Consider a dual-socket system:

          node0             node1
        +-------+         +-------+
        | CPU0  |---------| CPU1  |
        +-------+         +-------+
        | DRAM0 |         | DRAM1 |
        +---+---+         +---+---+
            |                 |
        +---+---+         +---+---+
        | CXL0  |         | CXL1  |
        +-------+         +-------+
          node2             node3

Local device capabilities (GB/s) versus cross-socket effective bandwidth:

         0     1     2     3
     0  300   150   100    50
     1  150   300    50   100

A reasonable global weight vector reflecting device capabilities is:

     node0=3 node1=3 node2=1 node3=1

However, applying it flat to all sources yields the effective map:

         0     1     2     3
     0   3     3     1     1
     1   3     3     1     1

This does not account for the interconnect penalty (e.g., node0->node1
drops 300->150, node0->node3 drops 100->50) and thus permits cross-socket
allocations that underutilize local bandwidth.

This patch makes weighted interleave socket-aware. Before weighting is
applied, the candidate nodes are restricted to the current socket; only
if no eligible local nodes remain does the policy fall back to the wider
set. The resulting effective map becomes:

         0     1     2     3
     0   3     0     1     0
     1   0     3     0     1

Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
effective bandwidth, preserves NUMA locality, and reduces cross-socket
traffic.

Signed-off-by: Rakie Kim <rakie.kim@sk.com>
---
 mm/mempolicy.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 90 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a3f0fde6c626..541853ac08bc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -117,6 +117,7 @@
 #include <asm/tlb.h>
 #include <linux/uaccess.h>
 #include <linux/memory.h>
+#include <linux/memory-tiers.h>
 
 #include "internal.h"
 
@@ -2134,17 +2135,87 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 	return zone >= dynamic_policy_zone;
 }
 
+/**
+ * policy_resolve_package_nodes - Restrict policy nodes to the current package
+ * @policy: Target mempolicy whose user-selected nodes are in @policy->nodes.
+ * @mask:   Output nodemask. On success, contains policy->nodes limited to
+ *          the package that should be used for the allocation.
+ *
+ * This helper combines two constraints to decide where within a socket/package
+ * memory may be allocated:
+ *
+ *   1) The caller's package: derived via mp_get_package_nodes(numa_node_id()).
+ *   2) The user's preselected set @policy->nodes (cpusets/mempolicy).
+ *
+ * The function obtains the nodemask of the current CPU's package and
+ * intersects it with @policy->nodes. If the intersection is empty (e.g. the
+ * user excluded every node of the current package), it falls back to the
+ * node in @policy->nodes, derives that node's package, and intersects
+ * again. If the fallback also yields an empty set, @mask stays empty and a
+ * non-zero error is returned.
+ *
+ * Examples (packages: P0={CPU:0, MEM:2}, P1={CPU:1, MEM:3}):
+ *   - policy->nodes = {0,1,2,3}
+ *       on P0: mask = {0,2}; on P1: mask = {1,3}.
+ *   - policy->nodes = {0,1,3}
+ *       on P0: mask = {0}      (only node 0 from P0 is allowed).
+ *   - policy->nodes = {1,2,3}
+ *       on P0: mask = {2}      (only node 2 from P0 is allowed).
+ *   - policy->nodes = {1,3}
+ *       on P0: current package (P0) & policy = NULL -> fallback to policy=1,
+ *               package(1)=P1, mask = {1,3}. (User effectively opted out of P0.)
+ *
+ * Return:
+ *   0 on success with @mask set as above;
+ *   -EINVAL if @policy/@mask is NULL;
+ *   Propagated error from mp_get_package_nodes() on failure.
+ */
+static int policy_resolve_package_nodes(struct mempolicy *policy, nodemask_t *mask)
+{
+	unsigned int node, ret = 0;
+	nodemask_t package_mask;
+
+	if (!policy || !mask)
+		return -EINVAL;
+
+	nodes_clear(*mask);
+
+	node = numa_node_id();
+	ret = mp_get_package_nodes(node, &package_mask);
+	if (!ret) {
+		nodes_and(*mask, package_mask, policy->nodes);
+
+		if (nodes_empty(*mask)) {
+			node = first_node(policy->nodes);
+			ret = mp_get_package_nodes(node, &package_mask);
+			if (ret)
+				goto out;
+			nodes_and(*mask, package_mask, policy->nodes);
+			if (nodes_empty(*mask))
+				goto out;
+		}
+	}
+
+out:
+	return ret;
+}
+
 static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 {
 	unsigned int node;
 	unsigned int cpuset_mems_cookie;
+	nodemask_t mask;
 
 retry:
 	/* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
 	cpuset_mems_cookie = read_mems_allowed_begin();
 	node = current->il_prev;
-	if (!current->il_weight || !node_isset(node, policy->nodes)) {
-		node = next_node_in(node, policy->nodes);
+
+	if (policy_resolve_package_nodes(policy, &mask))
+		mask = policy->nodes;
+
+	if (!current->il_weight || !node_isset(node, mask)) {
+		node = next_node_in(node, mask);
 		if (read_mems_allowed_retry(cpuset_mems_cookie))
 			goto retry;
 		if (node == MAX_NUMNODES)
@@ -2237,6 +2308,21 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
 	return nodes_weight(*mask);
 }
 
+static unsigned int read_once_policy_package_nodemask(struct mempolicy *pol,
+						      nodemask_t *mask)
+{
+	nodemask_t package_mask;
+
+	barrier();
+	if (policy_resolve_package_nodes(pol, &package_mask))
+		memcpy(mask, &pol->nodes, sizeof(nodemask_t));
+	else
+		memcpy(mask, &package_mask, sizeof(nodemask_t));
+	barrier();
+
+	return nodes_weight(*mask);
+}
+
 static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 {
 	struct weighted_interleave_state *state;
@@ -2247,7 +2333,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	u8 weight;
 	int nid = 0;
 
-	nr_nodes = read_once_policy_nodemask(pol, &nodemask);
+	nr_nodes = read_once_policy_package_nodemask(pol, &nodemask);
 	if (!nr_nodes)
 		return numa_node_id();
 
@@ -2691,7 +2777,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 	/* read the nodes onto the stack, retry if done during rebind */
 	do {
 		cpuset_mems_cookie = read_mems_allowed_begin();
-		nnodes = read_once_policy_nodemask(pol, &nodes);
+		nnodes = read_once_policy_package_nodemask(pol, &nodes);
 	} while (read_mems_allowed_retry(cpuset_mems_cookie));
 
 	/* if the nodemask has become invalid, we cannot do anything */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
                   ` (3 preceding siblings ...)
  2026-03-16  5:12 ` [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality Rakie Kim
@ 2026-03-16 14:01 ` Gregory Price
  2026-03-17  9:50   ` Rakie Kim
  2026-03-16 15:19 ` Joshua Hahn
  2026-03-18 12:02 ` Jonathan Cameron
  6 siblings, 1 reply; 18+ messages in thread
From: Gregory Price @ 2026-03-16 14:01 UTC (permalink / raw)
  To: Rakie Kim
  Cc: akpm, linux-mm, linux-kernel, linux-cxl, ziy, matthew.brost,
	joshua.hahnjy, byungchul, ying.huang, apopple, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun

On Mon, Mar 16, 2026 at 02:12:48PM +0900, Rakie Kim wrote:
> This patch series is an RFC to propose and discuss the overall design
> and concept of a socket-aware weighted interleave mechanism. As there
> are areas requiring further refinement, the primary goal at this stage
> is to gather feedback on the architectural approach rather than focusing
> on fine-grained implementation details.
> 

I gave this a brief browse this morning, and I rather like this
approach, more-so than the original proposals for socket-awareness
that encoded the weights in a 2-dimensional array.

I think this would be a great discussion at LSF, and I wonder if
something like memory-package could be used for more purposes than just
weighted interleave.

~Gregory


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
                   ` (4 preceding siblings ...)
  2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
@ 2026-03-16 15:19 ` Joshua Hahn
  2026-03-16 19:45   ` Gregory Price
  2026-03-17 11:36   ` Rakie Kim
  2026-03-18 12:02 ` Jonathan Cameron
  6 siblings, 2 replies; 18+ messages in thread
From: Joshua Hahn @ 2026-03-16 15:19 UTC (permalink / raw)
  To: Rakie Kim
  Cc: akpm, gourry, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, joshua.hahnjy, byungchul, ying.huang, apopple,
	david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, kernel_team,
	honggyu.kim, yunjeong.mun

Hello Rakie! I hope you have been doing well. Thank you for this
RFC, I think it is a very interesting idea. 

[...snip...]

> Consider a dual-socket system:
> 
>           node0             node1
>         +-------+         +-------+
>         | CPU 0 |---------| CPU 1 |
>         +-------+         +-------+
>         | DRAM0 |         | DRAM1 |
>         +---+---+         +---+---+
>             |                 |
>         +---+---+         +---+---+
>         | CXL 0 |         | CXL 1 |
>         +-------+         +-------+
>           node2             node3
> 
> Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> the effective bandwidth varies significantly from the perspective of
> each CPU due to inter-socket interconnect penalties.
> 
> Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
> 
>          0     1     2     3
> CPU 0  300   150   100    50
> CPU 1  150   300    50   100
> 
> A reasonable global weight vector reflecting the base capabilities is:
> 
>      node0=3 node1=3 node2=1 node3=1
> 
> However, because these configured node weights do not account for
> interconnect degradation between sockets, applying them flatly to all
> sources yields the following effective map from each CPU's perspective:
> 
>          0     1     2     3
> CPU 0    3     3     1     1
> CPU 1    3     3     1     1
> 
> This does not account for the interconnect penalty (e.g., node0->node1
> drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> that cause a mismatch with actual performance.
> 
> This patch makes weighted interleave socket-aware. Before weighting is
> applied, the candidate nodes are restricted to the current socket; only
> if no eligible local nodes remain does the policy fall back to the
> wider set.

So when I saw this, I thought the idea was that we would attempt an
allocation with these socket-aware weights, and upon failure, fall back
to the global weights that are set so that we can try to fulfill the
allocation from cross-socket nodes.

However, reading the implementation in 4/4, it seems like what is meant
by "fallback" here is not in the sense of a fallback allocation, but
in the sense of "if there is a misconfiguration and the intersection
between policy nodes and the CPU's package is empty, use the global
nodes instead". 

Am I understanding this correctly? 

And, it seems like what this also means is that under sane configurations,
there is no more cross socket memory allocation, since it will always
try to fulfill it from the local node. 

> Even if the configured global weights remain identically set:
> 
>      node0=3 node1=3 node2=1 node3=1
> 
> The resulting effective map from the perspective of each CPU becomes:
> 
>          0     1     2     3
> CPU 0    3     0     1     0
> CPU 1    0     3     0     1

> Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> effective bandwidth, preserves NUMA locality, and reduces cross-socket
> traffic.

In that sense I thought the word "prefer" was a bit confusing, since I
thought it would mean that it would try to fulfill the alloactions
from within a packet first, then fall back to remote packets if that
failed. (Or maybe I am just misunderstanding your explanation. Please
do let me know if that is the case : -) )

If what I understand is the case , I think this is the same thing as
just restricting allocations to be socket-local. I also wonder if
this idea applies to other mempolicies as well (i.e. unweighted interleave)

I think we should consider what the expected and desirable behavior is
when one socket is fully saturated but the other socket is empty. In my
mind this is no different from considering within-packet remote NUMA
allocations; the tradeoff becomes between reclaiming locally and
keeping allocations local, vs. skipping reclaiming and consuming
free memory while eating the remote access latency, similar to
zone_reclaim mode (packet_reclaim_mode? ; -) )

In my mind (without doing any benchmarking myself or looking at the numbers)
I imagine that there are some scenarios where we actually do want cross
socket allocations, like in the example above when we have very asymmetric
saturations across sockets. Is this something that could be worth
benchmarking as well?

I will end by saying that in the normal case (sockets have similar saturation)
I think this series is a definite win and improvement to weighted interleave.
I just was curious whether we can handle the worst-case scenarios.

Thank you again for the series. Have a great day!
Joshua


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-16 15:19 ` Joshua Hahn
@ 2026-03-16 19:45   ` Gregory Price
  2026-03-17 11:50     ` Rakie Kim
  2026-03-17 11:36   ` Rakie Kim
  1 sibling, 1 reply; 18+ messages in thread
From: Gregory Price @ 2026-03-16 19:45 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Rakie Kim, akpm, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, byungchul, ying.huang, apopple, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun

On Mon, Mar 16, 2026 at 08:19:32AM -0700, Joshua Hahn wrote:
> 
> In that sense I thought the word "prefer" was a bit confusing, since I
> thought it would mean that it would try to fulfill the alloactions
> from within a packet first, then fall back to remote packets if that
> failed. (Or maybe I am just misunderstanding your explanation. Please
> do let me know if that is the case : -) )
> 
> If what I understand is the case , I think this is the same thing as
> just restricting allocations to be socket-local. I also wonder if
> this idea applies to other mempolicies as well (i.e. unweighted interleave)
> 

I was thinking about this as well, and in my head i think you have to
consider a 2x2 situation

cpuset             |   multi-socket-cpu      single-socket-cpu
==================================================================
single-socket-mem  |     mem-package            mem-package
------------------------------------------------------------------
multi-socket-mem   |       global                 global
------------------------------------------------------------------

But I think this reduces to cpuset nodes dictates the weights used -
which should already be the case with the existing code.

I think you are right that we need to be very explicit about the
fallback semantics here - but that may just be a matter of dictating
whether the allocation falls back or prefers direct reclaim to push
pages out of their requested nodes.

~Gregory


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
@ 2026-03-17  9:50   ` Rakie Kim
  0 siblings, 0 replies; 18+ messages in thread
From: Rakie Kim @ 2026-03-17  9:50 UTC (permalink / raw)
  To: Gregory Price
  Cc: akpm, linux-mm, linux-kernel, linux-cxl, ziy, matthew.brost,
	joshua.hahnjy, byungchul, ying.huang, apopple, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, harry.yoo, lsf-pc, kernel_team,
	honggyu.kim, yunjeong.mun, Rakie Kim

On Mon, 16 Mar 2026 10:01:48 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Mon, Mar 16, 2026 at 02:12:48PM +0900, Rakie Kim wrote:
> > This patch series is an RFC to propose and discuss the overall design
> > and concept of a socket-aware weighted interleave mechanism. As there
> > are areas requiring further refinement, the primary goal at this stage
> > is to gather feedback on the architectural approach rather than focusing
> > on fine-grained implementation details.
> > 
> 
> I gave this a brief browse this morning, and I rather like this
> approach, more-so than the original proposals for socket-awareness
> that encoded the weights in a 2-dimensional array.

Hello Gregory,

Thanks for your review and feedback. I also think this approach is
much better than the previous 2-dimensional array idea. Since this
is still an early draft, I hope this code will be developed into a
better design through community discussions.

> 
> I think this would be a great discussion at LSF, and I wonder if
> something like memory-package could be used for more purposes than just
> weighted interleave.
> 
> ~Gregory

I and Honggyu Kim are actually preparing to propose this exact topic
for the upcoming LSF/MM/BPF summit. However, I accidentally missed
adding lsf-pc@lists.linux-foundation.org to the CC list. I will
re-post or forward this to the LSF PC list soon.

You are exactly right about the memory-package. When I first designed
it, I wanted to use it for memory tiering and other areas, not just
for weighted interleave. For now, weighted interleave is the only
implemented use case, but I hope to keep improving it so it can be
used in other subsystems as well.

Thanks again for your time and review.

Rakie Kim

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-16 15:19 ` Joshua Hahn
  2026-03-16 19:45   ` Gregory Price
@ 2026-03-17 11:36   ` Rakie Kim
  1 sibling, 0 replies; 18+ messages in thread
From: Rakie Kim @ 2026-03-17 11:36 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: akpm, gourry, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, byungchul, ying.huang, apopple, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, harry.yoo, lsf-pc, kernel_team,
	honggyu.kim, yunjeong.mun, Rakie Kim

On Mon, 16 Mar 2026 08:19:32 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> Hello Rakie! I hope you have been doing well. Thank you for this
> RFC, I think it is a very interesting idea. 

Hello Joshua,

I hope you are doing well. Thanks for your review and feedback on this RFC.

> 
> [...snip...]
> 
> > Consider a dual-socket system:
> > 
h >           node0             node1
> >         +-------+         +-------+
> >         | CPU 0 |---------| CPU 1 |
> >         +-------+         +-------+
> >         | DRAM0 |         | DRAM1 |
> >         +---+---+         +---+---+
> >             |                 |
> >         +---+---+         +---+---+
> >         | CXL 0 |         | CXL 1 |
> >         +-------+         +-------+
> >           node2             node3
> > 
> > Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> > the effective bandwidth varies significantly from the perspective of
> > each CPU due to inter-socket interconnect penalties.
> > 
> > Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
> > 
> >          0     1     2     3
> > CPU 0  300   150   100    50
> > CPU 1  150   300    50   100
> > 
> > A reasonable global weight vector reflecting the base capabilities is:
> > 
> >      node0=3 node1=3 node2=1 node3=1
> > 
> > However, because these configured node weights do not account for
> > interconnect degradation between sockets, applying them flatly to all
> > sources yields the following effective map from each CPU's perspective:
> > 
> >          0     1     2     3
> > CPU 0    3     3     1     1
> > CPU 1    3     3     1     1
> > 
> > This does not account for the interconnect penalty (e.g., node0->node1
> > drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> > that cause a mismatch with actual performance.
> > 
> > This patch makes weighted interleave socket-aware. Before weighting is
> > applied, the candidate nodes are restricted to the current socket; only
> > if no eligible local nodes remain does the policy fall back to the
> > wider set.
> 
> So when I saw this, I thought the idea was that we would attempt an
> allocation with these socket-aware weights, and upon failure, fall back
> to the global weights that are set so that we can try to fulfill the
> allocation from cross-socket nodes.
> 
> However, reading the implementation in 4/4, it seems like what is meant
> by "fallback" here is not in the sense of a fallback allocation, but
> in the sense of "if there is a misconfiguration and the intersection
> between policy nodes and the CPU's package is empty, use the global
> nodes instead". 
> 
> Am I understanding this correctly? 
> 
> And, it seems like what this also means is that under sane configurations,
> there is no more cross socket memory allocation, since it will always
> try to fulfill it from the local node. 
> 

Your analysis of the code in patch 4/4 is exactly correct. I apologize
for using the term "fallback" in the cover letter, which caused some
confusion. As you understood, the current implementation strictly
restricts allocations to the local socket to avoid cross-socket traffic.

> > Even if the configured global weights remain identically set:
> > 
> >      node0=3 node1=3 node2=1 node3=1
> > 
> > The resulting effective map from the perspective of each CPU becomes:
> > 
> >          0     1     2     3
> > CPU 0    3     0     1     0
> > CPU 1    0     3     0     1
> 
> > Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> > node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> > effective bandwidth, preserves NUMA locality, and reduces cross-socket
> > traffic.
> 
> In that sense I thought the word "prefer" was a bit confusing, since I
> thought it would mean that it would try to fulfill the alloactions
> from within a packet first, then fall back to remote packets if that
> failed. (Or maybe I am just misunderstanding your explanation. Please
> do let me know if that is the case : -) )
> 
> If what I understand is the case , I think this is the same thing as
> just restricting allocations to be socket-local. I also wonder if
> this idea applies to other mempolicies as well (i.e. unweighted interleave)

Again, I apologize for the confusion caused by words like "prefer" and
"fallback" in the commit message. Your understanding is correct; the
current code strictly restricts allocations to the socket-local nodes.

To determine where memory may be allocated within a socket, the code uses
a function named policy_resolve_package_nodes(). As described in the
comments, the logic works as follows:

1. Success case: It tries to use the intersection of the current CPU's
   package nodes and the user's preselected policy nodes. If the
   intersection is not empty, it uses these local nodes.
2. Failure case: If the intersection is empty (e.g., the user opted out
   of the current package), it finds the package of another node in the
   policy nodes and gets the intersection again. If this also yields an
   empty set, it completely falls back to the original global policy nodes.

In this early version, the consideration for handling various detailed
cases is insufficient. Also, as you pointed out, applying this strict
local restriction directly to other policies like unweighted interleave
might be difficult, as it could conflict with the original purpose of
interleaving. I plan to consider these aspects further and prepare a
more complemented design.

> 
> I think we should consider what the expected and desirable behavior is
> when one socket is fully saturated but the other socket is empty. In my
> mind this is no different from considering within-packet remote NUMA
> allocations; the tradeoff becomes between reclaiming locally and
> keeping allocations local, vs. skipping reclaiming and consuming
> free memory while eating the remote access latency, similar to
> zone_reclaim mode (packet_reclaim_mode? ; -) )

This is an issue I have been thinking about since the early design phase,
and it must be resolved to improve this patch series. The trade-off
between forcing local memory reclaim to stay local versus accepting the
latency penalty of using a remote socket is a point we need to address.
I will continue to think about how to handle this properly.

> 
> In my mind (without doing any benchmarking myself or looking at the numbers)
> I imagine that there are some scenarios where we actually do want cross
> socket allocations, like in the example above when we have very asymmetric
> saturations across sockets. Is this something that could be worth
> benchmarking as well?

Your suggestion is valid and worth considering. I am currently analyzing
the behavior of this feature under various workloads. I will also
consider the asymmetric saturation scenarios you suggested.

> 
> I will end by saying that in the normal case (sockets have similar saturation)
> I think this series is a definite win and improvement to weighted interleave.
> I just was curious whether we can handle the worst-case scenarios.
> 
> Thank you again for the series. Have a great day!
> Joshua

Thanks again for the review. I will prepare a more considered design
for the next version based on these points.

Rakie Kim



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-16 19:45   ` Gregory Price
@ 2026-03-17 11:50     ` Rakie Kim
  0 siblings, 0 replies; 18+ messages in thread
From: Rakie Kim @ 2026-03-17 11:50 UTC (permalink / raw)
  To: Gregory Price
  Cc: Rakie Kim, akpm, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, byungchul, ying.huang, apopple, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, harry.yoo, lsf-pc, kernel_team,
	honggyu.kim, yunjeong.mun, Joshua Hahn

On Mon, 16 Mar 2026 15:45:24 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Mon, Mar 16, 2026 at 08:19:32AM -0700, Joshua Hahn wrote:
> > 
> > In that sense I thought the word "prefer" was a bit confusing, since I
> > thought it would mean that it would try to fulfill the alloactions
> > from within a packet first, then fall back to remote packets if that
> > failed. (Or maybe I am just misunderstanding your explanation. Please
> > do let me know if that is the case : -) )
> > 
> > If what I understand is the case , I think this is the same thing as
> > just restricting allocations to be socket-local. I also wonder if
> > this idea applies to other mempolicies as well (i.e. unweighted interleave)
> > 
> 
> I was thinking about this as well, and in my head i think you have to
> consider a 2x2 situation
> 
> cpuset             |   multi-socket-cpu      single-socket-cpu
> ==================================================================
> single-socket-mem  |     mem-package            mem-package
> ------------------------------------------------------------------
> multi-socket-mem   |       global                 global
> ------------------------------------------------------------------
> 
> But I think this reduces to cpuset nodes dictates the weights used -
> which should already be the case with the existing code.

Hello Gregory,

Thanks for your additional feedback.

I agree with your analysis. The final behavior should follow the nodes
dictated by the cpuset or mempolicy configurations.

> 
> I think you are right that we need to be very explicit about the
> fallback semantics here - but that may just be a matter of dictating
> whether the allocation falls back or prefers direct reclaim to push
> pages out of their requested nodes.
> 
> ~Gregory

As you and Joshua pointed out, making the fallback semantics explicit
is the most critical issue for this patch series. We need a clear policy
to decide whether the allocation should fall back to a remote node or
force direct reclaim to keep the allocation local.

I will explicitly define these fallback semantics and address this
trade-off in the design for the next version.

Thanks again for your time and review.

Rakie Kim



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
                   ` (5 preceding siblings ...)
  2026-03-16 15:19 ` Joshua Hahn
@ 2026-03-18 12:02 ` Jonathan Cameron
  2026-03-19  7:55   ` Rakie Kim
  6 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2026-03-18 12:02 UTC (permalink / raw)
  To: Rakie Kim
  Cc: akpm, gourry, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, joshua.hahnjy, byungchul, ying.huang, apopple,
	david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, dave, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun,
	Keith Busch

On Mon, 16 Mar 2026 14:12:48 +0900
Rakie Kim <rakie.kim@sk.com> wrote:

> This patch series is an RFC to propose and discuss the overall design
> and concept of a socket-aware weighted interleave mechanism. As there
> are areas requiring further refinement, the primary goal at this stage
> is to gather feedback on the architectural approach rather than focusing
> on fine-grained implementation details.
> 
> Weighted interleave distributes page allocations across multiple nodes
> based on configured weights. However, the current implementation applies
> a single global weight vector. In multi-socket systems, this creates a
> mismatch between configured weights and actual hardware performance, as
> it cannot account for inter-socket interconnect costs. To address this,
> we propose a socket-aware approach that restricts candidate nodes to
> the local socket before applying weights.
> 
> Flat weighted interleave applies one global weight vector regardless of
> where a task runs. On multi-socket systems, this ignores inter-socket
> interconnect costs, meaning the configured weights do not accurately
> reflect the actual hardware performance.
> 
> Consider a dual-socket system:
> 
>           node0             node1
>         +-------+         +-------+
>         | CPU 0 |---------| CPU 1 |
>         +-------+         +-------+
>         | DRAM0 |         | DRAM1 |
>         +---+---+         +---+---+
>             |                 |
>         +---+---+         +---+---+
>         | CXL 0 |         | CXL 1 |
>         +-------+         +-------+
>           node2             node3
> 
> Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> the effective bandwidth varies significantly from the perspective of
> each CPU due to inter-socket interconnect penalties.

I'm fully on board with this problem and very pleased to see someone
working on it!

I have some questions about the example.
The condition definitely applies when the local node to
CXL bandwidth > interconnect bandwidth, but that's not true here so this is
a more complex and I'm curious about the example

> 
> Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
> 
>          0     1     2     3
> CPU 0  300   150   100    50
> CPU 1  150   300    50   100

These numbers don't seem consistent with the 100 / 300 numbers above.
These aren't low load bandwidths because if they were you'd not see any
drop on the CXL numbers as the bottleneck is still the CXL bus.  Given the
game here is bandwidth interleaving - fair enough that these should be
loaded bandwidths.

If these are fully loaded bandwidth then the headline DRAM / CXL numbers need
to be the sum of all access paths.  So DRAM must be 450GiB/s and CXL 150GiB/s
The cross CPU interconnect is 200GiB/s in each direction I think.
This is ignoring caching etc which can make judging interconnect effects tricky
at best!

Years ago there were some attempts to standardize the information available
on topology under load. To put it lightly it got tricky fast and no one
could agree on how to measure it for an empirical solution.

> 
> A reasonable global weight vector reflecting the base capabilities is:
> 
>      node0=3 node1=3 node2=1 node3=1
> 
> However, because these configured node weights do not account for
> interconnect degradation between sockets, applying them flatly to all
> sources yields the following effective map from each CPU's perspective:
> 
>          0     1     2     3
> CPU 0    3     3     1     1
> CPU 1    3     3     1     1
> 
> This does not account for the interconnect penalty (e.g., node0->node1
> drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> that cause a mismatch with actual performance.
> 
> This patch makes weighted interleave socket-aware. Before weighting is
> applied, the candidate nodes are restricted to the current socket; only
> if no eligible local nodes remain does the policy fall back to the
> wider set.
> 
> Even if the configured global weights remain identically set:
> 
>      node0=3 node1=3 node2=1 node3=1
> 
> The resulting effective map from the perspective of each CPU becomes:
> 
>          0     1     2     3
> CPU 0    3     0     1     0
> CPU 1    0     3     0     1
> 
> Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> effective bandwidth, preserves NUMA locality, and reduces cross-socket
> traffic.

Workload wise this is kind of assuming each NUMA node is doing something
similar and keeping to itself. Assuming a nice balanced setup that is
fine. However, with certain CPU topologies you are likely to see slightly
messier things.

> 
> To make this possible, the system requires a mechanism to understand
> the physical topology. The existing NUMA distance model provides only
> relative latency values between nodes and lacks any notion of
> structural grouping such as socket boundaries. This is especially
> problematic for CXL memory nodes, which appear without an explicit
> socket association.

So in a general sense, the missing info here is effectively the same
stuff we are missing from the HMAT presentation (it's there in the
table and it's there to compute in CXL cases) just because we decided
not to surface anything other than distances to memory from nearest
initiator.  I chatted to Joshua and Kieth about filling in that stuff
at last LSFMM. To me that's just a bit of engineering work that needs
doing now we have proven use cases for the data. Mostly it's figuring out
the presentation to userspace and kernel data structures as it's a
lot of data in a big system (typically at least 32 NUMA nodes).

> 
> This patch series introduces a socket-aware topology management layer
> that groups NUMA nodes according to their physical package. It
> explicitly links CPU and memory-only nodes (such as CXL) under the
> same socket using an initiator CPU node. This captures the true
> hardware hierarchy rather than relying solely on flat distance values.
> 
> 
> [Experimental Results]
> 
> System Configuration:
> - Processor: Dual-Socket Intel Xeon 6980P (Granite Rapids)
> 
>                node0                       node1
>              +-------+                   +-------+
>              | CPU 0 |-------------------| CPU 1 |
>              +-------+                   +-------+
> 12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
> DDR5-6400    +---+---+                   +---+---+  DDR5-6400
>                  |                           |
>              +---+---+                   +---+---+
> 8 Channels   | CXL 0 |                   | CXL 1 |  8 Channels
> DDR5-6400    +-------+                   +-------+  DDR5-6400
>                node2                       node3
> 
> 1) Throughput (System Bandwidth)
>    - DRAM Only: 966 GB/s
>    - Weighted Interleave: 903 GB/s (7% decrease compared to DRAM Only)
>    - Socket-Aware Weighted Interleave: 1329 GB/s (1.33TB/s)
>      (38% increase compared to DRAM Only,
>       47% increase compared to Weighted Interleave)
> 
> 2) Loaded Latency (Under High Bandwidth)
>    - DRAM Only: 544 ns
>    - Weighted Interleave: 545 ns
>    - Socket-Aware Weighted Interleave: 436 ns
>      (20% reduction compared to both)
> 

This may prove too simplistic so we need to be a little careful.
It may be enough for now though so I'm not saying we necessarily
need to change things (yet)!. Just highlighting things I've seen
turn up before in such discussions.

Simplest one is that we have more CXL memory on some nodes than
others.  Only so many lanes and we probably want some of them for
other purposes!

More fun, multi NUMA node per sockets systems.

A typical CPU Die with memory controllers (e.g. taking one of
our old parts where there are dieshots online kunpeng 920 to
avoid any chance of leaking anything...).

                  Socket 0             Socket 1
 |    node0      |   node 1|       | node2 | |    node 3     |
 +-----+ +-------+ +-------+       +-------+ +-------+ +-----+
 | IO  | | CPU 0 | | CPU 1 |-------| CPU 2 | | CPU 3 | | IO  |
 | DIE | +-------+ +-------+       +-------+ +-------+ | DIE |
 +--+--+ | DRAM0 | | DRAM1 |       | DRAM2 | | DRAM2 | +--+--+
    |    +-------+ +-------+       +-------+ +-------+    |
    |                                                     |
+---+---+                                             +---+---+ 
| CXL 0 |                                             | CXL 1 |
+-------+                                             +-------+

So only a single CXL device per socket and the socket is multiple
NUMA nodes as the DRAM interfaces are on the CPU Dies (unlike some
others where they are on the IO Die alongside the CXL interfaces).

CXL topology cases:

A simple dual socket setup with a CXL switch and MLD below it
makes for a shared link to the CXL memory (and hence a bandwidth
restriction) that this can't model.

                node0                       node1
              +-------+                   +-------+
              | CPU 0 |-------------------| CPU 1 |
              +-------+                   +-------+
 12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
 DDR5-6400    +---+---+                   +---+---+  DDR5-6400
                  |                           |
                  |___________________________| 
                                |
                                |
                            +---+---+       
            Many Channels   | CXL 0 |    
               DDR5-6400    +-------+   
                node2/3     
 
Note it's still two nodes for the CXL as we aren't accessing the same DPA for
each host node but their actual memory is interleaved across the same devices
to give peak BW.

The reason you might do this is load balancing across lots of CXL devices
downstream of the switch.

Note this also effectively happens with MHDs just the load balancing is across
backend memory being provided via multiple heads.  Whether people wire MHDs
that way or tend to have multiple top of rack devices with each CPU
socket connecting to a different one is an open question to me.

I have no idea yet on how you'd present the resulting bandwidth interference
effects of such as setup.

IO Expanders on the CPU interconnect:

Just for fun, on similar interconnects we've previously also seen
the following and I'd be surprised if those going for max bandwidth
don't do this for CXL at some point soon.


                node0                       node1
              +-------+                   +-------+
              | CPU 0 |-------------------| CPU 1 |
              +-------+                   +-------+
 12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
 DDR5-6400    +---+---+                   +---+---+  DDR5-6400
                  |                           |
                  |___________________________|
                      |  IO Expander      |
                      |  CPU interconnect |
                      |___________________|
                                |
                            +---+---+       
            Many Channels   | CXL 0 |    
               DDR5-6400    +-------+   
                node2

That is the CXL memory is effectively the same distance from
CPU0 and CPU1 - they probably have their own local CXL as well
as this approach is done to scale up interconnect lanes in a system
when bandwidth is way more important than compute. Similar to the
MHD case but in this case we are accessing the same DPAs via
both paths.

Anyhow, the exact details of those don't matter beyond the general
point that even in 'balanced' high performance configurations there
may not be a clean 1:1 relationship between NUMA nodes and CXL memory
devices.  Maybe some maths that aggregates some groups of nodes
together would be enough. I've not really thought it through yet.

Fun and useful topic.  Whilst I won't be at LSFMM it is definitely
something I'd like to see move forward in general.

Thanks,

Jonathan

> 
> [Additional Considerations]
> 
> Please note that this series includes modifications to the CXL driver
> to register these nodes. However, the necessity and the approach of
> these driver-side changes require further discussion and consideration.
> Additionally, this topology layer was originally designed to support
> both memory tiering and weighted interleave. Currently, it is only
> utilized by the weighted interleave policy. As a result, several
> functions exposed by this layer are not actively used in this RFC.
> Unused portions will be cleaned up and removed in the final patch
> submission.
> 
> Summary of patches:
> 
>   [PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask()
>   This patch adds a new NUMA helper function to find all nodes in a
>   given nodemask that share the minimum distance from a specified
>   source node.
> 
>   [PATCH 2/4] mm/memory-tiers: introduce socket-aware topology mgmt
>   This patch introduces a management layer that groups NUMA nodes by
>   their physical package (socket). It forms a "memory package" to
>   abstract real hardware locality for predictable NUMA memory
>   management.
> 
>   [PATCH 3/4] mm/memory-tiers: register CXL nodes to socket packages
>   This patch implements a registration path to bind CXL memory nodes
>   to a socket-aware memory package using an initiator CPU node. This
>   ensures CXL nodes are deterministically grouped with the CPUs they
>   service.
> 
>   [PATCH 4/4] mm/mempolicy: enhance weighted interleave with locality
>   This patch modifies the weighted interleave policy to restrict
>   candidate nodes to the current socket before applying weights. It
>   reduces cross-socket traffic and aligns memory allocation with
>   actual bandwidth.
> 
> Any feedback and discussions are highly appreciated.
> 
> Thanks
> 
> Rakie Kim (4):
>   mm/numa: introduce nearest_nodes_nodemask()
>   mm/memory-tiers: introduce socket-aware topology management for NUMA
>     nodes
>   mm/memory-tiers: register CXL nodes to socket-aware packages via
>     initiator
>   mm/mempolicy: enhance weighted interleave with socket-aware locality
> 
>  drivers/cxl/core/region.c    |  46 +++
>  drivers/cxl/cxl.h            |   1 +
>  drivers/dax/kmem.c           |   2 +
>  include/linux/memory-tiers.h |  93 +++++
>  include/linux/numa.h         |   8 +
>  mm/memory-tiers.c            | 766 +++++++++++++++++++++++++++++++++++
>  mm/mempolicy.c               | 135 +++++-
>  7 files changed, 1047 insertions(+), 4 deletions(-)
> 
> 
> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes
  2026-03-16  5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
@ 2026-03-18 12:22   ` Jonathan Cameron
  0 siblings, 0 replies; 18+ messages in thread
From: Jonathan Cameron @ 2026-03-18 12:22 UTC (permalink / raw)
  To: Rakie Kim
  Cc: akpm, gourry, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, joshua.hahnjy, byungchul, ying.huang, apopple,
	david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, dave, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun

On Mon, 16 Mar 2026 14:12:50 +0900
Rakie Kim <rakie.kim@sk.com> wrote:

> The existing NUMA distance model provides only relative latency values
> between nodes and lacks any notion of structural grouping such as socket
> or package boundaries. As a result, memory policies based solely on
> distance cannot differentiate between nodes that are physically local
> to the same socket and those that belong to different sockets. This
> often leads to inefficient cross-socket demotion and suboptimal memory
> placement.
> 
> This patch introduces a socket-aware topology management layer that
> groups NUMA nodes according to their physical package (socket)
> association. Each group forms a "memory package" that explicitly links
> CPU and memory-only nodes (such as CXL or HBM) under the same socket.
> This structure allows the kernel to interpret NUMA topology in a way
> that reflects real hardware locality rather than relying solely on
> flat distance values.
> 
> By maintaining socket-level grouping, the kernel can:
>  - Enforce demotion and promotion policies that stay within the same
>    socket.
>  - Avoid unintended cross-socket migrations that degrade performance.
>  - Provide a structural abstraction for future policy and tiering logic.
> 
> Unlike ACPI-provided distance tables, which offer static and symmetric
> relationships, this socket-aware model captures the true hardware
> hierarchy and provides a flexible foundation for systems where the
> distance matrix alone cannot accurately express socket boundaries or
> asymmetric topologies.

Careful with the generalities in here. There is no way to derive the
'true' hierarchy. What this is doing is applying a particular set
of heuristics to the data that ACPI provided and attempting to use
that to derive relationships. In simple cases that might work fine.0

Doing so is OK in an RFC for discussion but this will need testing
against a wide range of topologies to at least ensure it fails gracefully.
Note we've had to paper over quite a few topology assumptions in the
kernel and this feels like another one that will bite us later.

I'd avoid the socket terminology as multiple NUMA nodes in sockets
have been a thing for many years. Today there can even be multiple
IO dies with a complex 'distance' relationship wrt to the CPUs
in that socket. Topologies of memory controllers in those
packages are another level of complexity.

Otherwise a few general things from a quick look. 

I'd avoid goto out; where out just returns.  That just makes code
flow more complex and often makes for longer code. When you have
an error and there is nothing to cleanup just return immediately.

guard() / scoped_guard() will help simplify some of the locking.

Thanks,

Jonathan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-18 12:02 ` Jonathan Cameron
@ 2026-03-19  7:55   ` Rakie Kim
  2026-03-20 16:56     ` Jonathan Cameron
  0 siblings, 1 reply; 18+ messages in thread
From: Rakie Kim @ 2026-03-19  7:55 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: akpm, gourry, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, joshua.hahnjy, byungchul, ying.huang, apopple,
	david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, dave, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun,
	Keith Busch, Rakie Kim

On Wed, 18 Mar 2026 12:02:45 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> On Mon, 16 Mar 2026 14:12:48 +0900
> Rakie Kim <rakie.kim@sk.com> wrote:
> 

Hello Jonathan,

Thanks for your detailed review and the insights on various topology cases.

> > This patch series is an RFC to propose and discuss the overall design
> > and concept of a socket-aware weighted interleave mechanism. As there
> > are areas requiring further refinement, the primary goal at this stage
> > is to gather feedback on the architectural approach rather than focusing
> > on fine-grained implementation details.
> > 
> > Weighted interleave distributes page allocations across multiple nodes
> > based on configured weights. However, the current implementation applies
> > a single global weight vector. In multi-socket systems, this creates a
> > mismatch between configured weights and actual hardware performance, as
> > it cannot account for inter-socket interconnect costs. To address this,
> > we propose a socket-aware approach that restricts candidate nodes to
> > the local socket before applying weights.
> > 
> > Flat weighted interleave applies one global weight vector regardless of
> > where a task runs. On multi-socket systems, this ignores inter-socket
> > interconnect costs, meaning the configured weights do not accurately
> > reflect the actual hardware performance.
> > 
> > Consider a dual-socket system:
> > 
> >           node0             node1
> >         +-------+         +-------+
> >         | CPU 0 |---------| CPU 1 |
> >         +-------+         +-------+
> >         | DRAM0 |         | DRAM1 |
> >         +---+---+         +---+---+
> >             |                 |
> >         +---+---+         +---+---+
> >         | CXL 0 |         | CXL 1 |
> >         +-------+         +-------+
> >           node2             node3
> > 
> > Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> > the effective bandwidth varies significantly from the perspective of
> > each CPU due to inter-socket interconnect penalties.
> 
> I'm fully on board with this problem and very pleased to see someone
> working on it!
> 
> I have some questions about the example.
> The condition definitely applies when the local node to
> CXL bandwidth > interconnect bandwidth, but that's not true here so this is
> a more complex and I'm curious about the example
> 
> > 
> > Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
> > 
> >          0     1     2     3
> > CPU 0  300   150   100    50
> > CPU 1  150   300    50   100
> 
> These numbers don't seem consistent with the 100 / 300 numbers above.
> These aren't low load bandwidths because if they were you'd not see any
> drop on the CXL numbers as the bottleneck is still the CXL bus.  Given the
> game here is bandwidth interleaving - fair enough that these should be
> loaded bandwidths.
> 
> If these are fully loaded bandwidth then the headline DRAM / CXL numbers need
> to be the sum of all access paths.  So DRAM must be 450GiB/s and CXL 150GiB/s
> The cross CPU interconnect is 200GiB/s in each direction I think.
> This is ignoring caching etc which can make judging interconnect effects tricky
> at best!
> 
> Years ago there were some attempts to standardize the information available
> on topology under load. To put it lightly it got tricky fast and no one
> could agree on how to measure it for an empirical solution.
> 

You are exactly right about the numbers. The values used in the example
were overly simplified just to briefly illustrate the concept of the
interconnect penalty. I realize that this oversimplification caused
confusion regarding the actual bottleneck and fully loaded bandwidth.
In the next update, I will revise the example to use more accurate
numbers based on the actual system I am currently using.

> > 
> > A reasonable global weight vector reflecting the base capabilities is:
> > 
> >      node0=3 node1=3 node2=1 node3=1
> > 
> > However, because these configured node weights do not account for
> > interconnect degradation between sockets, applying them flatly to all
> > sources yields the following effective map from each CPU's perspective:
> > 
> >          0     1     2     3
> > CPU 0    3     3     1     1
> > CPU 1    3     3     1     1
> > 
> > This does not account for the interconnect penalty (e.g., node0->node1
> > drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> > that cause a mismatch with actual performance.
> > 
> > This patch makes weighted interleave socket-aware. Before weighting is
> > applied, the candidate nodes are restricted to the current socket; only
> > if no eligible local nodes remain does the policy fall back to the
> > wider set.
> > 
> > Even if the configured global weights remain identically set:
> > 
> >      node0=3 node1=3 node2=1 node3=1
> > 
> > The resulting effective map from the perspective of each CPU becomes:
> > 
> >          0     1     2     3
> > CPU 0    3     0     1     0
> > CPU 1    0     3     0     1
> > 
> > Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> > node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> > effective bandwidth, preserves NUMA locality, and reduces cross-socket
> > traffic.
> 
> Workload wise this is kind of assuming each NUMA node is doing something
> similar and keeping to itself. Assuming a nice balanced setup that is
> fine. However, with certain CPU topologies you are likely to see slightly
> messier things.
> 

I agree with your point. Since the current design is still an early draft,
I understand that this assumption may not hold true for all workloads.
This is an area that requires further consideration.

> > 
> > To make this possible, the system requires a mechanism to understand
> > the physical topology. The existing NUMA distance model provides only
> > relative latency values between nodes and lacks any notion of
> > structural grouping such as socket boundaries. This is especially
> > problematic for CXL memory nodes, which appear without an explicit
> > socket association.
> 
> So in a general sense, the missing info here is effectively the same
> stuff we are missing from the HMAT presentation (it's there in the
> table and it's there to compute in CXL cases) just because we decided
> not to surface anything other than distances to memory from nearest
> initiator.  I chatted to Joshua and Kieth about filling in that stuff
> at last LSFMM. To me that's just a bit of engineering work that needs
> doing now we have proven use cases for the data. Mostly it's figuring out
> the presentation to userspace and kernel data structures as it's a
> lot of data in a big system (typically at least 32 NUMA nodes).
> 

Hearing about the discussion on exposing HMAT data is very welcome news.
Because this detailed topology information is not yet fully exposed to
the kernel and userspace, I used a temporary package-based restriction.
Figuring out how to expose and integrate this data into the kernel data
structures is indeed a crucial engineering task we need to solve.

Actually, when I first started this work, I considered fetching the
topology information from HMAT before adopting the current approach.
However, I encountered a firmware issue on my test systems
(Granite Rapids and Sierra Forest).

Although each socket has its own locally attached CXL device, the HMAT
only registers node1 (Socket 1) as the initiator for both CXL memory
nodes (node2 and node3). As a result, the sysfs HMAT initiators for
both node2 and node3 only expose node1.

Even though the distance map shows node2 is physically closer to
Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
routing path strictly through Socket 1. Because the HMAT alone made it
difficult to determine the exact physical socket connections on these
systems, I ended up using the current CXL driver-based approach.

I wonder if others have experienced similar broken HMAT cases with CXL.
If HMAT information becomes more reliable in the future, we could
build a much more efficient structure.

> > 
> > This patch series introduces a socket-aware topology management layer
> > that groups NUMA nodes according to their physical package. It
> > explicitly links CPU and memory-only nodes (such as CXL) under the
> > same socket using an initiator CPU node. This captures the true
> > hardware hierarchy rather than relying solely on flat distance values.
> > 
> > 
> > [Experimental Results]
> > 
> > System Configuration:
> > - Processor: Dual-Socket Intel Xeon 6980P (Granite Rapids)
> > 
> >                node0                       node1
> >              +-------+                   +-------+
> >              | CPU 0 |-------------------| CPU 1 |
> >              +-------+                   +-------+
> > 12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
> > DDR5-6400    +---+---+                   +---+---+  DDR5-6400
> >                  |                           |
> >              +---+---+                   +---+---+
> > 8 Channels   | CXL 0 |                   | CXL 1 |  8 Channels
> > DDR5-6400    +-------+                   +-------+  DDR5-6400
> >                node2                       node3
> > 
> > 1) Throughput (System Bandwidth)
> >    - DRAM Only: 966 GB/s
> >    - Weighted Interleave: 903 GB/s (7% decrease compared to DRAM Only)
> >    - Socket-Aware Weighted Interleave: 1329 GB/s (1.33TB/s)
> >      (38% increase compared to DRAM Only,
> >       47% increase compared to Weighted Interleave)
> > 
> > 2) Loaded Latency (Under High Bandwidth)
> >    - DRAM Only: 544 ns
> >    - Weighted Interleave: 545 ns
> >    - Socket-Aware Weighted Interleave: 436 ns
> >      (20% reduction compared to both)
> > 
> 
> This may prove too simplistic so we need to be a little careful.
> It may be enough for now though so I'm not saying we necessarily
> need to change things (yet)!. Just highlighting things I've seen
> turn up before in such discussions.
> 
> Simplest one is that we have more CXL memory on some nodes than
> others.  Only so many lanes and we probably want some of them for
> other purposes!
> 
> More fun, multi NUMA node per sockets systems.
> 
> A typical CPU Die with memory controllers (e.g. taking one of
> our old parts where there are dieshots online kunpeng 920 to
> avoid any chance of leaking anything...).
> 
>                   Socket 0             Socket 1
>  |    node0      |   node 1|       | node2 | |    node 3     |
>  +-----+ +-------+ +-------+       +-------+ +-------+ +-----+
>  | IO  | | CPU 0 | | CPU 1 |-------| CPU 2 | | CPU 3 | | IO  |
>  | DIE | +-------+ +-------+       +-------+ +-------+ | DIE |
>  +--+--+ | DRAM0 | | DRAM1 |       | DRAM2 | | DRAM2 | +--+--+
>     |    +-------+ +-------+       +-------+ +-------+    |
>     |                                                     |
> +---+---+                                             +---+---+ 
> | CXL 0 |                                             | CXL 1 |
> +-------+                                             +-------+
> 
> So only a single CXL device per socket and the socket is multiple
> NUMA nodes as the DRAM interfaces are on the CPU Dies (unlike some
> others where they are on the IO Die alongside the CXL interfaces).
> 
> CXL topology cases:
> 
> A simple dual socket setup with a CXL switch and MLD below it
> makes for a shared link to the CXL memory (and hence a bandwidth
> restriction) that this can't model.
> 
>                 node0                       node1
>               +-------+                   +-------+
>               | CPU 0 |-------------------| CPU 1 |
>               +-------+                   +-------+
>  12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
>  DDR5-6400    +---+---+                   +---+---+  DDR5-6400
>                   |                           |
>                   |___________________________| 
>                                 |
>                                 |
>                             +---+---+       
>             Many Channels   | CXL 0 |    
>                DDR5-6400    +-------+   
>                 node2/3     
>  
> Note it's still two nodes for the CXL as we aren't accessing the same DPA for
> each host node but their actual memory is interleaved across the same devices
> to give peak BW.
> 
> The reason you might do this is load balancing across lots of CXL devices
> downstream of the switch.
> 
> Note this also effectively happens with MHDs just the load balancing is across
> backend memory being provided via multiple heads.  Whether people wire MHDs
> that way or tend to have multiple top of rack devices with each CPU
> socket connecting to a different one is an open question to me.
> 
> I have no idea yet on how you'd present the resulting bandwidth interference
> effects of such as setup.
> 
> IO Expanders on the CPU interconnect:
> 
> Just for fun, on similar interconnects we've previously also seen
> the following and I'd be surprised if those going for max bandwidth
> don't do this for CXL at some point soon.
> 
> 
>                 node0                       node1
>               +-------+                   +-------+
>               | CPU 0 |-------------------| CPU 1 |
>               +-------+                   +-------+
>  12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
>  DDR5-6400    +---+---+                   +---+---+  DDR5-6400
>                   |                           |
>                   |___________________________|
>                       |  IO Expander      |
>                       |  CPU interconnect |
>                       |___________________|
>                                 |
>                             +---+---+       
>             Many Channels   | CXL 0 |    
>                DDR5-6400    +-------+   
>                 node2
> 
> That is the CXL memory is effectively the same distance from
> CPU0 and CPU1 - they probably have their own local CXL as well
> as this approach is done to scale up interconnect lanes in a system
> when bandwidth is way more important than compute. Similar to the
> MHD case but in this case we are accessing the same DPAs via
> both paths.
> 
> Anyhow, the exact details of those don't matter beyond the general
> point that even in 'balanced' high performance configurations there
> may not be a clean 1:1 relationship between NUMA nodes and CXL memory
> devices.  Maybe some maths that aggregates some groups of nodes
> together would be enough. I've not really thought it through yet.
> 
> Fun and useful topic.  Whilst I won't be at LSFMM it is definitely
> something I'd like to see move forward in general.
> 
> Thanks,
> 
> Jonathan
> 

The complex topology cases you presented, such as multi-NUMA per socket,
shared CXL switches, and IO expanders, are very important points.
I clearly understand that the simple package-level grouping does not fully
reflect the 1:1 relationship in these future hardware architectures.

I have also thought about the shared CXL switch scenario you mentioned,
and I know the current design falls short in addressing it properly.
While the current implementation starts with a simple socket-local
restriction, I plan to evolve it into a more flexible node aggregation
model to properly reflect all the diverse topologies you suggested.

Thanks again for your time and review.

Rakie Kim

> > 
> > [Additional Considerations]
> > 
> > Please note that this series includes modifications to the CXL driver
> > to register these nodes. However, the necessity and the approach of
> > these driver-side changes require further discussion and consideration.
> > Additionally, this topology layer was originally designed to support
> > both memory tiering and weighted interleave. Currently, it is only
> > utilized by the weighted interleave policy. As a result, several
> > functions exposed by this layer are not actively used in this RFC.
> > Unused portions will be cleaned up and removed in the final patch
> > submission.
> > 
> > Summary of patches:
> > 
> >   [PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask()
> >   This patch adds a new NUMA helper function to find all nodes in a
> >   given nodemask that share the minimum distance from a specified
> >   source node.
> > 
> >   [PATCH 2/4] mm/memory-tiers: introduce socket-aware topology mgmt
> >   This patch introduces a management layer that groups NUMA nodes by
> >   their physical package (socket). It forms a "memory package" to
> >   abstract real hardware locality for predictable NUMA memory
> >   management.
> > 
> >   [PATCH 3/4] mm/memory-tiers: register CXL nodes to socket packages
> >   This patch implements a registration path to bind CXL memory nodes
> >   to a socket-aware memory package using an initiator CPU node. This
> >   ensures CXL nodes are deterministically grouped with the CPUs they
> >   service.
> > 
> >   [PATCH 4/4] mm/mempolicy: enhance weighted interleave with locality
> >   This patch modifies the weighted interleave policy to restrict
> >   candidate nodes to the current socket before applying weights. It
> >   reduces cross-socket traffic and aligns memory allocation with
> >   actual bandwidth.
> > 
> > Any feedback and discussions are highly appreciated.
> > 
> > Thanks
> > 
> > Rakie Kim (4):
> >   mm/numa: introduce nearest_nodes_nodemask()
> >   mm/memory-tiers: introduce socket-aware topology management for NUMA
> >     nodes
> >   mm/memory-tiers: register CXL nodes to socket-aware packages via
> >     initiator
> >   mm/mempolicy: enhance weighted interleave with socket-aware locality
> > 
> >  drivers/cxl/core/region.c    |  46 +++
> >  drivers/cxl/cxl.h            |   1 +
> >  drivers/dax/kmem.c           |   2 +
> >  include/linux/memory-tiers.h |  93 +++++
> >  include/linux/numa.h         |   8 +
> >  mm/memory-tiers.c            | 766 +++++++++++++++++++++++++++++++++++
> >  mm/mempolicy.c               | 135 +++++-
> >  7 files changed, 1047 insertions(+), 4 deletions(-)
> > 
> > 
> > base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-19  7:55   ` Rakie Kim
@ 2026-03-20 16:56     ` Jonathan Cameron
  2026-03-24  5:35       ` Rakie Kim
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2026-03-20 16:56 UTC (permalink / raw)
  To: Rakie Kim
  Cc: akpm, gourry, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, joshua.hahnjy, byungchul, ying.huang, apopple,
	david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, dave, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun,
	Keith Busch


> > > 
> > > To make this possible, the system requires a mechanism to understand
> > > the physical topology. The existing NUMA distance model provides only
> > > relative latency values between nodes and lacks any notion of
> > > structural grouping such as socket boundaries. This is especially
> > > problematic for CXL memory nodes, which appear without an explicit
> > > socket association.  
> > 
> > So in a general sense, the missing info here is effectively the same
> > stuff we are missing from the HMAT presentation (it's there in the
> > table and it's there to compute in CXL cases) just because we decided
> > not to surface anything other than distances to memory from nearest
> > initiator.  I chatted to Joshua and Kieth about filling in that stuff
> > at last LSFMM. To me that's just a bit of engineering work that needs
> > doing now we have proven use cases for the data. Mostly it's figuring out
> > the presentation to userspace and kernel data structures as it's a
> > lot of data in a big system (typically at least 32 NUMA nodes).
> >   
> 
> Hearing about the discussion on exposing HMAT data is very welcome news.
> Because this detailed topology information is not yet fully exposed to
> the kernel and userspace, I used a temporary package-based restriction.
> Figuring out how to expose and integrate this data into the kernel data
> structures is indeed a crucial engineering task we need to solve.
> 
> Actually, when I first started this work, I considered fetching the
> topology information from HMAT before adopting the current approach.
> However, I encountered a firmware issue on my test systems
> (Granite Rapids and Sierra Forest).
> 
> Although each socket has its own locally attached CXL device, the HMAT
> only registers node1 (Socket 1) as the initiator for both CXL memory
> nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> both node2 and node3 only expose node1.

Do you mean the Memory Proximity Domain Attributes Structure has
the "Proximity Domain for the Attached Initiator" set wrong?
Was this for it's presentation of the full path to CXL mem nodes, or
to a PXM with a generic port?  Sounds like you have SRAT covering
the CXL mem so ideal would be to have the HMAT data to GP and to
the CXL PXMs that BIOS has set up.

Either way having that set at all for CXL memory is fishy as it's about
where the 'memory controller' is and on CXL mem that should be at the
device end of the link.  My understanding of that is was only meant
to be set when you have separate memory only Nodes where the physical
controller is in a particular other node (e.g. what you do
if you have a CPU with DRAM and HBM).  Maybe we need to make the
kernel warn + ignore that if it is set to something odd like yours.

> 
> Even though the distance map shows node2 is physically closer to
> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> routing path strictly through Socket 1. Because the HMAT alone made it
> difficult to determine the exact physical socket connections on these
> systems, I ended up using the current CXL driver-based approach.

Are the HMAT latencies and bandwidths all there?  Or are some missing
and you have to use SLIT (which generally is garbage for historical
reasons of tuning SLIT to particular OS behaviour).

> 
> I wonder if others have experienced similar broken HMAT cases with CXL.
> If HMAT information becomes more reliable in the future, we could
> build a much more efficient structure.

Given it's being lightly used I suspect there will be many bugs :(
I hope we can assume they will get fixed however!

...

> 
> The complex topology cases you presented, such as multi-NUMA per socket,
> shared CXL switches, and IO expanders, are very important points.
> I clearly understand that the simple package-level grouping does not fully
> reflect the 1:1 relationship in these future hardware architectures.
> 
> I have also thought about the shared CXL switch scenario you mentioned,
> and I know the current design falls short in addressing it properly.
> While the current implementation starts with a simple socket-local
> restriction, I plan to evolve it into a more flexible node aggregation
> model to properly reflect all the diverse topologies you suggested.

If we can ensure it fails cleanly when it finds a topology that it can't
cope with (and I guess falls back to current) then I'm fine with a partial
solution that evolves.


> 
> Thanks again for your time and review.

You are welcome.

Thanks

Jonathan

> 
> Rakie Kim
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-20 16:56     ` Jonathan Cameron
@ 2026-03-24  5:35       ` Rakie Kim
  2026-03-25 12:33         ` Jonathan Cameron
  0 siblings, 1 reply; 18+ messages in thread
From: Rakie Kim @ 2026-03-24  5:35 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: akpm, gourry, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, joshua.hahnjy, byungchul, ying.huang, apopple,
	david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, dave, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, harry.yoo, lsf-pc, kernel_team,
	honggyu.kim, yunjeong.mun, Keith Busch, Rakie Kim

On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> > > > 
> > > > To make this possible, the system requires a mechanism to understand
> > > > the physical topology. The existing NUMA distance model provides only
> > > > relative latency values between nodes and lacks any notion of
> > > > structural grouping such as socket boundaries. This is especially
> > > > problematic for CXL memory nodes, which appear without an explicit
> > > > socket association.  
> > > 
> > > So in a general sense, the missing info here is effectively the same
> > > stuff we are missing from the HMAT presentation (it's there in the
> > > table and it's there to compute in CXL cases) just because we decided
> > > not to surface anything other than distances to memory from nearest
> > > initiator.  I chatted to Joshua and Kieth about filling in that stuff
> > > at last LSFMM. To me that's just a bit of engineering work that needs
> > > doing now we have proven use cases for the data. Mostly it's figuring out
> > > the presentation to userspace and kernel data structures as it's a
> > > lot of data in a big system (typically at least 32 NUMA nodes).
> > >   
> > 
> > Hearing about the discussion on exposing HMAT data is very welcome news.
> > Because this detailed topology information is not yet fully exposed to
> > the kernel and userspace, I used a temporary package-based restriction.
> > Figuring out how to expose and integrate this data into the kernel data
> > structures is indeed a crucial engineering task we need to solve.
> > 
> > Actually, when I first started this work, I considered fetching the
> > topology information from HMAT before adopting the current approach.
> > However, I encountered a firmware issue on my test systems
> > (Granite Rapids and Sierra Forest).
> > 
> > Although each socket has its own locally attached CXL device, the HMAT
> > only registers node1 (Socket 1) as the initiator for both CXL memory
> > nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> > both node2 and node3 only expose node1.
> 
> Do you mean the Memory Proximity Domain Attributes Structure has
> the "Proximity Domain for the Attached Initiator" set wrong?
> Was this for it's presentation of the full path to CXL mem nodes, or
> to a PXM with a generic port?  Sounds like you have SRAT covering
> the CXL mem so ideal would be to have the HMAT data to GP and to
> the CXL PXMs that BIOS has set up.
> 
> Either way having that set at all for CXL memory is fishy as it's about
> where the 'memory controller' is and on CXL mem that should be at the
> device end of the link.  My understanding of that is was only meant
> to be set when you have separate memory only Nodes where the physical
> controller is in a particular other node (e.g. what you do
> if you have a CPU with DRAM and HBM).  Maybe we need to make the
> kernel warn + ignore that if it is set to something odd like yours.
> 

Hello Jonathan,

Your insight is incredibly accurate. To clarify the situation, here is
the actual configuration of my system:

NODE   Type          PXD
node0  local memory  0x00
node1  local memory  0x01
node2  cxl memory    0x0A
node3  cxl memory    0x0B

Physically, the node2 CXL is attached to node0 (Socket 0), and the
node3 CXL is attached to node1 (Socket 1). However, extracting the
HMAT.dsl reveals the following:

- local memory
  [028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
         Attached Initiator Proximity Domain: 0x00
         Memory Proximity Domain: 0x00
  [050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
         Attached Initiator Proximity Domain: 0x01
         Memory Proximity Domain: 0x01

- cxl memory
  [078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
         Attached Initiator Proximity Domain: 0x00
         Memory Proximity Domain: 0x0A
  [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
         Attached Initiator Proximity Domain: 0x00
         Memory Proximity Domain: 0x0B

As you correctly suspected, the flags for the CXL memory are 0000,
meaning the Processor Proximity Domain is marked as invalid. But when
checking the sysfs initiator configurations, it shows a different story:

Node   access0 Initiator  access1 Initiator
node0  node0              node0
node1  node1              node1
node2  node1              node1
node3  node1              node1

Although the Attached Initiator is set to 0 in HMAT with an invalid
flag, sysfs strangely registers node1 as the initiator for both CXL
nodes. Because both HMAT and sysfs are exposing abnormal values, it was
impossible for me to determine the true socket connections for CXL
using this data.

> > 
> > Even though the distance map shows node2 is physically closer to
> > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > routing path strictly through Socket 1. Because the HMAT alone made it
> > difficult to determine the exact physical socket connections on these
> > systems, I ended up using the current CXL driver-based approach.
> 
> Are the HMAT latencies and bandwidths all there?  Or are some missing
> and you have to use SLIT (which generally is garbage for historical
> reasons of tuning SLIT to particular OS behaviour).
> 

The HMAT latencies and bandwidths are present, but the values seem
broken. Here is the latency table:

Init->Target | node0 | node1 | node2 | node3
node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
node1        | 0x89F | 0x38B | 0x3AFC| 0x4268

I used the identical type of DRAM and CXL memory for both sockets.
However, looking at the table, the local CXL access latency from
node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
unjustified difference. This asymmetry proves that the table is
currently unreliable.

> > 
> > I wonder if others have experienced similar broken HMAT cases with CXL.
> > If HMAT information becomes more reliable in the future, we could
> > build a much more efficient structure.
> 
> Given it's being lightly used I suspect there will be many bugs :(
> I hope we can assume they will get fixed however!
> 
> ...
> 

The most critical issue caused by this broken initiator setting is that
topology analysis tools like `hwloc` are completely misled. Currently,
`hwloc` displays both CXL nodes as being attached to Socket 1.

I observed this exact same issue on both Sierra Forest and Granite
Rapids systems. I believe this broken topology exposure is a severe
problem that must be addressed, though I am not entirely sure what the
best fix would be yet. I would love to hear your thoughts on this.

> > 
> > The complex topology cases you presented, such as multi-NUMA per socket,
> > shared CXL switches, and IO expanders, are very important points.
> > I clearly understand that the simple package-level grouping does not fully
> > reflect the 1:1 relationship in these future hardware architectures.
> > 
> > I have also thought about the shared CXL switch scenario you mentioned,
> > and I know the current design falls short in addressing it properly.
> > While the current implementation starts with a simple socket-local
> > restriction, I plan to evolve it into a more flexible node aggregation
> > model to properly reflect all the diverse topologies you suggested.
> 
> If we can ensure it fails cleanly when it finds a topology that it can't
> cope with (and I guess falls back to current) then I'm fine with a partial
> solution that evolves.
> 

I completely agree with ensuring a clean failure. To stabilize this
partial solution, I am currently considering a few options for the
next version:

1. Enable this feature only when a strict 1:1 topology is detected.
2. Provide a sysfs allowing users to enable/disable it.
3. Allow users to manually override/configure the topology via sysfs.
4. Implement dynamic fallback behaviors depending on the detected
   topology shape (needs further thought).

By the way, when I first posted this RFC, I accidentally missed adding
lsf-pc@lists.linux-foundation.org to the CC list. I am considering
re-posting it to ensure it reaches the lsf-pc.

Thanks again for your profound insights and time. It is tremendously
helpful.

Rakie Kim

> 
> > 
> > Thanks again for your time and review.
> 
> You are welcome.
> 
> Thanks
> 
> Jonathan
> 
> > 
> > Rakie Kim
> > 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-24  5:35       ` Rakie Kim
@ 2026-03-25 12:33         ` Jonathan Cameron
  2026-03-26  8:54           ` Rakie Kim
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2026-03-25 12:33 UTC (permalink / raw)
  To: Rakie Kim
  Cc: akpm, gourry, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, joshua.hahnjy, byungchul, ying.huang, apopple,
	david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, dave, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, harry.yoo, lsf-pc, kernel_team,
	honggyu.kim, yunjeong.mun, Keith Busch

On Tue, 24 Mar 2026 14:35:45 +0900
Rakie Kim <rakie.kim@sk.com> wrote:

> On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> >   
> > > > > 
> > > > > To make this possible, the system requires a mechanism to understand
> > > > > the physical topology. The existing NUMA distance model provides only
> > > > > relative latency values between nodes and lacks any notion of
> > > > > structural grouping such as socket boundaries. This is especially
> > > > > problematic for CXL memory nodes, which appear without an explicit
> > > > > socket association.    
> > > > 
> > > > So in a general sense, the missing info here is effectively the same
> > > > stuff we are missing from the HMAT presentation (it's there in the
> > > > table and it's there to compute in CXL cases) just because we decided
> > > > not to surface anything other than distances to memory from nearest
> > > > initiator.  I chatted to Joshua and Kieth about filling in that stuff
> > > > at last LSFMM. To me that's just a bit of engineering work that needs
> > > > doing now we have proven use cases for the data. Mostly it's figuring out
> > > > the presentation to userspace and kernel data structures as it's a
> > > > lot of data in a big system (typically at least 32 NUMA nodes).
> > > >     
> > > 
> > > Hearing about the discussion on exposing HMAT data is very welcome news.
> > > Because this detailed topology information is not yet fully exposed to
> > > the kernel and userspace, I used a temporary package-based restriction.
> > > Figuring out how to expose and integrate this data into the kernel data
> > > structures is indeed a crucial engineering task we need to solve.
> > > 
> > > Actually, when I first started this work, I considered fetching the
> > > topology information from HMAT before adopting the current approach.
> > > However, I encountered a firmware issue on my test systems
> > > (Granite Rapids and Sierra Forest).
> > > 
> > > Although each socket has its own locally attached CXL device, the HMAT
> > > only registers node1 (Socket 1) as the initiator for both CXL memory
> > > nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> > > both node2 and node3 only expose node1.  
> > 
> > Do you mean the Memory Proximity Domain Attributes Structure has
> > the "Proximity Domain for the Attached Initiator" set wrong?
> > Was this for it's presentation of the full path to CXL mem nodes, or
> > to a PXM with a generic port?  Sounds like you have SRAT covering
> > the CXL mem so ideal would be to have the HMAT data to GP and to
> > the CXL PXMs that BIOS has set up.
> > 
> > Either way having that set at all for CXL memory is fishy as it's about
> > where the 'memory controller' is and on CXL mem that should be at the
> > device end of the link.  My understanding of that is was only meant
> > to be set when you have separate memory only Nodes where the physical
> > controller is in a particular other node (e.g. what you do
> > if you have a CPU with DRAM and HBM).  Maybe we need to make the
> > kernel warn + ignore that if it is set to something odd like yours.
> >   
> 
> Hello Jonathan,
> 
> Your insight is incredibly accurate. To clarify the situation, here is
> the actual configuration of my system:
> 
> NODE   Type          PXD
> node0  local memory  0x00
> node1  local memory  0x01
> node2  cxl memory    0x0A
> node3  cxl memory    0x0B
> 
> Physically, the node2 CXL is attached to node0 (Socket 0), and the
> node3 CXL is attached to node1 (Socket 1). However, extracting the
> HMAT.dsl reveals the following:
> 
> - local memory
>   [028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x00
>   [050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
>          Attached Initiator Proximity Domain: 0x01
>          Memory Proximity Domain: 0x01
> 
> - cxl memory
>   [078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x0A
>   [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x0B

That's faintly amusing given it conveys no information at all.
Still unless we have a bug shouldn't cause anything odd.

> 
> As you correctly suspected, the flags for the CXL memory are 0000,
> meaning the Processor Proximity Domain is marked as invalid. But when
> checking the sysfs initiator configurations, it shows a different story:
> 
> Node   access0 Initiator  access1 Initiator
> node0  node0              node0
> node1  node1              node1
> node2  node1              node1
> node3  node1              node1
> 
> Although the Attached Initiator is set to 0 in HMAT with an invalid
> flag, sysfs strangely registers node1 as the initiator for both CXL
> nodes.
Been a while since I looked the hmat parser..

If ACPI_HMAT_PROCESSOR_PD_VALID isn't set, hmat_parse_proximity_domain()
shouldn't set the target. At end of that function should be set to PXM_INVALID.

It should therefore retain the state from alloc_memory_intiator() I think?

Given I did all my testing without the PD_VALID set (as it wasn't on my
test system) it should be fine with that.  Anyhow, let's look at the data
for proximity.



> Because both HMAT and sysfs are exposing abnormal values, it was
> impossible for me to determine the true socket connections for CXL
> using this data.
> 
> > > 
> > > Even though the distance map shows node2 is physically closer to
> > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > > routing path strictly through Socket 1. Because the HMAT alone made it
> > > difficult to determine the exact physical socket connections on these
> > > systems, I ended up using the current CXL driver-based approach.  
> > 
> > Are the HMAT latencies and bandwidths all there?  Or are some missing
> > and you have to use SLIT (which generally is garbage for historical
> > reasons of tuning SLIT to particular OS behaviour).
> >   
> 
> The HMAT latencies and bandwidths are present, but the values seem
> broken. Here is the latency table:
> 
> Init->Target | node0 | node1 | node2 | node3
> node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> node1        | 0x89F | 0x38B | 0x3AFC| 0x4268

Yeah. That would do it...  Looks like that final value is garbage.

> 
> I used the identical type of DRAM and CXL memory for both sockets.
> However, looking at the table, the local CXL access latency from
> node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> unjustified difference. This asymmetry proves that the table is
> currently unreliable.

Poke your favourite bios vendor I guess.

I asked one of the intel folk to take a look at see if this is a broader issue
or just one particular bios.

> 
> > > 
> > > I wonder if others have experienced similar broken HMAT cases with CXL.
> > > If HMAT information becomes more reliable in the future, we could
> > > build a much more efficient structure.  
> > 
> > Given it's being lightly used I suspect there will be many bugs :(
> > I hope we can assume they will get fixed however!
> > 
> > ...
> >   
> 
> The most critical issue caused by this broken initiator setting is that
> topology analysis tools like `hwloc` are completely misled. Currently,
> `hwloc` displays both CXL nodes as being attached to Socket 1.
> 
> I observed this exact same issue on both Sierra Forest and Granite
> Rapids systems. I believe this broken topology exposure is a severe
> problem that must be addressed, though I am not entirely sure what the
> best fix would be yet. I would love to hear your thoughts on this.

Fix then bios.  If you don't mind, can you provide dumps of
cat /sys/firmware/acpi/tables/HMAT  just so we can check there is nothing
wrong with the parser.

> 
> > > 
> > > The complex topology cases you presented, such as multi-NUMA per socket,
> > > shared CXL switches, and IO expanders, are very important points.
> > > I clearly understand that the simple package-level grouping does not fully
> > > reflect the 1:1 relationship in these future hardware architectures.
> > > 
> > > I have also thought about the shared CXL switch scenario you mentioned,
> > > and I know the current design falls short in addressing it properly.
> > > While the current implementation starts with a simple socket-local
> > > restriction, I plan to evolve it into a more flexible node aggregation
> > > model to properly reflect all the diverse topologies you suggested.  
> > 
> > If we can ensure it fails cleanly when it finds a topology that it can't
> > cope with (and I guess falls back to current) then I'm fine with a partial
> > solution that evolves.
> >   
> 
> I completely agree with ensuring a clean failure. To stabilize this
> partial solution, I am currently considering a few options for the
> next version:
> 
> 1. Enable this feature only when a strict 1:1 topology is detected.
Definitely default to off.  Maybe allow a user to say they want to do it
anyway. I can see there might be systems that are only a tiny bit off and
it makes not practical difference.

> 2. Provide a sysfs allowing users to enable/disable it.
Makes sense.
> 3. Allow users to manually override/configure the topology via sysfs.

No.  If people are in this state we should apply fixes to the HMAT table
either by injection of real data or some quirking.  If we add userspace
control via simpler means the motivation for people to fix bios goes out
the window and it never gets resolved.

> 4. Implement dynamic fallback behaviors depending on the detected
>    topology shape (needs further thought).

That would be interesting. But maybe not a 1st version thing :)

> 
> By the way, when I first posted this RFC, I accidentally missed adding
> lsf-pc@lists.linux-foundation.org to the CC list. I am considering
> re-posting it to ensure it reaches the lsf-pc.

Makes sense. Make sure to add a back link to this so it is visible
discussion already going on.
> 
> Thanks again for your profound insights and time. It is tremendously
> helpful.

Thanks to you for starting to solve the problem!

J
> 
> Rakie Kim
> 
> >   
> > > 
> > > Thanks again for your time and review.  
> > 
> > You are welcome.
> > 
> > Thanks
> > 
> > Jonathan
> >   
> > > 
> > > Rakie Kim
> > >   
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
  2026-03-25 12:33         ` Jonathan Cameron
@ 2026-03-26  8:54           ` Rakie Kim
  0 siblings, 0 replies; 18+ messages in thread
From: Rakie Kim @ 2026-03-26  8:54 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: akpm, gourry, linux-mm, linux-kernel, linux-cxl, ziy,
	matthew.brost, joshua.hahnjy, byungchul, ying.huang, apopple,
	david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, dave, dave.jiang, alison.schofield, vishal.l.verma,
	ira.weiny, dan.j.williams, kernel_team, honggyu.kim, yunjeong.mun,
	Keith Busch, Rakie Kim

On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> On Tue, 24 Mar 2026 14:35:45 +0900
> Rakie Kim <rakie.kim@sk.com> wrote:
> 
> > On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> > >   
> > > > > > 
> > > > > > To make this possible, the system requires a mechanism to understand
> > > > > > the physical topology. The existing NUMA distance model provides only
> > > > > > relative latency values between nodes and lacks any notion of
> > > > > > structural grouping such as socket boundaries. This is especially
> > > > > > problematic for CXL memory nodes, which appear without an explicit
> > > > > > socket association.    
> > > > > 
> > > > > So in a general sense, the missing info here is effectively the same
> > > > > stuff we are missing from the HMAT presentation (it's there in the
> > > > > table and it's there to compute in CXL cases) just because we decided
> > > > > not to surface anything other than distances to memory from nearest
> > > > > initiator.  I chatted to Joshua and Kieth about filling in that stuff
> > > > > at last LSFMM. To me that's just a bit of engineering work that needs
> > > > > doing now we have proven use cases for the data. Mostly it's figuring out
> > > > > the presentation to userspace and kernel data structures as it's a
> > > > > lot of data in a big system (typically at least 32 NUMA nodes).
> > > > >     
> > > > 
> > > > Hearing about the discussion on exposing HMAT data is very welcome news.
> > > > Because this detailed topology information is not yet fully exposed to
> > > > the kernel and userspace, I used a temporary package-based restriction.
> > > > Figuring out how to expose and integrate this data into the kernel data
> > > > structures is indeed a crucial engineering task we need to solve.
> > > > 
> > > > Actually, when I first started this work, I considered fetching the
> > > > topology information from HMAT before adopting the current approach.
> > > > However, I encountered a firmware issue on my test systems
> > > > (Granite Rapids and Sierra Forest).
> > > > 
> > > > Although each socket has its own locally attached CXL device, the HMAT
> > > > only registers node1 (Socket 1) as the initiator for both CXL memory
> > > > nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> > > > both node2 and node3 only expose node1.  
> > > 
> > > Do you mean the Memory Proximity Domain Attributes Structure has
> > > the "Proximity Domain for the Attached Initiator" set wrong?
> > > Was this for it's presentation of the full path to CXL mem nodes, or
> > > to a PXM with a generic port?  Sounds like you have SRAT covering
> > > the CXL mem so ideal would be to have the HMAT data to GP and to
> > > the CXL PXMs that BIOS has set up.
> > > 
> > > Either way having that set at all for CXL memory is fishy as it's about
> > > where the 'memory controller' is and on CXL mem that should be at the
> > > device end of the link.  My understanding of that is was only meant
> > > to be set when you have separate memory only Nodes where the physical
> > > controller is in a particular other node (e.g. what you do
> > > if you have a CPU with DRAM and HBM).  Maybe we need to make the
> > > kernel warn + ignore that if it is set to something odd like yours.
> > >   
> > 
> > Hello Jonathan,
> > 
> > Your insight is incredibly accurate. To clarify the situation, here is
> > the actual configuration of my system:
> > 
> > NODE   Type          PXD
> > node0  local memory  0x00
> > node1  local memory  0x01
> > node2  cxl memory    0x0A
> > node3  cxl memory    0x0B
> > 
> > Physically, the node2 CXL is attached to node0 (Socket 0), and the
> > node3 CXL is attached to node1 (Socket 1). However, extracting the
> > HMAT.dsl reveals the following:
> > 
> > - local memory
> >   [028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x00
> >   [050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
> >          Attached Initiator Proximity Domain: 0x01
> >          Memory Proximity Domain: 0x01
> > 
> > - cxl memory
> >   [078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x0A
> >   [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x0B
> 
> That's faintly amusing given it conveys no information at all.
> Still unless we have a bug shouldn't cause anything odd.
> 
> > 
> > As you correctly suspected, the flags for the CXL memory are 0000,
> > meaning the Processor Proximity Domain is marked as invalid. But when
> > checking the sysfs initiator configurations, it shows a different story:
> > 
> > Node   access0 Initiator  access1 Initiator
> > node0  node0              node0
> > node1  node1              node1
> > node2  node1              node1
> > node3  node1              node1
> > 
> > Although the Attached Initiator is set to 0 in HMAT with an invalid
> > flag, sysfs strangely registers node1 as the initiator for both CXL
> > nodes.
> Been a while since I looked the hmat parser..
> 
> If ACPI_HMAT_PROCESSOR_PD_VALID isn't set, hmat_parse_proximity_domain()
> shouldn't set the target. At end of that function should be set to PXM_INVALID.
> 
> It should therefore retain the state from alloc_memory_intiator() I think?
> 
> Given I did all my testing without the PD_VALID set (as it wasn't on my
> test system) it should be fine with that.  Anyhow, let's look at the data
> for proximity.
> 
> 

Hello Jonathan,

Thank you for the deep insight into the HMAT parser code. As you
mentioned, considering the current state where node 1 is still
registered as the initiator in sysfs despite the flag being 0, it
seems highly likely that the kernel parser logic is not handling
this specific situation gracefully.

> 
> > Because both HMAT and sysfs are exposing abnormal values, it was
> > impossible for me to determine the true socket connections for CXL
> > using this data.
> > 
> > > > 
> > > > Even though the distance map shows node2 is physically closer to
> > > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > > > routing path strictly through Socket 1. Because the HMAT alone made it
> > > > difficult to determine the exact physical socket connections on these
> > > > systems, I ended up using the current CXL driver-based approach.  
> > > 
> > > Are the HMAT latencies and bandwidths all there?  Or are some missing
> > > and you have to use SLIT (which generally is garbage for historical
> > > reasons of tuning SLIT to particular OS behaviour).
> > >   
> > 
> > The HMAT latencies and bandwidths are present, but the values seem
> > broken. Here is the latency table:
> > 
> > Init->Target | node0 | node1 | node2 | node3
> > node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> > node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> 
> Yeah. That would do it...  Looks like that final value is garbage.
> 
> > 
> > I used the identical type of DRAM and CXL memory for both sockets.
> > However, looking at the table, the local CXL access latency from
> > node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> > unjustified difference. This asymmetry proves that the table is
> > currently unreliable.
> 
> Poke your favourite bios vendor I guess.
> 
> I asked one of the intel folk to take a look at see if this is a broader issue
> or just one particular bios.
> 

I really appreciate you reaching out to the Intel contact to check if
this is a broader platform issue. I will also try to find a way to
report this BIOS issue to our system vendor, though I might need to
figure out the proper channel since I am not the system administrator.

Regarding the HMAT dump you requested, how should I provide it to you?
Would a hex dump converted via a utility like `xxd` be acceptable,
something like the snippet below?

00000000: 484d 4154 6806 0000 026a 4742 5420 2020  HMATh....jGBT
00000010: 4742 5455 4143 5049 0920 0701 414d 4920  GBTUACPI. ..AMI
00000020: 2806 2320 0000 0000 0000 0000 2800 0000  (.# ........(...
00000030: 0100 0000 0000 0000 0000 0000 0000 0000  ................

> > 
> > > > 
> > > > I wonder if others have experienced similar broken HMAT cases with CXL.
> > > > If HMAT information becomes more reliable in the future, we could
> > > > build a much more efficient structure.  
> > > 
> > > Given it's being lightly used I suspect there will be many bugs :(
> > > I hope we can assume they will get fixed however!
> > > 
> > > ...
> > >   
> > 
> > The most critical issue caused by this broken initiator setting is that
> > topology analysis tools like `hwloc` are completely misled. Currently,
> > `hwloc` displays both CXL nodes as being attached to Socket 1.
> > 
> > I observed this exact same issue on both Sierra Forest and Granite
> > Rapids systems. I believe this broken topology exposure is a severe
> > problem that must be addressed, though I am not entirely sure what the
> > best fix would be yet. I would love to hear your thoughts on this.
> 
> Fix then bios.  If you don't mind, can you provide dumps of
> cat /sys/firmware/acpi/tables/HMAT  just so we can check there is nothing
> wrong with the parser.
> 
> > 
> > > > 
> > > > The complex topology cases you presented, such as multi-NUMA per socket,
> > > > shared CXL switches, and IO expanders, are very important points.
> > > > I clearly understand that the simple package-level grouping does not fully
> > > > reflect the 1:1 relationship in these future hardware architectures.
> > > > 
> > > > I have also thought about the shared CXL switch scenario you mentioned,
> > > > and I know the current design falls short in addressing it properly.
> > > > While the current implementation starts with a simple socket-local
> > > > restriction, I plan to evolve it into a more flexible node aggregation
> > > > model to properly reflect all the diverse topologies you suggested.  
> > > 
> > > If we can ensure it fails cleanly when it finds a topology that it can't
> > > cope with (and I guess falls back to current) then I'm fine with a partial
> > > solution that evolves.
> > >   
> > 
> > I completely agree with ensuring a clean failure. To stabilize this
> > partial solution, I am currently considering a few options for the
> > next version:
> > 
> > 1. Enable this feature only when a strict 1:1 topology is detected.
> Definitely default to off.  Maybe allow a user to say they want to do it
> anyway. I can see there might be systems that are only a tiny bit off and
> it makes not practical difference.
> 

Your suggestion is very reasonable. I will proceed with this approach
for the next version, keeping the feature disabled by default.

> > 2. Provide a sysfs allowing users to enable/disable it.
> Makes sense.

I will include this sysfs enable/disable feature in the next version.

> > 3. Allow users to manually override/configure the topology via sysfs.
> 
> No.  If people are in this state we should apply fixes to the HMAT table
> either by injection of real data or some quirking.  If we add userspace
> control via simpler means the motivation for people to fix bios goes out
> the window and it never gets resolved.
> 

Your reasoning is absolutely correct. I will not allow users to modify
the topology via sysfs. However, I plan to provide a read-only sysfs
interface so users can at least check the current topology information.

> > 4. Implement dynamic fallback behaviors depending on the detected
> >    topology shape (needs further thought).
> 
> That would be interesting. But maybe not a 1st version thing :)
> 

This is an area I also need to think more deeply about. I will not
include it in the initial version, but will consider implementing it
in the future.

Once again, I deeply appreciate your time, thorough review, and for
reaching out to Intel for further clarification. It is a huge help.

Rakie Kim

 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-03-26  8:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
2026-03-18 12:22   ` Jonathan Cameron
2026-03-16  5:12 ` [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality Rakie Kim
2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
2026-03-17  9:50   ` Rakie Kim
2026-03-16 15:19 ` Joshua Hahn
2026-03-16 19:45   ` Gregory Price
2026-03-17 11:50     ` Rakie Kim
2026-03-17 11:36   ` Rakie Kim
2026-03-18 12:02 ` Jonathan Cameron
2026-03-19  7:55   ` Rakie Kim
2026-03-20 16:56     ` Jonathan Cameron
2026-03-24  5:35       ` Rakie Kim
2026-03-25 12:33         ` Jonathan Cameron
2026-03-26  8:54           ` Rakie Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox