[LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Rakie Kim <rakie.kim@sk.com>
To: akpm@linux-foundation.org
Cc: gourry@gourry.net, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
	ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	byungchul@sk.com, ying.huang@linux.alibaba.com,
	apopple@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, dave@stgolabs.net,
	jonathan.cameron@huawei.com, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, dan.j.williams@intel.com,
	kernel_team@skhynix.com, honggyu.kim@sk.com, yunjeong.mun@sk.com,
	rakie.kim@sk.com
Subject: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
Date: Mon, 16 Mar 2026 14:12:48 +0900	[thread overview]
Message-ID: <20260316051258.246-1-rakie.kim@sk.com> (raw)

This patch series is an RFC to propose and discuss the overall design
and concept of a socket-aware weighted interleave mechanism. As there
are areas requiring further refinement, the primary goal at this stage
is to gather feedback on the architectural approach rather than focusing
on fine-grained implementation details.

Weighted interleave distributes page allocations across multiple nodes
based on configured weights. However, the current implementation applies
a single global weight vector. In multi-socket systems, this creates a
mismatch between configured weights and actual hardware performance, as
it cannot account for inter-socket interconnect costs. To address this,
we propose a socket-aware approach that restricts candidate nodes to
the local socket before applying weights.

Flat weighted interleave applies one global weight vector regardless of
where a task runs. On multi-socket systems, this ignores inter-socket
interconnect costs, meaning the configured weights do not accurately
reflect the actual hardware performance.

Consider a dual-socket system:

          node0             node1
        +-------+         +-------+
        | CPU 0 |---------| CPU 1 |
        +-------+         +-------+
        | DRAM0 |         | DRAM1 |
        +---+---+         +---+---+
            |                 |
        +---+---+         +---+---+
        | CXL 0 |         | CXL 1 |
        +-------+         +-------+
          node2             node3

Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
the effective bandwidth varies significantly from the perspective of
each CPU due to inter-socket interconnect penalties.

Local device capabilities (GB/s) vs. cross-socket effective bandwidth:

         0     1     2     3
CPU 0  300   150   100    50
CPU 1  150   300    50   100

A reasonable global weight vector reflecting the base capabilities is:

     node0=3 node1=3 node2=1 node3=1

However, because these configured node weights do not account for
interconnect degradation between sockets, applying them flatly to all
sources yields the following effective map from each CPU's perspective:

         0     1     2     3
CPU 0    3     3     1     1
CPU 1    3     3     1     1

This does not account for the interconnect penalty (e.g., node0->node1
drops 300->150, node0->node3 drops 100->50) and thus forces allocations
that cause a mismatch with actual performance.

This patch makes weighted interleave socket-aware. Before weighting is
applied, the candidate nodes are restricted to the current socket; only
if no eligible local nodes remain does the policy fall back to the
wider set.

Even if the configured global weights remain identically set:

     node0=3 node1=3 node2=1 node3=1

The resulting effective map from the perspective of each CPU becomes:

         0     1     2     3
CPU 0    3     0     1     0
CPU 1    0     3     0     1

Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
effective bandwidth, preserves NUMA locality, and reduces cross-socket
traffic.

To make this possible, the system requires a mechanism to understand
the physical topology. The existing NUMA distance model provides only
relative latency values between nodes and lacks any notion of
structural grouping such as socket boundaries. This is especially
problematic for CXL memory nodes, which appear without an explicit
socket association.

This patch series introduces a socket-aware topology management layer
that groups NUMA nodes according to their physical package. It
explicitly links CPU and memory-only nodes (such as CXL) under the
same socket using an initiator CPU node. This captures the true
hardware hierarchy rather than relying solely on flat distance values.

[Experimental Results]

System Configuration:
- Processor: Dual-Socket Intel Xeon 6980P (Granite Rapids)

               node0                       node1
             +-------+                   +-------+
             | CPU 0 |-------------------| CPU 1 |
             +-------+                   +-------+
12 Channels  | DRAM0 |                   | DRAM1 |  12 Channels
DDR5-6400    +---+---+                   +---+---+  DDR5-6400
                 |                           |
             +---+---+                   +---+---+
8 Channels   | CXL 0 |                   | CXL 1 |  8 Channels
DDR5-6400    +-------+                   +-------+  DDR5-6400
               node2                       node3

1) Throughput (System Bandwidth)
   - DRAM Only: 966 GB/s
   - Weighted Interleave: 903 GB/s (7% decrease compared to DRAM Only)
   - Socket-Aware Weighted Interleave: 1329 GB/s (1.33TB/s)
     (38% increase compared to DRAM Only,
      47% increase compared to Weighted Interleave)

2) Loaded Latency (Under High Bandwidth)
   - DRAM Only: 544 ns
   - Weighted Interleave: 545 ns
   - Socket-Aware Weighted Interleave: 436 ns
     (20% reduction compared to both)

[Additional Considerations]

Please note that this series includes modifications to the CXL driver
to register these nodes. However, the necessity and the approach of
these driver-side changes require further discussion and consideration.
Additionally, this topology layer was originally designed to support
both memory tiering and weighted interleave. Currently, it is only
utilized by the weighted interleave policy. As a result, several
functions exposed by this layer are not actively used in this RFC.
Unused portions will be cleaned up and removed in the final patch
submission.

Summary of patches:

  [PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask()
  This patch adds a new NUMA helper function to find all nodes in a
  given nodemask that share the minimum distance from a specified
  source node.

  [PATCH 2/4] mm/memory-tiers: introduce socket-aware topology mgmt
  This patch introduces a management layer that groups NUMA nodes by
  their physical package (socket). It forms a "memory package" to
  abstract real hardware locality for predictable NUMA memory
  management.

  [PATCH 3/4] mm/memory-tiers: register CXL nodes to socket packages
  This patch implements a registration path to bind CXL memory nodes
  to a socket-aware memory package using an initiator CPU node. This
  ensures CXL nodes are deterministically grouped with the CPUs they
  service.

  [PATCH 4/4] mm/mempolicy: enhance weighted interleave with locality
  This patch modifies the weighted interleave policy to restrict
  candidate nodes to the current socket before applying weights. It
  reduces cross-socket traffic and aligns memory allocation with
  actual bandwidth.

Any feedback and discussions are highly appreciated.

Thanks

Rakie Kim (4):
  mm/numa: introduce nearest_nodes_nodemask()
  mm/memory-tiers: introduce socket-aware topology management for NUMA
    nodes
  mm/memory-tiers: register CXL nodes to socket-aware packages via
    initiator
  mm/mempolicy: enhance weighted interleave with socket-aware locality

 drivers/cxl/core/region.c    |  46 +++
 drivers/cxl/cxl.h            |   1 +
 drivers/dax/kmem.c           |   2 +
 include/linux/memory-tiers.h |  93 +++++
 include/linux/numa.h         |   8 +
 mm/memory-tiers.c            | 766 +++++++++++++++++++++++++++++++++++
 mm/mempolicy.c               | 135 +++++-
 7 files changed, 1047 insertions(+), 4 deletions(-)

base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
-- 
2.34.1

next             reply	other threads:[~2026-03-16  5:13 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16  5:12 Rakie Kim [this message]
2026-03-16  5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
2026-03-18 12:22   ` Jonathan Cameron
2026-03-16  5:12 ` [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality Rakie Kim
2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
2026-03-17  9:50   ` Rakie Kim
2026-03-16 15:19 ` Joshua Hahn
2026-03-16 19:45   ` Gregory Price
2026-03-17 11:50     ` Rakie Kim
2026-03-17 11:36   ` Rakie Kim
2026-03-18 12:02 ` Jonathan Cameron
2026-03-19  7:55   ` Rakie Kim
2026-03-20 16:56     ` Jonathan Cameron
2026-03-24  5:35       ` Rakie Kim
2026-03-25 12:33         ` Jonathan Cameron
2026-03-26  8:54           ` Rakie Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260316051258.246-1-rakie.kim@sk.com \
    --to=rakie.kim@sk.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=honggyu.kim@sk.com \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel_team@skhynix.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yunjeong.mun@sk.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox