From: Rakie Kim <rakie.kim@sk.com>
To: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: akpm@linux-foundation.org, gourry@gourry.net, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
ziy@nvidia.com, matthew.brost@intel.com, byungchul@sk.com,
ying.huang@linux.alibaba.com, apopple@nvidia.com,
david@kernel.org, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, dave.jiang@intel.com,
alison.schofield@intel.com, vishal.l.verma@intel.com,
ira.weiny@intel.com, dan.j.williams@intel.com,
harry.yoo@oracle.com, lsf-pc@lists.linux-foundation.org,
kernel_team@skhynix.com, honggyu.kim@sk.com, yunjeong.mun@sk.com,
Rakie Kim <rakie.kim@sk.com>
Subject: Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
Date: Tue, 17 Mar 2026 20:36:53 +0900 [thread overview]
Message-ID: <20260317113656.283-1-rakie.kim@sk.com> (raw)
In-Reply-To: <20260316151933.3093626-1-joshua.hahnjy@gmail.com>
On Mon, 16 Mar 2026 08:19:32 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> Hello Rakie! I hope you have been doing well. Thank you for this
> RFC, I think it is a very interesting idea.
Hello Joshua,
I hope you are doing well. Thanks for your review and feedback on this RFC.
>
> [...snip...]
>
> > Consider a dual-socket system:
> >
h > node0 node1
> > +-------+ +-------+
> > | CPU 0 |---------| CPU 1 |
> > +-------+ +-------+
> > | DRAM0 | | DRAM1 |
> > +---+---+ +---+---+
> > | |
> > +---+---+ +---+---+
> > | CXL 0 | | CXL 1 |
> > +-------+ +-------+
> > node2 node3
> >
> > Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> > the effective bandwidth varies significantly from the perspective of
> > each CPU due to inter-socket interconnect penalties.
> >
> > Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
> >
> > 0 1 2 3
> > CPU 0 300 150 100 50
> > CPU 1 150 300 50 100
> >
> > A reasonable global weight vector reflecting the base capabilities is:
> >
> > node0=3 node1=3 node2=1 node3=1
> >
> > However, because these configured node weights do not account for
> > interconnect degradation between sockets, applying them flatly to all
> > sources yields the following effective map from each CPU's perspective:
> >
> > 0 1 2 3
> > CPU 0 3 3 1 1
> > CPU 1 3 3 1 1
> >
> > This does not account for the interconnect penalty (e.g., node0->node1
> > drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> > that cause a mismatch with actual performance.
> >
> > This patch makes weighted interleave socket-aware. Before weighting is
> > applied, the candidate nodes are restricted to the current socket; only
> > if no eligible local nodes remain does the policy fall back to the
> > wider set.
>
> So when I saw this, I thought the idea was that we would attempt an
> allocation with these socket-aware weights, and upon failure, fall back
> to the global weights that are set so that we can try to fulfill the
> allocation from cross-socket nodes.
>
> However, reading the implementation in 4/4, it seems like what is meant
> by "fallback" here is not in the sense of a fallback allocation, but
> in the sense of "if there is a misconfiguration and the intersection
> between policy nodes and the CPU's package is empty, use the global
> nodes instead".
>
> Am I understanding this correctly?
>
> And, it seems like what this also means is that under sane configurations,
> there is no more cross socket memory allocation, since it will always
> try to fulfill it from the local node.
>
Your analysis of the code in patch 4/4 is exactly correct. I apologize
for using the term "fallback" in the cover letter, which caused some
confusion. As you understood, the current implementation strictly
restricts allocations to the local socket to avoid cross-socket traffic.
> > Even if the configured global weights remain identically set:
> >
> > node0=3 node1=3 node2=1 node3=1
> >
> > The resulting effective map from the perspective of each CPU becomes:
> >
> > 0 1 2 3
> > CPU 0 3 0 1 0
> > CPU 1 0 3 0 1
>
> > Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> > node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> > effective bandwidth, preserves NUMA locality, and reduces cross-socket
> > traffic.
>
> In that sense I thought the word "prefer" was a bit confusing, since I
> thought it would mean that it would try to fulfill the alloactions
> from within a packet first, then fall back to remote packets if that
> failed. (Or maybe I am just misunderstanding your explanation. Please
> do let me know if that is the case : -) )
>
> If what I understand is the case , I think this is the same thing as
> just restricting allocations to be socket-local. I also wonder if
> this idea applies to other mempolicies as well (i.e. unweighted interleave)
Again, I apologize for the confusion caused by words like "prefer" and
"fallback" in the commit message. Your understanding is correct; the
current code strictly restricts allocations to the socket-local nodes.
To determine where memory may be allocated within a socket, the code uses
a function named policy_resolve_package_nodes(). As described in the
comments, the logic works as follows:
1. Success case: It tries to use the intersection of the current CPU's
package nodes and the user's preselected policy nodes. If the
intersection is not empty, it uses these local nodes.
2. Failure case: If the intersection is empty (e.g., the user opted out
of the current package), it finds the package of another node in the
policy nodes and gets the intersection again. If this also yields an
empty set, it completely falls back to the original global policy nodes.
In this early version, the consideration for handling various detailed
cases is insufficient. Also, as you pointed out, applying this strict
local restriction directly to other policies like unweighted interleave
might be difficult, as it could conflict with the original purpose of
interleaving. I plan to consider these aspects further and prepare a
more complemented design.
>
> I think we should consider what the expected and desirable behavior is
> when one socket is fully saturated but the other socket is empty. In my
> mind this is no different from considering within-packet remote NUMA
> allocations; the tradeoff becomes between reclaiming locally and
> keeping allocations local, vs. skipping reclaiming and consuming
> free memory while eating the remote access latency, similar to
> zone_reclaim mode (packet_reclaim_mode? ; -) )
This is an issue I have been thinking about since the early design phase,
and it must be resolved to improve this patch series. The trade-off
between forcing local memory reclaim to stay local versus accepting the
latency penalty of using a remote socket is a point we need to address.
I will continue to think about how to handle this properly.
>
> In my mind (without doing any benchmarking myself or looking at the numbers)
> I imagine that there are some scenarios where we actually do want cross
> socket allocations, like in the example above when we have very asymmetric
> saturations across sockets. Is this something that could be worth
> benchmarking as well?
Your suggestion is valid and worth considering. I am currently analyzing
the behavior of this feature under various workloads. I will also
consider the asymmetric saturation scenarios you suggested.
>
> I will end by saying that in the normal case (sockets have similar saturation)
> I think this series is a definite win and improvement to weighted interleave.
> I just was curious whether we can handle the worst-case scenarios.
>
> Thank you again for the series. Have a great day!
> Joshua
Thanks again for the review. I will prepare a more considered design
for the next version based on these points.
Rakie Kim
next prev parent reply other threads:[~2026-03-17 11:37 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
2026-03-16 5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
2026-03-16 5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
2026-03-18 12:22 ` Jonathan Cameron
2026-03-16 5:12 ` [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator Rakie Kim
2026-03-16 5:12 ` [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality Rakie Kim
2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
2026-03-17 9:50 ` Rakie Kim
2026-03-16 15:19 ` Joshua Hahn
2026-03-16 19:45 ` Gregory Price
2026-03-17 11:50 ` Rakie Kim
2026-03-17 11:36 ` Rakie Kim [this message]
2026-03-18 12:02 ` Jonathan Cameron
2026-03-19 7:55 ` Rakie Kim
2026-03-20 16:56 ` Jonathan Cameron
2026-03-24 5:35 ` Rakie Kim
2026-03-25 12:33 ` Jonathan Cameron
2026-03-26 8:54 ` Rakie Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260317113656.283-1-rakie.kim@sk.com \
--to=rakie.kim@sk.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=byungchul@sk.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=gourry@gourry.net \
--cc=harry.yoo@oracle.com \
--cc=honggyu.kim@sk.com \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel_team@skhynix.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yunjeong.mun@sk.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox