Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Rakie Kim <rakie.kim@sk.com>
Cc: <akpm@linux-foundation.org>, <gourry@gourry.net>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<linux-cxl@vger.kernel.org>, <ziy@nvidia.com>,
	<matthew.brost@intel.com>, <joshua.hahnjy@gmail.com>,
	<byungchul@sk.com>, <ying.huang@linux.alibaba.com>,
	<apopple@nvidia.com>, <david@kernel.org>,
	<lorenzo.stoakes@oracle.com>, <Liam.Howlett@oracle.com>,
	<vbabka@suse.cz>, <rppt@kernel.org>, <surenb@google.com>,
	<mhocko@suse.com>, <dave@stgolabs.net>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
	<ira.weiny@intel.com>, <dan.j.williams@intel.com>,
	<kernel_team@skhynix.com>, <honggyu.kim@sk.com>,
	<yunjeong.mun@sk.com>, Keith Busch <kbusch@kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
Date: Fri, 20 Mar 2026 16:56:05 +0000	[thread overview]
Message-ID: <20260320165605.000024c0@huawei.com> (raw)
In-Reply-To: <20260319075512.309-1-rakie.kim@sk.com>


> > > 
> > > To make this possible, the system requires a mechanism to understand
> > > the physical topology. The existing NUMA distance model provides only
> > > relative latency values between nodes and lacks any notion of
> > > structural grouping such as socket boundaries. This is especially
> > > problematic for CXL memory nodes, which appear without an explicit
> > > socket association.  
> > 
> > So in a general sense, the missing info here is effectively the same
> > stuff we are missing from the HMAT presentation (it's there in the
> > table and it's there to compute in CXL cases) just because we decided
> > not to surface anything other than distances to memory from nearest
> > initiator.  I chatted to Joshua and Kieth about filling in that stuff
> > at last LSFMM. To me that's just a bit of engineering work that needs
> > doing now we have proven use cases for the data. Mostly it's figuring out
> > the presentation to userspace and kernel data structures as it's a
> > lot of data in a big system (typically at least 32 NUMA nodes).
> >   
> 
> Hearing about the discussion on exposing HMAT data is very welcome news.
> Because this detailed topology information is not yet fully exposed to
> the kernel and userspace, I used a temporary package-based restriction.
> Figuring out how to expose and integrate this data into the kernel data
> structures is indeed a crucial engineering task we need to solve.
> 
> Actually, when I first started this work, I considered fetching the
> topology information from HMAT before adopting the current approach.
> However, I encountered a firmware issue on my test systems
> (Granite Rapids and Sierra Forest).
> 
> Although each socket has its own locally attached CXL device, the HMAT
> only registers node1 (Socket 1) as the initiator for both CXL memory
> nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> both node2 and node3 only expose node1.

Do you mean the Memory Proximity Domain Attributes Structure has
the "Proximity Domain for the Attached Initiator" set wrong?
Was this for it's presentation of the full path to CXL mem nodes, or
to a PXM with a generic port?  Sounds like you have SRAT covering
the CXL mem so ideal would be to have the HMAT data to GP and to
the CXL PXMs that BIOS has set up.

Either way having that set at all for CXL memory is fishy as it's about
where the 'memory controller' is and on CXL mem that should be at the
device end of the link.  My understanding of that is was only meant
to be set when you have separate memory only Nodes where the physical
controller is in a particular other node (e.g. what you do
if you have a CPU with DRAM and HBM).  Maybe we need to make the
kernel warn + ignore that if it is set to something odd like yours.

> 
> Even though the distance map shows node2 is physically closer to
> Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> routing path strictly through Socket 1. Because the HMAT alone made it
> difficult to determine the exact physical socket connections on these
> systems, I ended up using the current CXL driver-based approach.

Are the HMAT latencies and bandwidths all there?  Or are some missing
and you have to use SLIT (which generally is garbage for historical
reasons of tuning SLIT to particular OS behaviour).

> 
> I wonder if others have experienced similar broken HMAT cases with CXL.
> If HMAT information becomes more reliable in the future, we could
> build a much more efficient structure.

Given it's being lightly used I suspect there will be many bugs :(
I hope we can assume they will get fixed however!

...

> 
> The complex topology cases you presented, such as multi-NUMA per socket,
> shared CXL switches, and IO expanders, are very important points.
> I clearly understand that the simple package-level grouping does not fully
> reflect the 1:1 relationship in these future hardware architectures.
> 
> I have also thought about the shared CXL switch scenario you mentioned,
> and I know the current design falls short in addressing it properly.
> While the current implementation starts with a simple socket-local
> restriction, I plan to evolve it into a more flexible node aggregation
> model to properly reflect all the diverse topologies you suggested.

If we can ensure it fails cleanly when it finds a topology that it can't
cope with (and I guess falls back to current) then I'm fine with a partial
solution that evolves.


> 
> Thanks again for your time and review.

You are welcome.

Thanks

Jonathan

> 
> Rakie Kim
>

next prev parent reply	other threads:[~2026-03-20 16:56 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
2026-03-18 12:22   ` Jonathan Cameron
2026-03-16  5:12 ` [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality Rakie Kim
2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
2026-03-17  9:50   ` Rakie Kim
2026-03-16 15:19 ` Joshua Hahn
2026-03-16 19:45   ` Gregory Price
2026-03-17 11:50     ` Rakie Kim
2026-03-17 11:36   ` Rakie Kim
2026-03-18 12:02 ` Jonathan Cameron
2026-03-19  7:55   ` Rakie Kim
2026-03-20 16:56     ` Jonathan Cameron [this message]
2026-03-24  5:35       ` Rakie Kim
2026-03-25 12:33         ` Jonathan Cameron
2026-03-26  8:54           ` Rakie Kim
2026-03-26 21:41             ` Dave Jiang
2026-03-26 22:19               ` Dave Jiang
2026-03-26 20:13         ` Dan Williams
2026-03-26 22:24         ` Dave Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260320165605.000024c0@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=honggyu.kim@sk.com \
    --cc=ira.weiny@intel.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kbusch@kernel.org \
    --cc=kernel_team@skhynix.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=rakie.kim@sk.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yunjeong.mun@sk.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox