Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Rakie Kim <rakie.kim@sk.com>
To: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: akpm@linux-foundation.org, gourry@gourry.net, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
	ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	byungchul@sk.com, ying.huang@linux.alibaba.com,
	apopple@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, dave@stgolabs.net,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com,
	dan.j.williams@intel.com, kernel_team@skhynix.com,
	honggyu.kim@sk.com, yunjeong.mun@sk.com,
	Keith Busch <kbusch@kernel.org>, Rakie Kim <rakie.kim@sk.com>
Subject: Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
Date: Thu, 26 Mar 2026 17:54:55 +0900	[thread overview]
Message-ID: <20260326085501.343-1-rakie.kim@sk.com> (raw)
In-Reply-To: <20260325123350.00004d48@huawei.com>

On Wed, 25 Mar 2026 12:33:50 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> On Tue, 24 Mar 2026 14:35:45 +0900
> Rakie Kim <rakie.kim@sk.com> wrote:
> 
> > On Fri, 20 Mar 2026 16:56:05 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> > >   
> > > > > > 
> > > > > > To make this possible, the system requires a mechanism to understand
> > > > > > the physical topology. The existing NUMA distance model provides only
> > > > > > relative latency values between nodes and lacks any notion of
> > > > > > structural grouping such as socket boundaries. This is especially
> > > > > > problematic for CXL memory nodes, which appear without an explicit
> > > > > > socket association.    
> > > > > 
> > > > > So in a general sense, the missing info here is effectively the same
> > > > > stuff we are missing from the HMAT presentation (it's there in the
> > > > > table and it's there to compute in CXL cases) just because we decided
> > > > > not to surface anything other than distances to memory from nearest
> > > > > initiator.  I chatted to Joshua and Kieth about filling in that stuff
> > > > > at last LSFMM. To me that's just a bit of engineering work that needs
> > > > > doing now we have proven use cases for the data. Mostly it's figuring out
> > > > > the presentation to userspace and kernel data structures as it's a
> > > > > lot of data in a big system (typically at least 32 NUMA nodes).
> > > > >     
> > > > 
> > > > Hearing about the discussion on exposing HMAT data is very welcome news.
> > > > Because this detailed topology information is not yet fully exposed to
> > > > the kernel and userspace, I used a temporary package-based restriction.
> > > > Figuring out how to expose and integrate this data into the kernel data
> > > > structures is indeed a crucial engineering task we need to solve.
> > > > 
> > > > Actually, when I first started this work, I considered fetching the
> > > > topology information from HMAT before adopting the current approach.
> > > > However, I encountered a firmware issue on my test systems
> > > > (Granite Rapids and Sierra Forest).
> > > > 
> > > > Although each socket has its own locally attached CXL device, the HMAT
> > > > only registers node1 (Socket 1) as the initiator for both CXL memory
> > > > nodes (node2 and node3). As a result, the sysfs HMAT initiators for
> > > > both node2 and node3 only expose node1.  
> > > 
> > > Do you mean the Memory Proximity Domain Attributes Structure has
> > > the "Proximity Domain for the Attached Initiator" set wrong?
> > > Was this for it's presentation of the full path to CXL mem nodes, or
> > > to a PXM with a generic port?  Sounds like you have SRAT covering
> > > the CXL mem so ideal would be to have the HMAT data to GP and to
> > > the CXL PXMs that BIOS has set up.
> > > 
> > > Either way having that set at all for CXL memory is fishy as it's about
> > > where the 'memory controller' is and on CXL mem that should be at the
> > > device end of the link.  My understanding of that is was only meant
> > > to be set when you have separate memory only Nodes where the physical
> > > controller is in a particular other node (e.g. what you do
> > > if you have a CPU with DRAM and HBM).  Maybe we need to make the
> > > kernel warn + ignore that if it is set to something odd like yours.
> > >   
> > 
> > Hello Jonathan,
> > 
> > Your insight is incredibly accurate. To clarify the situation, here is
> > the actual configuration of my system:
> > 
> > NODE   Type          PXD
> > node0  local memory  0x00
> > node1  local memory  0x01
> > node2  cxl memory    0x0A
> > node3  cxl memory    0x0B
> > 
> > Physically, the node2 CXL is attached to node0 (Socket 0), and the
> > node3 CXL is attached to node1 (Socket 1). However, extracting the
> > HMAT.dsl reveals the following:
> > 
> > - local memory
> >   [028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x00
> >   [050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
> >          Attached Initiator Proximity Domain: 0x01
> >          Memory Proximity Domain: 0x01
> > 
> > - cxl memory
> >   [078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x0A
> >   [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
> >          Attached Initiator Proximity Domain: 0x00
> >          Memory Proximity Domain: 0x0B
> 
> That's faintly amusing given it conveys no information at all.
> Still unless we have a bug shouldn't cause anything odd.
> 
> > 
> > As you correctly suspected, the flags for the CXL memory are 0000,
> > meaning the Processor Proximity Domain is marked as invalid. But when
> > checking the sysfs initiator configurations, it shows a different story:
> > 
> > Node   access0 Initiator  access1 Initiator
> > node0  node0              node0
> > node1  node1              node1
> > node2  node1              node1
> > node3  node1              node1
> > 
> > Although the Attached Initiator is set to 0 in HMAT with an invalid
> > flag, sysfs strangely registers node1 as the initiator for both CXL
> > nodes.
> Been a while since I looked the hmat parser..
> 
> If ACPI_HMAT_PROCESSOR_PD_VALID isn't set, hmat_parse_proximity_domain()
> shouldn't set the target. At end of that function should be set to PXM_INVALID.
> 
> It should therefore retain the state from alloc_memory_intiator() I think?
> 
> Given I did all my testing without the PD_VALID set (as it wasn't on my
> test system) it should be fine with that.  Anyhow, let's look at the data
> for proximity.
> 
> 

Hello Jonathan,

Thank you for the deep insight into the HMAT parser code. As you
mentioned, considering the current state where node 1 is still
registered as the initiator in sysfs despite the flag being 0, it
seems highly likely that the kernel parser logic is not handling
this specific situation gracefully.

> 
> > Because both HMAT and sysfs are exposing abnormal values, it was
> > impossible for me to determine the true socket connections for CXL
> > using this data.
> > 
> > > > 
> > > > Even though the distance map shows node2 is physically closer to
> > > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > > > routing path strictly through Socket 1. Because the HMAT alone made it
> > > > difficult to determine the exact physical socket connections on these
> > > > systems, I ended up using the current CXL driver-based approach.  
> > > 
> > > Are the HMAT latencies and bandwidths all there?  Or are some missing
> > > and you have to use SLIT (which generally is garbage for historical
> > > reasons of tuning SLIT to particular OS behaviour).
> > >   
> > 
> > The HMAT latencies and bandwidths are present, but the values seem
> > broken. Here is the latency table:
> > 
> > Init->Target | node0 | node1 | node2 | node3
> > node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> > node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> 
> Yeah. That would do it...  Looks like that final value is garbage.
> 
> > 
> > I used the identical type of DRAM and CXL memory for both sockets.
> > However, looking at the table, the local CXL access latency from
> > node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> > unjustified difference. This asymmetry proves that the table is
> > currently unreliable.
> 
> Poke your favourite bios vendor I guess.
> 
> I asked one of the intel folk to take a look at see if this is a broader issue
> or just one particular bios.
> 

I really appreciate you reaching out to the Intel contact to check if
this is a broader platform issue. I will also try to find a way to
report this BIOS issue to our system vendor, though I might need to
figure out the proper channel since I am not the system administrator.

Regarding the HMAT dump you requested, how should I provide it to you?
Would a hex dump converted via a utility like `xxd` be acceptable,
something like the snippet below?

00000000: 484d 4154 6806 0000 026a 4742 5420 2020  HMATh....jGBT
00000010: 4742 5455 4143 5049 0920 0701 414d 4920  GBTUACPI. ..AMI
00000020: 2806 2320 0000 0000 0000 0000 2800 0000  (.# ........(...
00000030: 0100 0000 0000 0000 0000 0000 0000 0000  ................

> > 
> > > > 
> > > > I wonder if others have experienced similar broken HMAT cases with CXL.
> > > > If HMAT information becomes more reliable in the future, we could
> > > > build a much more efficient structure.  
> > > 
> > > Given it's being lightly used I suspect there will be many bugs :(
> > > I hope we can assume they will get fixed however!
> > > 
> > > ...
> > >   
> > 
> > The most critical issue caused by this broken initiator setting is that
> > topology analysis tools like `hwloc` are completely misled. Currently,
> > `hwloc` displays both CXL nodes as being attached to Socket 1.
> > 
> > I observed this exact same issue on both Sierra Forest and Granite
> > Rapids systems. I believe this broken topology exposure is a severe
> > problem that must be addressed, though I am not entirely sure what the
> > best fix would be yet. I would love to hear your thoughts on this.
> 
> Fix then bios.  If you don't mind, can you provide dumps of
> cat /sys/firmware/acpi/tables/HMAT  just so we can check there is nothing
> wrong with the parser.
> 
> > 
> > > > 
> > > > The complex topology cases you presented, such as multi-NUMA per socket,
> > > > shared CXL switches, and IO expanders, are very important points.
> > > > I clearly understand that the simple package-level grouping does not fully
> > > > reflect the 1:1 relationship in these future hardware architectures.
> > > > 
> > > > I have also thought about the shared CXL switch scenario you mentioned,
> > > > and I know the current design falls short in addressing it properly.
> > > > While the current implementation starts with a simple socket-local
> > > > restriction, I plan to evolve it into a more flexible node aggregation
> > > > model to properly reflect all the diverse topologies you suggested.  
> > > 
> > > If we can ensure it fails cleanly when it finds a topology that it can't
> > > cope with (and I guess falls back to current) then I'm fine with a partial
> > > solution that evolves.
> > >   
> > 
> > I completely agree with ensuring a clean failure. To stabilize this
> > partial solution, I am currently considering a few options for the
> > next version:
> > 
> > 1. Enable this feature only when a strict 1:1 topology is detected.
> Definitely default to off.  Maybe allow a user to say they want to do it
> anyway. I can see there might be systems that are only a tiny bit off and
> it makes not practical difference.
> 

Your suggestion is very reasonable. I will proceed with this approach
for the next version, keeping the feature disabled by default.

> > 2. Provide a sysfs allowing users to enable/disable it.
> Makes sense.

I will include this sysfs enable/disable feature in the next version.

> > 3. Allow users to manually override/configure the topology via sysfs.
> 
> No.  If people are in this state we should apply fixes to the HMAT table
> either by injection of real data or some quirking.  If we add userspace
> control via simpler means the motivation for people to fix bios goes out
> the window and it never gets resolved.
> 

Your reasoning is absolutely correct. I will not allow users to modify
the topology via sysfs. However, I plan to provide a read-only sysfs
interface so users can at least check the current topology information.

> > 4. Implement dynamic fallback behaviors depending on the detected
> >    topology shape (needs further thought).
> 
> That would be interesting. But maybe not a 1st version thing :)
> 

This is an area I also need to think more deeply about. I will not
include it in the initial version, but will consider implementing it
in the future.

Once again, I deeply appreciate your time, thorough review, and for
reaching out to Intel for further clarification. It is a huge help.

Rakie Kim

     prev parent reply	other threads:[~2026-03-26  8:55 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
2026-03-18 12:22   ` Jonathan Cameron
2026-03-16  5:12 ` [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality Rakie Kim
2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
2026-03-17  9:50   ` Rakie Kim
2026-03-16 15:19 ` Joshua Hahn
2026-03-16 19:45   ` Gregory Price
2026-03-17 11:50     ` Rakie Kim
2026-03-17 11:36   ` Rakie Kim
2026-03-18 12:02 ` Jonathan Cameron
2026-03-19  7:55   ` Rakie Kim
2026-03-20 16:56     ` Jonathan Cameron
2026-03-24  5:35       ` Rakie Kim
2026-03-25 12:33         ` Jonathan Cameron
2026-03-26  8:54           ` Rakie Kim [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260326085501.343-1-rakie.kim@sk.com \
    --to=rakie.kim@sk.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=honggyu.kim@sk.com \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kbusch@kernel.org \
    --cc=kernel_team@skhynix.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yunjeong.mun@sk.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox