Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Dan Williams <dan.j.williams@intel.com>
To: Rakie Kim <rakie.kim@sk.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: <akpm@linux-foundation.org>, <gourry@gourry.net>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<linux-cxl@vger.kernel.org>, <ziy@nvidia.com>,
	<matthew.brost@intel.com>, <joshua.hahnjy@gmail.com>,
	<byungchul@sk.com>, <ying.huang@linux.alibaba.com>,
	<apopple@nvidia.com>, <david@kernel.org>,
	<lorenzo.stoakes@oracle.com>, <Liam.Howlett@oracle.com>,
	<vbabka@suse.cz>, <rppt@kernel.org>, <surenb@google.com>,
	<mhocko@suse.com>, <dave@stgolabs.net>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
	<ira.weiny@intel.com>, <dan.j.williams@intel.com>,
	<harry.yoo@oracle.com>, <lsf-pc@lists.linux-foundation.org>,
	<kernel_team@skhynix.com>, <honggyu.kim@sk.com>,
	<yunjeong.mun@sk.com>, Keith Busch <kbusch@kernel.org>,
	Rakie Kim <rakie.kim@sk.com>
Subject: Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave
Date: Thu, 26 Mar 2026 13:13:40 -0700	[thread overview]
Message-ID: <69c5937425273_7ee31005f@dwillia2-mobl4.notmuch> (raw)
In-Reply-To: <20260324053549.324-1-rakie.kim@sk.com>

Rakie Kim wrote:
[..]
> Hello Jonathan,
> 
> Your insight is incredibly accurate. To clarify the situation, here is
> the actual configuration of my system:
> 
> NODE   Type          PXD
> node0  local memory  0x00
> node1  local memory  0x01
> node2  cxl memory    0x0A
> node3  cxl memory    0x0B
> 
> Physically, the node2 CXL is attached to node0 (Socket 0), and the
> node3 CXL is attached to node1 (Socket 1). However, extracting the
> HMAT.dsl reveals the following:
> 
> - local memory
>   [028h] Flags: 0001 (Processor Proximity Domain Valid = 1)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x00
>   [050h] Flags: 0001 (Processor Proximity Domain Valid = 1)
>          Attached Initiator Proximity Domain: 0x01
>          Memory Proximity Domain: 0x01
> 
> - cxl memory
>   [078h] Flags: 0000 (Processor Proximity Domain Valid = 0)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x0A
>   [0A0h] Flags: 0000 (Processor Proximity Domain Valid = 0)
>          Attached Initiator Proximity Domain: 0x00
>          Memory Proximity Domain: 0x0B

This looks good.

Unless the CPU is directly attached to the memory controller then there
is no attached initiator. For example, if you wanted to run an x86
memory controller configuration instruction like PCONFIG you would issue
an IPI to the CPU attached to the target memory controller. There is no
such connection for a CPU to do the same for a CXL proximity domain.

> As you correctly suspected, the flags for the CXL memory are 0000,
> meaning the Processor Proximity Domain is marked as invalid. But when
> checking the sysfs initiator configurations, it shows a different story:
> 
> Node   access0 Initiator  access1 Initiator
> node0  node0              node0
> node1  node1              node1
> node2  node1              node1
> node3  node1              node1

2 comments. HMAT is not a physical topology layout table. The
fallback determination of "best" initiator when "Attached Initiator PXM"
is not set is just a heuristic. That heuristic probably has not been
touched since the initial HMAT support went upstream.

> Although the Attached Initiator is set to 0 in HMAT with an invalid
> flag, sysfs strangely registers node1 as the initiator for both CXL
> nodes. Because both HMAT and sysfs are exposing abnormal values, it was
> impossible for me to determine the true socket connections for CXL
> using this data.

Yeah, this sounds more like a kernel bug report than a firmware bug
report at this point.


> > > Even though the distance map shows node2 is physically closer to
> > > Socket 0 and node3 to Socket 1, the HMAT incorrectly defines the
> > > routing path strictly through Socket 1. Because the HMAT alone made it
> > > difficult to determine the exact physical socket connections on these
> > > systems, I ended up using the current CXL driver-based approach.
> > 
> > Are the HMAT latencies and bandwidths all there?  Or are some missing
> > and you have to use SLIT (which generally is garbage for historical
> > reasons of tuning SLIT to particular OS behaviour).
> > 
> 
> The HMAT latencies and bandwidths are present, but the values seem
> broken. Here is the latency table:
> 
> Init->Target | node0 | node1 | node2 | node3
> node0        | 0x38B | 0x89F | 0x9C4 | 0x3AFC
> node1        | 0x89F | 0x38B | 0x3AFC| 0x4268
> 
> I used the identical type of DRAM and CXL memory for both sockets.
> However, looking at the table, the local CXL access latency from
> node0->node2 (0x9C4) and node1->node3 (0x4268) shows a massive,
> unjustified difference. This asymmetry proves that the table is
> currently unreliable.

...or it is telling the truth. Would need more data.

> > > I wonder if others have experienced similar broken HMAT cases with CXL.
> > > If HMAT information becomes more reliable in the future, we could
> > > build a much more efficient structure.
> > 
> > Given it's being lightly used I suspect there will be many bugs :(
> > I hope we can assume they will get fixed however!
> > 
> > ...
> > 
> 
> The most critical issue caused by this broken initiator setting is that
> topology analysis tools like `hwloc` are completely misled. Currently,
> `hwloc` displays both CXL nodes as being attached to Socket 1.
> 
> I observed this exact same issue on both Sierra Forest and Granite
> Rapids systems. I believe this broken topology exposure is a severe
> problem that must be addressed, though I am not entirely sure what the
> best fix would be yet. I would love to hear your thoughts on this.

Before determining that these numbers are wrong you would need to redo
the calculation from CDAT data to see if you get a different answer.

The driver currently does this calculation as part of determining a QoS
class. It would be reasonable to also use that same calculation to double
check the BIOS firmware numbers for CXL proximity domains established at
boot.

> > > The complex topology cases you presented, such as multi-NUMA per socket,
> > > shared CXL switches, and IO expanders, are very important points.
> > > I clearly understand that the simple package-level grouping does not fully
> > > reflect the 1:1 relationship in these future hardware architectures.
> > > 
> > > I have also thought about the shared CXL switch scenario you mentioned,
> > > and I know the current design falls short in addressing it properly.
> > > While the current implementation starts with a simple socket-local
> > > restriction, I plan to evolve it into a more flexible node aggregation
> > > model to properly reflect all the diverse topologies you suggested.
> > 
> > If we can ensure it fails cleanly when it finds a topology that it can't
> > cope with (and I guess falls back to current) then I'm fine with a partial
> > solution that evolves.
> > 
> 
> I completely agree with ensuring a clean failure. To stabilize this
> partial solution, I am currently considering a few options for the
> next version:
> 
> 1. Enable this feature only when a strict 1:1 topology is detected.
> 2. Provide a sysfs allowing users to enable/disable it.
> 3. Allow users to manually override/configure the topology via sysfs.
> 4. Implement dynamic fallback behaviors depending on the detected
>    topology shape (needs further thought).

The advice is always start as simple as possible but no simpler.

It may be the case that Linux indeed finds that platform firmware comes
to a different result than expected. When that happens the CXL subsystem
can probably emit the mismatch details, or otherwise validate the HMAT.

As for actual physical topology layout determination, that is out of
scope for HMAT, but the CXL CDAT calculations do consider PCI link
details.

> By the way, when I first posted this RFC, I accidentally missed adding
> lsf-pc@lists.linux-foundation.org to the CC list. I am considering
> re-posting it to ensure it reaches the lsf-pc.

They are on the Cc: now, I expect that is sufficient.

next prev parent reply	other threads:[~2026-03-26 20:13 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16  5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
2026-03-18 12:22   ` Jonathan Cameron
2026-03-16  5:12 ` [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator Rakie Kim
2026-03-16  5:12 ` [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality Rakie Kim
2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
2026-03-17  9:50   ` Rakie Kim
2026-03-16 15:19 ` Joshua Hahn
2026-03-16 19:45   ` Gregory Price
2026-03-17 11:50     ` Rakie Kim
2026-03-17 11:36   ` Rakie Kim
2026-03-18 12:02 ` Jonathan Cameron
2026-03-19  7:55   ` Rakie Kim
2026-03-20 16:56     ` Jonathan Cameron
2026-03-24  5:35       ` Rakie Kim
2026-03-25 12:33         ` Jonathan Cameron
2026-03-26  8:54           ` Rakie Kim
2026-03-26 21:41             ` Dave Jiang
2026-03-26 22:19               ` Dave Jiang
2026-03-26 20:13         ` Dan Williams [this message]
2026-03-26 22:24         ` Dave Jiang
2026-03-27  1:54         ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=69c5937425273_7ee31005f@dwillia2-mobl4.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=apopple@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=harry.yoo@oracle.com \
    --cc=honggyu.kim@sk.com \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kbusch@kernel.org \
    --cc=kernel_team@skhynix.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=rakie.kim@sk.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.l.verma@intel.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yunjeong.mun@sk.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox