Re: How to programmatically discover online and offline memory and its latency and bandwidth from user space?

Linux CXL
 help / color / mirror / Atom feed

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: "Olivi, Matteo" <molivi3@gatech.edu>
Cc: "linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>
Subject: Re: How to programmatically discover online and offline memory and its latency and bandwidth from user space?
Date: Tue, 26 Aug 2025 14:58:49 +0100	[thread overview]
Message-ID: <20250826145849.000022d7@huawei.com> (raw)
In-Reply-To: <DM5PR07MB3548AFCDDA3DC39F50EECBE0973DA@DM5PR07MB3548.namprd07.prod.outlook.com>

On Fri, 22 Aug 2025 02:38:34 +0000
"Olivi, Matteo" <molivi3@gatech.edu> wrote:

> Thanks for the thorough answer.
> 
> Given this part of the answer:
> 
> > The BIOS may have configured the CXL memory and done the work for SRAT and HMAT
> > to include that memory.  Or it may present HMAT to a generic port entry in SRAT and
> > leave the discovery of performance to the OS when it is setting up the memory
> > mappings etc. For now we present the data for the nearest initiator (cpu / cpu or other)
> > to the CXL memory.  
> 
> I have three follow-up questions:
> 
> 1. Assume the OS, and not the BIOS, does the discovery. Then, the
> HMAT would not list the latency and bandwidth to the memory (only to
> the generic ports).

Correct.

> But the sysfs files with the latency for local
> target-initiators pairs would still have the "complete" latency to
> the memory (as discovered by the OS), right?

Exactly.

> 
> 2. If it's the OS which does the discovery, what information does it
> use? 

Several sources are combined with the firmware description to the port.
1 - Estimates of link latencies and bandwidths based on PCI information.
    That is, how many lanes, frequency, and the encoding over the wire.
2 - CDAT table access via DOE (mailbox in the PCI config space).  These
    provide latency and bandwidth from port to port on a switch and
    port to memory on a type 3 device.


> Does it rely on some firmware hardcoded values like the BIOS, or
> does it run some measurements (e.g. perform some memory requests and
> time them)? 

Upstream Linux just uses the values that are discoverable from
firmware + device provided info (which is probably coming from device
firmware). 

> In case it does measurements, how does that work for
> pooled memory that is physically, but not logically, plugged to the
> host (there's no way to issue memory requests to it)?
> 

I gather other OSes sometimes do it by measurement in early
boot but you are correct in thinking that's tricky if no memory there
yet.

> 3. Regardless of whether the OS or the BIOS does the discovery,
> assume the memory is from a CXL pool that is external to the host. A
> portion of the latency will depend on the PCIe link that will have
> variable length (and thus latency).
Unless it is a very long link that doesn't make a significant difference
in practice, the serializing on and off a link with a fixed maximum
frequency is more important.

> There's no way the motherboard
> firmware can know that latency at boot time. Is the latency for the
> link accounted for in the HMAT (and the derived sysfs files)?

Assumes a zero latency wrt to the actual wire time, but the width
and speed of the link is incorporated.

Jonathan

> 
> Thanks,
> Matteo Olivi.
> ________________________________________
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Sent: Friday, January 10, 2025 12:01 PM
> To: Olivi, Matteo <molivi3@gatech.edu>
> Cc: linux-cxl@vger.kernel.org <linux-cxl@vger.kernel.org>
> Subject: Re: How to programmatically discover online and offline
> memory and its latency and bandwidth from user space? 
> On Wed, 8 Jan 2025 17:55:41 +0000
> "Olivi, Matteo" <molivi3@gatech.edu> wrote:
> 
> > Hello,
> > I'm a PhD student working on orchestrator support for memory
> > disaggregation.
> >
> > I have some questions about how Linux presents CXL memory and its
> > performance characteristics to user space.
> >
> > 1. What is the simplest way for a user space program (with root
> > privileges) to learn the latency and bandwidth between each pair of
> > NUMA nodes (even non-CXL ones)? Are reading the HMAT and shelling
> > out to the cxl cli the only two options? I've
> > read https://docs.kernel.org/admin-guide/mm/numaperf.html but AFAIU given a memory target those sysfs files only report the performance from the local initiators. I care about each pair,
> > not just local ones.  
> 
> Unfortunately the interface indeed only presents a tiny part of the
> data in a full HMAT table. The original discussion on this a few
> years back concluded that was all that made sense until there was a
> clear use case for more complete data.
> 
> HMAT doesn't have to be complete but I'd assume it normally is.
> 
> >
> > 2. Is there a way to get the information question 1 asks for for
> > memory that is physically connected to the host, but logically
> > isn't? The ACPI
> > spec https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#system-resource-affinity-table-definition states that
> > "The SRAT describes the system locality that all processors and
> > memory present in a system belong to at system boot. This includes
> > memory that can be hot-added (that is memory that can be added to
> > the system while it is running, without requiring a reboot)." I
> > interpret that to mean that if (CXL) memory is physically, but not
> > logically, connected to the host, the SRAT will still describe the
> > corresponding NUMA node. But what about the HMAT? The ACPI
> > spec https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#heterogeneous-memory-attributes-information states that  "
> > The static HMAT table provides the boot time description of the
> > memory latency and bandwidth among all memory access Initiator and
> > memory Target System Localities. For hot-added devices and dynamic
> > reconfiguration of the system localities, the _HMA object must be
> > used for runtime update." but it's unclear to me if that applies
> > only to physically hot-plugged memory or to logically hot-plugged
> > memory as well.  
> 
> The BIOS may have configured the CXL memory and done the work for
> SRAT and HMAT to include that memory.  Or it may present HMAT to a
> generic port entry in SRAT and leave the discovery of performance to
> the OS when it is setting up the memory mappings etc. For now we
> present the data for the nearest initiator (cpu / cpu or other) to
> the CXL memory.
> 
> >
> > 3. Is there a recommended way for a user space program to tell CXL
> > NUMA nodes from local NUMA nodes (both online and offline ones)?
> > One hack would be to check whether the NUMA node has CPUs or not.
> > Another option would be shelling out to the cxl-cli.  
> 
> In general not really. It's just memory, you should never care that
> it is CXL beyond that it's performance characteristics are different
> and maybe for error handling reasons. You can indeed use cxl-cli or
> reads of the sysfs entries that tool is using to figure it out.
> 
> >
> > 4. Is there a way for a user space program (with root privileges)
> > to learn IDs of CXL NUMA nodes (both online and offline ones) that
> > are globally unique? What I want is: a. if two hosts are both
> > connected to the same CXL memory, they should see that memory with
> > the same ID.  
> 
> Look at serial numbers of the devices.  That's not connected to NUMA
> node IDs that are local to a given host.  Those can be obtained with
> lspci and are unique (assuming manufacturer set them - which
> sometimes doesn't happen in prototype parts).
> 
> > b. two different CXL memory pools will never be seen with the same
> > ID by different hosts.  
> ID here can't be NUMA node as those are used to index non sparse
> structures so it wouldn't scale.
> 
> Once we get upstream support for DCD (only sensible way to do pools
> and remain compliant for the spec) and tagging of what that provides,
> then the globally unique ID will be associated with particular bit of
> shared memory on the device rather than the whole device. My guess is
> that will take a few kernel cycles though.
> 
> >
> > All my questions talk about NUMA nodes. I understand that Linux has
> > multiple layers of abstractions to represent memory, and NUMA nodes
> > are one of the highest ones. If any of the questions above can be
> > answered but at a lower level of abstraction than NUMA nodes,
> > that's fine as long as there's a way to map the entity in the lower
> > level of abstraction to the corresponding NUMA node.  
> 
> Hope that helps a little!
> 
> Jonathan
> 
> >
> > Thanks,
> > Matteo Olivi.
> >

next prev parent reply	other threads:[~2025-08-26 13:58 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-08 17:55 How to programmatically discover online and offline memory and its latency and bandwidth from user space? Olivi, Matteo
2025-01-10 17:01 ` Jonathan Cameron
2025-08-22  2:38   ` Olivi, Matteo
2025-08-26 13:58     ` Jonathan Cameron [this message]
2025-08-26 17:31     ` Dave Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250826145849.000022d7@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=molivi3@gatech.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox