Linux CXL
 help / color / mirror / Atom feed
* How to programmatically discover online and offline memory and its latency and bandwidth from user space?
@ 2025-01-08 17:55 Olivi, Matteo
  2025-01-10 17:01 ` Jonathan Cameron
  0 siblings, 1 reply; 5+ messages in thread
From: Olivi, Matteo @ 2025-01-08 17:55 UTC (permalink / raw)
  To: linux-cxl@vger.kernel.org

Hello,
I'm a PhD student working on orchestrator support for memory disaggregation.

I have some questions about how Linux presents CXL memory and its performance
characteristics to user space.

1. What is the simplest way for a user space program (with root privileges) to learn the
latency and bandwidth between each pair of NUMA nodes (even non-CXL ones)? Are
reading the HMAT and shelling out to the cxl cli the only two options? I've read
https://docs.kernel.org/admin-guide/mm/numaperf.html but AFAIU given a memory target
those sysfs files only report the performance from the local initiators. I care about each pair,
not just local ones.

2. Is there a way to get the information question 1 asks for for memory that is physically
connected to the host, but logically isn't? The ACPI spec https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#system-resource-affinity-table-definition 
states that "The SRAT describes the system locality that all processors and memory
present in a system belong to at system boot. This includes memory that can be hot-added (that
is memory that can be added to the system while it is running, without requiring a reboot)."
I interpret that to mean that if (CXL) memory is physically, but not logically, connected to the host,
the SRAT will still describe the corresponding NUMA node. But what about the HMAT? The ACPI spec
https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#heterogeneous-memory-attributes-information
states that  "The static HMAT table provides the boot time description of the memory latency and bandwidth
among all memory access Initiator and memory Target System Localities. For hot-added devices and
dynamic reconfiguration of the system localities, the _HMA object must be used for runtime update."
but it's unclear to me if that applies only to physically hot-plugged memory or to logically hot-plugged
memory as well.

3. Is there a recommended way for a user space program to tell CXL NUMA nodes from local NUMA nodes
(both online and offline ones)? One hack would be to check whether the NUMA node has CPUs or not.
Another option would be shelling out to the cxl-cli. 

4. Is there a way for a user space program (with root privileges) to learn IDs of CXL NUMA
nodes (both online and offline ones) that are globally unique? What I want is:
a. if two hosts are both connected to the same CXL memory, they should see that memory
with the same ID.
b. two different CXL memory pools will never be seen with the same ID by different hosts.

All my questions talk about NUMA nodes. I understand that Linux has multiple
layers of abstractions to represent memory, and NUMA nodes are one of the highest ones.
If any of the questions above can be answered but at a lower level of abstraction than NUMA
nodes, that's fine as long as there's a way to map the entity in the lower level of abstraction
to the corresponding NUMA node.

Thanks,
Matteo Olivi.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to programmatically discover online and offline memory and its latency and bandwidth from user space?
  2025-01-08 17:55 How to programmatically discover online and offline memory and its latency and bandwidth from user space? Olivi, Matteo
@ 2025-01-10 17:01 ` Jonathan Cameron
  2025-08-22  2:38   ` Olivi, Matteo
  0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Cameron @ 2025-01-10 17:01 UTC (permalink / raw)
  To: Olivi, Matteo; +Cc: linux-cxl@vger.kernel.org

On Wed, 8 Jan 2025 17:55:41 +0000
"Olivi, Matteo" <molivi3@gatech.edu> wrote:

> Hello,
> I'm a PhD student working on orchestrator support for memory disaggregation.
> 
> I have some questions about how Linux presents CXL memory and its performance
> characteristics to user space.
> 
> 1. What is the simplest way for a user space program (with root privileges) to learn the
> latency and bandwidth between each pair of NUMA nodes (even non-CXL ones)? Are
> reading the HMAT and shelling out to the cxl cli the only two options? I've read
> https://docs.kernel.org/admin-guide/mm/numaperf.html but AFAIU given a memory target
> those sysfs files only report the performance from the local initiators. I care about each pair,
> not just local ones.

Unfortunately the interface indeed only presents a tiny part of the data in a full HMAT table.
The original discussion on this a few years back concluded that was all that made sense
until there was a clear use case for more complete data.

HMAT doesn't have to be complete but I'd assume it normally is.

> 
> 2. Is there a way to get the information question 1 asks for for memory that is physically
> connected to the host, but logically isn't? The ACPI spec https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#system-resource-affinity-table-definition 
> states that "The SRAT describes the system locality that all processors and memory
> present in a system belong to at system boot. This includes memory that can be hot-added (that
> is memory that can be added to the system while it is running, without requiring a reboot)."
> I interpret that to mean that if (CXL) memory is physically, but not logically, connected to the host,
> the SRAT will still describe the corresponding NUMA node.
> But what about the HMAT? The ACPI spec
> https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#heterogeneous-memory-attributes-information
> states that  "The static HMAT table provides the boot time description of the memory latency and bandwidth
> among all memory access Initiator and memory Target System Localities. For hot-added devices and
> dynamic reconfiguration of the system localities, the _HMA object must be used for runtime update."
> but it's unclear to me if that applies only to physically hot-plugged memory or to logically hot-plugged
> memory as well.

The BIOS may have configured the CXL memory and done the work for SRAT and HMAT to include
that memory.  Or it may present HMAT to a generic port entry in SRAT and leave the discovery of
performance to the OS when it is setting up the memory mappings etc.
For now we present the data for the nearest initiator (cpu / cpu or other) to the CXL memory.

> 
> 3. Is there a recommended way for a user space program to tell CXL NUMA nodes from local NUMA nodes
> (both online and offline ones)? One hack would be to check whether the NUMA node has CPUs or not.
> Another option would be shelling out to the cxl-cli. 

In general not really. It's just memory, you should never care that it is CXL beyond that
it's performance characteristics are different and maybe for error handling reasons.
You can indeed use cxl-cli or reads of the sysfs entries that tool is using to figure it out.

> 
> 4. Is there a way for a user space program (with root privileges) to learn IDs of CXL NUMA
> nodes (both online and offline ones) that are globally unique? What I want is:
> a. if two hosts are both connected to the same CXL memory, they should see that memory
> with the same ID.

Look at serial numbers of the devices.  That's not connected to NUMA node IDs that are local
to a given host.  Those can be obtained with lspci and are unique (assuming manufacturer
set them - which sometimes doesn't happen in prototype parts).

> b. two different CXL memory pools will never be seen with the same ID by different hosts.
ID here can't be NUMA node as those are used to index non sparse structures so it wouldn't
scale.

Once we get upstream support for DCD (only sensible way to do pools and remain compliant for
the spec) and tagging of what that provides, then the globally unique ID will be associated
with particular bit of shared memory on the device rather than the whole device.
My guess is that will take a few kernel cycles though.

> 
> All my questions talk about NUMA nodes. I understand that Linux has multiple
> layers of abstractions to represent memory, and NUMA nodes are one of the highest ones.
> If any of the questions above can be answered but at a lower level of abstraction than NUMA
> nodes, that's fine as long as there's a way to map the entity in the lower level of abstraction
> to the corresponding NUMA node.

Hope that helps a little!

Jonathan

> 
> Thanks,
> Matteo Olivi.
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to programmatically discover online and offline memory and its latency and bandwidth from user space?
  2025-01-10 17:01 ` Jonathan Cameron
@ 2025-08-22  2:38   ` Olivi, Matteo
  2025-08-26 13:58     ` Jonathan Cameron
  2025-08-26 17:31     ` Dave Jiang
  0 siblings, 2 replies; 5+ messages in thread
From: Olivi, Matteo @ 2025-08-22  2:38 UTC (permalink / raw)
  To: Jonathan Cameron; +Cc: linux-cxl@vger.kernel.org

Thanks for the thorough answer.

Given this part of the answer:

> The BIOS may have configured the CXL memory and done the work for SRAT and HMAT
> to include that memory.  Or it may present HMAT to a generic port entry in SRAT and
> leave the discovery of performance to the OS when it is setting up the memory
> mappings etc. For now we present the data for the nearest initiator (cpu / cpu or other)
> to the CXL memory.

I have three follow-up questions:

1. Assume the OS, and not the BIOS, does the discovery. Then, the HMAT would not list the latency and bandwidth to the memory (only to the generic ports). But the sysfs
files with the latency for local target-initiators pairs would still have the "complete" latency to the memory (as discovered by the OS), right?

2. If it's the OS which does the discovery, what information does it use? Does it rely on some firmware hardcoded values like the BIOS, or does it run some measurements
(e.g. perform some memory requests and time them)? In case it does measurements, how does that work for pooled memory that is physically, but not logically,
plugged to the host (there's no way to issue memory requests to it)?

3. Regardless of whether the OS or the BIOS does the discovery, assume the memory is from a CXL pool that is external to the host. A portion of the latency will depend on the PCIe link that will have variable length (and thus latency). There's no way the motherboard firmware can know that latency at boot time. Is the latency for the link accounted for in the HMAT (and the derived sysfs files)?

Thanks,
Matteo Olivi.
________________________________________
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Sent: Friday, January 10, 2025 12:01 PM
To: Olivi, Matteo <molivi3@gatech.edu>
Cc: linux-cxl@vger.kernel.org <linux-cxl@vger.kernel.org>
Subject: Re: How to programmatically discover online and offline memory and its latency and bandwidth from user space?
 
On Wed, 8 Jan 2025 17:55:41 +0000
"Olivi, Matteo" <molivi3@gatech.edu> wrote:

> Hello,
> I'm a PhD student working on orchestrator support for memory disaggregation.
>
> I have some questions about how Linux presents CXL memory and its performance
> characteristics to user space.
>
> 1. What is the simplest way for a user space program (with root privileges) to learn the
> latency and bandwidth between each pair of NUMA nodes (even non-CXL ones)? Are
> reading the HMAT and shelling out to the cxl cli the only two options? I've read
> https://docs.kernel.org/admin-guide/mm/numaperf.html but AFAIU given a memory target
> those sysfs files only report the performance from the local initiators. I care about each pair,
> not just local ones.

Unfortunately the interface indeed only presents a tiny part of the data in a full HMAT table.
The original discussion on this a few years back concluded that was all that made sense
until there was a clear use case for more complete data.

HMAT doesn't have to be complete but I'd assume it normally is.

>
> 2. Is there a way to get the information question 1 asks for for memory that is physically
> connected to the host, but logically isn't? The ACPI spec https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#system-resource-affinity-table-definition
> states that "The SRAT describes the system locality that all processors and memory
> present in a system belong to at system boot. This includes memory that can be hot-added (that
> is memory that can be added to the system while it is running, without requiring a reboot)."
> I interpret that to mean that if (CXL) memory is physically, but not logically, connected to the host,
> the SRAT will still describe the corresponding NUMA node.
> But what about the HMAT? The ACPI spec
> https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#heterogeneous-memory-attributes-information
> states that  "The static HMAT table provides the boot time description of the memory latency and bandwidth
> among all memory access Initiator and memory Target System Localities. For hot-added devices and
> dynamic reconfiguration of the system localities, the _HMA object must be used for runtime update."
> but it's unclear to me if that applies only to physically hot-plugged memory or to logically hot-plugged
> memory as well.

The BIOS may have configured the CXL memory and done the work for SRAT and HMAT to include
that memory.  Or it may present HMAT to a generic port entry in SRAT and leave the discovery of
performance to the OS when it is setting up the memory mappings etc.
For now we present the data for the nearest initiator (cpu / cpu or other) to the CXL memory.

>
> 3. Is there a recommended way for a user space program to tell CXL NUMA nodes from local NUMA nodes
> (both online and offline ones)? One hack would be to check whether the NUMA node has CPUs or not.
> Another option would be shelling out to the cxl-cli.

In general not really. It's just memory, you should never care that it is CXL beyond that
it's performance characteristics are different and maybe for error handling reasons.
You can indeed use cxl-cli or reads of the sysfs entries that tool is using to figure it out.

>
> 4. Is there a way for a user space program (with root privileges) to learn IDs of CXL NUMA
> nodes (both online and offline ones) that are globally unique? What I want is:
> a. if two hosts are both connected to the same CXL memory, they should see that memory
> with the same ID.

Look at serial numbers of the devices.  That's not connected to NUMA node IDs that are local
to a given host.  Those can be obtained with lspci and are unique (assuming manufacturer
set them - which sometimes doesn't happen in prototype parts).

> b. two different CXL memory pools will never be seen with the same ID by different hosts.
ID here can't be NUMA node as those are used to index non sparse structures so it wouldn't
scale.

Once we get upstream support for DCD (only sensible way to do pools and remain compliant for
the spec) and tagging of what that provides, then the globally unique ID will be associated
with particular bit of shared memory on the device rather than the whole device.
My guess is that will take a few kernel cycles though.

>
> All my questions talk about NUMA nodes. I understand that Linux has multiple
> layers of abstractions to represent memory, and NUMA nodes are one of the highest ones.
> If any of the questions above can be answered but at a lower level of abstraction than NUMA
> nodes, that's fine as long as there's a way to map the entity in the lower level of abstraction
> to the corresponding NUMA node.

Hope that helps a little!

Jonathan

>
> Thanks,
> Matteo Olivi.
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to programmatically discover online and offline memory and its latency and bandwidth from user space?
  2025-08-22  2:38   ` Olivi, Matteo
@ 2025-08-26 13:58     ` Jonathan Cameron
  2025-08-26 17:31     ` Dave Jiang
  1 sibling, 0 replies; 5+ messages in thread
From: Jonathan Cameron @ 2025-08-26 13:58 UTC (permalink / raw)
  To: Olivi, Matteo; +Cc: linux-cxl@vger.kernel.org

On Fri, 22 Aug 2025 02:38:34 +0000
"Olivi, Matteo" <molivi3@gatech.edu> wrote:

> Thanks for the thorough answer.
> 
> Given this part of the answer:
> 
> > The BIOS may have configured the CXL memory and done the work for SRAT and HMAT
> > to include that memory.  Or it may present HMAT to a generic port entry in SRAT and
> > leave the discovery of performance to the OS when it is setting up the memory
> > mappings etc. For now we present the data for the nearest initiator (cpu / cpu or other)
> > to the CXL memory.  
> 
> I have three follow-up questions:
> 
> 1. Assume the OS, and not the BIOS, does the discovery. Then, the
> HMAT would not list the latency and bandwidth to the memory (only to
> the generic ports).

Correct.

> But the sysfs files with the latency for local
> target-initiators pairs would still have the "complete" latency to
> the memory (as discovered by the OS), right?

Exactly.

> 
> 2. If it's the OS which does the discovery, what information does it
> use? 

Several sources are combined with the firmware description to the port.
1 - Estimates of link latencies and bandwidths based on PCI information.
    That is, how many lanes, frequency, and the encoding over the wire.
2 - CDAT table access via DOE (mailbox in the PCI config space).  These
    provide latency and bandwidth from port to port on a switch and
    port to memory on a type 3 device.


> Does it rely on some firmware hardcoded values like the BIOS, or
> does it run some measurements (e.g. perform some memory requests and
> time them)? 

Upstream Linux just uses the values that are discoverable from
firmware + device provided info (which is probably coming from device
firmware). 

> In case it does measurements, how does that work for
> pooled memory that is physically, but not logically, plugged to the
> host (there's no way to issue memory requests to it)?
> 

I gather other OSes sometimes do it by measurement in early
boot but you are correct in thinking that's tricky if no memory there
yet.

> 3. Regardless of whether the OS or the BIOS does the discovery,
> assume the memory is from a CXL pool that is external to the host. A
> portion of the latency will depend on the PCIe link that will have
> variable length (and thus latency).
Unless it is a very long link that doesn't make a significant difference
in practice, the serializing on and off a link with a fixed maximum
frequency is more important.

> There's no way the motherboard
> firmware can know that latency at boot time. Is the latency for the
> link accounted for in the HMAT (and the derived sysfs files)?

Assumes a zero latency wrt to the actual wire time, but the width
and speed of the link is incorporated.

Jonathan

> 
> Thanks,
> Matteo Olivi.
> ________________________________________
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Sent: Friday, January 10, 2025 12:01 PM
> To: Olivi, Matteo <molivi3@gatech.edu>
> Cc: linux-cxl@vger.kernel.org <linux-cxl@vger.kernel.org>
> Subject: Re: How to programmatically discover online and offline
> memory and its latency and bandwidth from user space? 
> On Wed, 8 Jan 2025 17:55:41 +0000
> "Olivi, Matteo" <molivi3@gatech.edu> wrote:
> 
> > Hello,
> > I'm a PhD student working on orchestrator support for memory
> > disaggregation.
> >
> > I have some questions about how Linux presents CXL memory and its
> > performance characteristics to user space.
> >
> > 1. What is the simplest way for a user space program (with root
> > privileges) to learn the latency and bandwidth between each pair of
> > NUMA nodes (even non-CXL ones)? Are reading the HMAT and shelling
> > out to the cxl cli the only two options? I've
> > read https://docs.kernel.org/admin-guide/mm/numaperf.html but AFAIU given a memory target those sysfs files only report the performance from the local initiators. I care about each pair,
> > not just local ones.  
> 
> Unfortunately the interface indeed only presents a tiny part of the
> data in a full HMAT table. The original discussion on this a few
> years back concluded that was all that made sense until there was a
> clear use case for more complete data.
> 
> HMAT doesn't have to be complete but I'd assume it normally is.
> 
> >
> > 2. Is there a way to get the information question 1 asks for for
> > memory that is physically connected to the host, but logically
> > isn't? The ACPI
> > spec https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#system-resource-affinity-table-definition states that
> > "The SRAT describes the system locality that all processors and
> > memory present in a system belong to at system boot. This includes
> > memory that can be hot-added (that is memory that can be added to
> > the system while it is running, without requiring a reboot)." I
> > interpret that to mean that if (CXL) memory is physically, but not
> > logically, connected to the host, the SRAT will still describe the
> > corresponding NUMA node. But what about the HMAT? The ACPI
> > spec https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#heterogeneous-memory-attributes-information states that  "
> > The static HMAT table provides the boot time description of the
> > memory latency and bandwidth among all memory access Initiator and
> > memory Target System Localities. For hot-added devices and dynamic
> > reconfiguration of the system localities, the _HMA object must be
> > used for runtime update." but it's unclear to me if that applies
> > only to physically hot-plugged memory or to logically hot-plugged
> > memory as well.  
> 
> The BIOS may have configured the CXL memory and done the work for
> SRAT and HMAT to include that memory.  Or it may present HMAT to a
> generic port entry in SRAT and leave the discovery of performance to
> the OS when it is setting up the memory mappings etc. For now we
> present the data for the nearest initiator (cpu / cpu or other) to
> the CXL memory.
> 
> >
> > 3. Is there a recommended way for a user space program to tell CXL
> > NUMA nodes from local NUMA nodes (both online and offline ones)?
> > One hack would be to check whether the NUMA node has CPUs or not.
> > Another option would be shelling out to the cxl-cli.  
> 
> In general not really. It's just memory, you should never care that
> it is CXL beyond that it's performance characteristics are different
> and maybe for error handling reasons. You can indeed use cxl-cli or
> reads of the sysfs entries that tool is using to figure it out.
> 
> >
> > 4. Is there a way for a user space program (with root privileges)
> > to learn IDs of CXL NUMA nodes (both online and offline ones) that
> > are globally unique? What I want is: a. if two hosts are both
> > connected to the same CXL memory, they should see that memory with
> > the same ID.  
> 
> Look at serial numbers of the devices.  That's not connected to NUMA
> node IDs that are local to a given host.  Those can be obtained with
> lspci and are unique (assuming manufacturer set them - which
> sometimes doesn't happen in prototype parts).
> 
> > b. two different CXL memory pools will never be seen with the same
> > ID by different hosts.  
> ID here can't be NUMA node as those are used to index non sparse
> structures so it wouldn't scale.
> 
> Once we get upstream support for DCD (only sensible way to do pools
> and remain compliant for the spec) and tagging of what that provides,
> then the globally unique ID will be associated with particular bit of
> shared memory on the device rather than the whole device. My guess is
> that will take a few kernel cycles though.
> 
> >
> > All my questions talk about NUMA nodes. I understand that Linux has
> > multiple layers of abstractions to represent memory, and NUMA nodes
> > are one of the highest ones. If any of the questions above can be
> > answered but at a lower level of abstraction than NUMA nodes,
> > that's fine as long as there's a way to map the entity in the lower
> > level of abstraction to the corresponding NUMA node.  
> 
> Hope that helps a little!
> 
> Jonathan
> 
> >
> > Thanks,
> > Matteo Olivi.
> >  


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to programmatically discover online and offline memory and its latency and bandwidth from user space?
  2025-08-22  2:38   ` Olivi, Matteo
  2025-08-26 13:58     ` Jonathan Cameron
@ 2025-08-26 17:31     ` Dave Jiang
  1 sibling, 0 replies; 5+ messages in thread
From: Dave Jiang @ 2025-08-26 17:31 UTC (permalink / raw)
  To: Olivi, Matteo, Jonathan Cameron; +Cc: linux-cxl@vger.kernel.org



On 8/21/25 7:38 PM, Olivi, Matteo wrote:
> Thanks for the thorough answer.
> 
> Given this part of the answer:
> 
>> The BIOS may have configured the CXL memory and done the work for SRAT and HMAT
>> to include that memory.  Or it may present HMAT to a generic port entry in SRAT and
>> leave the discovery of performance to the OS when it is setting up the memory
>> mappings etc. For now we present the data for the nearest initiator (cpu / cpu or other)
>> to the CXL memory.
> 
> I have three follow-up questions:
> 
> 1. Assume the OS, and not the BIOS, does the discovery. Then, the HMAT would not list the latency and bandwidth to the memory (only to the generic ports). But the sysfs
> files with the latency for local target-initiators pairs would still have the "complete" latency to the memory (as discovered by the OS), right?
> 
> 2. If it's the OS which does the discovery, what information does it use? Does it rely on some firmware hardcoded values like the BIOS, or does it run some measurements
> (e.g. perform some memory requests and time them)? In case it does measurements, how does that work for pooled memory that is physically, but not logically,
> plugged to the host (there's no way to issue memory requests to it)?

This document and the associated docs may be helpful.
https://docs.kernel.org/driver-api/cxl/linux/access-coordinates.html

> 
> 3. Regardless of whether the OS or the BIOS does the discovery, assume the memory is from a CXL pool that is external to the host. A portion of the latency will depend on the PCIe link that will have variable length (and thus latency). There's no way the motherboard firmware can know that latency at boot time. Is the latency for the link accounted for in the HMAT (and the derived sysfs files)?
> 
> Thanks,
> Matteo Olivi.
> ________________________________________
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Sent: Friday, January 10, 2025 12:01 PM
> To: Olivi, Matteo <molivi3@gatech.edu>
> Cc: linux-cxl@vger.kernel.org <linux-cxl@vger.kernel.org>
> Subject: Re: How to programmatically discover online and offline memory and its latency and bandwidth from user space?
>  
> On Wed, 8 Jan 2025 17:55:41 +0000
> "Olivi, Matteo" <molivi3@gatech.edu> wrote:
> 
>> Hello,
>> I'm a PhD student working on orchestrator support for memory disaggregation.
>>
>> I have some questions about how Linux presents CXL memory and its performance
>> characteristics to user space.
>>
>> 1. What is the simplest way for a user space program (with root privileges) to learn the
>> latency and bandwidth between each pair of NUMA nodes (even non-CXL ones)? Are
>> reading the HMAT and shelling out to the cxl cli the only two options? I've read
>> https://docs.kernel.org/admin-guide/mm/numaperf.html but AFAIU given a memory target
>> those sysfs files only report the performance from the local initiators. I care about each pair,
>> not just local ones.
> 
> Unfortunately the interface indeed only presents a tiny part of the data in a full HMAT table.
> The original discussion on this a few years back concluded that was all that made sense
> until there was a clear use case for more complete data.
> 
> HMAT doesn't have to be complete but I'd assume it normally is.
> 
>>
>> 2. Is there a way to get the information question 1 asks for for memory that is physically
>> connected to the host, but logically isn't? The ACPI spec https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#system-resource-affinity-table-definition
>> states that "The SRAT describes the system locality that all processors and memory
>> present in a system belong to at system boot. This includes memory that can be hot-added (that
>> is memory that can be added to the system while it is running, without requiring a reboot)."
>> I interpret that to mean that if (CXL) memory is physically, but not logically, connected to the host,
>> the SRAT will still describe the corresponding NUMA node.
>> But what about the HMAT? The ACPI spec
>> https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/17_NUMA_Architecture_Platforms/NUMA_Architecture_Platforms.html#heterogeneous-memory-attributes-information
>> states that  "The static HMAT table provides the boot time description of the memory latency and bandwidth
>> among all memory access Initiator and memory Target System Localities. For hot-added devices and
>> dynamic reconfiguration of the system localities, the _HMA object must be used for runtime update."
>> but it's unclear to me if that applies only to physically hot-plugged memory or to logically hot-plugged
>> memory as well.
> 
> The BIOS may have configured the CXL memory and done the work for SRAT and HMAT to include
> that memory.  Or it may present HMAT to a generic port entry in SRAT and leave the discovery of
> performance to the OS when it is setting up the memory mappings etc.
> For now we present the data for the nearest initiator (cpu / cpu or other) to the CXL memory.
> 
>>
>> 3. Is there a recommended way for a user space program to tell CXL NUMA nodes from local NUMA nodes
>> (both online and offline ones)? One hack would be to check whether the NUMA node has CPUs or not.
>> Another option would be shelling out to the cxl-cli.
> 
> In general not really. It's just memory, you should never care that it is CXL beyond that
> it's performance characteristics are different and maybe for error handling reasons.
> You can indeed use cxl-cli or reads of the sysfs entries that tool is using to figure it out.
> 
>>
>> 4. Is there a way for a user space program (with root privileges) to learn IDs of CXL NUMA
>> nodes (both online and offline ones) that are globally unique? What I want is:
>> a. if two hosts are both connected to the same CXL memory, they should see that memory
>> with the same ID.
> 
> Look at serial numbers of the devices.  That's not connected to NUMA node IDs that are local
> to a given host.  Those can be obtained with lspci and are unique (assuming manufacturer
> set them - which sometimes doesn't happen in prototype parts).
> 
>> b. two different CXL memory pools will never be seen with the same ID by different hosts.
> ID here can't be NUMA node as those are used to index non sparse structures so it wouldn't
> scale.
> 
> Once we get upstream support for DCD (only sensible way to do pools and remain compliant for
> the spec) and tagging of what that provides, then the globally unique ID will be associated
> with particular bit of shared memory on the device rather than the whole device.
> My guess is that will take a few kernel cycles though.
> 
>>
>> All my questions talk about NUMA nodes. I understand that Linux has multiple
>> layers of abstractions to represent memory, and NUMA nodes are one of the highest ones.
>> If any of the questions above can be answered but at a lower level of abstraction than NUMA
>> nodes, that's fine as long as there's a way to map the entity in the lower level of abstraction
>> to the corresponding NUMA node.
> 
> Hope that helps a little!
> 
> Jonathan
> 
>>
>> Thanks,
>> Matteo Olivi.
>>
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-08-26 17:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-08 17:55 How to programmatically discover online and offline memory and its latency and bandwidth from user space? Olivi, Matteo
2025-01-10 17:01 ` Jonathan Cameron
2025-08-22  2:38   ` Olivi, Matteo
2025-08-26 13:58     ` Jonathan Cameron
2025-08-26 17:31     ` Dave Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox