* [LSF/MM] Linux management of volatile CXL memory devices - boot to bash.
@ 2024-12-26 20:19 Gregory Price
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
` (6 more replies)
0 siblings, 7 replies; 81+ messages in thread
From: Gregory Price @ 2024-12-26 20:19 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
I'd like to propose a discussion about the variety of CXL configuration
decisions made by platforms, BIOS, EFI, and linux that have created a
complex, and sometimes subtly confusing administration environment.
In particular, when and how memory configuration occurs have major
implications for major feature support (interleave, ras, hotplug, etc).
For example, treating CXL memory as "conventional" without marking is
"special purpose" may limit the applicability of certain RAS features -
but even marking it "special purpose" may be insufficient for
(coordinated) device hotplug compatibility.
Another example, RAS features like POISON have different end-state
implications (full system crash vs userland crash) that depend on
whether the memory was being used by kernel or userland (which is
actually somewhat controllable, and therefore useful for
administrators!)
Some of this complexity stems from interleave settings and how CXL
memory is distributed among NUMA nodes (1 per device, 1 per homogeneous
set, or a single heterogeneous numa node).
Specifically we'll talk about
- iomem resource allocation
- EFI_CONVENTIONAL_MEMORY, MEMORY_SP, and CONFIG_EFI_SOFT_RESERVE
- e820 & EFI mmemory map inclusion
- driver-time allocation
- hotplug implications
- Addressing
- SPA == HPA vs SPA != HPA
- Boot-time configuration vs Driver Configuration
- Interleave configuration
- Platform configuration vs Driver configuration
- PRMT-provided translation
- RAS feature implications
- Management implications (hotplug, teardown, etc)
- Linux Memory (Block) Hotplug
- auto-online vs user-policy
- systemd / typical user story
- Zone-assignment and Poison
I'd like to lay out (as best I can, with help!) the current environment in
linux kernel, the "maintenance implications" of certain configurations
decisions, and discuss where ambiguities are present / challenging.
I'll add some additional follow-on emails that break down some of these
scenarios more in-depth over the next few months for some background reading.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
@ 2025-02-05 2:17 ` Gregory Price
2025-02-18 10:12 ` Yuquan Wang
` (3 more replies)
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
` (5 subsequent siblings)
6 siblings, 4 replies; 81+ messages in thread
From: Gregory Price @ 2025-02-05 2:17 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
Tossing this out as larger documentation of these steps for comment,
not as a representation of what will show up in the talk.
This is trying to cover the minimum needed information to start
reasoning about the growing complexity of configurations.
Platform / BIOS / EFI Configuraiton
===================================
---------------------------------------
Step 1: BIOS-time hardware programming.
---------------------------------------
I don't want to focus on platform specifics, so really all you need
to know about this phase for the purpose of MM is that platforms may
program the CXL device heirarchy and lock the configuration.
In practice it means you probably can't reconfigure things after boot
without doing major teardowns of the devices and resetting them -
assuming the platform doesn't have major quirks that prevent this.
This has implications for Hotplug, Interleave, and RAS, but we'll
cover those explicitly elsewhere. Otherwise, if something gets mucked
up at this stage - complain to your platform / hardware vendor.
------------------------------------------------------------------
Step 2: BIOS / EFI generates the CEDT (CXL Early Detection Table).
------------------------------------------------------------------
This table is responsible for reporting each "CXL Host Bridge" and
"CXL Fixed Memory Window" present at boot - which enables early boot
software to manage those devices and the memory capacity presented
by those devices.
Example CEDT Entries (truncated)
Subtable Type : 00 [CXL Host Bridge Structure]
Reserved : 00
Length : 0020
Associated host bridge : 00000005
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 000000C050000000
Window size : 0000003CA0000000
If this memory is NOT marked "Special Purpose" by BIOS (next section),
you should find a matching entry EFI Memory Map and /proc/iomem
BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] usable
/proc/iomem: c050000000-fcefffffff : System RAM
Observation: This memory is treated as 100% normal System RAM
1) This memory may be placed in any zone (ZONE_NORMAL, typically)
2) The kernel may use this memory for arbitrary allocations
4) The driver still enumerates CXL devices and memory regions, but
3) The CXL driver CANNOT manage this memory (as of today)
(Caveat: *some* RAS features may still work, possibly)
This creates an nuanced management state.
The memory is online by default and completely usable, AND the driver
appears to be managing the devices - BUT the memory resources and the
management structure are fundamentally separate.
1) CXL Driver manages CXL features
2) Non-CXL SystemRAM mechanisms surface the memory to allocators.
---------------------------------------------------------------
Step 3: EFI_MEMORY_SP - Deferring Management to the CXL Driver.
---------------------------------------------------------------
Assuming you DON'T want CXL memory to default to SystemRAM and prefer
NOT to have your kernel allocate arbitrary resources on CXL, you
probably want to defer managing these memory regions to the CXL driver.
The mechanism for is setting EFI_MEMORY_SP bit on CXL memory in BIOS.
This will mark the memory "Special Purpose".
Doing this will result in your memory being marked "Soft Reserved" on
x86 and ARM (presently unknown on other architectures).
You will see Memory Map and iomem entries like so:
BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] soft reserved
/proc/iomem: c050000000-fcefffffff : Soft Reserved
Unless of course:
1) CONFIG_EFI_SOFT_RESERVE=n in your build config, or
2) You set the nosoftreserve boot parameter
3) You kexec'd from a kernel where conditions #1 or #2 are met
In which case you'll get SystemRAM as if EFI_MEMORY_SP was never set.
(#3 was fun to debug, for some definition of fun. Ask me over coffee)
------------------------------------------------------------
First bit of nuanced complexity: Early-Boot Resource Re-use.
------------------------------------------------------------
How are MemoryMap resources managed by a driver after being reserved
during early boot? Example: Hot-(un)plugging a device.
What if we replace said Hot-unplugged device with a device with a new
capacity? What if the arch/platform code combines two adjacent
regions with similar attributes before creating resources?
Recent work by Nathan Fontenot [1] has been looking to try to address
some of the issues with these Soft Reserved resources and either re-using them
or handing them off entirely to the relative driver for management.
[1] https://lore.kernel.org/linux-cxl/cover.1737046620.git.nathan.fontenot@amd.com/
--------------------------------------------------------------------
The Complexity story up til now (what's likely to show up in slides)
--------------------------------------------------------------------
Platform and BIOS:
May configure all the devices prior to kernel hand-off.
May or may not support reconfiguring / hotplug.
BIOS and EFI:
EFI_MEMORY_SP - used to defer management to drivers
Kernel Build and Boot:
CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM
nosoftreserve - Will always result in CXL as SystemRAM
kexec - SystemRAM configs carry over to target
--------------------------------------------------------------------
Next Up:
Driver Management - Decoders, HPA/SPA, DAX, and RAS.
Memory (Block) Hotplug - Zones, Auto-Online, and User Policy.
RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE.
Interleave - RAS and Region Management (Hotplug-ability)
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* CXL Boot to Bash - Section 2: The Drivers
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
@ 2025-02-05 16:06 ` Gregory Price
2025-02-06 0:47 ` Dan Williams
` (2 more replies)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
` (4 subsequent siblings)
6 siblings, 3 replies; 81+ messages in thread
From: Gregory Price @ 2025-02-05 16:06 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
(background reading as we build up complexity)
Driver Management - Decoders, HPA/SPA, DAX, and RAS.
The Drivers
===========
----------------------
The Story Up 'til Now.
----------------------
When we left the Platform arena, assuming we've configured with special
purpose memory, we are left with an entry in the memory map like so:
BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] soft reserved
/proc/iomem: c050000000-fcefffffff : Soft Reserved
This resource (see mm/resource.c) is left unused until a driver comes
along to actually surface it to allocators (or some other interface).
In our case, the drivers involved (or at least the ones we'll reference)
drivers/base/ : device probing, memory (block) hotplug
drivers/acpi/ : device hotplug
drivers/acpi/numa : NUMA ACPI table info (SRAT, CEDT, HMAT, ...)
drivers/pci/ : PCI device probing
drivers/cxl/ : CXL device probing
drivers/dax/ : cxl device to memory resource association
We don't necessarily care about the specifics of each driver, we'll
focus on just the aspects that ultimately affect memory management.
-------------------------------
Step 4: Basic build complexity.
-------------------------------
To make a long story short:
CXL Build Configurations:
CONFIG_CXL_ACPI
CONFIG_CXL_BUS
CONFIG_CXL_MEM
CONFIG_CXL_PCI
CONFIG_CXL_PORT
CONFIG_CXL_REGION
DAX Build Configurations:
CONFIG_DEV_DAX
CONFIG_DEV_DAX_CXL
CONFIG_DEV_DAX_KMEM
Without all of these enabled, your journey will end up cut short because
some piece of the probe process will stop progressing.
The most common misconfiguration I run into is CONFIG_DEV_DAX_CXL not
being enabled. You end up with memory regions without dax devices.
[/sys/bus/cxl/devices]# ls
dax_region0 decoder0.0 decoder1.0 decoder2.0 .....
dax_region1 decoder0.1 decoder1.1 decoder3.0 .....
^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully
surface as dax devices, which can then be converted to system ram.
---------------------------------------------------------------
Step 5: The CXL driver associating devices and iomem resources.
---------------------------------------------------------------
The CXL driver wires up the following devices:
root : CXL root
portN : An intermediate or endpoint destination for accesses
memN : memory devices
Each device in the heirarchy may have one or more decoders
decoderN.M : Address routing and translation devices
The driver will also create additional objects and associations
regionN : device-to-iomem resource mapping
dax_regionN : region-to-dax device mapping
Most associations built by the driver are done by validating decoders
against each other at each point in the heirarchy.
Root decoders describe memory regions and route DMA to ports.
Intermediate decoders route DMA through CXL fabric.
Endpoint decoders translate addresses (Host to device).
A Root port has 1 decoder per associated CFMW in the CEDT
decoder0.0 -> `c050000000-fcefffffff : Soft Reserved`
A region (iomem resource mapping) can be created for these decoders
[/sys/bus/cxl/devices/region0]# cat resource size target0
0xc050000000 0x3ca0000000 decoder5.0
A dax_region surfaces these regions as a dax device
[/sys/bus/cxl/devices/dax_region0/dax0.0]# cat resource
0xc050000000
So in a simple environment with 1 device, we end up with a mapping
that looks something like this.
root --- decoder0.0 --- region0 -- dax_region0 -- dax0
| | |
port1 --- decoder1.0 |
| | |
endpoint0 --- decoder3.0--------/
Much of the complexity in region creation stems from validating decoder
programming and associating regions with targets (endpoint decoders).
The take-away from this section is the existence of "decoders", of which
there may be an arbitrary number between the root and endpoint.
This will be relevant when we talk about RAS (Poison) and Interleave.
---------------------------------------------------------------
Step 6: DAX surfacing Memory Blocks - First bit of User Policy.
---------------------------------------------------------------
The last step in surfacing memory to allocators is to convert a dax
device into memory blocks. On most default kernel builds, dax devices
are not automatically converted to SystemRAM.
Policy Choices
userland policy: daxctl
default-online : CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
or
CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
or
memhp_default_state=*
To convert a dax device to SystemRAM utilizing daxctl:
daxctl online-memory dax0.0 [--no-movable]
By default the memory will online into ZONE_MOVABLE
The --no-movable option will online the memory in ZONE_NORMAL
Alternatively, this can be done at Build or Boot time using
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (v6.13 or below)
CONFIG_MHP_DEFAULT_ONLINE_TYPE_* (v6.14 or above)
memhp_default_state=* (boot param predating cxl)
I will save the discussion of ZONE selection to the next section,
which will cover more memory-hotplug specifics.
At this point, the memory blocks are exposed to the kernel mm allocators
and may be used as normal System RAM.
---------------------------------------------------------
Second bit of nuanced complexity: Memory Block Alignment.
---------------------------------------------------------
In section 1, we introduced CEDT / CFMW and how they map to iomem
resources. In this section we discussed out we surface memory blocks
to the kernel allocators.
However, at no time did platform, arch code, and driver communicate
about the expected size of a memory block. In most cases, the size
of a memory block is defined by the architecture - unaware of CXL.
On x86, for example, the heuristic for memory block size is:
1) user boot-arg value
2) Maximize size (up to 2GB) if operating on bare metal
3) Use smallest value that aligns with the end of memory
The problem is that [SOFT RESERVED] memory is not considered in the
alignment calculation - and not all [SOFT RESERVED] memory *should*
be considered for alignment.
In the case of our working example (real system, btw):
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Window base address : 000000C050000000
Window size : 0000003CA0000000
The base is 256MB aligned (the minimum for the CXL Spec), and the
window size is 512MB. This results in a loss of almost a full memory
block worth of memory (~1280MB on the front, and ~512MB on the back).
This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB).
[1] has been proposed to allow for drivers (specifically ACPI) to advise
the memory hotplug system on the suggested alignment, and for arch code
to choose how to utilize this advisement.
[1] https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/
--------------------------------------------------------------------
The Complexity story up til now (what's likely to show up in slides)
--------------------------------------------------------------------
Platform and BIOS:
May configure all the devices prior to kernel hand-off.
May or may not support reconfiguring / hotplug.
BIOS and EFI:
EFI_MEMORY_SP - used to defer management to drivers
Kernel Build and Boot:
CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM
nosoftreserve - Will always result in CXL as SystemRAM
kexec - SystemRAM configs carry over to target
Driver Build Options Required
CONFIG_CXL_ACPI
CONFIG_CXL_BUS
CONFIG_CXL_MEM
CONFIG_CXL_PCI
CONFIG_CXL_PORT
CONFIG_CXL_REGION
CONFIG_DEV_DAX
CONFIG_DEV_DAX_CXL
CONFIG_DEV_DAX_KMEM
User Policy
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14)
memhp_default_state (boot param)
daxctl online-memory daxN.Y (userland)
Nuances
Early-boot resource re-use
Memory Block Alignment
--------------------------------------------------------------------
Next Up:
Memory (Block) Hotplug - Zones and Kernel Use of CXL
RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
Interleave - RAS and Region Management (Hotplug-ability)
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2: The Drivers
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
@ 2025-02-06 0:47 ` Dan Williams
2025-02-06 15:59 ` Gregory Price
2025-03-04 1:32 ` Gregory Price
2025-03-06 23:56 ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2 siblings, 1 reply; 81+ messages in thread
From: Dan Williams @ 2025-02-06 0:47 UTC (permalink / raw)
To: Gregory Price, lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
Gregory Price wrote:
> (background reading as we build up complexity)
Thanks for this taxonomy!
>
> Driver Management - Decoders, HPA/SPA, DAX, and RAS.
>
> The Drivers
> ===========
> ----------------------
> The Story Up 'til Now.
> ----------------------
>
> When we left the Platform arena, assuming we've configured with special
> purpose memory, we are left with an entry in the memory map like so:
>
> BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] soft reserved
> /proc/iomem: c050000000-fcefffffff : Soft Reserved
>
> This resource (see mm/resource.c) is left unused until a driver comes
> along to actually surface it to allocators (or some other interface).
>
> In our case, the drivers involved (or at least the ones we'll reference)
>
> drivers/base/ : device probing, memory (block) hotplug
> drivers/acpi/ : device hotplug
> drivers/acpi/numa : NUMA ACPI table info (SRAT, CEDT, HMAT, ...)
> drivers/pci/ : PCI device probing
> drivers/cxl/ : CXL device probing
> drivers/dax/ : cxl device to memory resource association
>
> We don't necessarily care about the specifics of each driver, we'll
> focus on just the aspects that ultimately affect memory management.
>
> -------------------------------
> Step 4: Basic build complexity.
> -------------------------------
> To make a long story short:
>
> CXL Build Configurations:
> CONFIG_CXL_ACPI
> CONFIG_CXL_BUS
> CONFIG_CXL_MEM
> CONFIG_CXL_PCI
> CONFIG_CXL_PORT
> CONFIG_CXL_REGION
>
> DAX Build Configurations:
> CONFIG_DEV_DAX
> CONFIG_DEV_DAX_CXL
> CONFIG_DEV_DAX_KMEM
>
> Without all of these enabled, your journey will end up cut short because
> some piece of the probe process will stop progressing.
>
> The most common misconfiguration I run into is CONFIG_DEV_DAX_CXL not
> being enabled. You end up with memory regions without dax devices.
>
> [/sys/bus/cxl/devices]# ls
> dax_region0 decoder0.0 decoder1.0 decoder2.0 .....
> dax_region1 decoder0.1 decoder1.1 decoder3.0 .....
>
> ^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully
> surface as dax devices, which can then be converted to system ram.
At least for this problem the plan is to fall back to
CONFIG_DEV_DAX_HMEM [1] which skips all of the RAS and device
enumeration benefits and just shunts EFI_MEMORY_SP over to device_dax.
There is also the panic button of efi=nosoftreserve which is the flag of
surrender if the kernel fails to parse the CXL configuration.
I am otherwise open to suggestions about a better model for how to
handle a type of memory capacity that elicits diverging opinions on
whether it should be treated as System RAM, dedicated application
memory, or some kind of cold-memory swap target.
[1]: http://lore.kernel.org/cover.1737046620.git.nathan.fontenot@amd.com
> ---------------------------------------------------------------
> Step 5: The CXL driver associating devices and iomem resources.
> ---------------------------------------------------------------
>
> The CXL driver wires up the following devices:
> root : CXL root
> portN : An intermediate or endpoint destination for accesses
> memN : memory devices
>
>
> Each device in the heirarchy may have one or more decoders
> decoderN.M : Address routing and translation devices
>
>
> The driver will also create additional objects and associations
> regionN : device-to-iomem resource mapping
> dax_regionN : region-to-dax device mapping
>
>
> Most associations built by the driver are done by validating decoders
> against each other at each point in the heirarchy.
>
> Root decoders describe memory regions and route DMA to ports.
> Intermediate decoders route DMA through CXL fabric.
> Endpoint decoders translate addresses (Host to device).
>
>
> A Root port has 1 decoder per associated CFMW in the CEDT
> decoder0.0 -> `c050000000-fcefffffff : Soft Reserved`
>
>
> A region (iomem resource mapping) can be created for these decoders
> [/sys/bus/cxl/devices/region0]# cat resource size target0
> 0xc050000000 0x3ca0000000 decoder5.0
>
>
> A dax_region surfaces these regions as a dax device
> [/sys/bus/cxl/devices/dax_region0/dax0.0]# cat resource
> 0xc050000000
>
>
> So in a simple environment with 1 device, we end up with a mapping
> that looks something like this.
>
> root --- decoder0.0 --- region0 -- dax_region0 -- dax0
> | | |
> port1 --- decoder1.0 |
> | | |
> endpoint0 --- decoder3.0--------/
>
>
> Much of the complexity in region creation stems from validating decoder
> programming and associating regions with targets (endpoint decoders).
>
> The take-away from this section is the existence of "decoders", of which
> there may be an arbitrary number between the root and endpoint.
>
> This will be relevant when we talk about RAS (Poison) and Interleave.
Good summary. I often look at this pile of objects and wonder "why so
complex", but then I look at the heroics of drivers/edac/. Compared to
that wide range of implementation specific quirks of various memory
controllers, the CXL object hierarchy does not look that bad.
> ---------------------------------------------------------------
> Step 6: DAX surfacing Memory Blocks - First bit of User Policy.
> ---------------------------------------------------------------
>
> The last step in surfacing memory to allocators is to convert a dax
> device into memory blocks. On most default kernel builds, dax devices
> are not automatically converted to SystemRAM.
I thought most distributions are shipping with
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, or the default online udev rule?
For example Fedora is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y and RHEL is
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, but with the udev hotplug rule.
> Policy Choices
> userland policy: daxctl
> default-online : CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> or
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
> or
> memhp_default_state=*
>
> To convert a dax device to SystemRAM utilizing daxctl:
>
> daxctl online-memory dax0.0 [--no-movable]
On RHEL at least it finds that udev already took care of it.
>
> By default the memory will online into ZONE_MOVABLE
> The --no-movable option will online the memory in ZONE_NORMAL
>
>
> Alternatively, this can be done at Build or Boot time using
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (v6.13 or below)
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_* (v6.14 or above)
> memhp_default_state=* (boot param predating cxl)
Oh, TIL the new CONFIG_MHP_DEFAULT_ONLINE_TYPE_* option.
>
> I will save the discussion of ZONE selection to the next section,
> which will cover more memory-hotplug specifics.
>
> At this point, the memory blocks are exposed to the kernel mm allocators
> and may be used as normal System RAM.
>
>
> ---------------------------------------------------------
> Second bit of nuanced complexity: Memory Block Alignment.
> ---------------------------------------------------------
> In section 1, we introduced CEDT / CFMW and how they map to iomem
> resources. In this section we discussed out we surface memory blocks
> to the kernel allocators.
>
> However, at no time did platform, arch code, and driver communicate
> about the expected size of a memory block. In most cases, the size
> of a memory block is defined by the architecture - unaware of CXL.
>
> On x86, for example, the heuristic for memory block size is:
> 1) user boot-arg value
> 2) Maximize size (up to 2GB) if operating on bare metal
> 3) Use smallest value that aligns with the end of memory
>
> The problem is that [SOFT RESERVED] memory is not considered in the
> alignment calculation - and not all [SOFT RESERVED] memory *should*
> be considered for alignment.
>
> In the case of our working example (real system, btw):
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Window base address : 000000C050000000
> Window size : 0000003CA0000000
>
> The base is 256MB aligned (the minimum for the CXL Spec), and the
> window size is 512MB. This results in a loss of almost a full memory
> block worth of memory (~1280MB on the front, and ~512MB on the back).
>
> This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB).
This feels like an example, of "hey platform vendors, I understand
that spec grants you the freedom to misalign, please refrain from taking
advantage of that freedom".
>
> [1] has been proposed to allow for drivers (specifically ACPI) to advise
> the memory hotplug system on the suggested alignment, and for arch code
> to choose how to utilize this advisement.
>
> [1] https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/
>
>
> --------------------------------------------------------------------
> The Complexity story up til now (what's likely to show up in slides)
> --------------------------------------------------------------------
> Platform and BIOS:
> May configure all the devices prior to kernel hand-off.
> May or may not support reconfiguring / hotplug.
> BIOS and EFI:
> EFI_MEMORY_SP - used to defer management to drivers
> Kernel Build and Boot:
> CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM
> nosoftreserve - Will always result in CXL as SystemRAM
> kexec - SystemRAM configs carry over to target
> Driver Build Options Required
> CONFIG_CXL_ACPI
> CONFIG_CXL_BUS
> CONFIG_CXL_MEM
> CONFIG_CXL_PCI
> CONFIG_CXL_PORT
> CONFIG_CXL_REGION
> CONFIG_DEV_DAX
> CONFIG_DEV_DAX_CXL
> CONFIG_DEV_DAX_KMEM
> User Policy
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
> CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14)
> memhp_default_state (boot param)
> daxctl online-memory daxN.Y (userland)
memory hotlpug udev rule (userland)
> Nuances
> Early-boot resource re-use
> Memory Block Alignment
>
> --------------------------------------------------------------------
> Next Up:
> Memory (Block) Hotplug - Zones and Kernel Use of CXL
> RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
> Interleave - RAS and Region Management (Hotplug-ability)
Really appreciate you organizing all of this information.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2: The Drivers
2025-02-06 0:47 ` Dan Williams
@ 2025-02-06 15:59 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-02-06 15:59 UTC (permalink / raw)
To: Dan Williams; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Wed, Feb 05, 2025 at 04:47:17PM -0800, Dan Williams wrote:
> Gregory Price wrote:
> > [/sys/bus/cxl/devices]# ls
> > dax_region0 decoder0.0 decoder1.0 decoder2.0 .....
> > dax_region1 decoder0.1 decoder1.1 decoder3.0 .....
> >
> > ^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully
> > surface as dax devices, which can then be converted to system ram.
>
> At least for this problem the plan is to fall back to
> CONFIG_DEV_DAX_HMEM [1] which skips all of the RAS and device
> enumeration benefits and just shunts EFI_MEMORY_SP over to device_dax.
>
Hm, would this actually happen in the scenario where CONFIG_DEV_DAX_CXL
is not enabled but everything else is? The region0 still gets created
and associated with the resource, but the dax_region0 never gets
created.
On one system I have I see the following:
c050000000-fcefffffff : Soft Reserved
c050000000-fcefffffff : CXL Window 0
c050000000-fcefffffff : region0
c050000000-fcefffffff : dax0.0
c050000000-fcefffffff : System RAM (kmem)
fcf0000000-ffffffffff : Reserved
10000000000-1035fffffff : Soft Reserved
10000000000-1035fffffff : CXL Window 1
10000000000-1035fffffff : region1
10000000000-1035fffffff : dax1.0
10000000000-1035fffffff : System RAM (kmem)
I would expect the above HMEM/shunt to only work if everything down
through CXL Window 0 is torn down.
But if CONFIG_DEV_DAX_CXL is not enabled, everything "succeeds", it just
doesn't "Do what you want"(TM) - dax0.0 and RAM entries are absent.
It makes me wonder whether the driver over-componentized the build.
> I am otherwise open to suggestions about a better model for how to
> handle a type of memory capacity that elicits diverging opinions on
> whether it should be treated as System RAM, dedicated application
> memory, or some kind of cold-memory swap target.
>
My gut tells me there's no "elegant solution" here given that user
intent is fairly unknowable - i.e. best we can do is make the build
and boot options easier to understand.
> > ---------------------------------------------------------------
> > Step 6: DAX surfacing Memory Blocks - First bit of User Policy.
> > ---------------------------------------------------------------
> >
> > The last step in surfacing memory to allocators is to convert a dax
> > device into memory blocks. On most default kernel builds, dax devices
> > are not automatically converted to SystemRAM.
>
> I thought most distributions are shipping with
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, or the default online udev rule?
> For example Fedora is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y and RHEL is
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, but with the udev hotplug rule.
>
Good point, my bias take showing up in the notes here. I didn't know
RHEL had gotten as far as a udev rule already. I'll adjust my notes.
But this also hides some nuance as well - the default behavior onlines
memory into ZONE_NORMAL with DEFAULT_ONLINE (next section).
> > Alternatively, this can be done at Build or Boot time using
> > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (v6.13 or below)
> > CONFIG_MHP_DEFAULT_ONLINE_TYPE_* (v6.14 or above)
> > memhp_default_state=* (boot param predating cxl)
>
> Oh, TIL the new CONFIG_MHP_DEFAULT_ONLINE_TYPE_* option.
>
It was only just added:
https://lore.kernel.org/linux-mm/20241226182918.648799-1-gourry@gourry.net/
Basically creates parity between memhp_default_state and build options.
> > The base is 256MB aligned (the minimum for the CXL Spec), and the
> > window size is 512MB. This results in a loss of almost a full memory
> > block worth of memory (~1280MB on the front, and ~512MB on the back).
> >
> > This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB).
>
> This feels like an example, of "hey platform vendors, I understand
> that spec grants you the freedom to misalign, please refrain from taking
> advantage of that freedom".
>
Only x86 appears to actually do this (presently) - so is this a real
constraint or just a quirk of how the x86 arch code has chosen to
"optimize memory block size"?
Granted I'm a platform consumer, not a vendor - but I wouldn't even know
where to look to see where this constraint is defined (if it is).
All I'd know is "CXL Says I can align to 256MB, and minimum memory block
size on linux is 256MB so allons y!"
On the linux side - these platforms are now out there, in the wild.
So the surface impression now appears to be that linux just throws
away ~0.5% of your CXL capacity for no reason on these platforms.
That said, I also understand that more memory blocks might affect
allocation performance when the system is pressured - but losing
gigabytes of memory can also reduce performance.
(Preview of one of my next nuance additions in section 3)
If this (advisement) change is unwelcome, then we should be spewing
a really loud warning somewhere so vendors get signal for consumers.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
@ 2025-02-17 20:05 ` Gregory Price
2025-02-18 16:24 ` David Hildenbrand
2025-02-18 17:49 ` Yang Shi
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
` (3 subsequent siblings)
6 siblings, 2 replies; 81+ messages in thread
From: Gregory Price @ 2025-02-17 20:05 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
The story up to now
-------------------
When we left the driver arena, we had created a dax device - which
connects a Soft Reserved iomem resource to one or more `memory blocks`
via the kmem driver. We also discussed a bit about ZONE selection
and default online behavior.
In this section we'll discuss what actually goes into memory block
creation, how those memory blocks are exposed to kernel allocators
(tl;dr: sparsemem / memmap / struct page), and the implications of
the selected memory zones.
-------------------------------------
Step 7: Hot-(un)plug Memory (Blocks).
-------------------------------------
Memory hotplug refers to surfacing physical memory to kernel
allocators (page, slab, cache, etc) - as opposed to the action of
"physically hotplugging" a device into a system (e.g. USB).
Physical memory is exposed to allocators in the form of memory blocks.
A `memory block` is an abstraction to describe a physically
contiguous region memory, or more explicitly a collection of physically
contiguous page frames which is described by a physically contiguous
set of `struct page` structures in the system memory-map.
The system memmap is what is used for pfn-to-page (struct) and
page(struct)-to-pfn conversions. The system memmap has `flat` and
`sparse` modes (configured at build-time). Memory hotplug requires the
use of `sparsemem`, which aptly makes the memory map sparse.
Hot *remove* (un-plug) is distinct from Hot add (plug). To hot-remove
an active memory block, the pages in-use must have their data (and
therefore mappings) migrated to another memory block. Hot-remove must
be specifically enabled separate from hotplug.
Build configurations affecting memory block hot(un)plug
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG
CONFIG_SPARSEMEM
CONFIG_64BIT
CONFIG_MEMORY_HOTPLUG
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE
CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
CONFIG_MHP_MEMMAP_ON_MEMORY
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
CONFIG_MIGRATION
CONFIG_MEMORY_HOTREMOVE
During early-boot, the kernel finds all SystemRAM memory regions NOT
marked "Special Purpose" and will create memory blocks for these
regions by default. These blocks are defaulted into ZONE_NORMAL
(more on zones shortly).
Memory regions present at boot marked `EFI_MEMORY_SP` have memory blocks
created and hot-plugged by drivers. The same mechanism is used to
hot-add memory physically hotplugged after system boot (i.e. not present
in the EFI Memory Map at boot time).
The DAX/KMEM driver hotplugs memory blocks via the
`add_memory_driver_managed()`
function.
-------------------------------
Step 8: Page Struct allocation.
-------------------------------
A `memory block` is made up of a collection of physical memory pages,
which must have entries in the system Memory Map - which is managed by
sparsemem on systems with memory (block) hotplug. Sparsemem fills the
memory map with `struct page` for hot-plugged memory.
Here is a rough trace through the (current) stack on how page structs
are populated into the system Memory Map on hotplug.
```
add_memory_driver_managed
add_memory_resource
memblock_add_node
arch_add_memory
init_memory_mapping
add_pages
__add_pages
sparse_add_section
section_activate
populate_section_memmap
__populate_section_memmap
memmap_alloc
memblock_alloc_try_nid_raw
memblock_alloc_internal
memblock_alloc_range_nid
kzalloc_node(..., GFP_KERNEL, ...)
```
All allocatable-memory requires `struct page` resources to describe the
physical page state. On a system with regular 4kb size pages and 256GB
of memory - 4GB is required just to describe/manage the memory.
This is ~1.5% of the new capacity to just surface it (4/256).
This becomes an issue if the memory is not intended for kernel-use,
as `struct page` memory must be allocated in non-movable, kernel memory
`zones`. If hot-plugged capacity is designated for a non-kernel zone
(ZONE_MOVABLE, ZONE_DEVICE, etc), then there must be sufficient
ZONE_NORMAL (or similar kernel-compatible zone) to allocate from.
Matthew Wilcox has a plan to reduce this cost, some details of his plan:
https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/
https://lore.kernel.org/all/Z37pxbkHPbLYnDKn@casper.infradead.org/
---------------------
Step 9: Memory Zones.
---------------------
We've alluded to "Memory Zones" in prior sections, with really the only
detail about these concepts being that there are "Kernel-allocation
compatible" and "Movable" zones, as well as some relationship between
memory blocks and memory zones.
The two zones we really care about are `ZONE_NORMAL` and `ZONE_MOVABLE`.
For the purpose of this reading we'll consider two basic use-cases:
- memory block hot-unplug
- kernel resource allocation
You can (for the most part) consider these cases incompatible. If the
kernel allocates `struct page` memory from a block, then that block cannot
be hot-unplugged. This memory is typically unmovable (cannot be migrated),
and its pages unlikely to be removed from the memory map.
There are other scenarios, such as page pinning, that can block hot-unplug.
The individual mechanisms preventing hot-unplug are less important than
their relationship to memory zones.
ZONE_NORMAL basically allows any allocations, including things like page
tables, struct pages, and pinned memory.
ZONE_MOVABLE, under normal conditions, disallows most kernel allocations.
ZONE_MOVABLE does NOT make a *strong* guarantee of hut-unplug-ability.
The kernel and privileged users can cause long-term pinning to occur -
even in ZONE_MOVABLE. It should be seen as a best-attempt at providing
hot-unplug-ability under normal conditions.
Here's the take-away:
Any capacity marked SystemRAM but not Special Purpose during early boot
will be onlined into ZONE_NORMAL by default - making it available for
kernel-use during boot. There is no guarantee of being hot-unpluggable.
Any capacity marked Special Purpose at boot, or hot-added (physically),
will be onlined into a user-selected zone (Normal or Movable).
There are (at least) 4 ways to select what zone to online memory blocks.
Build Time:
CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
Boot Time:
memhp_default_state (boot parameter)
udev / daxctl:
user policy explicitly requesting the zone
memory sysfs
online_movable > /sys/bus/memory/devices/memoryN/online
------------------------------------------
Nuance: memmap_on_memory and ZONE_MOVABLE.
------------------------------------------
As alluded to in the prior sections - hot-added ZONE_MOVABLE capacity
will consume ZONE_NORMAL capacity for its kernel resources. This can
be problematic if vast amounts of ZONE_MOVABLE is added on a system
with limited ZONE_NORMAL capacity.
For example, consider a system with 4GB of ZONE_NORMAL and 256GB of
ZONE_MOVABLE. This wouldn't work, as the entirety of ZONE_NORMAL would
be consumed to allocate `struct page` resources for the ZONE_MOVABLE
capacity - leaving no working memory for the rest of the kernel.
The `memmap_on_memory` configuration option allows for hotplugged memory
blocks to host their own `struct page` allocations...
if they're placed in ZONE_NORMAL.
To enable, use the boot param: `memory_hotplug.memmap_on_memory=1`.
Sparsemem allocation of memory map resources ultimately uses a
`kzalloc_node` call, which simply allocates memory from ZONE_NORMAL with
a *suggested* node.
```
memmap_alloc
memblock_alloc_try_nid_raw
memblock_alloc_internal
memblock_alloc_range_nid
kzalloc_node(..., GFP_KERNEL, ...)
```
The node ID passed in as an argument is a "preferred node", which means
is insufficient space on that node exists to service the GFP_KERNEL
allocation, it will fall back to another node.
If all hot-plugged memory is added to ZONE_MOVABLE, two things occur:
1) A portion of the memory block is carved out for to allocate memmap
data (reducing usable size by 64b*nr_pages)
2) The memory is allocated on ZONE_NORMAL on another node..
Result: Lost capacity due to the unused carve-out area for no value.
--------------------------------
The Complexity Story up til now.
--------------------------------
Platform and BIOS:
May configure all the devices prior to kernel hand-off.
May or may not support reconfiguring / hotplug.
BIOS and EFI:
EFI_MEMORY_SP - used to defer management to drivers
Kernel Build and Boot:
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG
CONFIG_SPARSEMEM
CONFIG_64BIT
CONFIG_MEMORY_HOTPLUG
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE
CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
CONFIG_MHP_MEMMAP_ON_MEMORY
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
CONFIG_MIGRATION
CONFIG_MEMORY_HOTREMOVE
CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM
nosoftreserve - Will always result in CXL as SystemRAM
kexec - SystemRAM configs carry over to target
memory_hotplug.memmap_on_memory
Driver Build Options Required
CONFIG_CXL_ACPI
CONFIG_CXL_BUS
CONFIG_CXL_MEM
CONFIG_CXL_PCI
CONFIG_CXL_PORT
CONFIG_CXL_REGION
CONFIG_DEV_DAX
CONFIG_DEV_DAX_CXL
CONFIG_DEV_DAX_KMEM
User Policy
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14)
memhp_default_state (boot param)
daxctl online-memory daxN.Y (userland)
Nuances
Early-boot resource re-use
Memory Block Alignment
memmap_on_meomry + ZONE_MOVABLE
----------------------------------------------------
Next up:
RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
Interleave - RAS and Region Management
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
@ 2025-02-18 10:12 ` Yuquan Wang
2025-02-18 16:11 ` Gregory Price
2025-02-20 16:30 ` Jonathan Cameron
` (2 subsequent siblings)
3 siblings, 1 reply; 81+ messages in thread
From: Yuquan Wang @ 2025-02-18 10:12 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Tue, Feb 04, 2025 at 09:17:09PM -0500, Gregory Price wrote:
>
> ------------------------------------------------------------------
> Step 2: BIOS / EFI generates the CEDT (CXL Early Detection Table).
> ------------------------------------------------------------------
>
> This table is responsible for reporting each "CXL Host Bridge" and
> "CXL Fixed Memory Window" present at boot - which enables early boot
> software to manage those devices and the memory capacity presented
> by those devices.
>
> Example CEDT Entries (truncated)
> Subtable Type : 00 [CXL Host Bridge Structure]
> Reserved : 00
> Length : 0020
> Associated host bridge : 00000005
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 000000C050000000
> Window size : 0000003CA0000000
>
> If this memory is NOT marked "Special Purpose" by BIOS (next section),
> you should find a matching entry EFI Memory Map and /proc/iomem
>
> BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] usable
> /proc/iomem: c050000000-fcefffffff : System RAM
>
>
> Observation: This memory is treated as 100% normal System RAM
>
> 1) This memory may be placed in any zone (ZONE_NORMAL, typically)
> 2) The kernel may use this memory for arbitrary allocations
> 4) The driver still enumerates CXL devices and memory regions, but
> 3) The CXL driver CANNOT manage this memory (as of today)
> (Caveat: *some* RAS features may still work, possibly)
>
Hi, Gregory
Thanks for the in-depth introduction and analysis.
Here I have some confusion:
1) In this scenario, does it mean users could not create a CXL region
dynamically after OS boot?
2) A CXL region (interleave set) would influence the real used memory
in this memory range. Therefore, apart from devices, does platforms
have to configure CXL regions in this stage?
3) How bios/EFI to describe a CXL region?
> This creates an nuanced management state.
>
> The memory is online by default and completely usable, AND the driver
> appears to be managing the devices - BUT the memory resources and the
> management structure are fundamentally separate.
> 1) CXL Driver manages CXL features
> 2) Non-CXL SystemRAM mechanisms surface the memory to allocators.
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot
2025-02-18 10:12 ` Yuquan Wang
@ 2025-02-18 16:11 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-02-18 16:11 UTC (permalink / raw)
To: Yuquan Wang; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Tue, Feb 18, 2025 at 06:12:47PM +0800, Yuquan Wang wrote:
> On Tue, Feb 04, 2025 at 09:17:09PM -0500, Gregory Price wrote:
> >
> > 1) This memory may be placed in any zone (ZONE_NORMAL, typically)
> > 2) The kernel may use this memory for arbitrary allocations
> > 4) The driver still enumerates CXL devices and memory regions, but
> > 3) The CXL driver CANNOT manage this memory (as of today)
> > (Caveat: *some* RAS features may still work, possibly)
> >
>
> Hi, Gregory
>
> Thanks for the in-depth introduction and analysis.
>
> Here I have some confusion:
>
> 1) In this scenario, does it mean users could not create a CXL region
> dynamically after OS boot?
>
It helps to be a bit more precise here.
"A CXL Region" is a device managed by the CXL driver:
/sys/bus/cxl/devices/regionN
In this setup, a "CXL region" is not required, because the memory has
already been associated with memory blocks in ZONE_NORMAL by the kernel.
The blocks themselves are not managed by the driver, they are created
during early boot - and are not related to driver operation at all.
The driver can still enumerate the fabric and the devices that back this
memory, but presently it does not manage the memory blocks themselves.
more explicitly: There is no link between a memdev and memory blocks,
which would normally be created via a region+dax_region+dax device.
> 2) A CXL region (interleave set) would influence the real used memory
> in this memory range. Therefore, apart from devices, does platforms
> have to configure CXL regions in this stage?
>
Again, you need to be more explicit about "CXL region". A "CXL region
device" is a construct created by the driver. In this scenario, the
platform configures the CXL memory for use as normal system RAM by
marking it EFI_CONVENTIONAL_MEMORY without EFI_MEMORY_SP.
Some platforms configure interleave in BIOS - how this is done is
platform specific but ultimately constrained by the CXL specification on
programming decoders throughout the fabric.
> 3) How bios/EFI to describe a CXL region?
>
You would have to discuss this with the individual platform folks.
The main mechanism to communicate CXL configuration from BIOS/EFI to
kernel is the CEDT/CFMW and HMAT.
~Gregory.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
@ 2025-02-18 16:24 ` David Hildenbrand
2025-02-18 17:03 ` Gregory Price
2025-02-18 17:49 ` Yang Shi
1 sibling, 1 reply; 81+ messages in thread
From: David Hildenbrand @ 2025-02-18 16:24 UTC (permalink / raw)
To: Gregory Price, lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
> ---------------------
> Step 9: Memory Zones.
> ---------------------
> We've alluded to "Memory Zones" in prior sections, with really the only
> detail about these concepts being that there are "Kernel-allocation
> compatible" and "Movable" zones, as well as some relationship between
> memory blocks and memory zones.
>
> The two zones we really care about are `ZONE_NORMAL` and `ZONE_MOVABLE`.
>
> For the purpose of this reading we'll consider two basic use-cases:
> - memory block hot-unplug
> - kernel resource allocation
>
> You can (for the most part) consider these cases incompatible. If the
> kernel allocates `struct page` memory from a block, then that block cannot
> be hot-unplugged. This memory is typically unmovable (cannot be migrated),
> and its pages unlikely to be removed from the memory map.
>
> There are other scenarios, such as page pinning, that can block hot-unplug.
> The individual mechanisms preventing hot-unplug are less important than
> their relationship to memory zones.
>
> ZONE_NORMAL basically allows any allocations, including things like page
> tables, struct pages, and pinned memory.
>
> ZONE_MOVABLE, under normal conditions, disallows most kernel allocations.
>
In essence, only movable allocations (some kernel allcoations are movable).
> ZONE_MOVABLE does NOT make a *strong* guarantee of hut-unplug-ability.
> The kernel and privileged users can cause long-term pinning to occur -
> even in ZONE_MOVABLE. It should be seen as a best-attempt at providing
> hot-unplug-ability under normal conditions.
Yes and no; actual long-term pinning is disallowed (FOLL_LONGTERM), but
we have a bunch of cases that need fixing. [1]
Of course, new cases keep popping up. It's a constant fight to make
hot-unplug as reliable as possible. So yes, we cannot give "strong"
guarantees, but make it as reliable as possible in sane configurations.
[1]
https://lkml.kernel.org/r/882b566c-34d6-4e68-9447-6c74a0693f18@redhat.com
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-18 16:24 ` David Hildenbrand
@ 2025-02-18 17:03 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-02-18 17:03 UTC (permalink / raw)
To: David Hildenbrand; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Tue, Feb 18, 2025 at 05:24:30PM +0100, David Hildenbrand wrote:
> >
> > ZONE_MOVABLE, under normal conditions, disallows most kernel allocations.
> >
>
> In essence, only movable allocations (some kernel allcoations are movable).
>
> > ZONE_MOVABLE does NOT make a *strong* guarantee of hut-unplug-ability.
> > The kernel and privileged users can cause long-term pinning to occur -
> > even in ZONE_MOVABLE. It should be seen as a best-attempt at providing
> > hot-unplug-ability under normal conditions.
>
> Yes and no; actual long-term pinning is disallowed (FOLL_LONGTERM), but we
> have a bunch of cases that need fixing. [1]
>
> Of course, new cases keep popping up. It's a constant fight to make
> hot-unplug as reliable as possible. So yes, we cannot give "strong"
> guarantees, but make it as reliable as possible in sane configurations.
>
> [1]
> https://lkml.kernel.org/r/882b566c-34d6-4e68-9447-6c74a0693f18@redhat.com
>
Appreciate the additional context, I missed your topic proposal. I was
trying to be conservative about the claims ZONE_MOVABLE makes so that I
don't present it as a "This will fix all your hotplug woes" solution.
Looking forward to this LSF :]
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24 ` David Hildenbrand
@ 2025-02-18 17:49 ` Yang Shi
2025-02-18 18:04 ` Gregory Price
1 sibling, 1 reply; 81+ messages in thread
From: Yang Shi @ 2025-02-18 17:49 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Mon, Feb 17, 2025 at 12:05 PM Gregory Price <gourry@gourry.net> wrote:
>
>
> The story up to now
> -------------------
> When we left the driver arena, we had created a dax device - which
> connects a Soft Reserved iomem resource to one or more `memory blocks`
> via the kmem driver. We also discussed a bit about ZONE selection
> and default online behavior.
>
> In this section we'll discuss what actually goes into memory block
> creation, how those memory blocks are exposed to kernel allocators
> (tl;dr: sparsemem / memmap / struct page), and the implications of
> the selected memory zones.
>
>
> -------------------------------------
> Step 7: Hot-(un)plug Memory (Blocks).
> -------------------------------------
> Memory hotplug refers to surfacing physical memory to kernel
> allocators (page, slab, cache, etc) - as opposed to the action of
> "physically hotplugging" a device into a system (e.g. USB).
>
> Physical memory is exposed to allocators in the form of memory blocks.
>
> A `memory block` is an abstraction to describe a physically
> contiguous region memory, or more explicitly a collection of physically
> contiguous page frames which is described by a physically contiguous
> set of `struct page` structures in the system memory-map.
>
> The system memmap is what is used for pfn-to-page (struct) and
> page(struct)-to-pfn conversions. The system memmap has `flat` and
> `sparse` modes (configured at build-time). Memory hotplug requires the
> use of `sparsemem`, which aptly makes the memory map sparse.
>
> Hot *remove* (un-plug) is distinct from Hot add (plug). To hot-remove
> an active memory block, the pages in-use must have their data (and
> therefore mappings) migrated to another memory block. Hot-remove must
> be specifically enabled separate from hotplug.
>
>
> Build configurations affecting memory block hot(un)plug
> CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG
> CONFIG_SPARSEMEM
> CONFIG_64BIT
> CONFIG_MEMORY_HOTPLUG
> CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
> CONFIG_MHP_MEMMAP_ON_MEMORY
> CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> CONFIG_MIGRATION
> CONFIG_MEMORY_HOTREMOVE
>
> During early-boot, the kernel finds all SystemRAM memory regions NOT
> marked "Special Purpose" and will create memory blocks for these
> regions by default. These blocks are defaulted into ZONE_NORMAL
> (more on zones shortly).
>
> Memory regions present at boot marked `EFI_MEMORY_SP` have memory blocks
> created and hot-plugged by drivers. The same mechanism is used to
> hot-add memory physically hotplugged after system boot (i.e. not present
> in the EFI Memory Map at boot time).
>
> The DAX/KMEM driver hotplugs memory blocks via the
> `add_memory_driver_managed()`
> function.
>
>
> -------------------------------
> Step 8: Page Struct allocation.
> -------------------------------
> A `memory block` is made up of a collection of physical memory pages,
> which must have entries in the system Memory Map - which is managed by
> sparsemem on systems with memory (block) hotplug. Sparsemem fills the
> memory map with `struct page` for hot-plugged memory.
>
> Here is a rough trace through the (current) stack on how page structs
> are populated into the system Memory Map on hotplug.
>
> ```
> add_memory_driver_managed
> add_memory_resource
> memblock_add_node
> arch_add_memory
> init_memory_mapping
> add_pages
> __add_pages
> sparse_add_section
> section_activate
> populate_section_memmap
> __populate_section_memmap
> memmap_alloc
> memblock_alloc_try_nid_raw
> memblock_alloc_internal
> memblock_alloc_range_nid
> kzalloc_node(..., GFP_KERNEL, ...)
> ```
>
> All allocatable-memory requires `struct page` resources to describe the
> physical page state. On a system with regular 4kb size pages and 256GB
> of memory - 4GB is required just to describe/manage the memory.
>
> This is ~1.5% of the new capacity to just surface it (4/256).
>
> This becomes an issue if the memory is not intended for kernel-use,
> as `struct page` memory must be allocated in non-movable, kernel memory
> `zones`. If hot-plugged capacity is designated for a non-kernel zone
> (ZONE_MOVABLE, ZONE_DEVICE, etc), then there must be sufficient
> ZONE_NORMAL (or similar kernel-compatible zone) to allocate from.
>
> Matthew Wilcox has a plan to reduce this cost, some details of his plan:
> https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/
> https://lore.kernel.org/all/Z37pxbkHPbLYnDKn@casper.infradead.org/
>
>
> ---------------------
> Step 9: Memory Zones.
> ---------------------
> We've alluded to "Memory Zones" in prior sections, with really the only
> detail about these concepts being that there are "Kernel-allocation
> compatible" and "Movable" zones, as well as some relationship between
> memory blocks and memory zones.
>
> The two zones we really care about are `ZONE_NORMAL` and `ZONE_MOVABLE`.
>
> For the purpose of this reading we'll consider two basic use-cases:
> - memory block hot-unplug
> - kernel resource allocation
>
> You can (for the most part) consider these cases incompatible. If the
> kernel allocates `struct page` memory from a block, then that block cannot
> be hot-unplugged. This memory is typically unmovable (cannot be migrated),
> and its pages unlikely to be removed from the memory map.
>
> There are other scenarios, such as page pinning, that can block hot-unplug.
> The individual mechanisms preventing hot-unplug are less important than
> their relationship to memory zones.
>
> ZONE_NORMAL basically allows any allocations, including things like page
> tables, struct pages, and pinned memory.
>
> ZONE_MOVABLE, under normal conditions, disallows most kernel allocations.
>
> ZONE_MOVABLE does NOT make a *strong* guarantee of hut-unplug-ability.
> The kernel and privileged users can cause long-term pinning to occur -
> even in ZONE_MOVABLE. It should be seen as a best-attempt at providing
> hot-unplug-ability under normal conditions.
>
>
> Here's the take-away:
>
> Any capacity marked SystemRAM but not Special Purpose during early boot
> will be onlined into ZONE_NORMAL by default - making it available for
> kernel-use during boot. There is no guarantee of being hot-unpluggable.
>
> Any capacity marked Special Purpose at boot, or hot-added (physically),
> will be onlined into a user-selected zone (Normal or Movable).
>
> There are (at least) 4 ways to select what zone to online memory blocks.
>
> Build Time:
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
> Boot Time:
> memhp_default_state (boot parameter)
> udev / daxctl:
> user policy explicitly requesting the zone
> memory sysfs
> online_movable > /sys/bus/memory/devices/memoryN/online
>
>
> ------------------------------------------
> Nuance: memmap_on_memory and ZONE_MOVABLE.
> ------------------------------------------
> As alluded to in the prior sections - hot-added ZONE_MOVABLE capacity
> will consume ZONE_NORMAL capacity for its kernel resources. This can
> be problematic if vast amounts of ZONE_MOVABLE is added on a system
> with limited ZONE_NORMAL capacity.
>
> For example, consider a system with 4GB of ZONE_NORMAL and 256GB of
> ZONE_MOVABLE. This wouldn't work, as the entirety of ZONE_NORMAL would
> be consumed to allocate `struct page` resources for the ZONE_MOVABLE
> capacity - leaving no working memory for the rest of the kernel.
>
> The `memmap_on_memory` configuration option allows for hotplugged memory
> blocks to host their own `struct page` allocations...
>
> if they're placed in ZONE_NORMAL.
>
> To enable, use the boot param: `memory_hotplug.memmap_on_memory=1`.
>
> Sparsemem allocation of memory map resources ultimately uses a
> `kzalloc_node` call, which simply allocates memory from ZONE_NORMAL with
> a *suggested* node.
>
> ```
> memmap_alloc
> memblock_alloc_try_nid_raw
> memblock_alloc_internal
> memblock_alloc_range_nid
> kzalloc_node(..., GFP_KERNEL, ...)
> ```
>
> The node ID passed in as an argument is a "preferred node", which means
> is insufficient space on that node exists to service the GFP_KERNEL
> allocation, it will fall back to another node.
>
> If all hot-plugged memory is added to ZONE_MOVABLE, two things occur:
>
> 1) A portion of the memory block is carved out for to allocate memmap
> data (reducing usable size by 64b*nr_pages)
>
> 2) The memory is allocated on ZONE_NORMAL on another node..
Nice write-up, thanks for putting everything together. A follow up
question on this. Do you mean the memmap memory will show up as a new
node with ZONE_NORMAL only besides other hot-plugged memory blocks? So
we will actually see two nodes are hot-plugged?
Thanks,
Yang
>
> Result: Lost capacity due to the unused carve-out area for no value.
>
> --------------------------------
> The Complexity Story up til now.
> --------------------------------
> Platform and BIOS:
> May configure all the devices prior to kernel hand-off.
> May or may not support reconfiguring / hotplug.
>
> BIOS and EFI:
> EFI_MEMORY_SP - used to defer management to drivers
>
> Kernel Build and Boot:
> CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG
> CONFIG_SPARSEMEM
> CONFIG_64BIT
> CONFIG_MEMORY_HOTPLUG
> CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
> CONFIG_MHP_MEMMAP_ON_MEMORY
> CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> CONFIG_MIGRATION
> CONFIG_MEMORY_HOTREMOVE
> CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM
> nosoftreserve - Will always result in CXL as SystemRAM
> kexec - SystemRAM configs carry over to target
> memory_hotplug.memmap_on_memory
>
> Driver Build Options Required
> CONFIG_CXL_ACPI
> CONFIG_CXL_BUS
> CONFIG_CXL_MEM
> CONFIG_CXL_PCI
> CONFIG_CXL_PORT
> CONFIG_CXL_REGION
> CONFIG_DEV_DAX
> CONFIG_DEV_DAX_CXL
> CONFIG_DEV_DAX_KMEM
>
> User Policy
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
> CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14)
> memhp_default_state (boot param)
> daxctl online-memory daxN.Y (userland)
>
> Nuances
> Early-boot resource re-use
> Memory Block Alignment
> memmap_on_meomry + ZONE_MOVABLE
>
> ----------------------------------------------------
> Next up:
> RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
> Interleave - RAS and Region Management
>
> ~Gregory
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-18 17:49 ` Yang Shi
@ 2025-02-18 18:04 ` Gregory Price
2025-02-18 19:25 ` David Hildenbrand
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-02-18 18:04 UTC (permalink / raw)
To: Yang Shi; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Tue, Feb 18, 2025 at 09:49:28AM -0800, Yang Shi wrote:
> On Mon, Feb 17, 2025 at 12:05 PM Gregory Price <gourry@gourry.net> wrote:
> > The node ID passed in as an argument is a "preferred node", which means
> > is insufficient space on that node exists to service the GFP_KERNEL
> > allocation, it will fall back to another node.
> >
> > If all hot-plugged memory is added to ZONE_MOVABLE, two things occur:
> >
> > 1) A portion of the memory block is carved out for to allocate memmap
> > data (reducing usable size by 64b*nr_pages)
> >
> > 2) The memory is allocated on ZONE_NORMAL on another node..
>
> Nice write-up, thanks for putting everything together. A follow up
> question on this. Do you mean the memmap memory will show up as a new
> node with ZONE_NORMAL only besides other hot-plugged memory blocks? So
> we will actually see two nodes are hot-plugged?
>
No, it creates 1 ZONE_MOVABLE memory block of size
(BLOCK_SIZE - memmap_size)
and as far as i can tell the actual memory map allocations still
occur on ZONE_NORMAL (i.e. not CXL).
So you just lose the capacity, it's just stranded and unused.
> Thanks,
> Yang
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-18 18:04 ` Gregory Price
@ 2025-02-18 19:25 ` David Hildenbrand
2025-02-18 20:25 ` Gregory Price
0 siblings, 1 reply; 81+ messages in thread
From: David Hildenbrand @ 2025-02-18 19:25 UTC (permalink / raw)
To: Gregory Price, Yang Shi; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On 18.02.25 19:04, Gregory Price wrote:
> On Tue, Feb 18, 2025 at 09:49:28AM -0800, Yang Shi wrote:
>> On Mon, Feb 17, 2025 at 12:05 PM Gregory Price <gourry@gourry.net> wrote:
>>> The node ID passed in as an argument is a "preferred node", which means
>>> is insufficient space on that node exists to service the GFP_KERNEL
>>> allocation, it will fall back to another node.
>>>
>>> If all hot-plugged memory is added to ZONE_MOVABLE, two things occur:
>>>
>>> 1) A portion of the memory block is carved out for to allocate memmap
>>> data (reducing usable size by 64b*nr_pages)
>>>
>>> 2) The memory is allocated on ZONE_NORMAL on another node..
>>
>> Nice write-up, thanks for putting everything together. A follow up
>> question on this. Do you mean the memmap memory will show up as a new
>> node with ZONE_NORMAL only besides other hot-plugged memory blocks? So
>> we will actually see two nodes are hot-plugged?
>>
>
> No, it creates 1 ZONE_MOVABLE memory block of size
>
> (BLOCK_SIZE - memmap_size)
>
> and as far as i can tell the actual memory map allocations still
> occur on ZONE_NORMAL (i.e. not CXL).
>
> So you just lose the capacity, it's just stranded and unused.
Hm?
If you enable memmap_on_memory, we will place the memmap on that
carved-out region, independent of ZONE_NORMAL/ZONE_MOVABLE etc. It's the
"altmap".
Reason that we can place the memmap on a ZONE_MOVABLE is because,
although it is "unmovable", we told memory offlining code that it
doesn't have to care about offlining that memmap carveout, there is no
migration to be done. Just offline the block (memmap gets stale) and
remove that block (memmap gets removed).
If there is a reason where we carve out the memmap and *not* use it,
that case must be fixed.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-18 19:25 ` David Hildenbrand
@ 2025-02-18 20:25 ` Gregory Price
2025-02-18 20:57 ` David Hildenbrand
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-02-18 20:25 UTC (permalink / raw)
To: David Hildenbrand; +Cc: Yang Shi, lsf-pc, linux-mm, linux-cxl, linux-kernel
On Tue, Feb 18, 2025 at 08:25:59PM +0100, David Hildenbrand wrote:
> On 18.02.25 19:04, Gregory Price wrote:
>
> Hm?
>
> If you enable memmap_on_memory, we will place the memmap on that carved-out
> region, independent of ZONE_NORMAL/ZONE_MOVABLE etc. It's the "altmap".
>
> Reason that we can place the memmap on a ZONE_MOVABLE is because, although
> it is "unmovable", we told memory offlining code that it doesn't have to
> care about offlining that memmap carveout, there is no migration to be done.
> Just offline the block (memmap gets stale) and remove that block (memmap
> gets removed).
>
> If there is a reason where we carve out the memmap and *not* use it, that
> case must be fixed.
>
Hm, I managed to trace down the wrong path on this particular code.
I will go back and redo my tests to sanity check, but here's what I
would expect to see:
1) if memmap_on_memory is off, and hotplug capacity (node1) is
zone_movable - then zone_normal (node0) should have N pages
accounted in nr_memmap_pages
1a) when dropping these memory blocks, I should see node0 memory
use drop by 4GB - since this is just GFP_KERNEL pages.
2) if memmap_on_memory is on, and hotplug capacity (node1) is
zone_movable - then each memory block (256MB) should appear
as 252MB (-4MB of 64-byte page structs). For 256GB (my system)
I should see a total of 252GB of onlined memory (-4GB of page struct)
2a) when dropping these memory blocks, I should see node0 memory use
stay the same - since it was vmemmap usage.
I will double check that this isn't working as expected, and i'll double
check for a build option as well.
stupid question - it sorta seems like you'd want this as the default
setting for driver-managed hotplug memory blocks, but I suppose for
very small blocks there's problems (as described in the docs).
:thinking: - is it silly to suggest maybe a per-driver memmap_on_memory
setting rather than just a global setting? For CXL capacity, this seems
like a no-brainer since blocks can't be smaller than 256MB (per spec).
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-18 20:25 ` Gregory Price
@ 2025-02-18 20:57 ` David Hildenbrand
2025-02-19 1:10 ` Gregory Price
2025-02-20 17:50 ` Yang Shi
0 siblings, 2 replies; 81+ messages in thread
From: David Hildenbrand @ 2025-02-18 20:57 UTC (permalink / raw)
To: Gregory Price; +Cc: Yang Shi, lsf-pc, linux-mm, linux-cxl, linux-kernel
On 18.02.25 21:25, Gregory Price wrote:
> On Tue, Feb 18, 2025 at 08:25:59PM +0100, David Hildenbrand wrote:
>> On 18.02.25 19:04, Gregory Price wrote:
>>
>> Hm?
>>
>> If you enable memmap_on_memory, we will place the memmap on that carved-out
>> region, independent of ZONE_NORMAL/ZONE_MOVABLE etc. It's the "altmap".
>>
>> Reason that we can place the memmap on a ZONE_MOVABLE is because, although
>> it is "unmovable", we told memory offlining code that it doesn't have to
>> care about offlining that memmap carveout, there is no migration to be done.
>> Just offline the block (memmap gets stale) and remove that block (memmap
>> gets removed).
>>
>> If there is a reason where we carve out the memmap and *not* use it, that
>> case must be fixed.
>>
>
> Hm, I managed to trace down the wrong path on this particular code.
>
> I will go back and redo my tests to sanity check, but here's what I
> would expect to see:
>
> 1) if memmap_on_memory is off, and hotplug capacity (node1) is
> zone_movable - then zone_normal (node0) should have N pages
> accounted in nr_memmap_pages
Right, we'll allocate the memmap from the buddy, which ends up
allocating from ZONE_NORMAL on that node.
>
> 1a) when dropping these memory blocks, I should see node0 memory
> use drop by 4GB - since this is just GFP_KERNEL pages.
I assume you mean "when hotunplugging them". Yes, we should be freeing the memmap back to the buddy.
>
> 2) if memmap_on_memory is on, and hotplug capacity (node1) is
> zone_movable - then each memory block (256MB) should appear
> as 252MB (-4MB of 64-byte page structs). For 256GB (my system)
> I should see a total of 252GB of onlined memory (-4GB of page struct)
In memory_block_online(), we have:
/*
* Account once onlining succeeded. If the zone was unpopulated, it is
* now already properly populated.
*/
if (nr_vmemmap_pages)
adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
nr_vmemmap_pages);
So we'd add the vmemmap pages to
* zone->present_pages
* zone->zone_pgdat->node_present_pages
(mhp_init_memmap_on_memory() moved the vmemmap pages to ZONE_MOVABLE)
However, we don't add them to
* zone->managed_pages
* totalram pages
/proc/zoneinfo would show them as present but not managed.
/proc/meminfo would not include them in MemTotal
We could adjust the latter two, if there is a problem.
(just needs some adjust_managed_page_count() calls)
So yes, staring at MemTotal, you should see an increase by 252 MiB right now.
>
> 2a) when dropping these memory blocks, I should see node0 memory use
> stay the same - since it was vmemmap usage.
Yes.
>
> I will double check that this isn't working as expected, and i'll double
> check for a build option as well.
>
> stupid question - it sorta seems like you'd want this as the default
> setting for driver-managed hotplug memory blocks, but I suppose for
> very small blocks there's problems (as described in the docs).
The issue is that it is per-memblock. So you'll never have 1 GiB ranges
of consecutive usable memory (e.g., 1 GiB hugetlb page).
>
> :thinking: - is it silly to suggest maybe a per-driver memmap_on_memory
> setting rather than just a global setting? For CXL capacity, this seems
> like a no-brainer since blocks can't be smaller than 256MB (per spec).
I thought we had that? See MHP_MEMMAP_ON_MEMORY set by dax/kmem.
IIRC, the global toggle must be enabled for the driver option to be considered.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-18 20:57 ` David Hildenbrand
@ 2025-02-19 1:10 ` Gregory Price
2025-02-19 8:53 ` David Hildenbrand
2025-02-20 17:50 ` Yang Shi
1 sibling, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-02-19 1:10 UTC (permalink / raw)
To: David Hildenbrand; +Cc: Yang Shi, lsf-pc, linux-mm, linux-cxl, linux-kernel
On Tue, Feb 18, 2025 at 09:57:06PM +0100, David Hildenbrand wrote:
> >
> > 2) if memmap_on_memory is on, and hotplug capacity (node1) is
> > zone_movable - then each memory block (256MB) should appear
> > as 252MB (-4MB of 64-byte page structs). For 256GB (my system)
> > I should see a total of 252GB of onlined memory (-4GB of page struct)
>
> In memory_block_online(), we have:
>
> /*
> * Account once onlining succeeded. If the zone was unpopulated, it is
> * now already properly populated.
> */
> if (nr_vmemmap_pages)
> adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
> nr_vmemmap_pages);
>
I've validated the behavior on my system, I just mis-read my results.
memmap_on_memory works as suggested.
What's mildly confusing is for pages used for altmap to be accounted for
as if it's an allocation in vmstat - but for that capacity to be chopped
out of the memory-block (it "makes sense" it's just subtly misleading).
I thought the system was saying i'd allocated memory (from the 'free'
capacity) instead of just reducing capacity.
Thank you for clearing this up.
> >
> > stupid question - it sorta seems like you'd want this as the default
> > setting for driver-managed hotplug memory blocks, but I suppose for
> > very small blocks there's problems (as described in the docs).
>
> The issue is that it is per-memblock. So you'll never have 1 GiB ranges
> of consecutive usable memory (e.g., 1 GiB hugetlb page).
>
That makes sense, i had not considered this. Although it only applies
for small blocks - which is basically an indictment of this suggestion:
https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/
So I'll have to consider this and whether this should be a default.
It's probably this is enough to nak this entirely.
... that said ....
Interestingly, when I tried allocating 1GiB hugetlb pages on a dax device
in ZONE_MOVABLE (without memmap_on_memory) - the allocation fails silently
regardless of block size (tried both 2GB and 256MB). I can't find a reason
why this would be the case in the existing documentation.
(note: hugepage migration is enabled in build config, so it's not that)
If I enable one block (256MB) into ZONE_NORMAL, and the remainder in
movable (with memmap_on_memory=n) the allocation still fails, and:
nr_slab_unreclaimable 43
in node1/vmstat - where previously there was nothing.
Onlining the dax devices into ZONE_NORMAL successfully allowed 1GiB huge
pages to allocate.
This used the /sys/bus/node/devices/node1/hugepages/* interfaces to test
Using the /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages with
interleave mempolicy - all hugepages end up on ZONE_NORMAL.
(v6.13 base kernel)
This behavior is *curious* to say the least. Not sure if bug, or some
nuance missing from the documentation - but certainly glad I caught it.
> I thought we had that? See MHP_MEMMAP_ON_MEMORY set by dax/kmem.
>
> IIRC, the global toggle must be enabled for the driver option to be considered.
Oh, well, that's an extra layer I missed. So there's:
build:
CONFIG_MHP_MEMMAP_ON_MEMORY=y
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
global:
/sys/module/memory_hotplug/parameters/memmap_on_memory
device:
/sys/bus/dax/devices/dax0.0/memmap_on_memory
And looking at it - this does seem to be the default for dax.
So I can drop the existing `nuance movable/memmap` section and just
replace it with the hugetlb subtleties x_x.
I appreciate the clarifications here, sorry for the incorrect info and
the increasing confusing.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-19 1:10 ` Gregory Price
@ 2025-02-19 8:53 ` David Hildenbrand
2025-02-19 16:14 ` Gregory Price
0 siblings, 1 reply; 81+ messages in thread
From: David Hildenbrand @ 2025-02-19 8:53 UTC (permalink / raw)
To: Gregory Price; +Cc: Yang Shi, lsf-pc, linux-mm, linux-cxl, linux-kernel
> What's mildly confusing is for pages used for altmap to be accounted for
> as if it's an allocation in vmstat - but for that capacity to be chopped
> out of the memory-block (it "makes sense" it's just subtly misleading).
Would the following make it better or worse?
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4765f2928725c..17a4432427051 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -237,9 +237,12 @@ static int memory_block_online(struct memory_block *mem)
* Account once onlining succeeded. If the zone was unpopulated, it is
* now already properly populated.
*/
- if (nr_vmemmap_pages)
+ if (nr_vmemmap_pages) {
adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
nr_vmemmap_pages);
+ adjust_managed_page_count(pfn_to_page(start_pfn),
+ nr_vmemmap_pages);
+ }
mem->zone = zone;
mem_hotplug_done();
@@ -273,17 +276,23 @@ static int memory_block_offline(struct memory_block *mem)
nr_vmemmap_pages = mem->altmap->free;
mem_hotplug_begin();
- if (nr_vmemmap_pages)
+ if (nr_vmemmap_pages) {
adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
-nr_vmemmap_pages);
+ adjust_managed_page_count(pfn_to_page(start_pfn),
+ -nr_vmemmap_pages);
+ }
ret = offline_pages(start_pfn + nr_vmemmap_pages,
nr_pages - nr_vmemmap_pages, mem->zone, mem->group);
if (ret) {
/* offline_pages() failed. Account back. */
- if (nr_vmemmap_pages)
+ if (nr_vmemmap_pages) {
adjust_present_page_count(pfn_to_page(start_pfn),
mem->group, nr_vmemmap_pages);
+ adjust_managed_page_count(pfn_to_page(start_pfn),
+ nr_vmemmap_pages);
+ }
goto out;
}
Then, it would look "just like allocated memory" from that node/zone.
As if, the memmap was allocated immediately when we onlined the memory
(see below).
>
> I thought the system was saying i'd allocated memory (from the 'free'
> capacity) instead of just reducing capacity.
The question is whether you want that memory to be hidden from MemTotal
(carveout?) or treated just like allocated memory (allocation?).
If you treat the memmap as "just a memory allocation after early boot"
and "memap_on_memory" telling you to allocate that memory from the
hotplugged memory instead of the buddy, then "carveout"
might be more of an internal implementation detail to achieve that memory
allocation.
>>> stupid question - it sorta seems like you'd want this as the default
>>> setting for driver-managed hotplug memory blocks, but I suppose for
>>> very small blocks there's problems (as described in the docs).
>>
>> The issue is that it is per-memblock. So you'll never have 1 GiB ranges
>> of consecutive usable memory (e.g., 1 GiB hugetlb page).
>>
>
> That makes sense, i had not considered this. Although it only applies
> for small blocks - which is basically an indictment of this suggestion:
>
> https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/
>
> So I'll have to consider this and whether this should be a default.
> It's probably this is enough to nak this entirely.
>
>
> ... that said ....
>
> Interestingly, when I tried allocating 1GiB hugetlb pages on a dax device
> in ZONE_MOVABLE (without memmap_on_memory) - the allocation fails silently
> regardless of block size (tried both 2GB and 256MB). I can't find a reason
> why this would be the case in the existing documentation.
Right, it only currently works with ZONE_NORMAL, because 1 GiB pages are
considered unmovable in practice (try reliably finding a 1 GiB area to
migrate the memory to during memory unplug ... when these hugetlb things are
unswappable etc.).
I documented it under https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html
"Gigantic pages are unmovable, resulting in user space consuming a lot of unmovable memory."
If we ever support THP in that size range, we might consider them movable
because we can just split/swapout them when allcoating a migration target
fails.
>
> (note: hugepage migration is enabled in build config, so it's not that)
>
> If I enable one block (256MB) into ZONE_NORMAL, and the remainder in
> movable (with memmap_on_memory=n) the allocation still fails, and:
>
> nr_slab_unreclaimable 43
>
> in node1/vmstat - where previously there was nothing.
>
> Onlining the dax devices into ZONE_NORMAL successfully allowed 1GiB huge
> pages to allocate.
> > This used the /sys/bus/node/devices/node1/hugepages/* interfaces to test
>
> Using the /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages with
> interleave mempolicy - all hugepages end up on ZONE_NORMAL.
>
> (v6.13 base kernel)
>
> This behavior is *curious* to say the least. Not sure if bug, or some
> nuance missing from the documentation - but certainly glad I caught it.
See above :)
>
>
>> I thought we had that? See MHP_MEMMAP_ON_MEMORY set by dax/kmem.
>>
>> IIRC, the global toggle must be enabled for the driver option to be considered.
>
> Oh, well, that's an extra layer I missed. So there's:
>
> build:
> CONFIG_MHP_MEMMAP_ON_MEMORY=y
> CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
> global:
> /sys/module/memory_hotplug/parameters/memmap_on_memory
> device:
> /sys/bus/dax/devices/dax0.0/memmap_on_memory
>
> And looking at it - this does seem to be the default for dax.
>
> So I can drop the existing `nuance movable/memmap` section and just
> replace it with the hugetlb subtleties x_x.
>
> I appreciate the clarifications here, sorry for the incorrect info and
> the increasing confusing.
No worries. If you have ideas on what to improve in the memory hotplug
docs, please let me know!
--
Cheers,
David / dhildenb
^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-19 8:53 ` David Hildenbrand
@ 2025-02-19 16:14 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-02-19 16:14 UTC (permalink / raw)
To: David Hildenbrand; +Cc: Yang Shi, lsf-pc, linux-mm, linux-cxl, linux-kernel
On Wed, Feb 19, 2025 at 09:53:07AM +0100, David Hildenbrand wrote:
... snip ...
>
> Would the following make it better or worse?
>
> The question is whether you want that memory to be hidden from MemTotal
> (carveout?) or treated just like allocated memory (allocation?).
>
> If you treat the memmap as "just a memory allocation after early boot"
> and "memap_on_memory" telling you to allocate that memory from the
> hotplugged memory instead of the buddy, then "carveout"
> might be more of an internal implementation detail to achieve that memory
> allocation.
>
Probably this is fine either way, but I'll poke at the patch you
provided and let you know.
> Right, it only currently works with ZONE_NORMAL, because 1 GiB pages are
> considered unmovable in practice (try reliably finding a 1 GiB area to
> migrate the memory to during memory unplug ... when these hugetlb things are
> unswappable etc.).
>
> I documented it under https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html
>
> "Gigantic pages are unmovable, resulting in user space consuming a lot of unmovable memory."
>
Ah, I'm so used to seeing "1GiB Huge Pages" I missed that parts of the
kernel refer to them as "gigantic". Completely missed this line in the
docs. So a subtle choice being made by onlining memory into zone
movable is the exclusion of this memory region from being used for
gigantic pages (for now, assuming it never changes).
This is good to know.
> > I appreciate the clarifications here, sorry for the incorrect info and
> > the increasing confusing.
>
> No worries. If you have ideas on what to improve in the memory hotplug
> docs, please let me know!
>
I think that clears up most of my misconceptions.
The end-goal of this series is essentially 2-4 "basic choices" for
onlining CXL memory - the implicit decisions being made - while
identifying some ambiguity.
For example: driver-setup into ZONE_MOVABLE w/o memmap_on_memory means
1) ZONE_NORMAL page-struct use
2) no gigantic page support
3) limited kernel allocation support
4) changeable configuration without reboot (in theory)
5) Additioanl ras capabilities.
Thanks again for you patience and help as I work through this.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12 ` Yuquan Wang
@ 2025-02-20 16:30 ` Jonathan Cameron
2025-02-20 16:52 ` Gregory Price
2025-03-04 0:32 ` Gregory Price
2025-03-10 10:45 ` Yuquan Wang
3 siblings, 1 reply; 81+ messages in thread
From: Jonathan Cameron @ 2025-02-20 16:30 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
>
> Example CEDT Entries (truncated)
> Subtable Type : 00 [CXL Host Bridge Structure]
> Reserved : 00
> Length : 0020
> Associated host bridge : 00000005
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 000000C050000000
> Window size : 0000003CA0000000
>
> If this memory is NOT marked "Special Purpose" by BIOS (next section),
Specific purpose. You don't want to know how long that term took to
agree on...
> you should find a matching entry EFI Memory Map and /proc/iomem
>
> BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] usable
Trivial but that's not the EFI memory map, that's the e820.
On some architectures this really will be coming from the EFI memory map.
> /proc/iomem: c050000000-fcefffffff : System RAM
>
>
> ---------------------------------------------------------------
> Step 3: EFI_MEMORY_SP - Deferring Management to the CXL Driver.
> ---------------------------------------------------------------
>
> Assuming you DON'T want CXL memory to default to SystemRAM and prefer
> NOT to have your kernel allocate arbitrary resources on CXL, you
> probably want to defer managing these memory regions to the CXL driver.
>
> The mechanism for is setting EFI_MEMORY_SP bit on CXL memory in BIOS.
> This will mark the memory "Special Purpose".
Specific purpose if we are keeping the UEFI naming.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot
2025-02-20 16:30 ` Jonathan Cameron
@ 2025-02-20 16:52 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-02-20 16:52 UTC (permalink / raw)
To: Jonathan Cameron; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Feb 20, 2025 at 04:30:43PM +0000, Jonathan Cameron wrote:
>
> >
> > Example CEDT Entries (truncated)
> > Subtable Type : 00 [CXL Host Bridge Structure]
> > Reserved : 00
> > Length : 0020
> > Associated host bridge : 00000005
> >
> > Subtable Type : 01 [CXL Fixed Memory Window Structure]
> > Reserved : 00
> > Length : 002C
> > Reserved : 00000000
> > Window base address : 000000C050000000
> > Window size : 0000003CA0000000
> >
> > If this memory is NOT marked "Special Purpose" by BIOS (next section),
>
> Specific purpose. You don't want to know how long that term took to
> agree on...
>
Oh man how'd i muck that up. Thanks for the correction.
I suppose I can/should re-issue all of these with corrections
accordingly. Maybe even convert this into a documentation somewhere
:sweat:
> > you should find a matching entry EFI Memory Map and /proc/iomem
> >
> > BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] usable
>
> Trivial but that's not the EFI memory map, that's the e820.
> On some architectures this really will be coming from the EFI memory map.
>
You're right, though on my system they're equivalent, just plucked the
wrong thing out of dmesg. Will correct.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-18 20:57 ` David Hildenbrand
2025-02-19 1:10 ` Gregory Price
@ 2025-02-20 17:50 ` Yang Shi
2025-02-20 18:43 ` Gregory Price
1 sibling, 1 reply; 81+ messages in thread
From: Yang Shi @ 2025-02-20 17:50 UTC (permalink / raw)
To: David Hildenbrand
Cc: Gregory Price, lsf-pc, linux-mm, linux-cxl, linux-kernel
>
> >
> > 2) if memmap_on_memory is on, and hotplug capacity (node1) is
> > zone_movable - then each memory block (256MB) should appear
> > as 252MB (-4MB of 64-byte page structs). For 256GB (my system)
> > I should see a total of 252GB of onlined memory (-4GB of page struct)
>
> In memory_block_online(), we have:
>
> /*
> * Account once onlining succeeded. If the zone was unpopulated, it is
> * now already properly populated.
> */
> if (nr_vmemmap_pages)
> adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
> nr_vmemmap_pages);
>
> So we'd add the vmemmap pages to
> * zone->present_pages
> * zone->zone_pgdat->node_present_pages
>
> (mhp_init_memmap_on_memory() moved the vmemmap pages to ZONE_MOVABLE)
>
> However, we don't add them to
> * zone->managed_pages
> * totalram pages
>
> /proc/zoneinfo would show them as present but not managed.
> /proc/meminfo would not include them in MemTotal
>
> We could adjust the latter two, if there is a problem.
> (just needs some adjust_managed_page_count() calls)
>
> So yes, staring at MemTotal, you should see an increase by 252 MiB right now.
>
> >
> > 2a) when dropping these memory blocks, I should see node0 memory use
> > stay the same - since it was vmemmap usage.
>
> Yes.
>
> >
> > I will double check that this isn't working as expected, and i'll double
> > check for a build option as well.
> >
> > stupid question - it sorta seems like you'd want this as the default
> > setting for driver-managed hotplug memory blocks, but I suppose for
> > very small blocks there's problems (as described in the docs).
>
> The issue is that it is per-memblock. So you'll never have 1 GiB ranges
> of consecutive usable memory (e.g., 1 GiB hugetlb page).
Regardless of ZONE_MOVABLE or ZONE_NORMAL, right?
Thanks,
Yang
>
> >
> > :thinking: - is it silly to suggest maybe a per-driver memmap_on_memory
> > setting rather than just a global setting? For CXL capacity, this seems
> > like a no-brainer since blocks can't be smaller than 256MB (per spec).
>
> I thought we had that? See MHP_MEMMAP_ON_MEMORY set by dax/kmem.
>
> IIRC, the global toggle must be enabled for the driver option to be considered.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-20 17:50 ` Yang Shi
@ 2025-02-20 18:43 ` Gregory Price
2025-02-20 19:26 ` David Hildenbrand
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-02-20 18:43 UTC (permalink / raw)
To: Yang Shi; +Cc: David Hildenbrand, lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Feb 20, 2025 at 09:50:07AM -0800, Yang Shi wrote:
> > > I will double check that this isn't working as expected, and i'll double
> > > check for a build option as well.
> > >
> > > stupid question - it sorta seems like you'd want this as the default
> > > setting for driver-managed hotplug memory blocks, but I suppose for
> > > very small blocks there's problems (as described in the docs).
> >
> > The issue is that it is per-memblock. So you'll never have 1 GiB ranges
> > of consecutive usable memory (e.g., 1 GiB hugetlb page).
>
> Regardless of ZONE_MOVABLE or ZONE_NORMAL, right?
>
> Thanks,
> Yang
From my testing, yes.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-20 18:43 ` Gregory Price
@ 2025-02-20 19:26 ` David Hildenbrand
2025-02-20 19:35 ` Gregory Price
2025-03-11 14:53 ` Zi Yan
0 siblings, 2 replies; 81+ messages in thread
From: David Hildenbrand @ 2025-02-20 19:26 UTC (permalink / raw)
To: Gregory Price, Yang Shi; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On 20.02.25 19:43, Gregory Price wrote:
> On Thu, Feb 20, 2025 at 09:50:07AM -0800, Yang Shi wrote:
>>>> I will double check that this isn't working as expected, and i'll double
>>>> check for a build option as well.
>>>>
>>>> stupid question - it sorta seems like you'd want this as the default
>>>> setting for driver-managed hotplug memory blocks, but I suppose for
>>>> very small blocks there's problems (as described in the docs).
>>>
>>> The issue is that it is per-memblock. So you'll never have 1 GiB ranges
>>> of consecutive usable memory (e.g., 1 GiB hugetlb page).
>>
>> Regardless of ZONE_MOVABLE or ZONE_NORMAL, right?
>>
>> Thanks,
>> Yang
>
> From my testing, yes.
Yes, the only way to get some 1 GiB pages is by using larger memory
blocks (e.g., 2 GiB on x86-64), which comes with a different set of
issues (esp. hotplug granularity).
Of course, only 1x usable 1 GiB page for each 2 GiB block.
There were ideas in how to optimize that (e.g., requiring a new sysfs
interface to expose variable-sized blocks), if anybody is interested,
please reach out.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-20 19:26 ` David Hildenbrand
@ 2025-02-20 19:35 ` Gregory Price
2025-02-20 19:44 ` David Hildenbrand
2025-03-11 14:53 ` Zi Yan
1 sibling, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-02-20 19:35 UTC (permalink / raw)
To: David Hildenbrand; +Cc: Yang Shi, lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Feb 20, 2025 at 08:26:21PM +0100, David Hildenbrand wrote:
> On 20.02.25 19:43, Gregory Price wrote:
>
> There were ideas in how to optimize that (e.g., requiring a new sysfs
> interface to expose variable-sized blocks), if anybody is interested, please
> reach out.
>
Multiple block sizes would allow managing the misaligned issues I
discussed earlier as well (where a memory region is not aligned to the
arch-preferred block size). At least in that scenario, the misaligned
regions could be brought online as minimal block size (256MB) while the
aligned portions could be brought online in larger (2GB) chunks.
Do you think this would this require a lot of churn to enable? Last I dug
through hotplug, it seemed to assume block sizes would be uniform. I was
hesitant to start ripping through it.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-20 19:35 ` Gregory Price
@ 2025-02-20 19:44 ` David Hildenbrand
2025-02-20 20:06 ` Gregory Price
0 siblings, 1 reply; 81+ messages in thread
From: David Hildenbrand @ 2025-02-20 19:44 UTC (permalink / raw)
To: Gregory Price; +Cc: Yang Shi, lsf-pc, linux-mm, linux-cxl, linux-kernel
On 20.02.25 20:35, Gregory Price wrote:
> On Thu, Feb 20, 2025 at 08:26:21PM +0100, David Hildenbrand wrote:
>> On 20.02.25 19:43, Gregory Price wrote:
>>
>> There were ideas in how to optimize that (e.g., requiring a new sysfs
>> interface to expose variable-sized blocks), if anybody is interested, please
>> reach out.
>>
>
> Multiple block sizes would allow managing the misaligned issues I
> discussed earlier as well (where a memory region is not aligned to the
> arch-preferred block size). At least in that scenario, the misaligned
> regions could be brought online as minimal block size (256MB) while the
> aligned portions could be brought online in larger (2GB) chunks.
>
> Do you think this would this require a lot of churn to enable? Last I dug
> through hotplug, it seemed to assume block sizes would be uniform. I was
> hesitant to start ripping through it.
There were some discussions around variable-sized memory blocks. But we
cannot rip out the old stuff, because it breaks API and thereby Linux
user space. There would have to be a world switch (e.g., kernel cmdline,
config option)
... and some corner cases are rather nasty (e.g., hotunplugging boot
memory, where we only learn from ACPI which regions actually belong to a
single DIMM; dlpar hotunplugging individual memory blocks of boot memory
IIRC).
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-20 19:44 ` David Hildenbrand
@ 2025-02-20 20:06 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-02-20 20:06 UTC (permalink / raw)
To: David Hildenbrand; +Cc: Yang Shi, lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Feb 20, 2025 at 08:44:36PM +0100, David Hildenbrand wrote:
> On 20.02.25 20:35, Gregory Price wrote:
>
> There were some discussions around variable-sized memory blocks. But we
> cannot rip out the old stuff, because it breaks API and thereby Linux user
> space. There would have to be a world switch (e.g., kernel cmdline, config
> option)
>
> ... and some corner cases are rather nasty (e.g., hotunplugging boot memory,
> where we only learn from ACPI which regions actually belong to a single
> DIMM; dlpar hotunplugging individual memory blocks of boot memory IIRC).
>
Probably we just end back up between a rock and a hard place then. Let
me finish this documentation series and cleaning up corrections and I'll
come back around on this and figure out if it's feasible.
If variable sized memory blocks in infeasible, then the only solution to
misaligned memory regions seems to be to wag a finger at hardware vendors
as Dan suggests.
Thinking about my current acpi/block size extension, it does seem bad
that the user can't choose to force a larger block sizes without a
boot parameter - but i hesitate to add *yet another* switch on top of
that. The problem is complicated enough.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12 ` Yuquan Wang
2025-02-20 16:30 ` Jonathan Cameron
@ 2025-03-04 0:32 ` Gregory Price
2025-03-13 16:12 ` Jonathan Cameron
2025-03-10 10:45 ` Yuquan Wang
3 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-04 0:32 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
On Tue, Feb 04, 2025 at 09:17:09PM -0500, Gregory Price wrote:
> ------------------------------------------------------------------
> Step 2: BIOS / EFI generates the CEDT (CXL Early Detection Table).
> ------------------------------------------------------------------
>
> This table is responsible for reporting each "CXL Host Bridge" and
> "CXL Fixed Memory Window" present at boot - which enables early boot
> software to manage those devices and the memory capacity presented
> by those devices.
>
> Example CEDT Entries (truncated)
> Subtable Type : 00 [CXL Host Bridge Structure]
> Reserved : 00
> Length : 0020
> Associated host bridge : 00000005
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 000000C050000000
> Window size : 0000003CA0000000
>
> If this memory is NOT marked "Special Purpose" by BIOS (next section),
> you should find a matching entry EFI Memory Map and /proc/iomem
>
> BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] usable
> /proc/iomem: c050000000-fcefffffff : System RAM
>
>
> Observation: This memory is treated as 100% normal System RAM
>
> 1) This memory may be placed in any zone (ZONE_NORMAL, typically)
> 2) The kernel may use this memory for arbitrary allocations
> 4) The driver still enumerates CXL devices and memory regions, but
> 3) The CXL driver CANNOT manage this memory (as of today)
> (Caveat: *some* RAS features may still work, possibly)
>
> This creates an nuanced management state.
>
> The memory is online by default and completely usable, AND the driver
> appears to be managing the devices - BUT the memory resources and the
> management structure are fundamentally separate.
> 1) CXL Driver manages CXL features
> 2) Non-CXL SystemRAM mechanisms surface the memory to allocators.
>
Adding some additional context here
-------------------------------------
Nuance X: NUMA Nodes and ACPI Tables.
-------------------------------------
ACPI Table parsing is partially architecture/platform dependent, but
there is common code that affects boot-time creation of NUMA nodes.
NUMA-nodes are not a dynamic resource. They are (presently, Feb 2025)
statically configured during kernel init, and the number of possible
NUMA nodes (N_POSSIBLE) may not change during runtime.
CEDT/CFMW and SRAT/Memory Affinity entries describe memory regions
associated with CXL devices. These tables are used to allocate NUMA
node IDs during _init.
The "System Resource Affinity Table" has "Memory Affinity" entries
which associate memory regions with a "Proximity Domain"
Subtable Type : 01 [Memory Affinity]
Length : 28
Proximity Domain : 00000001
Reserved1 : 0000
Base Address : 000000C050000000
Address Length : 0000003CA0000000
The "Proximity Domain" utilized by the kernel ACPI driver to match this
region with a NUMA node (in most cases, the proximity domains here will
directly translate to a NUMA node ID - but not always).
CEDT/CFMWS do not have a proximity domain - so the kernel will assign it
a NUMA node association IFF no SRAT Memory Affinity entry is present.
SRAT entries are optional, CFMWS are required for each host bridge.
If SRAT entries are present, one NUMA node is created for each detected
proximity domain in the SRAT. Additional NUMA nodes are created for each
CFMWS without a matching SRAT entry.
CFMWS describes host-bridge information, and so if SRAT is missing - all
devices behind the host bridge will become naturally associated with the
same NUMA node.
big long TL;DR:
This creates the subtle assumption that each host-bridge will have
devices with similar performance characteristics if they're intended
for use as general purpose memory and/or interleave.
This means you should expect to have to reboot your machine if a
different NUMA topology is needed (for example, if you are physically
hotunplugging a volatile device to plug in a non-volatile device).
Stay tuned for more Fun and Profit with ACPI tables :]
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2: The Drivers
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06 0:47 ` Dan Williams
@ 2025-03-04 1:32 ` Gregory Price
2025-03-06 23:56 ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-04 1:32 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
On Wed, Feb 05, 2025 at 11:06:08AM -0500, Gregory Price wrote:
> ---------------------------------------------------------
> Second bit of nuanced complexity: Memory Block Alignment.
> ---------------------------------------------------------
> In section 1, we introduced CEDT / CFMW and how they map to iomem
> resources. In this section we discussed out we surface memory blocks
> to the kernel allocators.
>
> However, at no time did platform, arch code, and driver communicate
> about the expected size of a memory block. In most cases, the size
> of a memory block is defined by the architecture - unaware of CXL.
>
> On x86, for example, the heuristic for memory block size is:
> 1) user boot-arg value
> 2) Maximize size (up to 2GB) if operating on bare metal
> 3) Use smallest value that aligns with the end of memory
>
> The problem is that [SOFT RESERVED] memory is not considered in the
> alignment calculation - and not all [SOFT RESERVED] memory *should*
> be considered for alignment.
>
> In the case of our working example (real system, btw):
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Window base address : 000000C050000000
> Window size : 0000003CA0000000
>
> The base is 256MB aligned (the minimum for the CXL Spec), and the
> window size is 512MB. This results in a loss of almost a full memory
> block worth of memory (~1280MB on the front, and ~512MB on the back).
>
> This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB).
>
Some additional nuance adding here: Dynamic Capacity Devices will also
have problems with this issue unless the fabric manager is aware of the
host memory block size configuration.
Variable sized or sparse memory blocks may be possible in the future,
but as of today they are not.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
` (2 preceding siblings ...)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
@ 2025-03-05 22:20 ` Gregory Price
2025-03-05 22:44 ` Dave Jiang
` (3 more replies)
2025-03-12 0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
` (2 subsequent siblings)
6 siblings, 4 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-05 22:20 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
--------------------
Part 0: ACPI Tables.
--------------------
I considered publishing this section first, or at least under
"Platform", but I've found this information largely useful in
debugging interleave configurations and tiering mechanisms -
which are higher level concepts.
Much of the information in this section is most relevant to
Interleave (yet to be published Section 4).
I promise not to simply regurgitate the entire ACPI specification
and limit this to necessary commentary to describe how these tables
relate to actual Linux resources (like numa nodes and tiers).
At the very least, if you find yourself trying to figure out why
your CXL system isn't producing NUMA nodes, memory tiers, root
decoders, memory regions - etc - I would check these tables
first for aberrations. Almost all my personal strife has been
associated with ACPI table misconfiguration.
ACPI tables can be inspected with the `acpica-tools` package.
mkdir acpi_tables && cd acpi_tables
acpidump -b
iasl -d *
-- inpect the *.dsl files
====
CEDT
====
The CXL Early Discovery Table is generated by BIOS to describe
the CXL devices present and configured (to some extent) at boot
by the BIOS.
# CHBS
The CXL Host Bridge Structure describes CXL host bridges. Other
than describing device register information, it reports the specific
host bridge UID for this host bridge. These host bridge ID's will
be referenced in other tables.
Debug hint: check that the host bridge IDs between tables are
consistent - stuff breaks oddly if they're not!
```
Subtable Type : 00 [CXL Host Bridge Structure]
Reserved : 00
Length : 0020
Associated host bridge : 00000007 <- Host bridge _UID
Specification version : 00000001
Reserved : 00000000
Register base : 0000010370400000
Register length : 0000000000010000
```
# CFMWS
The CXL Fixed Memory Window structure describes a memory region
associated with one or more CXL host bridges (as described by the
CHBS). It additionally describes any inter-host-bridge interleave
configuration that may have been programmed by BIOS. (Section 4)
```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 000000C050000000 <- Memory Region
Window size : 0000003CA0000000
Interleave Members (2^n) : 01 <- Interleave configuration
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
Next Target : 00000006 <- Host Bridge _UID
```
INTER-host-bridge interleave (multiple devices on one host bridge) is
NOT reported in this structure, and is solely defined via CXL device
decoder programming (host bridge and endpoint decoders). This will be
described later (Section 4 - Interleave)
====
SRAT
====
The System/Static Resource Affinity Table describes resource (CPU,
Memory) affinity to "Proximity Domains". This table is technically
optional, but for performance information (see "HMAT") to be enumerated
by linux it must be present.
# Proximity Domain
A proximity domain is ROUGHLY equivalent to "NUMA Node" - though a
1-to-1 mapping is not guaranteed. There are scenarios where "Proximity
Domain 4" may map to "NUMA Node 3", for example. (See "NUMA Node Creation")
# Memory Affinity
Generally speaking, if a host does any amount of CXL fabric (decoder)
programming in BIOS - an SRAT entry for that memory needs to be present.
```
Subtable Type : 01 [Memory Affinity]
Length : 28
Proximity Domain : 00000001 <- NUMA Node 1
Reserved1 : 0000
Base Address : 000000C050000000 <- Physical Memory Region
Address Length : 0000003CA0000000
Reserved2 : 00000000
Flags (decoded below) : 0000000B
Enabled : 1
Hot Pluggable : 1
Non-Volatile : 0
```
# Generic Initiator / Port
In the scenario where CXL devices are not present or configured by
BIOS, we may still want to generate proximity domain configurations
for those devices. The Generic Initiator interfaces are intended to
fill this gap, so that performance information can still be utilized
when the devices become available at runtime.
I won't cover the details here, for now, but I will link to the
proosal from Dan Williams and Jonathan Cameron if you would like
more information.
https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/
====
HMAT
====
The Heterogeneous Memory Attributes Table contains information such as
cache attributes and bandwidth and latency details for memory proximity
domains. For the purpose of this document, we will only discuss the
SSLIB entry.
# SLLBI
The System Locality Latency and Bandwidth Information records latency
and bandwidth information for proximity domains. This table is used by
Linux to configure interleave weights and memory tiers.
```
Heavily truncated for brevity
Structure Type : 0001 [SLLBI]
Data Type : 00 <- Latency
Target Proximity Domain List : 00000000
Target Proximity Domain List : 00000001
Entry : 0080 <- DRAM LTC
Entry : 0100 <- CXL LTC
Structure Type : 0001 [SLLBI]
Data Type : 03 <- Bandwidth
Target Proximity Domain List : 00000000
Target Proximity Domain List : 00000001
Entry : 1200 <- DRAM BW
Entry : 0200 <- CXL BW
```
---------------------------------
Part 00: Linux Resource Creation.
---------------------------------
==================
NUMA node creation
===================
NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are
identified at `__init` time, more specifically during `mm_init`.
What this means is that the CEDT and SRAT must contain sufficient
`proximity domain` information for linux to identify how many NUMA
nodes are required (and what memory regions to associate with them).
The relevant code exists in: linux/drivers/acpi/numa/srat.c
```
static int __init
acpi_parse_memory_affinity(union acpi_subtable_headers *header,
const unsigned long table_end)
{
... heavily truncated for brevity
pxm = ma->proximity_domain;
node = acpi_map_pxm_to_node(pxm);
if (numa_add_memblk(node, start, end) < 0)
....
node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE
}
static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
void *arg, const unsigned long table_end)
{
... heavily truncated for brevity
/*
* The SRAT may have already described NUMA details for all,
* or a portion of, this CFMWS HPA range. Extend the memblks
* found for any portion of the window to cover the entire
* window.
*/
if (!numa_fill_memblks(start, end))
return 0;
/* No SRAT description. Create a new node. */
node = acpi_map_pxm_to_node(*fake_pxm);
if (numa_add_memblk(node, start, end) < 0)
....
node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE
}
int __init acpi_numa_init(void)
{
...
if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
acpi_parse_memory_affinity, 0);
}
/* fake_pxm is the next unused PXM value after SRAT parsing */
acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
&fake_pxm);
```
Basically, the heuristic is as follows:
1) Add one NUMA node per Proximity Domain described in SRAT
2) If the SRAT describes all memory described by all CFMWS
- do not create nodes for CFMWS
3) If SRAT does not describe all memory described by CFMWS
- create a node for that CFMWS
Generally speaking, you will see one NUMA node per Host bridge, unless
inter-host-bridge interleave is in use (see Section 4 - Interleave).
============
Memory Tiers
============
The `abstract distance` of a node dictates what tier it lands in (and
therefore, what tiers are created). This is calculated based on the
following heuristic, using HMAT data:
```
int mt_perf_to_adistance(struct access_coordinate *perf, int *adist)
{
...
/*
* The abstract distance of a memory node is in direct proportion to
* its memory latency (read + write) and inversely proportional to its
* memory bandwidth (read + write). The abstract distance, memory
* latency, and memory bandwidth of the default DRAM nodes are used as
* the base.
*/
*adist = MEMTIER_ADISTANCE_DRAM *
(perf->read_latency + perf->write_latency) /
(default_dram_perf.read_latency + default_dram_perf.write_latency) *
(default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) /
(perf->read_bandwidth + perf->write_bandwidth);
return 0;
}
```
Debugging hint: If you have DRAM and CXL memory in separate numa nodes
but only find 1 memory tier, validate the HMAT!
============================
Memory Tier Demotion Targets
============================
When `demotion` is enabled (see Section 5 - allocation), the reclaim
system may opportunistically demote a page from one memory tier to
another. The selection of a `demotion target` is partially based on
Abstract Distance and Performance Data.
```
An example of demotion targets from memory-tiers.c
/* Example 1:
*
* Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
*
* node distances:
* node 0 1 2 3
* 0 10 20 30 40
* 1 20 10 40 30
* 2 30 40 10 40
* 3 40 30 40 10
*
* memory_tiers0 = 0-1
* memory_tiers1 = 2-3
*
* node_demotion[0].preferred = 2
* node_demotion[1].preferred = 3
* node_demotion[2].preferred = <empty>
* node_demotion[3].preferred = <empty>
*/
```
=============================
Mempolicy Weighted Interleave
=============================
The `weighted interleave` functionality of `mempolicy` utilizes weights
to distribute memory across NUMA nodes according to some set weight.
There is a proposal to auto-configure these weights based on HMAT data.
https://lore.kernel.org/linux-mm/20250305200506.2529583-1-joshua.hahnjy@gmail.com/T/#u
See Section 4 - Interleave, for more information on weighted interleave.
--------------
Build Options.
--------------
We can add these build configurations to our complexity picture.
CONFIG_NUMA - req for ACPI numa, mempolicy, and memory tiers
CONFIG_ACPI_NUMA -- enables srat and cedt parsing
CONFIG_ACPI_HMAT -- enables hmat parsing
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
@ 2025-03-05 22:44 ` Dave Jiang
2025-03-05 23:34 ` Gregory Price
2025-03-06 1:37 ` Yuquan Wang
` (2 subsequent siblings)
3 siblings, 1 reply; 81+ messages in thread
From: Dave Jiang @ 2025-03-05 22:44 UTC (permalink / raw)
To: Gregory Price, lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
On 3/5/25 3:20 PM, Gregory Price wrote:
> --------------------
> Part 0: ACPI Tables.
> --------------------
> I considered publishing this section first, or at least under
> "Platform", but I've found this information largely useful in
> debugging interleave configurations and tiering mechanisms -
> which are higher level concepts.
Hi Gregory,
Thanks for detailing all this information. It has been a really good read.
Do you intend to also add CDAT information and device performance data calculation related to that? The SRAT/HMAT info only covers CXL memory that are already setup by the BIOS as system memory. Otherwise it only contains performance data for the Generic Port and not the rest of the path to the endpoint.
DJ
>
> Much of the information in this section is most relevant to
> Interleave (yet to be published Section 4).
>
> I promise not to simply regurgitate the entire ACPI specification
> and limit this to necessary commentary to describe how these tables
> relate to actual Linux resources (like numa nodes and tiers).
>
> At the very least, if you find yourself trying to figure out why
> your CXL system isn't producing NUMA nodes, memory tiers, root
> decoders, memory regions - etc - I would check these tables
> first for aberrations. Almost all my personal strife has been
> associated with ACPI table misconfiguration.
>
>
> ACPI tables can be inspected with the `acpica-tools` package.
> mkdir acpi_tables && cd acpi_tables
> acpidump -b
> iasl -d *
> -- inpect the *.dsl files
>
> ====
> CEDT
> ====
> The CXL Early Discovery Table is generated by BIOS to describe
> the CXL devices present and configured (to some extent) at boot
> by the BIOS.
>
> # CHBS
> The CXL Host Bridge Structure describes CXL host bridges. Other
> than describing device register information, it reports the specific
> host bridge UID for this host bridge. These host bridge ID's will
> be referenced in other tables.
>
> Debug hint: check that the host bridge IDs between tables are
> consistent - stuff breaks oddly if they're not!
>
> ```
> Subtable Type : 00 [CXL Host Bridge Structure]
> Reserved : 00
> Length : 0020
> Associated host bridge : 00000007 <- Host bridge _UID
> Specification version : 00000001
> Reserved : 00000000
> Register base : 0000010370400000
> Register length : 0000000000010000
> ```
>
> # CFMWS
> The CXL Fixed Memory Window structure describes a memory region
> associated with one or more CXL host bridges (as described by the
> CHBS). It additionally describes any inter-host-bridge interleave
> configuration that may have been programmed by BIOS. (Section 4)
>
> ```
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 000000C050000000 <- Memory Region
> Window size : 0000003CA0000000
> Interleave Members (2^n) : 01 <- Interleave configuration
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
> Next Target : 00000006 <- Host Bridge _UID
> ```
>
> INTER-host-bridge interleave (multiple devices on one host bridge) is
> NOT reported in this structure, and is solely defined via CXL device
> decoder programming (host bridge and endpoint decoders). This will be
> described later (Section 4 - Interleave)
>
>
> ====
> SRAT
> ====
> The System/Static Resource Affinity Table describes resource (CPU,
> Memory) affinity to "Proximity Domains". This table is technically
> optional, but for performance information (see "HMAT") to be enumerated
> by linux it must be present.
>
>
> # Proximity Domain
> A proximity domain is ROUGHLY equivalent to "NUMA Node" - though a
> 1-to-1 mapping is not guaranteed. There are scenarios where "Proximity
> Domain 4" may map to "NUMA Node 3", for example. (See "NUMA Node Creation")
>
> # Memory Affinity
> Generally speaking, if a host does any amount of CXL fabric (decoder)
> programming in BIOS - an SRAT entry for that memory needs to be present.
>
> ```
> Subtable Type : 01 [Memory Affinity]
> Length : 28
> Proximity Domain : 00000001 <- NUMA Node 1
> Reserved1 : 0000
> Base Address : 000000C050000000 <- Physical Memory Region
> Address Length : 0000003CA0000000
> Reserved2 : 00000000
> Flags (decoded below) : 0000000B
> Enabled : 1
> Hot Pluggable : 1
> Non-Volatile : 0
> ```
>
> # Generic Initiator / Port
> In the scenario where CXL devices are not present or configured by
> BIOS, we may still want to generate proximity domain configurations
> for those devices. The Generic Initiator interfaces are intended to
> fill this gap, so that performance information can still be utilized
> when the devices become available at runtime.
>
> I won't cover the details here, for now, but I will link to the
> proosal from Dan Williams and Jonathan Cameron if you would like
> more information.
> https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/
>
> ====
> HMAT
> ====
> The Heterogeneous Memory Attributes Table contains information such as
> cache attributes and bandwidth and latency details for memory proximity
> domains. For the purpose of this document, we will only discuss the
> SSLIB entry.
>
> # SLLBI
> The System Locality Latency and Bandwidth Information records latency
> and bandwidth information for proximity domains. This table is used by
> Linux to configure interleave weights and memory tiers.
>
> ```
> Heavily truncated for brevity
> Structure Type : 0001 [SLLBI]
> Data Type : 00 <- Latency
> Target Proximity Domain List : 00000000
> Target Proximity Domain List : 00000001
> Entry : 0080 <- DRAM LTC
> Entry : 0100 <- CXL LTC
>
> Structure Type : 0001 [SLLBI]
> Data Type : 03 <- Bandwidth
> Target Proximity Domain List : 00000000
> Target Proximity Domain List : 00000001
> Entry : 1200 <- DRAM BW
> Entry : 0200 <- CXL BW
> ```
>
>
> ---------------------------------
> Part 00: Linux Resource Creation.
> ---------------------------------
>
> ==================
> NUMA node creation
> ===================
> NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are
> identified at `__init` time, more specifically during `mm_init`.
>
> What this means is that the CEDT and SRAT must contain sufficient
> `proximity domain` information for linux to identify how many NUMA
> nodes are required (and what memory regions to associate with them).
>
> The relevant code exists in: linux/drivers/acpi/numa/srat.c
> ```
> static int __init
> acpi_parse_memory_affinity(union acpi_subtable_headers *header,
> const unsigned long table_end)
> {
> ... heavily truncated for brevity
> pxm = ma->proximity_domain;
> node = acpi_map_pxm_to_node(pxm);
> if (numa_add_memblk(node, start, end) < 0)
> ....
> node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE
> }
>
> static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> void *arg, const unsigned long table_end)
> {
> ... heavily truncated for brevity
> /*
> * The SRAT may have already described NUMA details for all,
> * or a portion of, this CFMWS HPA range. Extend the memblks
> * found for any portion of the window to cover the entire
> * window.
> */
> if (!numa_fill_memblks(start, end))
> return 0;
>
> /* No SRAT description. Create a new node. */
> node = acpi_map_pxm_to_node(*fake_pxm);
> if (numa_add_memblk(node, start, end) < 0)
> ....
> node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE
> }
>
> int __init acpi_numa_init(void)
> {
> ...
> if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
> cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> acpi_parse_memory_affinity, 0);
> }
> /* fake_pxm is the next unused PXM value after SRAT parsing */
> acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> &fake_pxm);
>
> ```
>
> Basically, the heuristic is as follows:
> 1) Add one NUMA node per Proximity Domain described in SRAT
> 2) If the SRAT describes all memory described by all CFMWS
> - do not create nodes for CFMWS
> 3) If SRAT does not describe all memory described by CFMWS
> - create a node for that CFMWS
>
> Generally speaking, you will see one NUMA node per Host bridge, unless
> inter-host-bridge interleave is in use (see Section 4 - Interleave).
>
>
> ============
> Memory Tiers
> ============
> The `abstract distance` of a node dictates what tier it lands in (and
> therefore, what tiers are created). This is calculated based on the
> following heuristic, using HMAT data:
>
> ```
> int mt_perf_to_adistance(struct access_coordinate *perf, int *adist)
> {
> ...
> /*
> * The abstract distance of a memory node is in direct proportion to
> * its memory latency (read + write) and inversely proportional to its
> * memory bandwidth (read + write). The abstract distance, memory
> * latency, and memory bandwidth of the default DRAM nodes are used as
> * the base.
> */
> *adist = MEMTIER_ADISTANCE_DRAM *
> (perf->read_latency + perf->write_latency) /
> (default_dram_perf.read_latency + default_dram_perf.write_latency) *
> (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) /
> (perf->read_bandwidth + perf->write_bandwidth);
> return 0;
> }
> ```
>
> Debugging hint: If you have DRAM and CXL memory in separate numa nodes
> but only find 1 memory tier, validate the HMAT!
>
>
> ============================
> Memory Tier Demotion Targets
> ============================
> When `demotion` is enabled (see Section 5 - allocation), the reclaim
> system may opportunistically demote a page from one memory tier to
> another. The selection of a `demotion target` is partially based on
> Abstract Distance and Performance Data.
>
> ```
> An example of demotion targets from memory-tiers.c
> /* Example 1:
> *
> * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
> *
> * node distances:
> * node 0 1 2 3
> * 0 10 20 30 40
> * 1 20 10 40 30
> * 2 30 40 10 40
> * 3 40 30 40 10
> *
> * memory_tiers0 = 0-1
> * memory_tiers1 = 2-3
> *
> * node_demotion[0].preferred = 2
> * node_demotion[1].preferred = 3
> * node_demotion[2].preferred = <empty>
> * node_demotion[3].preferred = <empty>
> */
> ```
>
> =============================
> Mempolicy Weighted Interleave
> =============================
> The `weighted interleave` functionality of `mempolicy` utilizes weights
> to distribute memory across NUMA nodes according to some set weight.
> There is a proposal to auto-configure these weights based on HMAT data.
>
> https://lore.kernel.org/linux-mm/20250305200506.2529583-1-joshua.hahnjy@gmail.com/T/#u
>
> See Section 4 - Interleave, for more information on weighted interleave.
>
>
>
> --------------
> Build Options.
> --------------
> We can add these build configurations to our complexity picture.
>
> CONFIG_NUMA - req for ACPI numa, mempolicy, and memory tiers
> CONFIG_ACPI_NUMA -- enables srat and cedt parsing
> CONFIG_ACPI_HMAT -- enables hmat parsing
>
>
> ~Gregory
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-05 22:44 ` Dave Jiang
@ 2025-03-05 23:34 ` Gregory Price
2025-03-05 23:41 ` Dave Jiang
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-05 23:34 UTC (permalink / raw)
To: Dave Jiang; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Wed, Mar 05, 2025 at 03:44:13PM -0700, Dave Jiang wrote:
>
>
> On 3/5/25 3:20 PM, Gregory Price wrote:
> > --------------------
> > Part 0: ACPI Tables.
> > --------------------
> > I considered publishing this section first, or at least under
> > "Platform", but I've found this information largely useful in
> > debugging interleave configurations and tiering mechanisms -
> > which are higher level concepts.
>
> Hi Gregory,
> Thanks for detailing all this information. It has been a really good read.
>
> Do you intend to also add CDAT information and device performance data calculation related to that? The SRAT/HMAT info only covers CXL memory that are already setup by the BIOS as system memory. Otherwise it only contains performance data for the Generic Port and not the rest of the path to the endpoint.
>
Probably CDAT should land in here as well, though in the context of
simple volatile memory devices it seemed a bit overkill to include it.
I also don't have a ton of exposure to the GenPort flow of operations,
so i didn't want to delay what I do have here. If you have a
recommended addition - I do intend to go through and edit/reformat most
of this series after LSF/MM into a friendlier format of documentation.
I wanted to avoid dropping a 50 page writeup all at once with hopes of
getting feedback on each chunk to correct inaccuracies (see hotplug). So
I'm certainly open to adding whatever folks think is missing/important.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-05 23:34 ` Gregory Price
@ 2025-03-05 23:41 ` Dave Jiang
2025-03-06 0:09 ` Gregory Price
0 siblings, 1 reply; 81+ messages in thread
From: Dave Jiang @ 2025-03-05 23:41 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On 3/5/25 4:34 PM, Gregory Price wrote:
> On Wed, Mar 05, 2025 at 03:44:13PM -0700, Dave Jiang wrote:
>>
>>
>> On 3/5/25 3:20 PM, Gregory Price wrote:
>>> --------------------
>>> Part 0: ACPI Tables.
>>> --------------------
>>> I considered publishing this section first, or at least under
>>> "Platform", but I've found this information largely useful in
>>> debugging interleave configurations and tiering mechanisms -
>>> which are higher level concepts.
>>
>> Hi Gregory,
>> Thanks for detailing all this information. It has been a really good read.
>>
>> Do you intend to also add CDAT information and device performance data calculation related to that? The SRAT/HMAT info only covers CXL memory that are already setup by the BIOS as system memory. Otherwise it only contains performance data for the Generic Port and not the rest of the path to the endpoint.
>>
>
> Probably CDAT should land in here as well, though in the context of
> simple volatile memory devices it seemed a bit overkill to include it.
>
> I also don't have a ton of exposure to the GenPort flow of operations,
> so i didn't want to delay what I do have here. If you have a
> recommended addition - I do intend to go through and edit/reformat most
> of this series after LSF/MM into a friendlier format of documentation.
I can help write it if there's no great urgency. I'll try to find some time creating a draft and send it your way.
DJ
>
> I wanted to avoid dropping a 50 page writeup all at once with hopes of
> getting feedback on each chunk to correct inaccuracies (see hotplug). So
> I'm certainly open to adding whatever folks think is missing/important.
>
> ~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-05 23:41 ` Dave Jiang
@ 2025-03-06 0:09 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-06 0:09 UTC (permalink / raw)
To: Dave Jiang; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Wed, Mar 05, 2025 at 04:41:05PM -0700, Dave Jiang wrote:
> I can help write it if there's no great urgency. I'll try to find some time creating a draft and send it your way.
>
No major urgency, this is a best-attempt at background info before LSF.
My goal at LSF is to simply talk about some of the rough edges and
misunderstandings and subtleties associated with "bringing up CXL
memory in XYZ ways". (e.g. what set of constraints you're actually
agreeing to when you bring up CXL in ZONE_MOVABLE or ZONE_NORMAL).
On the back half, I imagine re-working this whole thing. The structure
here is getting a bit unwieldly - which is to be expected, lots of
moving parts from boot to bash :P
Maybe we can look at converting this into a set of driver docs or
something - though i've been careful not to aggressively define
driver-internal operation since it's fluid.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44 ` Dave Jiang
@ 2025-03-06 1:37 ` Yuquan Wang
2025-03-06 17:08 ` Gregory Price
2025-03-08 3:23 ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 16:55 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
3 siblings, 1 reply; 81+ messages in thread
From: Yuquan Wang @ 2025-03-06 1:37 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Wed, Mar 05, 2025 at 05:20:52PM -0500, Gregory Price wrote:
> ====
> SRAT
> ====
> The System/Static Resource Affinity Table describes resource (CPU,
> Memory) affinity to "Proximity Domains". This table is technically
> optional, but for performance information (see "HMAT") to be enumerated
> by linux it must be present.
>
>
> # Proximity Domain
> A proximity domain is ROUGHLY equivalent to "NUMA Node" - though a
> 1-to-1 mapping is not guaranteed. There are scenarios where "Proximity
> Domain 4" may map to "NUMA Node 3", for example. (See "NUMA Node Creation")
>
> # Memory Affinity
> Generally speaking, if a host does any amount of CXL fabric (decoder)
> programming in BIOS - an SRAT entry for that memory needs to be present.
>
> ```
> Subtable Type : 01 [Memory Affinity]
> Length : 28
> Proximity Domain : 00000001 <- NUMA Node 1
> Reserved1 : 0000
> Base Address : 000000C050000000 <- Physical Memory Region
> Address Length : 0000003CA0000000
> Reserved2 : 00000000
> Flags (decoded below) : 0000000B
> Enabled : 1
> Hot Pluggable : 1
> Non-Volatile : 0
> ```
>
> # Generic Initiator / Port
> In the scenario where CXL devices are not present or configured by
> BIOS, we may still want to generate proximity domain configurations
> for those devices. The Generic Initiator interfaces are intended to
> fill this gap, so that performance information can still be utilized
> when the devices become available at runtime.
>
> I won't cover the details here, for now, but I will link to the
> proosal from Dan Williams and Jonathan Cameron if you would like
> more information.
> https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/
>
> ====
> HMAT
> ====
> The Heterogeneous Memory Attributes Table contains information such as
> cache attributes and bandwidth and latency details for memory proximity
> domains. For the purpose of this document, we will only discuss the
> SSLIB entry.
>
> # SLLBI
> The System Locality Latency and Bandwidth Information records latency
> and bandwidth information for proximity domains. This table is used by
> Linux to configure interleave weights and memory tiers.
>
> ```
> Heavily truncated for brevity
> Structure Type : 0001 [SLLBI]
> Data Type : 00 <- Latency
> Target Proximity Domain List : 00000000
> Target Proximity Domain List : 00000001
> Entry : 0080 <- DRAM LTC
> Entry : 0100 <- CXL LTC
>
> Structure Type : 0001 [SLLBI]
> Data Type : 03 <- Bandwidth
> Target Proximity Domain List : 00000000
> Target Proximity Domain List : 00000001
> Entry : 1200 <- DRAM BW
> Entry : 0200 <- CXL BW
> ```
>
>
> ---------------------------------
> Part 00: Linux Resource Creation.
> ---------------------------------
>
> ==================
> NUMA node creation
> ===================
> NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are
> identified at `__init` time, more specifically during `mm_init`.
>
> What this means is that the CEDT and SRAT must contain sufficient
> `proximity domain` information for linux to identify how many NUMA
> nodes are required (and what memory regions to associate with them).
>
Hi, Gregory.
Recently, I found a corner case in CXL numa node creation.
Condition:
1) A UMA/NUMA system that SRAT is absence, but it keeps CEDT.CFMWS
2)Enable CONFIG_ACPI_NUMA
Results:
1) acpi_numa_init: the fake_pxm will be 0 and send to acpi_parse_cfmws()
2)If dynamically create cxl ram region, the cxl memory would be assigned
to node0 rather than a fake new node.
Confusions:
1) Does CXL memory usage require a numa system with SRAT? As you
mentioned in SRAT section:
"This table is technically optional, but for performance information
to be enumerated by linux it must be present."
Hence, as I understand it, it seems a bug in kernel.
2) If it is a bug, could we forbid this situation by adding fake_pxm
check and returning error in acpi_numa_init()?
3)If not, maybe we can add some kernel logic to allow create these fake
nodes on a system without SRAT?
Yuquan
> The relevant code exists in: linux/drivers/acpi/numa/srat.c
> ```
> static int __init
> acpi_parse_memory_affinity(union acpi_subtable_headers *header,
> const unsigned long table_end)
> {
> ... heavily truncated for brevity
> pxm = ma->proximity_domain;
> node = acpi_map_pxm_to_node(pxm);
> if (numa_add_memblk(node, start, end) < 0)
> ....
> node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE
> }
>
> static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> void *arg, const unsigned long table_end)
> {
> ... heavily truncated for brevity
> /*
> * The SRAT may have already described NUMA details for all,
> * or a portion of, this CFMWS HPA range. Extend the memblks
> * found for any portion of the window to cover the entire
> * window.
> */
> if (!numa_fill_memblks(start, end))
> return 0;
>
> /* No SRAT description. Create a new node. */
> node = acpi_map_pxm_to_node(*fake_pxm);
> if (numa_add_memblk(node, start, end) < 0)
> ....
> node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE
> }
>
> int __init acpi_numa_init(void)
> {
> ...
> if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
> cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> acpi_parse_memory_affinity, 0);
> }
> /* fake_pxm is the next unused PXM value after SRAT parsing */
> acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> &fake_pxm);
>
> ```
>
> Basically, the heuristic is as follows:
> 1) Add one NUMA node per Proximity Domain described in SRAT
> 2) If the SRAT describes all memory described by all CFMWS
> - do not create nodes for CFMWS
> 3) If SRAT does not describe all memory described by CFMWS
> - create a node for that CFMWS
>
> Generally speaking, you will see one NUMA node per Host bridge, unless
> inter-host-bridge interleave is in use (see Section 4 - Interleave).
>
>
> ============
> Memory Tiers
> ============
> The `abstract distance` of a node dictates what tier it lands in (and
> therefore, what tiers are created). This is calculated based on the
> following heuristic, using HMAT data:
>
> ```
> int mt_perf_to_adistance(struct access_coordinate *perf, int *adist)
> {
> ...
> /*
> * The abstract distance of a memory node is in direct proportion to
> * its memory latency (read + write) and inversely proportional to its
> * memory bandwidth (read + write). The abstract distance, memory
> * latency, and memory bandwidth of the default DRAM nodes are used as
> * the base.
> */
> *adist = MEMTIER_ADISTANCE_DRAM *
> (perf->read_latency + perf->write_latency) /
> (default_dram_perf.read_latency + default_dram_perf.write_latency) *
> (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) /
> (perf->read_bandwidth + perf->write_bandwidth);
> return 0;
> }
> ```
>
> Debugging hint: If you have DRAM and CXL memory in separate numa nodes
> but only find 1 memory tier, validate the HMAT!
>
>
> ============================
> Memory Tier Demotion Targets
> ============================
> When `demotion` is enabled (see Section 5 - allocation), the reclaim
> system may opportunistically demote a page from one memory tier to
> another. The selection of a `demotion target` is partially based on
> Abstract Distance and Performance Data.
>
> ```
> An example of demotion targets from memory-tiers.c
> /* Example 1:
> *
> * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
> *
> * node distances:
> * node 0 1 2 3
> * 0 10 20 30 40
> * 1 20 10 40 30
> * 2 30 40 10 40
> * 3 40 30 40 10
> *
> * memory_tiers0 = 0-1
> * memory_tiers1 = 2-3
> *
> * node_demotion[0].preferred = 2
> * node_demotion[1].preferred = 3
> * node_demotion[2].preferred = <empty>
> * node_demotion[3].preferred = <empty>
> */
> ```
>
> =============================
> Mempolicy Weighted Interleave
> =============================
> The `weighted interleave` functionality of `mempolicy` utilizes weights
> to distribute memory across NUMA nodes according to some set weight.
> There is a proposal to auto-configure these weights based on HMAT data.
>
> https://lore.kernel.org/linux-mm/20250305200506.2529583-1-joshua.hahnjy@gmail.com/T/#u
>
> See Section 4 - Interleave, for more information on weighted interleave.
>
>
>
> --------------
> Build Options.
> --------------
> We can add these build configurations to our complexity picture.
>
> CONFIG_NUMA - req for ACPI numa, mempolicy, and memory tiers
> CONFIG_ACPI_NUMA -- enables srat and cedt parsing
> CONFIG_ACPI_HMAT -- enables hmat parsing
>
>
> ~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-06 1:37 ` Yuquan Wang
@ 2025-03-06 17:08 ` Gregory Price
2025-03-07 2:20 ` Yuquan Wang
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-06 17:08 UTC (permalink / raw)
To: Yuquan Wang; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Mar 06, 2025 at 09:37:49AM +0800, Yuquan Wang wrote:
> On Wed, Mar 05, 2025 at 05:20:52PM -0500, Gregory Price wrote:
First, thank you for bringing this up, this is exactly the type of
ambiguiuty i was hoping others would contribute. It's difficult to
figure out if the ACPI tables are "Correct", if there's unimplemented
features, or we're doing something wrong - because some of this is
undocumented theory of operation.
> > ==================
> > NUMA node creation
> > ===================
> > NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are
> > identified at `__init` time, more specifically during `mm_init`.
> >
> > What this means is that the CEDT and SRAT must contain sufficient
> > `proximity domain` information for linux to identify how many NUMA
> > nodes are required (and what memory regions to associate with them).
> >
> Condition:
> 1) A UMA/NUMA system that SRAT is absence, but it keeps CEDT.CFMWS
> 2)Enable CONFIG_ACPI_NUMA
>
> Results:
> 1) acpi_numa_init: the fake_pxm will be 0 and send to acpi_parse_cfmws()
> 2)If dynamically create cxl ram region, the cxl memory would be assigned
> to node0 rather than a fake new node.
>
This is very interesting. Can I ask a few questions:
1) is this real hardware or a VM?
2) By `dynamic creation` you mean leveraging cxl-cli (ndctl)?
2a) Is the BIOS programming decoders, or are you programming the
decoder after boot?
> Confusions:
> 1) Does CXL memory usage require a numa system with SRAT? As you
> mentioned in SRAT section:
>
> "This table is technically optional, but for performance information
> to be enumerated by linux it must be present."
>
> Hence, as I understand it, it seems a bug in kernel.
>
It's hard to say if this is a bug yet. It's either a bug, or your
system should have an SRAT to describe what the BIOS has done.
> 2) If it is a bug, could we forbid this situation by adding fake_pxm
> check and returning error in acpi_numa_init()?
>
> 3)If not, maybe we can add some kernel logic to allow create these fake
> nodes on a system without SRAT?
>
I think we should at least provide a warning (if the SRAT is expected
but missing) - but lets get some more information first.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06 0:47 ` Dan Williams
2025-03-04 1:32 ` Gregory Price
@ 2025-03-06 23:56 ` Gregory Price
2025-03-07 0:57 ` Zhijian Li (Fujitsu)
2025-04-02 6:45 ` Zhijian Li (Fujitsu)
2 siblings, 2 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-06 23:56 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
I decided to dig into decoder programming as as an addendum to the
Driver section - where I said I *wouldn't* do this. It's important
though, when discussing interleave. So alas, we should at least have
some base understanding of what the heck decoders are actually doing.
This is not a regutitation of the spec, you can think of it closer to
a "Theory of Operation" or whatever. I will show discrete examples of
how ACPI tables, system memory map, and decoders relate.
----------------------------------------
Definitions: Addresses and HDM Decoders.
----------------------------------------
An HDM Decoder can be thought shorthand as a "routing" mechanism,
where the a Physical Address is used to determine one of:
1) Fabric routing (i.e. which pipe to send a request down)
2) Address translation (Host to Device Physical Address)
In section 2, I referenced a simple device-to-decoder mapping:
root --- decoder0.0 -- Root Port Decoder
| |
port1 --- decoder1.0 -- Host Bridge Decoder
| |
endpoint0 --- decoder2.0 -- Endpoint Decoder
Barring any special innovations (cough) - endpoint decoders should
be the only decoders that actually "Translation" addresses - at least
for basic volatile memory devices.
All other decoders (Root, Host Bridge, Switch, etc) should simply
forward DMA requests with the original Physical Address intact to
the correct downstream device.
For extra confusion, there are now 3 "Physical Address" domains
System Physical Address (SPA)
The physical address of some location according to linux.
This is the address you see in the system memory map.
Host Physical Address (HPA)
An abstract address used by decoders (I'll explain later)
Device Physical Address (DPA)
A device-local physical address (e.g. if a device has 1TB of
memory, it's DPA range might be 0-0x10000000000)
----------------------------
DMA Routing (No Interleave).
----------------------------
Ok, we have some decoders and confusing physical address definitions,
how does a DMA actually go from processor to DRAM via these decoders?
Lets consider our simple fabric with 256MB of memory at SPA base 4GB.
Lets assume this was all set up statically by BIOS. We'd have the
following CEDT CFMWS (See Section 0 - ACPI) and decoder programming.
```
CEDT
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region
Window size : 0000000010000000 <- 256MB
Interleave Members (2^n) : 00 <- Not interleaved
Memory Map:
[mem 0x0000000100000000-0x0000000110000000] usable <- SPA
Decoders
root --- decoder0.0 -- range=[0x100000000, 0x110000000]
| |
port1 --- decoder1.0 -- range=[0x100000000, 0x110000000]
| |
endpoint0 --- decoder2.0 -- range=[0x100000000, 0x110000000]
```
When the CPU accessed an address in this range, the memory controller
will send the request down the CXL fabric. The following steps occur:
0) CPU accesses SPA(0x101234567)
1) root decoder identifies HPA(0x101234567) is valid and forwards
to host bridge associated with that address (port 1)
2) host bridge decoder identifies HPA(0x101234567) is valid and
forwards to endpoint associated with that address (endpoint0)
3) endpoint decoder identifies HPA(0x101234567) is valid and
translates that address to DPA(0x01234567).
4) The endpoint device uses DPA(0x01234567) to fulfill the request.
In this scenario, our endpoint has a DPA range of (0, 0x10000000),
but technically DPA address space is device-defined and may be sparse.
As you can see, the root and host bridge decoders simply "route" the
access to the next appropriate hop, while the endpoint decoder actually
does the translation work.
What if instead, we had two 256MB endpoints on the same host bridge?
```
CEDT
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region
Window size : 0000000020000000 <- 512MB
Interleave Members (2^n) : 00 <- Not interleaved
Memory Map:
[mem 0x0000000100000000-0x0000000120000000] usable <- SPA
Decoders
decoder0.0
range=[0x100000000, 0x120000000]
|
decoder1.0
range=[0x100000000, 0x120000000]
/ \
decoded2.0 decoder3.0
range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
```
We still only have a single root port and host bridge decoder that
covers the entire 512MB range, but there are now 2 differently
programmed endpoint decoders.
This makes the routing a little more obvious. The root and host bridge
decoders cover the entire SPA space (512MB), while the endpoint decoders
only cover their own address space (256MB).
The host bridge in this case is responsible for routing the request to
the correct endpoint.
What if we had 2 endpoints, each attached to their own host bridges?
In this case We'd have 2 root ports and host bridge decoders.
```
CEDT
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region 1
Window size : 0000000010000000 <- 256MB
Interleave Members (2^n) : 00 <- Not interleaved
First Target : 00000007 <- Host Bridge _UID
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000110000000 <- Memory Region 1
Window size : 0000000010000000 <- 256MB
Interleave Members (2^n) : 00 <- Not interleaved
First Target : 00000006 <- Host Bridge _UID
Memory Map - this may or may not be collapsed depending on Linux arch
[mem 0x0000000100000000-0x0000000110000000] usable <- System Phys Address
[mem 0x0000000110000000-0x0000000120000000] usable <- System Phys Address
Decoders
decoder0.0 decoder1.0 - roots
[0x100000000, 0x110000000] [0x110000000, 0x120000000]
| |
decoder2.0 decoder3.0 - host bridges
[0x100000000, 0x110000000] [0x110000000, 0x120000000]
| |
decoder4.0 decoder5.0 - endpoints
[0x100000000, 0x110000000] [0x110000000, 0x120000000]
```
This scenario looks functionally same as the first - with two distinct,
non-overlapping sets of decoders (any given SPA may only be services by
one device). The platform memory controller is responsible for routing
the address to the correct root decoder.
In Section 4 (Interleave) we'll discuss a bit how the interleave is
accomplished - as this depends whether you are interleaving across
host bridges (aggregation) or within a host bridge (bifurcation).
---------------------------------------------
Nuance: Host Physical Address... translation?
---------------------------------------------
You might have noticed that all the addresses in the examples I showed
are direct subsets of their parent decoder address ranges. The root is
assigned a System Physical Address according to the system memory map,
and all decoders under it are a subset of that range.
You may have even noticed routing steps suddenly change from SPA to HPA
0) CPU accesses SPA(0x101234567)
1) root decoder identifies HPA(0x101234567) is valid and forwards
to host bridge associated with that address (port 1)
So what the heck is a "Host Physical Address"?
Why isn't everything just described as a "System Physical Address"?
CXL HDM decoders *definitionally* handle HPA to DPA translations.
That's it, that's the definition of an HPA.
On MOST systems, what you see in the memory map is an SPA, and SPA=HPA,
so all the decoders will appear to be programmed with SPA. The platform
MAY perform translation before a request is routed to decoder complex.
I will cover an example of this in-depth in an interleave addendum.
So the answer is that some ambiguity exists regarding whether platforms
can/should do translation prior to HDM decoders even being utilized. So
for the sake of making everything more complicated and confusing for very
little value:
1) decoders definitionally do "HPA to DPA" translation
2) most of the time "SPA=HPA"
3) so decoders mostly do "SPA to DPA" translation
If you're confused, that's ok, I was too - and still am. But Hopefully
between this section and Section 4 (Interleave) we can be marginally
less confused together.
-----------------------------------------------
Nuance: Memory Holes and Hotplug Memory Blocks!
-----------------------------------------------
Help, BIOS split my memory device across non-contiguous memory regions!
```
CEDT
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region 1
Window size : 0000000080000000 <- 128MB
Interleave Members (2^n) : 00 <- Not interleaved
First Target : 00000007 <- Host Bridge _UID
CEDT
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000110000000 <- Memory Region 1
Window size : 0000000080000000 <- 128MB
Interleave Members (2^n) : 00 <- Not interleaved
First Target : 00000007 <- Host Bridge _UID
Memory Map
[mem 0x0000000100000000-0x0000000107FFFFFF] usable <- SPA
[mem 0x0000000108000000-0x000000010FFFFFFF] reserved
[mem 0x0000000110000000-0x0000000118000000] usable <- SPA
```
Take a breath. Everything will be ok.
You can have multiple decoders at each point in the decoder complex!
(Most devices should implement for multiple decoders).
```
Decoders
Root Port 0
/ \
decoder0.0 decoder0.1
[0x100000000, 0x108000000] [0x110000000, 0x118000000]
\ /
Host Bridge 7
/ \
decoder1.0 decoder1.1
[0x100000000, 0x108000000] [0x110000000, 0x118000000]
\ /
Endpoint 0
/ \
decoder2.0 decoder2.1
[0x100000000, 0x108000000] [0x110000000, 0x118000000]
```
If your BIOS adds a memory hole, it better also use multiple decoders.
Oh, wait, Section 2 and Section 3 allude to hotplug memory blocks
having size and alignment issues!
If your BIOS adds a memory hole, it better also do it on Linux hotplug
memory block alignment (2GB on x86) or you'll lose 1 hotplug memory
block of capacity per CFMWS.
Oi, talk about some rough edges, right? :[
---------------------------------------
Nuance: BIOS vs OS Programmed Decoders.
---------------------------------------
The driver can (and does) program these decoders. However, it's
entirely normal for BIOS/EFI to program decoders prior to OS init.
Earlier in section 2 I said:
Most associations built by the driver are done by validating decoders
What I meant by this is the driver does one of two things with decoders:
1) Detects BIOS programmed decoders and sanity checks them.
If an unexpected configuration is found, it bails out.
This memory is not accessible if EFI_MEMORY_SP is set.
2) Provide an interface for user policy configuration of the decoders
For the most part, the mechanism is the same. This carve-out is to tell
you if something isn't working, you should check whether the BIOS/EFI or
driver programmed the decoders. It will help debug the issue quicker.
In my experience, it's USUALLY a bad ACPI table.
This distinction will be more important in Section 4 (Interleave) when
we discuss Inter-Host-Bridge and Intra-Host-Bridge interleave.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
2025-03-06 23:56 ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
@ 2025-03-07 0:57 ` Zhijian Li (Fujitsu)
2025-03-07 15:07 ` Gregory Price
2025-04-02 6:45 ` Zhijian Li (Fujitsu)
1 sibling, 1 reply; 81+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-03-07 0:57 UTC (permalink / raw)
To: Gregory Price, lsf-pc@lists.linux-foundation.org
Cc: linux-mm@kvack.org, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org
Hey Gregory,
Thank you so much for your detailed introduction to the entire CXL
software ecosystem, which I have thoroughly read. You are truly excellent.
On 07/03/2025 07:56, Gregory Price wrote:
> I decided to dig into decoder programming as as an addendum to the
> Driver section - where I said I *wouldn't* do this. It's important
> though, when discussing interleave. So alas, we should at least have
> some base understanding of what the heck decoders are actually doing.
>
> This is not a regutitation of the spec, you can think of it closer to
> a "Theory of Operation" or whatever. I will show discrete examples of
> how ACPI tables, system memory map, and decoders relate.
>
> ----------------------------------------
> Definitions: Addresses and HDM Decoders.
> ----------------------------------------
>
> An HDM Decoder can be thought shorthand as a "routing" mechanism,
> where the a Physical Address is used to determine one of:
>
> 1) Fabric routing (i.e. which pipe to send a request down)
> 2) Address translation (Host to Device Physical Address)
>
> In section 2, I referenced a simple device-to-decoder mapping:
>
> root --- decoder0.0 -- Root Port Decoder
> | |
> port1 --- decoder1.0 -- Host Bridge Decoder
> | |
> endpoint0 --- decoder2.0 -- Endpoint Decoder
Here, I noticed something that differs slightly from my understanding:
"root --- decoder0.0 -- Root Port Decoder."
From the perspective of the Linux Driver, decoder0.0 usually refers to
associated a CFMWs. Moreover, according to Spec r3.1 Table 8-22 CXL HDM Decoder Capability,
the CXL Root Port (also known as R in the table) is not permitted to implement
the HDM decoder.
If I have misunderstood something, please let me know.
Thanks
Zhijian
>
> Barring any special innovations (cough) - endpoint decoders should
> be the only decoders that actually "Translation" addresses - at least
> for basic volatile memory devices.
>
> All other decoders (Root, Host Bridge, Switch, etc) should simply
> forward DMA requests with the original Physical Address intact to
> the correct downstream device.
>
> For extra confusion, there are now 3 "Physical Address" domains
>
> System Physical Address (SPA)
> The physical address of some location according to linux.
> This is the address you see in the system memory map.
>
> Host Physical Address (HPA)
> An abstract address used by decoders (I'll explain later)
>
> Device Physical Address (DPA)
> A device-local physical address (e.g. if a device has 1TB of
> memory, it's DPA range might be 0-0x10000000000)
>
>
> ----------------------------
> DMA Routing (No Interleave).
> ----------------------------
> Ok, we have some decoders and confusing physical address definitions,
> how does a DMA actually go from processor to DRAM via these decoders?
>
> Lets consider our simple fabric with 256MB of memory at SPA base 4GB.
>
> Lets assume this was all set up statically by BIOS. We'd have the
> following CEDT CFMWS (See Section 0 - ACPI) and decoder programming.
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region
> Window size : 0000000010000000 <- 256MB
> Interleave Members (2^n) : 00 <- Not interleaved
>
> Memory Map:
> [mem 0x0000000100000000-0x0000000110000000] usable <- SPA
>
> Decoders
> root --- decoder0.0 -- range=[0x100000000, 0x110000000]
> | |
> port1 --- decoder1.0 -- range=[0x100000000, 0x110000000]
> | |
> endpoint0 --- decoder2.0 -- range=[0x100000000, 0x110000000]
> ```
>
> When the CPU accessed an address in this range, the memory controller
> will send the request down the CXL fabric. The following steps occur:
>
> 0) CPU accesses SPA(0x101234567)
>
> 1) root decoder identifies HPA(0x101234567) is valid and forwards
> to host bridge associated with that address (port 1)
>
> 2) host bridge decoder identifies HPA(0x101234567) is valid and
> forwards to endpoint associated with that address (endpoint0)
>
> 3) endpoint decoder identifies HPA(0x101234567) is valid and
> translates that address to DPA(0x01234567).
>
> 4) The endpoint device uses DPA(0x01234567) to fulfill the request.
>
> In this scenario, our endpoint has a DPA range of (0, 0x10000000),
> but technically DPA address space is device-defined and may be sparse.
>
> As you can see, the root and host bridge decoders simply "route" the
> access to the next appropriate hop, while the endpoint decoder actually
> does the translation work.
>
>
> What if instead, we had two 256MB endpoints on the same host bridge?
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region
> Window size : 0000000020000000 <- 512MB
> Interleave Members (2^n) : 00 <- Not interleaved
>
> Memory Map:
> [mem 0x0000000100000000-0x0000000120000000] usable <- SPA
>
> Decoders
> decoder0.0
> range=[0x100000000, 0x120000000]
> |
> decoder1.0
> range=[0x100000000, 0x120000000]
> / \
> decoded2.0 decoder3.0
> range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
> ```
>
> We still only have a single root port and host bridge decoder that
> covers the entire 512MB range, but there are now 2 differently
> programmed endpoint decoders.
>
> This makes the routing a little more obvious. The root and host bridge
> decoders cover the entire SPA space (512MB), while the endpoint decoders
> only cover their own address space (256MB).
>
> The host bridge in this case is responsible for routing the request to
> the correct endpoint.
>
>
> What if we had 2 endpoints, each attached to their own host bridges?
> In this case We'd have 2 root ports and host bridge decoders.
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region 1
> Window size : 0000000010000000 <- 256MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000110000000 <- Memory Region 1
> Window size : 0000000010000000 <- 256MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000006 <- Host Bridge _UID
>
> Memory Map - this may or may not be collapsed depending on Linux arch
> [mem 0x0000000100000000-0x0000000110000000] usable <- System Phys Address
> [mem 0x0000000110000000-0x0000000120000000] usable <- System Phys Address
>
> Decoders
> decoder0.0 decoder1.0 - roots
> [0x100000000, 0x110000000] [0x110000000, 0x120000000]
> | |
> decoder2.0 decoder3.0 - host bridges
> [0x100000000, 0x110000000] [0x110000000, 0x120000000]
> | |
> decoder4.0 decoder5.0 - endpoints
> [0x100000000, 0x110000000] [0x110000000, 0x120000000]
> ```
>
> This scenario looks functionally same as the first - with two distinct,
> non-overlapping sets of decoders (any given SPA may only be services by
> one device). The platform memory controller is responsible for routing
> the address to the correct root decoder.
>
> In Section 4 (Interleave) we'll discuss a bit how the interleave is
> accomplished - as this depends whether you are interleaving across
> host bridges (aggregation) or within a host bridge (bifurcation).
>
>
>
> ---------------------------------------------
> Nuance: Host Physical Address... translation?
> ---------------------------------------------
>
> You might have noticed that all the addresses in the examples I showed
> are direct subsets of their parent decoder address ranges. The root is
> assigned a System Physical Address according to the system memory map,
> and all decoders under it are a subset of that range.
>
> You may have even noticed routing steps suddenly change from SPA to HPA
>
> 0) CPU accesses SPA(0x101234567)
>
> 1) root decoder identifies HPA(0x101234567) is valid and forwards
> to host bridge associated with that address (port 1)
>
> So what the heck is a "Host Physical Address"?
> Why isn't everything just described as a "System Physical Address"?
>
> CXL HDM decoders *definitionally* handle HPA to DPA translations.
>
> That's it, that's the definition of an HPA.
>
> On MOST systems, what you see in the memory map is an SPA, and SPA=HPA,
> so all the decoders will appear to be programmed with SPA. The platform
> MAY perform translation before a request is routed to decoder complex.
>
> I will cover an example of this in-depth in an interleave addendum.
>
> So the answer is that some ambiguity exists regarding whether platforms
> can/should do translation prior to HDM decoders even being utilized. So
> for the sake of making everything more complicated and confusing for very
> little value:
>
> 1) decoders definitionally do "HPA to DPA" translation
> 2) most of the time "SPA=HPA"
> 3) so decoders mostly do "SPA to DPA" translation
>
> If you're confused, that's ok, I was too - and still am. But Hopefully
> between this section and Section 4 (Interleave) we can be marginally
> less confused together.
>
>
> -----------------------------------------------
> Nuance: Memory Holes and Hotplug Memory Blocks!
> -----------------------------------------------
> Help, BIOS split my memory device across non-contiguous memory regions!
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region 1
> Window size : 0000000080000000 <- 128MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000007 <- Host Bridge _UID
>
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000110000000 <- Memory Region 1
> Window size : 0000000080000000 <- 128MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000007 <- Host Bridge _UID
>
> Memory Map
> [mem 0x0000000100000000-0x0000000107FFFFFF] usable <- SPA
> [mem 0x0000000108000000-0x000000010FFFFFFF] reserved
> [mem 0x0000000110000000-0x0000000118000000] usable <- SPA
> ```
>
> Take a breath. Everything will be ok.
>
> You can have multiple decoders at each point in the decoder complex!
> (Most devices should implement for multiple decoders).
>
> ```
> Decoders
> Root Port 0
> / \
> decoder0.0 decoder0.1
> [0x100000000, 0x108000000] [0x110000000, 0x118000000]
> \ /
> Host Bridge 7
> / \
> decoder1.0 decoder1.1
> [0x100000000, 0x108000000] [0x110000000, 0x118000000]
> \ /
> Endpoint 0
> / \
> decoder2.0 decoder2.1
> [0x100000000, 0x108000000] [0x110000000, 0x118000000]
> ```
>
> If your BIOS adds a memory hole, it better also use multiple decoders.
>
> Oh, wait, Section 2 and Section 3 allude to hotplug memory blocks
> having size and alignment issues!
>
> If your BIOS adds a memory hole, it better also do it on Linux hotplug
> memory block alignment (2GB on x86) or you'll lose 1 hotplug memory
> block of capacity per CFMWS.
>
> Oi, talk about some rough edges, right? :[
>
> ---------------------------------------
> Nuance: BIOS vs OS Programmed Decoders.
> ---------------------------------------
> The driver can (and does) program these decoders. However, it's
> entirely normal for BIOS/EFI to program decoders prior to OS init.
>
> Earlier in section 2 I said:
> Most associations built by the driver are done by validating decoders
>
> What I meant by this is the driver does one of two things with decoders:
>
> 1) Detects BIOS programmed decoders and sanity checks them.
> If an unexpected configuration is found, it bails out.
> This memory is not accessible if EFI_MEMORY_SP is set.
>
> 2) Provide an interface for user policy configuration of the decoders
>
> For the most part, the mechanism is the same. This carve-out is to tell
> you if something isn't working, you should check whether the BIOS/EFI or
> driver programmed the decoders. It will help debug the issue quicker.
>
> In my experience, it's USUALLY a bad ACPI table.
>
> This distinction will be more important in Section 4 (Interleave) when
> we discuss Inter-Host-Bridge and Intra-Host-Bridge interleave.
>
> ~Gregory
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-06 17:08 ` Gregory Price
@ 2025-03-07 2:20 ` Yuquan Wang
2025-03-07 15:12 ` Gregory Price
0 siblings, 1 reply; 81+ messages in thread
From: Yuquan Wang @ 2025-03-07 2:20 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Mar 06, 2025 at 12:08:49PM -0500, Gregory Price wrote:
> On Thu, Mar 06, 2025 at 09:37:49AM +0800, Yuquan Wang wrote:
> > On Wed, Mar 05, 2025 at 05:20:52PM -0500, Gregory Price wrote:
>
> First, thank you for bringing this up, this is exactly the type of
> ambiguiuty i was hoping others would contribute. It's difficult to
> figure out if the ACPI tables are "Correct", if there's unimplemented
> features, or we're doing something wrong - because some of this is
> undocumented theory of operation.
>
Thank you for your patience in replying my questions. :)
> > > ==================
> > > NUMA node creation
> > > ===================
> > > NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are
> > > identified at `__init` time, more specifically during `mm_init`.
> > >
> > > What this means is that the CEDT and SRAT must contain sufficient
> > > `proximity domain` information for linux to identify how many NUMA
> > > nodes are required (and what memory regions to associate with them).
> > >
> > Condition:
> > 1) A UMA/NUMA system that SRAT is absence, but it keeps CEDT.CFMWS
> > 2)Enable CONFIG_ACPI_NUMA
> >
> > Results:
> > 1) acpi_numa_init: the fake_pxm will be 0 and send to acpi_parse_cfmws()
> > 2)If dynamically create cxl ram region, the cxl memory would be assigned
> > to node0 rather than a fake new node.
> >
>
> This is very interesting. Can I ask a few questions:
>
> 1) is this real hardware or a VM?
Qemu VM (arm64 virt).
> 2) By `dynamic creation` you mean leveraging cxl-cli (ndctl)?
Yes. After boot, I used "cxl create-region".
> 2a) Is the BIOS programming decoders, or are you programming the
> decoder after boot?
Program the decoder after boot. It seems like currently bios for qemu could
not programm cxl both on x86(q35) and arm64(virt). I am trying to find a
cxl-enable bios for qemu virt to do some test.
>
>
> > Confusions:
> > 1) Does CXL memory usage require a numa system with SRAT? As you
> > mentioned in SRAT section:
> >
> > "This table is technically optional, but for performance information
> > to be enumerated by linux it must be present."
> >
> > Hence, as I understand it, it seems a bug in kernel.
> >
>
> It's hard to say if this is a bug yet. It's either a bug, or your
> system should have an SRAT to describe what the BIOS has done.
>
> > 2) If it is a bug, could we forbid this situation by adding fake_pxm
> > check and returning error in acpi_numa_init()?
> >
>
> > 3)If not, maybe we can add some kernel logic to allow create these fake
> > nodes on a system without SRAT?
> >
>
> I think we should at least provide a warning (if the SRAT is expected
> but missing) - but lets get some more information first.
>
> ~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
2025-03-07 0:57 ` Zhijian Li (Fujitsu)
@ 2025-03-07 15:07 ` Gregory Price
2025-03-11 2:48 ` Zhijian Li (Fujitsu)
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-07 15:07 UTC (permalink / raw)
To: Zhijian Li (Fujitsu)
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org
On Fri, Mar 07, 2025 at 12:57:18AM +0000, Zhijian Li (Fujitsu) wrote:
> > In section 2, I referenced a simple device-to-decoder mapping:
> >
> > root --- decoder0.0 -- Root Port Decoder
> > | |
> > port1 --- decoder1.0 -- Host Bridge Decoder
> > | |
> > endpoint0 --- decoder2.0 -- Endpoint Decoder
>
> Here, I noticed something that differs slightly from my understanding:
> "root --- decoder0.0 -- Root Port Decoder."
>
> From the perspective of the Linux Driver, decoder0.0 usually refers to
> associated a CFMWs. Moreover, according to Spec r3.1 Table 8-22 CXL HDM Decoder Capability,
> the CXL Root Port (also known as R in the table) is not permitted to implement
> the HDM decoder.
>
> If I have misunderstood something, please let me know.
You're indeed right that in the spec it says root ports do not have
decoder capability. What we have here may be some jumbling of CXL
spec languange and Linux CXL driver language.
The decoder0.0 is a `root decoder`.
The `root_port` is a logical construct belonging to the CXL root
struct cxl_root {
struct cxl_port port; <--- root_port
}
A root_decoder is added to the CXL drivers `root_port` when
we parse the cfmws:
static int __cxl_parse_cfmws(struct acpi_cedt_cfmws *cfmws,
struct cxl_cfmws_context *ctx)
{
...
struct cxl_root_decoder *cxlrd __free(put_cxlrd) =
cxl_root_decoder_alloc(root_port, ways);
...
}
And the `root_port` is a port with downstream ports - which are
presumably the host bridges
static int cxl_acpi_probe(struct platform_device *pdev)
{
cxl_root = devm_cxl_add_root(host, &acpi_root_ops);
^^^^^^^^ - Create "The CXL Root"
root_port = &cxl_root->port;
^^^^^^^^^ - The Root's "Port"
rc = bus_for_each_dev(adev->dev.bus, NULL, root_port,
add_host_bridge_dport);
^^^^^^^^^ - Add host bridges "downstream" of The Root's "Port"
...
ctx = (struct cxl_cfmws_context) {
.dev = host,
.root_port = root_port,
.cxl_res = cxl_res,
};
rc = acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, cxl_parse_cfmws, &ctx);
^^^^^^^^^ - Add "root decoders" to The Root's "Port"
}
If we look at what a root decoder is defined as in cxl/cxl.h:
* struct cxl_root_decoder - Static platform CXL address decoder
So this is just some semantic confusion - and the reality is the driver
simply refers to the first device in the fabric as "The Root", and every
device has "A Port", and so the "Root Port" just means it's the Root's
"Port" not a "CXL Specification Root Port".
Whatever the case, from the snippets above, you can see the CFMWS adds 1
"root decoder" per CFMWS - which makes sense, as a CFMWS can describe
multi-host-bridge interleave - i.e. whatever the actual root device is
must be upstream of the host bridges themselves.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-07 2:20 ` Yuquan Wang
@ 2025-03-07 15:12 ` Gregory Price
2025-03-13 17:00 ` Jonathan Cameron
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-07 15:12 UTC (permalink / raw)
To: Yuquan Wang; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Fri, Mar 07, 2025 at 10:20:31AM +0800, Yuquan Wang wrote:
> > 2a) Is the BIOS programming decoders, or are you programming the
> > decoder after boot?
> Program the decoder after boot. It seems like currently bios for qemu could
> not programm cxl both on x86(q35) and arm64(virt). I am trying to find a
> cxl-enable bios for qemu virt to do some test.
What's likely happening here then is that QEMU is not emitting an SRAT
(either because the logic is missing or by design).
From other discussions, this may be the intention of the GenPort work,
which is intended to have placeholders in the SRAT for the Proximity
Domains for devices to be initialized later (i.e. dynamically).
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44 ` Dave Jiang
2025-03-06 1:37 ` Yuquan Wang
@ 2025-03-08 3:23 ` Gregory Price
2025-03-13 17:20 ` Jonathan Cameron
2025-03-13 16:55 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
3 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-08 3:23 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
In the last section we discussed how the CEDT CFMWS and SRAT Memory
Affinity structures are used by linux to "create" NUMA nodes (or at
least mark them as possible). However, the examples I used suggested
that there was a 1-to-1 relationship between CFMWS and devices or
host bridges.
This is not true - in fact, CFMWS are a simply a carve out of System
Physical Address space which may be used to map any number of endpoint
devices behind the associated Host Bridge(s).
The limiting factor is what your platform vendor BIOS supports.
This section describes a handful of *possible* configurations, what NUMA
structure they will create, and what flexibility this provides.
All of these CFMWS configurations are made up, and may or may not exist
in real machines. They are a conceptual teching tool, not a roadmap.
(When discussing interleave in this section, please note that I am
intentionally omitting details about decoder programming, as this
will be covered later.)
-------------------------------
One 2GB Device, Multiple CFMWS.
-------------------------------
Lets imagine we have one 2GB device attached to a host bridge.
In this example, the device hosts 2GB of persistent memory - but we
might want the flexibility to map capacity as volatile or persistent.
The platform vendor may decide that they want to reserve two entirely
separate system physical address ranges to represent the capacity.
```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000200000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 000A <- Bit(3) - Persistant
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS)
```
You might have a CEDT with two CFMWS as above, where the base addresses
are `0x100000000` and `0x200000000` respectively, but whose window sizes
cover the entire 2GB capacity of the device. This affords the user
flexibility in where the memory is mapped depending on if it is mapped
as volatile or persistent while keeping the two SPA ranges separate.
This is allowed because the endpoint decoders commit device physical
address space *in order*, meaning no two regions of device physical
address space can be mapped to more than one system physical address.
i.e.: DPA(0) can only map to SPA(0x200000000) xor SPA(0x100000000)
(See Section 2a - decoder programming).
-------------------------------
Two Devices On One Host Bridge.
-------------------------------
Lets say we have two CXL 2GB devices behind a single host bridge, and we
may or may not want to interleave some or all of those devices.
There are (at least) 2 ways to provide this flexibility.
First, we might simply have two CFMWS.
```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000200000000 <- Memory Region
Window size : 0000000080000000
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS)
```
These CFMWS target the same host bridge, but are NOT necessarily limited
to mapping memory from any one device. We could program decoders in
either of the following ways.
Example: Host bridge and endpoints are programmed WITHOUT interleave.
```
Decoders
CXL Root
/ \
decoder0.0 decoder1.0
[0x100000000, 0x17FFFFFFF] [0x200000000, 0x27FFFFFFF]
\ /
Host Bridge
/ \
decoder2.0 decoder2.1
[0x100000000, 0x17FFFFFFFF] [0x200000000, 0x27FFFFFFF]
| |
Endpoint 0 Endpoint 1
| |
decoder4.0 decoder5.0
[0x100000000, 0x17FFFFFFF] [0x200000000, 0x27FFFFFFF]
NUMA effect:
All of Endpoint 0 memory is on NUMA node A
All of Endpoint 1 memory is on NUMA node B
```
Alternatively, these decoders could be programmed to interleave memory
accesses across endpoints. We'll cover this configuration in-depth
later. For now, just know the above structure means that each endpoint
has its own NUMA node - but this is not required.
-------------------------------------------------------------
Two Devices On One Host Bridge - With and Without Interleave.
-------------------------------------------------------------
What if we wanted some capacity on each endpoint hosted on its own NUMA
node, and wanted to interleave a portion of each device capacity?
We could produce the following CFMWS configuration.
```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region 1
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000200000000 <- Memory Region 2
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000300000000 <- Memory Region 3
Window size : 0000000100000000 <- 4GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
NUMA effect: 3 nodes marked POSSIBLE (1 for each CFMWS)
```
In this configuration, we could still do what we did with the prior
configuration (2 CFMWS), but we could also use the third root decoder
to simplify decoder programming of interleave.
Since the third region has sufficient capacity (4GB) to cover both
devices (2GB/each), we can actually associate the entire capacity of
both devices in that region.
We'll discuss this decoder structure in-depth in Section 4.
-------------------------------------
Two devices on separate host bridges.
-------------------------------------
We may have placed the devices on separate host bridges.
In this case we may naturally have one CFMWS per host bridge.
```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region 1
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000200000000 <- Memory Region 2
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000006 <- Host Bridge _UID
NUMA Effects: 2 NUMA nodes marked POSSIBLE
```
But we may also want to interleave *across* host bridges. To do this,
the platform vendor may add the following CFMWS (either by itself if
done statically, or in addition to the above two for flexibility).
```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000300000000 <- Memory Region
Window size : 0000000100000000 <- 4GB
Interleave Members (2^n) : 01 <- 2-way interleave
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge 7
Next Target : 00000006 <- Host Bridge 6
NUMA Effect: an additional NUMA node marked POSSIBLE
```
This greatly simplifies the decoder programming structure, and allows
us to aggregate bandwidth across host bridges. The decoder programming
might look as follows in this setup.
```
Decoders:
CXL Root
|
decoder0.0
[0x300000000, 0x3FFFFFFFF]
/ \
Host Bridge 7 Host Bridge 6
/ \
decoder1.0 decoder2.0
[0x300000000, 0x3FFFFFFFFF] [0x300000000, 0x3FFFFFFFF]
| |
Endpoint 0 Endpoint 1
| |
decoder3.0 decoder4.0
[0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF]
```
We'll discuss this more in-depth in section 4 - but you can see how
straight-forward this is. All the decoders are programmed the same.
----------
SRAT Note.
----------
If you remember from the first portion of Section 0, the SRAT may be
used to statically assign memory regions to specific proximity domains.
```
Subtable Type : 01 [Memory Affinity]
Length : 28
Proximity Domain : 00000001 <- NUMA Node 1
Reserved1 : 0000
Base Address : 000000C050000000 <- Physical Memory Region
Address Length : 0000003CA0000000
```
There is a careful dance between the CEDT and SRAT tables and how NUMA
nodes are created. If things don't look quite the way you expect - check
the SRAT Memory Affinity entries and CEDT CFMWS to determine what your
platform actually supports in terms of flexible topologies.
--------
Summary.
--------
In the first part of Section 0 we showed how CFMWS and SRAT affect how
Linux creates NUMA nodes. Here we demonstrated that CFMWS are not a
1-to-1 relationship to either CXL devices or Host Bridges.
Instead, CFMWS are simply a System Physical Address carve out which can
be used in a number of ways to define your memory topology in software.
This is a core piece of the "Software Defined Memory" puzzle.
How your platform vendor decides to program the CEDT will dictate how
flexibly you can manage CXL devices in software.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
` (2 preceding siblings ...)
2025-03-04 0:32 ` Gregory Price
@ 2025-03-10 10:45 ` Yuquan Wang
2025-03-10 14:19 ` Gregory Price
3 siblings, 1 reply; 81+ messages in thread
From: Yuquan Wang @ 2025-03-10 10:45 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Tue, Feb 04, 2025 at 09:17:09PM -0500, Gregory Price wrote:
>
> Platform / BIOS / EFI Configuraiton
> ===================================
> ---------------------------------------
> Step 1: BIOS-time hardware programming.
> ---------------------------------------
>
> I don't want to focus on platform specifics, so really all you need
> to know about this phase for the purpose of MM is that platforms may
> program the CXL device heirarchy and lock the configuration.
>
This question can be very naive, what's the meaning of 'MM' here?
And since I am not familiar with cxl bios configurations, based on my
understanding of its acpi results, there are roughly two configuration
schemes in my analysis: a) users should enter some configuration
information manually (like region base/size). b) bios could provide a
recommendatory configuration by device information.
Is my understanding right?
> In practice it means you probably can't reconfigure things after boot
> without doing major teardowns of the devices and resetting them -
> assuming the platform doesn't have major quirks that prevent this.
>
> This has implications for Hotplug, Interleave, and RAS, but we'll
> cover those explicitly elsewhere. Otherwise, if something gets mucked
> up at this stage - complain to your platform / hardware vendor.
>
>
> ------------------------------------------------------------------
> Step 2: BIOS / EFI generates the CEDT (CXL Early Detection Table).
> ------------------------------------------------------------------
>
> This table is responsible for reporting each "CXL Host Bridge" and
> "CXL Fixed Memory Window" present at boot - which enables early boot
> software to manage those devices and the memory capacity presented
> by those devices.
>
> Example CEDT Entries (truncated)
> Subtable Type : 00 [CXL Host Bridge Structure]
> Reserved : 00
> Length : 0020
> Associated host bridge : 00000005
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 000000C050000000
> Window size : 0000003CA0000000
>
> If this memory is NOT marked "Special Purpose" by BIOS (next section),
> you should find a matching entry EFI Memory Map and /proc/iomem
>
> BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] usable
> /proc/iomem: c050000000-fcefffffff : System RAM
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot
2025-03-10 10:45 ` Yuquan Wang
@ 2025-03-10 14:19 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-10 14:19 UTC (permalink / raw)
To: Yuquan Wang; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Mon, Mar 10, 2025 at 06:45:12PM +0800, Yuquan Wang wrote:
> On Tue, Feb 04, 2025 at 09:17:09PM -0500, Gregory Price wrote:
> >
> > Platform / BIOS / EFI Configuraiton
> > ===================================
> > ---------------------------------------
> > Step 1: BIOS-time hardware programming.
> > ---------------------------------------
> >
> > I don't want to focus on platform specifics, so really all you need
> > to know about this phase for the purpose of MM is that platforms may
> > program the CXL device heirarchy and lock the configuration.
> >
> This question can be very naive, what's the meaning of 'MM' here?
>
Memory Management - linux/mm
> And since I am not familiar with cxl bios configurations, based on my
> understanding of its acpi results, there are roughly two configuration
> schemes in my analysis: a) users should enter some configuration
> information manually (like region base/size). b) bios could provide a
> recommendatory configuration by device information.
>
The BIOS must produce ACPI tables to set aside system physical memory
address space. *How* BIOS produces these ACPI tables (CEDT + SRAT vs
CEDT only) dictates whether this configuration is static or dynamic.
The devices will provide a CDAT (coherent device attribute table) used
by BIOS to generate these ACPI tables.
All of this dictates how linux configures its NUMA topology, programs
CXL HDM decoders, and how it associates device physical memory with numa
nodes and such.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
2025-03-07 15:07 ` Gregory Price
@ 2025-03-11 2:48 ` Zhijian Li (Fujitsu)
0 siblings, 0 replies; 81+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-03-11 2:48 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org
On 07/03/2025 23:07, Gregory Price wrote:
> So this is just some semantic confusion - and the reality is the driver
> simply refers to the first device in the fabric as "The Root", and every
> device has "A Port", and so the "Root Port" just means it's the Root's
> "Port" not a "CXL Specification Root Port".
>
> Whatever the case, from the snippets above, you can see the CFMWS adds 1
> "root decoder" per CFMWS - which makes sense, as a CFMWS can describe
> multi-host-bridge interleave - i.e. whatever the actual root device is
> must be upstream of the host bridges themselves.
Gregory,
Many thanks for your detailed explanation. This aligns with what I understood before.
Thanks
Zhijian
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-02-20 19:26 ` David Hildenbrand
2025-02-20 19:35 ` Gregory Price
@ 2025-03-11 14:53 ` Zi Yan
2025-03-11 15:58 ` Gregory Price
1 sibling, 1 reply; 81+ messages in thread
From: Zi Yan @ 2025-03-11 14:53 UTC (permalink / raw)
To: David Hildenbrand
Cc: Gregory Price, Yang Shi, lsf-pc, linux-mm, linux-cxl,
linux-kernel
On 20 Feb 2025, at 14:26, David Hildenbrand wrote:
> On 20.02.25 19:43, Gregory Price wrote:
>> On Thu, Feb 20, 2025 at 09:50:07AM -0800, Yang Shi wrote:
>>>>> I will double check that this isn't working as expected, and i'll double
>>>>> check for a build option as well.
>>>>>
>>>>> stupid question - it sorta seems like you'd want this as the default
>>>>> setting for driver-managed hotplug memory blocks, but I suppose for
>>>>> very small blocks there's problems (as described in the docs).
>>>>
>>>> The issue is that it is per-memblock. So you'll never have 1 GiB ranges
>>>> of consecutive usable memory (e.g., 1 GiB hugetlb page).
>>>
>>> Regardless of ZONE_MOVABLE or ZONE_NORMAL, right?
>>>
>>> Thanks,
>>> Yang
>>
>> From my testing, yes.
>
> Yes, the only way to get some 1 GiB pages is by using larger memory blocks (e.g., 2 GiB on x86-64), which comes with a different set of issues (esp. hotplug granularity).
An alternative I can think of is to mark a hot-plugged memory block dedicated
to memmap and use it for new memory block’s memmap provision. In this way,
a 256MB memory block can be used for 256MB*(256MB/4MB)=16GB hot plugged memory.
Yes, it will waste memory before 256MB+16GB is online, but that might be
easier to handle than variable sized memory block, I suppose?
>
> Of course, only 1x usable 1 GiB page for each 2 GiB block.
>
> There were ideas in how to optimize that (e.g., requiring a new sysfs interface to expose variable-sized blocks), if anybody is interested, please reach out.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-03-11 14:53 ` Zi Yan
@ 2025-03-11 15:58 ` Gregory Price
2025-03-11 16:08 ` Zi Yan
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-11 15:58 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, Yang Shi, lsf-pc, linux-mm, linux-cxl,
linux-kernel
On Tue, Mar 11, 2025 at 10:53:41AM -0400, Zi Yan wrote:
> On 20 Feb 2025, at 14:26, David Hildenbrand wrote:
>
> > Yes, the only way to get some 1 GiB pages is by using larger memory blocks (e.g., 2 GiB on x86-64), which comes with a different set of issues (esp. hotplug granularity).
>
> An alternative I can think of is to mark a hot-plugged memory block dedicated
> to memmap and use it for new memory block’s memmap provision. In this way,
> a 256MB memory block can be used for 256MB*(256MB/4MB)=16GB hot plugged memory.
> Yes, it will waste memory before 256MB+16GB is online, but that might be
> easier to handle than variable sized memory block, I suppose?
>
> >
The devil is in the details here. We'd need a way for the driver to
tell hotplug "use this for memmap for some yet-to-be-mapped region" -
rather than having that allocate naturally. Either this, or a special
ZONE specifically for memmap allocations.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-03-11 15:58 ` Gregory Price
@ 2025-03-11 16:08 ` Zi Yan
2025-03-11 16:15 ` Gregory Price
2025-03-11 16:35 ` Oscar Salvador
0 siblings, 2 replies; 81+ messages in thread
From: Zi Yan @ 2025-03-11 16:08 UTC (permalink / raw)
To: Gregory Price
Cc: David Hildenbrand, Yang Shi, lsf-pc, linux-mm, linux-cxl,
linux-kernel
On 11 Mar 2025, at 11:58, Gregory Price wrote:
> On Tue, Mar 11, 2025 at 10:53:41AM -0400, Zi Yan wrote:
>> On 20 Feb 2025, at 14:26, David Hildenbrand wrote:
>>
>>> Yes, the only way to get some 1 GiB pages is by using larger memory blocks (e.g., 2 GiB on x86-64), which comes with a different set of issues (esp. hotplug granularity).
>>
>> An alternative I can think of is to mark a hot-plugged memory block dedicated
>> to memmap and use it for new memory block’s memmap provision. In this way,
>> a 256MB memory block can be used for 256MB*(256MB/4MB)=16GB hot plugged memory.
>> Yes, it will waste memory before 256MB+16GB is online, but that might be
>> easier to handle than variable sized memory block, I suppose?
>>
>>>
>
> The devil is in the details here. We'd need a way for the driver to
> tell hotplug "use this for memmap for some yet-to-be-mapped region" -
> rather than having that allocate naturally. Either this, or a special
> ZONE specifically for memmap allocations.
Or a new option for memmap_on_memory like “use_whole_block”, then hotplug
code checks altmap is NULL or not when a memory block is plugged.
If altmap is NULL, the hot plugged memory block is used as memmap,
otherwise, the memory block is plugged as normal memory. The code also
needs to maintain which part of the altmap is used to tell whether
the memmap’d memory block can be offline or not.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-03-11 16:08 ` Zi Yan
@ 2025-03-11 16:15 ` Gregory Price
2025-03-11 16:35 ` Oscar Salvador
1 sibling, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-11 16:15 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand, Yang Shi, lsf-pc, linux-mm, linux-cxl,
linux-kernel
On Tue, Mar 11, 2025 at 12:08:03PM -0400, Zi Yan wrote:
> On 11 Mar 2025, at 11:58, Gregory Price wrote:
>
> > On Tue, Mar 11, 2025 at 10:53:41AM -0400, Zi Yan wrote:
> >> On 20 Feb 2025, at 14:26, David Hildenbrand wrote:
> >>
> >>> Yes, the only way to get some 1 GiB pages is by using larger memory blocks (e.g., 2 GiB on x86-64), which comes with a different set of issues (esp. hotplug granularity).
> >>
> >> An alternative I can think of is to mark a hot-plugged memory block dedicated
> >> to memmap and use it for new memory block’s memmap provision. In this way,
> >> a 256MB memory block can be used for 256MB*(256MB/4MB)=16GB hot plugged memory.
> >> Yes, it will waste memory before 256MB+16GB is online, but that might be
> >> easier to handle than variable sized memory block, I suppose?
> >>
> >>>
> >
> > The devil is in the details here. We'd need a way for the driver to
> > tell hotplug "use this for memmap for some yet-to-be-mapped region" -
> > rather than having that allocate naturally. Either this, or a special
> > ZONE specifically for memmap allocations.
>
> Or a new option for memmap_on_memory like “use_whole_block”, then hotplug
> code checks altmap is NULL or not when a memory block is plugged.
> If altmap is NULL, the hot plugged memory block is used as memmap,
> otherwise, the memory block is plugged as normal memory. The code also
> needs to maintain which part of the altmap is used to tell whether
> the memmap’d memory block can be offline or not.
>
Also just to be clear, this is only an issue if you intend to use CXL
memory for something like 1GB Gigantic pages - which do not support
ZONE_MOVABLE anyway. So for this to matter your system must:
1) Require smaller than 1GB alignment for memblocks, and
2) ZONE_NORMAL CXL memory.
The whole thing is mitigated by telling your platform vendor to align
memory base to 2GB and having DCD's allocate in 2GB aligned extents.
Both of these are reasonable requirements for hardware, requiring a
major kernel reconfiguration seems less reasonable.
We should continue talking, but I've reached the conclusion that simply
telling platform vendors to fix their alignment is a better overall
solution.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
2025-03-11 16:08 ` Zi Yan
2025-03-11 16:15 ` Gregory Price
@ 2025-03-11 16:35 ` Oscar Salvador
1 sibling, 0 replies; 81+ messages in thread
From: Oscar Salvador @ 2025-03-11 16:35 UTC (permalink / raw)
To: Zi Yan
Cc: Gregory Price, David Hildenbrand, Yang Shi, lsf-pc, linux-mm,
linux-cxl, linux-kernel
On Tue, Mar 11, 2025 at 12:08:03PM -0400, Zi Yan wrote:
> Or a new option for memmap_on_memory like “use_whole_block”, then hotplug
> code checks altmap is NULL or not when a memory block is plugged.
> If altmap is NULL, the hot plugged memory block is used as memmap,
> otherwise, the memory block is plugged as normal memory. The code also
> needs to maintain which part of the altmap is used to tell whether
> the memmap’d memory block can be offline or not.
One of the first versions of memmap_on_memory did not have the restrictions of
working only per memblock, so one could hot-plug more than a memory-block worth
of memory using memmap_on_memory, meaning that we could end up with memblocks
only containing memmap pages.
Now, we decided to go with only one memblock at a time because of simplicity
and also because we did not have any real-world scenarios that needed that
besides being able to have larger contiguos memory for e.g: hugetlb.
If people think that there is more to it, we could revisit that and see how it
would look nowadays. Maybe it is not too much of a surgery. (or maybe it is :-))
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 81+ messages in thread
* [LSF/MM] CXL Boot to Bash - Section 4: Interleave
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
` (3 preceding siblings ...)
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
@ 2025-03-12 0:09 ` Gregory Price
2025-03-13 8:31 ` Yuquan Wang
2025-03-26 9:28 ` Yuquan Wang
2025-03-14 3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
6 siblings, 2 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-12 0:09 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
In this section we'll cover a few different interleave mechanisms, some
of which require CXL decoder programming. We'll discuss some of the
platform implications of hardware-interleave and how that affects driver
support, as well as software-based interleave.
Terminology
- Interleave Ways (IW): Number of downstream targets in the interleave
- Interleave Granularity (IG): The size of the interleaved data
(typically 256B-16KB, or 1 page)
- Hardware Interleave: Interleave done in CXL decoders
- Software Interleave: Interleave done by Linux (mempolicy, libnuma).
--------------------------------
Hardware Vs Software Interleave.
--------------------------------
CXL Hardware interleave is a memory interleave mechanism which utilizes
hardware decoders to spread System Physical Address accesses across a
number of devices transparent to the operating system. A similar
technique is used on DRAM attached to a single socket.
We imagine physical memory as a linear construct, where physical address
implies the use of a specific piece of hardware. In reality, hardware
interleave spreads access (typically at some multiple of cache line
granularity) across many devices to make better use bandwidth.
Imagine a system with 4GB of RAM in the address range 0x0-0xFFFFFFFF.
We often think of this memory linearly, where the first 2GB might be
the first DIMM and the second 2GB belong to the next. But in reality,
when hardware interleave is in use, it may spread cache lines per-dimm.
Simple Model Reality
--------------- ---------------
| DIMM0 | 0x00000000 | DIMM0 |
| DIMM0 | | DIMM1 |
| DIMM0 | ... | DIMM0 |
| DIMM0 | | DIMM1 |
| DIMM1 | 0x80000000 | DIMM0 |
| DIMM1 | | DIMM1 |
| DIMM1 | | DIMM0 |
| DIMM1 | | DIMM1 |
--------------- ---------------
Software interleave, by contrast, concerns itself with managing
interleave among multiple NUMA nodes - where each node has different
performance characteristics. This is typically done on a page-boundary
and is enforced by the kernel allocation and mempolicy system.
You can visualize this as a series of allocation calls returning pages
on different nodes. In reality this occurs on fault (first access)
instead of malloc, but this is an easier way to think about it.
1:1 Interleave between two nodes.
malloc(4096) -> node0
malloc(4096) -> node1
malloc(4096) -> node0
malloc(4096) -> node1
... and so on ...
These techniques are not mutually exclusive, and the granularity/ways
of interleave may differ between hardware and software interleave.
-----------------------------
Inter-Host-Bridge Interleave.
-----------------------------
Imagine we have a system configuration where we've placed 2 CXL devices
on their own dedicated Host Bridge. Maybe each CXL device is capable of
a full x16 PCIE link, and we want to aggregate the bandwidth of these
devices by interleave across host bridges.
This setup will require the BIOS to create a CEDT CFMWS which reports the
intent to interleave across host bridges. This is typically because the
chipset memory controller needs to be made aware of how to route accesses
to host bridge, which is platform specific.
In the follow case, the BIOS has configured as single 4GB memory region
which interleaves capacity across two Host Bridges (7 and 6).
```
CFMWS:
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000300000000 <- Memory Region
Window size : 0000000100000000 <- 4GB
Interleave Members (2^n) : 01 <- 2-way interleave
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
Next Target : 00000006 <- Host Bridge _UID
```
Assuming no other CEDT or SRAT entries exist, this will result in Linux
creating the following NUMA topology, where all CXL memory is in Node 1.
```
NUMA Structure:
---------- -------- | ----------
| cpu0 |-----| DRAM |---|----| Node 0 |
---------- -------- | ----------
/ \ |
------- ------- | ----------
| HB0 |-----| HB1 |------------|----| Node 1 |
------- ------- | ----------
| | |
CXL Dev CXL Dev |
```
In this scenario, we program the decoders like so:
```
Decoders:
CXL Root
|
decoder0.0
IW:2 IG:256
[0x300000000, 0x3FFFFFFFF]
/ \
Host Bridge 7 Host Bridge 6
/ \
decoder1.0 decoder2.0
IW:1 IG:512 IW:1 IG:512
[0x300000000, 0x3FFFFFFFFF] [0x300000000, 0x3FFFFFFFF]
| |
Endpoint 0 Endpoint 1
| |
decoder3.0 decoder4.0
IW:2 IG:256 IW:2 IG:256
[0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF]
```
Notice the Host Bridge ways and granularity differ from the root and
endpoints. In the fabric (root through everything but endpoints),
Interleave ways are *target-count per-leg* and the granularity is the
parent's (IW * IG).
Host Bridge Decoder:
IW = 1 = number of targets
IG = 512 = Parent IW * Parent IG (2 * 256)
-----------------------------
Intra-Host-Bridge Interleave.
-----------------------------
Now lets consider a system where we've placed 2 CXL devices on the same
Host Bridge. Maybe each CXL device is only capable of x8 PCIE, and we
want to make full use of a single x16 link.
This setup only requires the BIOS to create a CEDT CFMWS which reports
the entire capacity of all devices under the host bridge, but does not
need to set up any interleaving.
In the follow case, the BIOS has configured as single 4GB memory region
which only targets the single host bridge, but maps the entire memory
capacity of both devices (2GB).
```
CFMWS:
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000300000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00 <- No host bridge interleave
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
```
Assuming no other CEDT or SRAT entries exist, this will result in linux
creating the following NUMA topology, where all CXL memory is in Node 1.
```
NUMA Structure:
--------- -------- | ----------
| cpu0 |-----| DRAM |---|----| Node 0 |
--------- -------- | ----------
| |
------- | ----------
| HB0 |-----------------|----| Node 1 |
------- | ----------
/ \ |
CXL Dev CXL Dev |
```
In this scenario, we program the decoders like so:
```
Decoders
CXL Root
|
decoder0.0
IW:1 IG:256
[0x300000000, 0x3FFFFFFFF]
|
Host Bridge
|
decoder1.0
IW:2 IG:256
[0x300000000, 0x3FFFFFFFF]
/ \
Endpoint 0 Endpoint 1
| |
decoder2.0 decoder3.0
IW:2 IG:256 IW:2 IG:256
[0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF]
```
The root decoder in this scenario does not participate in interleave,
it simply forwards all accesses in this range to the host bridge.
The host bridge then applies the interleave across its connected devices
and the decodes apply translation accordingly.
-----------------------
Combination Interleave.
-----------------------
Lets consider now a system where 2 Host Bridges have 2 CXL devices each,
and we want to interleave the entire set. This requires us to make use
of both inter and intra host bridge interleave.
First, we can interleave this with the a single CEDT entry, the same as
the first inter-host-bridge CEDT (now assuming 1GB per device).
```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000300000000 <- Memory Region
Window size : 0000000100000000 <- 4GB
Interleave Members (2^n) : 01 <- 2-way interleave
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID
Next Target : 00000006 <- Host Bridge _UID
```
This gives us a NUMA structure as follows:
```
NUMA Structure:
---------- -------- | ----------
| cpu0 |-----| DRAM |----|---| Node 0 |
---------- -------- | ----------
/ \ |
------- ------- | ----------
| HB0 |-----| HB1 |-------------|---| Node 1 |
------- ------- | ----------
/ \ / \ |
CXL0 CXL1 CXL2 CXL3 |
```
And the respective decoder programming looks as follows
```
Decoders:
CXL Root
|
decoder0.0
IW:2 IG:256
[0x300000000, 0x3FFFFFFFF]
/ \
Host Bridge 7 Host Bridge 6
/ \
decoder1.0 decoder2.0
IW:2 IG:512 IW:2 IG:512
[0x300000000, 0x3FFFFFFFFF] [0x300000000, 0x3FFFFFFFF]
/ \ / \
endpoint0 endpoint1 endpoint2 endpoint3
| | | |
decoder3.0 decoder4.0 decoder5.0 decoder6.0
IW:4 IG:256 IW:4 IG:256
[0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF]
```
Notice at both the root and the host bridge, the Interleave Ways is 2.
There are two targets at each level. The host bridge has a granularity
of 512 to capture its parent's ways and granularity (`2*256`).
Each decoder is programmed with the total number of targets (4) and the
overall granularity (256B).
We might use this setup if each CXL device is capable of x8 PCIE, and
we have 2 Host Bridges capable of full x16 - utilizing all bandwidth
available.
---------------------------------------------
Nuance: Hardware Interleave and Memory Holes.
---------------------------------------------
You may encounter a system which cannot place the entire memory capacity
into a single contiguous System Physical Address range. That's ok,
because we can just use multiple decoders to capture this nuance.
Most CXL devices allow for multiple decoders.
This may require an SRAT entry to keep these regions on the same node.
(Obviously the relies on your platform vendor's BIOS)
```
CFMWS:
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000300000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00 <- No host bridge interleave
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge 7
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000400000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00 <- No host bridge interleave
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge 7
SRAT:
Subtable Type : 01 [Memory Affinity]
Length : 28
Proximity Domain : 00000001 <- NUMA Node 1
Reserved1 : 0000
Base Address : 0000000300000000 <- Physical Memory Region
Address Length : 0000000080000000 <- first 2GB
Subtable Type : 01 [Memory Affinity]
Length : 28
Proximity Domain : 00000001 <- NUMA Node 1
Reserved1 : 0000
Base Address : 0000000400000000 <- Physical Memory Region
Address Length : 0000000080000000 <- second 2GB
```
The SRAT entries allow us to keep the regions attached to the same node.
```
NUMA Structure:
--------- -------- | ----------
| cpu0 |-----| DRAM |---|----| Node 0 |
--------- -------- | ----------
| |
------- | ----------
| HB0 |-----------------|----| Node 1 |
------- | ----------
/ \ |
CXL Dev CXL Dev |
```
And the decoder programming would look like so
```
Decoders:
CXL Root
/ \
decoder0.0 decoder0.1
IW:1 IG:256 IW:1 IG:256
[0x300000000, 0x37FFFFFFF] [0x400000000, 0x47FFFFFFF]
\ /
Host Bridge
/ \
decoder1.0 decoder1.1
IW:2 IG:256 IW:2 IG:256
[0x300000000, 0x37FFFFFFF] [0x400000000, 0x47FFFFFFF]
/ \ / \
Endpoint 0 Endpoint 1 Endpoint 0 Endpoint 1
| | | |
decoder2.0 decoder3.0 decoder2.1 decoder3.1
IW:2 IG:256 IW:2 IG:256
[0x300000000, 0x37FFFFFFF] [0x400000000, 0x47FFFFFFF]
```
Linux manages decoders in relation to the associated component, so
decoders are N.M where N is the component and M is the decoder number.
If you look, you'll see each side of this tree looks individually
equivalent to the intra-host-bridge interleave example, just with one
half of the total memory each (matching the CFMWS ranges).
Each of the root decoders still has an interleave width of 1 because
they both only target one host bridge (despite it being the same one).
--------------------------------
Software Interleave (Mempolicy).
--------------------------------
Linux provides a software mechanism to allow tasks to to interleave its
memory across NUMA nodes - which may have different performance
characteristics. This component is called `mempolicy`, and is primarily
operated on with the `set_mempolicy()` and `mbind()` syscalls.
These syscalls take a nodemask (bitmask representing NUMA node ids) as
an argument to describe the intended allocation policy of the task.
The following policies are presently supported (as of v6.13)
```
enum {
MPOL_DEFAULT,
MPOL_PREFERRED,
MPOL_BIND,
MPOL_INTERLEAVE,
MPOL_LOCAL,
MPOL_PREFERRED_MANY,
MPOL_WEIGHTED_INTERLEAVE,
};
```
Let's look at `MPOL_INTERLEAVE` and `MPOL_WEIGHTED_INTERLEAVE`.
To quote the man page:
```
MPOL_INTERLEAVE
This mode interleaves page allocations across the nodes specified
in nodemask in numeric node ID order. This optimizes for bandwidth
instead of latency by spreading out pages and memory accesses to those
pages across multiple nodes. However, accesses to a single page will
still be limited to the memory bandwidth of a single node.
MPOL_WEIGHTED_INTERLEAVE (since Linux 6.9)
This mode interleaves page allocations across the nodes specified in
nodemask according to the weights in
/sys/kernel/mm/mempolicy/weighted_interleave
For example, if bits 0, 2, and 5 are set in nodemask and the contents of
/sys/kernel/mm/mempolicy/weighted_interleave/node0
/sys/ ... /node2
/sys/ ... /node5
are 4, 7, and 9, respectively, then pages in this region will be
allocated on nodes 0, 2, and 5 in a 4:7:9 ratio.
```
To put it simply, MPOL_INTERLEAVE will interleave allocations at a page
granularity (4KB, 2MB, etc) across nodes in a 1:1 ratio, while
MPOL_WEIGHTED_INTERLEAVE takes into account weights - which presumably
map to the bandwidth of each respective node.
Or more concretely:
MPOL_INTERLEAVE
1:1 Interleave between two nodes.
malloc(4096) -> node0
malloc(4096) -> node1
malloc(4096) -> node0
malloc(4096) -> node1
... and so on ...
MPOL_WEIGHTED_INTERLEAVE
2:1 Interleave between two nodes.
malloc(4096) -> node0
malloc(4096) -> node0
malloc(4096) -> node1
malloc(4096) -> node0
malloc(4096) -> node0
malloc(4096) -> node1
... and so on ...
This is the preferred mechanism for *heterogeneous interleave* on Linux,
as it allows for predictable performance based on the explicit (and
visible) placement of memory.
It also allows for memory ZONE restrictions to enable better performance
predictability (e.g. keeping kernel locks out of CXL while allowing
workloads to leverage it for expansion or bandwidth).
======================
Mempolicy Limitations.
======================
Mempolicy is a *per-task* allocation policy that is inherited by
child-tasks on clone/fork. It can only be changed by the task itself,
though cgroups may affect the effective nodemask via cpusets.
This means once a task has been launched, and external actor cannot
change the policy of a running task - except possibly by migrating that
task between cgroups or changing the cpusets.mems value of the cgroup
the task lives in.
Additionally, If capacity on a given node is not available, allocations
will fall back to another node in the node mask - which may cause
interleave to become unbalanced.
================================
Hardware Interleave Limitations.
================================
Granularities:
granularities are limited on hardware
(typically 256B up to 16KB by power of 2)
Ways:
Ways are limited by the CXL configuration to:
2,4,8,16,3,6,12
Balance:
Linux does not allow imbalanced interleave configurations
(e.g. 3-way interleave where 2 targets are on 1 HB and 1 on another)
Depending on your platform vendor and type of interleave, you may not
be able to deconstruct an interleave region at all (decoders may be
locked). In this case, you may not have the flexiblity to convert
operation from interleaved to non-interleave via the driver interface.
In the scenario where your interleave configuration is entirely driver
managed, you cannot adjust the size of an interleave set without
deconstructing the entire set.
------------------------------------------------------------------------
Next we'll discuss how memory allocations occur in a CXL-enabled system,
which may be affected by things like Reclaim and Tiering systems.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 4: Interleave
2025-03-12 0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
@ 2025-03-13 8:31 ` Yuquan Wang
2025-03-13 16:48 ` Gregory Price
2025-03-26 9:28 ` Yuquan Wang
1 sibling, 1 reply; 81+ messages in thread
From: Yuquan Wang @ 2025-03-13 8:31 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Tue, Mar 11, 2025 at 08:09:02PM -0400, Gregory Price wrote:
>
> -----------------------------
> Intra-Host-Bridge Interleave.
> -----------------------------
> Now lets consider a system where we've placed 2 CXL devices on the same
> Host Bridge. Maybe each CXL device is only capable of x8 PCIE, and we
> want to make full use of a single x16 link.
>
> This setup only requires the BIOS to create a CEDT CFMWS which reports
> the entire capacity of all devices under the host bridge, but does not
> need to set up any interleaving.
>
> In the follow case, the BIOS has configured as single 4GB memory region
> which only targets the single host bridge, but maps the entire memory
> capacity of both devices (2GB).
>
> ```
> CFMWS:
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000300000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
I think is "Window size : 0000000100000000 <- 4GB" here.
> Interleave Members (2^n) : 00 <- No host bridge interleave
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
> ```
>
> Assuming no other CEDT or SRAT entries exist, this will result in linux
> creating the following NUMA topology, where all CXL memory is in Node 1.
>
> ```
> NUMA Structure:
> --------- -------- | ----------
> | cpu0 |-----| DRAM |---|----| Node 0 |
> --------- -------- | ----------
> | |
> ------- | ----------
> | HB0 |-----------------|----| Node 1 |
> ------- | ----------
> / \ |
> CXL Dev CXL Dev |
> ```
>
> In this scenario, we program the decoders like so:
> ```
> Decoders
> CXL Root
> |
> decoder0.0
> IW:1 IG:256
> [0x300000000, 0x3FFFFFFFF]
> |
> Host Bridge
> |
> decoder1.0
> IW:2 IG:256
> [0x300000000, 0x3FFFFFFFF]
> / \
> Endpoint 0 Endpoint 1
> | |
> decoder2.0 decoder3.0
> IW:2 IG:256 IW:2 IG:256
> [0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF]
> ```
>
> The root decoder in this scenario does not participate in interleave,
> it simply forwards all accesses in this range to the host bridge.
>
> The host bridge then applies the interleave across its connected devices
> and the decodes apply translation accordingly.
>
> -----------------------
> Combination Interleave.
> -----------------------
> Lets consider now a system where 2 Host Bridges have 2 CXL devices each,
> and we want to interleave the entire set. This requires us to make use
> of both inter and intra host bridge interleave.
>
> First, we can interleave this with the a single CEDT entry, the same as
> the first inter-host-bridge CEDT (now assuming 1GB per device).
>
> ```
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000300000000 <- Memory Region
> Window size : 0000000100000000 <- 4GB
> Interleave Members (2^n) : 01 <- 2-way interleave
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
> Next Target : 00000006 <- Host Bridge _UID
> ```
>
> This gives us a NUMA structure as follows:
> ```
> NUMA Structure:
>
> ---------- -------- | ----------
> | cpu0 |-----| DRAM |----|---| Node 0 |
> ---------- -------- | ----------
> / \ |
> ------- ------- | ----------
> | HB0 |-----| HB1 |-------------|---| Node 1 |
> ------- ------- | ----------
> / \ / \ |
> CXL0 CXL1 CXL2 CXL3 |
> ```
>
> And the respective decoder programming looks as follows
> ```
> Decoders:
> CXL Root
> |
> decoder0.0
> IW:2 IG:256
> [0x300000000, 0x3FFFFFFFF]
> / \
> Host Bridge 7 Host Bridge 6
> / \
> decoder1.0 decoder2.0
> IW:2 IG:512 IW:2 IG:512
> [0x300000000, 0x3FFFFFFFFF] [0x300000000, 0x3FFFFFFFF]
> / \ / \
> endpoint0 endpoint1 endpoint2 endpoint3
> | | | |
> decoder3.0 decoder4.0 decoder5.0 decoder6.0
> IW:4 IG:256 IW:4 IG:256
> [0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF]
> ```
>
> Notice at both the root and the host bridge, the Interleave Ways is 2.
> There are two targets at each level. The host bridge has a granularity
> of 512 to capture its parent's ways and granularity (`2*256`).
>
> Each decoder is programmed with the total number of targets (4) and the
> overall granularity (256B).
Is there any relationship between endpoints'decoder setup(IW&IG) and
others decoder?
>
> We might use this setup if each CXL device is capable of x8 PCIE, and
> we have 2 Host Bridges capable of full x16 - utilizing all bandwidth
> available.
>
> ---------------------------------------------
> Nuance: Hardware Interleave and Memory Holes.
> ---------------------------------------------
> You may encounter a system which cannot place the entire memory capacity
> into a single contiguous System Physical Address range. That's ok,
> because we can just use multiple decoders to capture this nuance.
>
> Most CXL devices allow for multiple decoders.
>
> This may require an SRAT entry to keep these regions on the same node.
> (Obviously the relies on your platform vendor's BIOS)
>
> ```
> CFMWS:
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000300000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00 <- No host bridge interleave
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge 7
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000400000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00 <- No host bridge interleave
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge 7
>
> SRAT:
> Subtable Type : 01 [Memory Affinity]
> Length : 28
> Proximity Domain : 00000001 <- NUMA Node 1
> Reserved1 : 0000
> Base Address : 0000000300000000 <- Physical Memory Region
> Address Length : 0000000080000000 <- first 2GB
>
> Subtable Type : 01 [Memory Affinity]
> Length : 28
> Proximity Domain : 00000001 <- NUMA Node 1
> Reserved1 : 0000
> Base Address : 0000000400000000 <- Physical Memory Region
> Address Length : 0000000080000000 <- second 2GB
> ```
>
> The SRAT entries allow us to keep the regions attached to the same node.
> ```
>
> NUMA Structure:
> --------- -------- | ----------
> | cpu0 |-----| DRAM |---|----| Node 0 |
> --------- -------- | ----------
> | |
> ------- | ----------
> | HB0 |-----------------|----| Node 1 |
> ------- | ----------
> / \ |
> CXL Dev CXL Dev |
> ```
>
Hi, Gregory
Seeing this, I have an assumption to discuss.
If the same system uses tables like below:
CFMWS:
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000300000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00 <- No host bridge interleave
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge 7
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000400000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00 <- No host bridge interleave
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge 7
SRAT:
Subtable Type : 01 [Memory Affinity]
Length : 28
Proximity Domain : 00000000 <- NUMA Node 0
Reserved1 : 0000
Base Address : 0000000300000000 <- Physical Memory Region
Address Length : 0000000080000000 <- first 2GB
Subtable Type : 01 [Memory Affinity]
Length : 28
Proximity Domain : 00000001 <- NUMA Node 1
Reserved1 : 0000
Base Address : 0000000400000000 <- Physical Memory Region
Address Length : 0000000080000000 <- second 2GB
The first 2GB cxl memory region would locate at node0 with DRAM.
NUMA Structure:
--------- -------- | ----------
| cpu0 |-----| DRAM |---|------------| Node 0 |
--------- -------- | / ----------
| | /first 2GB
------- | / ----------
| HB0 |-----------------|------------| Node 1 |
------- |second 2GB ----------
/ \ |
CXL Dev CXL Dev |
```
Is above configuration and structure valid?
Yuquan
> And the decoder programming would look like so
> ```
> Decoders:
> CXL Root
> / \
> decoder0.0 decoder0.1
> IW:1 IG:256 IW:1 IG:256
> [0x300000000, 0x37FFFFFFF] [0x400000000, 0x47FFFFFFF]
> \ /
> Host Bridge
> / \
> decoder1.0 decoder1.1
> IW:2 IG:256 IW:2 IG:256
> [0x300000000, 0x37FFFFFFF] [0x400000000, 0x47FFFFFFF]
> / \ / \
> Endpoint 0 Endpoint 1 Endpoint 0 Endpoint 1
> | | | |
> decoder2.0 decoder3.0 decoder2.1 decoder3.1
> IW:2 IG:256 IW:2 IG:256
> [0x300000000, 0x37FFFFFFF] [0x400000000, 0x47FFFFFFF]
> ```
>
> Linux manages decoders in relation to the associated component, so
> decoders are N.M where N is the component and M is the decoder number.
>
> If you look, you'll see each side of this tree looks individually
> equivalent to the intra-host-bridge interleave example, just with one
> half of the total memory each (matching the CFMWS ranges).
>
> Each of the root decoders still has an interleave width of 1 because
> they both only target one host bridge (despite it being the same one).
>
>
> --------------------------------
> Software Interleave (Mempolicy).
> --------------------------------
> Linux provides a software mechanism to allow tasks to to interleave its
> memory across NUMA nodes - which may have different performance
> characteristics. This component is called `mempolicy`, and is primarily
> operated on with the `set_mempolicy()` and `mbind()` syscalls.
>
> These syscalls take a nodemask (bitmask representing NUMA node ids) as
> an argument to describe the intended allocation policy of the task.
>
> The following policies are presently supported (as of v6.13)
> ```
> enum {
> MPOL_DEFAULT,
> MPOL_PREFERRED,
> MPOL_BIND,
> MPOL_INTERLEAVE,
> MPOL_LOCAL,
> MPOL_PREFERRED_MANY,
> MPOL_WEIGHTED_INTERLEAVE,
> };
> ```
>
> Let's look at `MPOL_INTERLEAVE` and `MPOL_WEIGHTED_INTERLEAVE`.
>
> To quote the man page:
> ```
> MPOL_INTERLEAVE
> This mode interleaves page allocations across the nodes specified
> in nodemask in numeric node ID order. This optimizes for bandwidth
> instead of latency by spreading out pages and memory accesses to those
> pages across multiple nodes. However, accesses to a single page will
> still be limited to the memory bandwidth of a single node.
>
> MPOL_WEIGHTED_INTERLEAVE (since Linux 6.9)
> This mode interleaves page allocations across the nodes specified in
> nodemask according to the weights in
> /sys/kernel/mm/mempolicy/weighted_interleave
> For example, if bits 0, 2, and 5 are set in nodemask and the contents of
> /sys/kernel/mm/mempolicy/weighted_interleave/node0
> /sys/ ... /node2
> /sys/ ... /node5
> are 4, 7, and 9, respectively, then pages in this region will be
> allocated on nodes 0, 2, and 5 in a 4:7:9 ratio.
> ```
>
> To put it simply, MPOL_INTERLEAVE will interleave allocations at a page
> granularity (4KB, 2MB, etc) across nodes in a 1:1 ratio, while
> MPOL_WEIGHTED_INTERLEAVE takes into account weights - which presumably
> map to the bandwidth of each respective node.
>
> Or more concretely:
>
> MPOL_INTERLEAVE
> 1:1 Interleave between two nodes.
> malloc(4096) -> node0
> malloc(4096) -> node1
> malloc(4096) -> node0
> malloc(4096) -> node1
> ... and so on ...
>
> MPOL_WEIGHTED_INTERLEAVE
> 2:1 Interleave between two nodes.
> malloc(4096) -> node0
> malloc(4096) -> node0
> malloc(4096) -> node1
> malloc(4096) -> node0
> malloc(4096) -> node0
> malloc(4096) -> node1
> ... and so on ...
>
> This is the preferred mechanism for *heterogeneous interleave* on Linux,
> as it allows for predictable performance based on the explicit (and
> visible) placement of memory.
>
> It also allows for memory ZONE restrictions to enable better performance
> predictability (e.g. keeping kernel locks out of CXL while allowing
> workloads to leverage it for expansion or bandwidth).
>
> ======================
> Mempolicy Limitations.
> ======================
> Mempolicy is a *per-task* allocation policy that is inherited by
> child-tasks on clone/fork. It can only be changed by the task itself,
> though cgroups may affect the effective nodemask via cpusets.
>
> This means once a task has been launched, and external actor cannot
> change the policy of a running task - except possibly by migrating that
> task between cgroups or changing the cpusets.mems value of the cgroup
> the task lives in.
>
> Additionally, If capacity on a given node is not available, allocations
> will fall back to another node in the node mask - which may cause
> interleave to become unbalanced.
>
> ================================
> Hardware Interleave Limitations.
> ================================
> Granularities:
> granularities are limited on hardware
> (typically 256B up to 16KB by power of 2)
>
> Ways:
> Ways are limited by the CXL configuration to:
> 2,4,8,16,3,6,12
>
> Balance:
> Linux does not allow imbalanced interleave configurations
> (e.g. 3-way interleave where 2 targets are on 1 HB and 1 on another)
>
> Depending on your platform vendor and type of interleave, you may not
> be able to deconstruct an interleave region at all (decoders may be
> locked). In this case, you may not have the flexiblity to convert
> operation from interleaved to non-interleave via the driver interface.
>
> In the scenario where your interleave configuration is entirely driver
> managed, you cannot adjust the size of an interleave set without
> deconstructing the entire set.
>
> ------------------------------------------------------------------------
>
> Next we'll discuss how memory allocations occur in a CXL-enabled system,
> which may be affected by things like Reclaim and Tiering systems.
>
> ~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot
2025-03-04 0:32 ` Gregory Price
@ 2025-03-13 16:12 ` Jonathan Cameron
2025-03-13 17:20 ` Gregory Price
0 siblings, 1 reply; 81+ messages in thread
From: Jonathan Cameron @ 2025-03-13 16:12 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Mon, 3 Mar 2025 19:32:43 -0500
Gregory Price <gourry@gourry.net> wrote:
> On Tue, Feb 04, 2025 at 09:17:09PM -0500, Gregory Price wrote:
> > ------------------------------------------------------------------
> > Step 2: BIOS / EFI generates the CEDT (CXL Early Detection Table).
> > ------------------------------------------------------------------
> >
> > This table is responsible for reporting each "CXL Host Bridge" and
> > "CXL Fixed Memory Window" present at boot - which enables early boot
> > software to manage those devices and the memory capacity presented
> > by those devices.
> >
> > Example CEDT Entries (truncated)
> > Subtable Type : 00 [CXL Host Bridge Structure]
> > Reserved : 00
> > Length : 0020
> > Associated host bridge : 00000005
> >
> > Subtable Type : 01 [CXL Fixed Memory Window Structure]
> > Reserved : 00
> > Length : 002C
> > Reserved : 00000000
> > Window base address : 000000C050000000
> > Window size : 0000003CA0000000
> >
> > If this memory is NOT marked "Special Purpose" by BIOS (next section),
> > you should find a matching entry EFI Memory Map and /proc/iomem
> >
> > BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] usable
> > /proc/iomem: c050000000-fcefffffff : System RAM
> >
> >
> > Observation: This memory is treated as 100% normal System RAM
> >
> > 1) This memory may be placed in any zone (ZONE_NORMAL, typically)
> > 2) The kernel may use this memory for arbitrary allocations
> > 4) The driver still enumerates CXL devices and memory regions, but
> > 3) The CXL driver CANNOT manage this memory (as of today)
> > (Caveat: *some* RAS features may still work, possibly)
> >
> > This creates an nuanced management state.
> >
> > The memory is online by default and completely usable, AND the driver
> > appears to be managing the devices - BUT the memory resources and the
> > management structure are fundamentally separate.
> > 1) CXL Driver manages CXL features
> > 2) Non-CXL SystemRAM mechanisms surface the memory to allocators.
> >
>
> Adding some additional context here
>
> -------------------------------------
> Nuance X: NUMA Nodes and ACPI Tables.
> -------------------------------------
>
> ACPI Table parsing is partially architecture/platform dependent, but
> there is common code that affects boot-time creation of NUMA nodes.
>
> NUMA-nodes are not a dynamic resource. They are (presently, Feb 2025)
> statically configured during kernel init, and the number of possible
> NUMA nodes (N_POSSIBLE) may not change during runtime.
>
> CEDT/CFMW and SRAT/Memory Affinity entries describe memory regions
> associated with CXL devices. These tables are used to allocate NUMA
> node IDs during _init.
>
> The "System Resource Affinity Table" has "Memory Affinity" entries
> which associate memory regions with a "Proximity Domain"
>
> Subtable Type : 01 [Memory Affinity]
> Length : 28
> Proximity Domain : 00000001
> Reserved1 : 0000
> Base Address : 000000C050000000
> Address Length : 0000003CA0000000
>
> The "Proximity Domain" utilized by the kernel ACPI driver to match this
> region with a NUMA node (in most cases, the proximity domains here will
> directly translate to a NUMA node ID - but not always).
>
> CEDT/CFMWS do not have a proximity domain - so the kernel will assign it
> a NUMA node association IFF no SRAT Memory Affinity entry is present.
>
> SRAT entries are optional, CFMWS are required for each host bridge.
They aren't required for each HB. You could have multiple host bridge and one CFMWS
as long as you have decided to only support interleave.
I would only expect to see this where the bios is instantiating CFMWS
entries to match a specific locked down config though.
>
> If SRAT entries are present, one NUMA node is created for each detected
> proximity domain in the SRAT. Additional NUMA nodes are created for each
> CFMWS without a matching SRAT entry.
Don't forget the fun of CFMWS covering multiple SRAT entries (I think
we just go with the first one?)
>
> CFMWS describes host-bridge information, and so if SRAT is missing - all
> devices behind the host bridge will become naturally associated with the
> same NUMA node.
I wouldn't go with naturally for the reason below. It happens, but maybe
not natural :)
>
>
> big long TL;DR:
>
> This creates the subtle assumption that each host-bridge will have
> devices with similar performance characteristics if they're intended
> for use as general purpose memory and/or interleave.
Not just devices, also topologies. Could well have switches below some
ports and direct connected devices on others.
>
> This means you should expect to have to reboot your machine if a
> different NUMA topology is needed (for example, if you are physically
> hotunplugging a volatile device to plug in a non-volatile device).
If the bios is friendly you should be able to map that to a different
CFMWS, but sure what bios is that nice?
>
>
>
> Stay tuned for more Fun and Profit with ACPI tables :]
:)
> ~Gregory
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 4: Interleave
2025-03-13 8:31 ` Yuquan Wang
@ 2025-03-13 16:48 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-13 16:48 UTC (permalink / raw)
To: Yuquan Wang; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Mar 13, 2025 at 04:31:31PM +0800, Yuquan Wang wrote:
> On Tue, Mar 11, 2025 at 08:09:02PM -0400, Gregory Price wrote:
> > Window size : 0000000080000000 <- 2GB
>
> I think is "Window size : 0000000100000000 <- 4GB" here.
>
Quite right. I am planning to migrate this all to a github somewhere
after LSF for edits, so i'll take all the feedback and incorporate it
then.
> > There are two targets at each level. The host bridge has a granularity
> > of 512 to capture its parent's ways and granularity (`2*256`).
> >
> > Each decoder is programmed with the total number of targets (4) and the
> > overall granularity (256B).
>
> Is there any relationship between endpoints'decoder setup(IW&IG) and
> others decoder?
>
I'm sure there's a mathematical relationship that dictates this up the
heirarchy, but each endpoint decoder needs to be programmed with the
same interleave weight and granularity of all other endpoints.
Technically unbalanced configurations are possible, but Linux does not
support them.
> Hi, Gregory
>
> Seeing this, I have an assumption to discuss.
>
> If the same system uses tables like below:
>
> CFMWS:
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Window base address : 0000000300000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
> First Target : 00000007 <- Host Bridge 7
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Window base address : 0000000400000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
> First Target : 00000007 <- Host Bridge 7
>
> SRAT:
> Subtable Type : 01 [Memory Affinity]
> Proximity Domain : 00000000 <- NUMA Node 0
> Base Address : 0000000300000000 <- Physical Memory Region
>
> Subtable Type : 01 [Memory Affinity]
> Proximity Domain : 00000001 <- NUMA Node 1
> Base Address : 0000000400000000 <- Physical Memory Region
>
>
> The first 2GB cxl memory region would locate at node0 with DRAM.
>
> NUMA Structure:
>
> --------- -------- | ----------
> | cpu0 |-----| DRAM |---|------------| Node 0 |
> --------- -------- | / ----------
> | | /first 2GB
> ------- | / ----------
> | HB0 |-----------------|------------| Node 1 |
> ------- |second 2GB ----------
> / \ |
> CXL Dev CXL Dev |
> ```
>
> Is above configuration and structure valid?
>
This is correct, the association between memory and numa node is pretty
much purely logical.
I'm not sure WHY you'd want to do this, but yeah you can do this
(assuming you can get the BIOS to produce that SRAT).
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
` (2 preceding siblings ...)
2025-03-08 3:23 ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
@ 2025-03-13 16:55 ` Jonathan Cameron
2025-03-13 17:30 ` Gregory Price
2025-03-27 9:34 ` Yuquan Wang
3 siblings, 2 replies; 81+ messages in thread
From: Jonathan Cameron @ 2025-03-13 16:55 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Wed, 5 Mar 2025 17:20:52 -0500
Gregory Price <gourry@gourry.net> wrote:
> --------------------
> Part 0: ACPI Tables.
> --------------------
> I considered publishing this section first, or at least under
> "Platform", but I've found this information largely useful in
> debugging interleave configurations and tiering mechanisms -
> which are higher level concepts.
>
> Much of the information in this section is most relevant to
> Interleave (yet to be published Section 4).
>
> I promise not to simply regurgitate the entire ACPI specification
> and limit this to necessary commentary to describe how these tables
> relate to actual Linux resources (like numa nodes and tiers).
>
> At the very least, if you find yourself trying to figure out why
> your CXL system isn't producing NUMA nodes, memory tiers, root
> decoders, memory regions - etc - I would check these tables
> first for aberrations. Almost all my personal strife has been
> associated with ACPI table misconfiguration.
>
>
> ACPI tables can be inspected with the `acpica-tools` package.
> mkdir acpi_tables && cd acpi_tables
> acpidump -b
> iasl -d *
> -- inpect the *.dsl files
>
> ====
> CEDT
> ====
> The CXL Early Discovery Table is generated by BIOS to describe
> the CXL devices present and configured (to some extent) at boot
> by the BIOS.
>
> # CHBS
> The CXL Host Bridge Structure describes CXL host bridges. Other
> than describing device register information, it reports the specific
> host bridge UID for this host bridge. These host bridge ID's will
> be referenced in other tables.
>
> Debug hint: check that the host bridge IDs between tables are
> consistent - stuff breaks oddly if they're not!
>
> ```
> Subtable Type : 00 [CXL Host Bridge Structure]
> Reserved : 00
> Length : 0020
> Associated host bridge : 00000007 <- Host bridge _UID
> Specification version : 00000001
> Reserved : 00000000
> Register base : 0000010370400000
> Register length : 0000000000010000
> ```
>
> # CFMWS
> The CXL Fixed Memory Window structure describes a memory region
> associated with one or more CXL host bridges (as described by the
> CHBS). It additionally describes any inter-host-bridge interleave
> configuration that may have been programmed by BIOS. (Section 4)
>
> ```
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 000000C050000000 <- Memory Region
> Window size : 0000003CA0000000
> Interleave Members (2^n) : 01 <- Interleave configuration
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
> Next Target : 00000006 <- Host Bridge _UID
> ```
>
> INTER-host-bridge interleave (multiple devices on one host bridge) is
> NOT reported in this structure, and is solely defined via CXL device
> decoder programming (host bridge and endpoint decoders). This will be
> described later (Section 4 - Interleave)
>
>
> ====
> SRAT
> ====
> The System/Static Resource Affinity Table describes resource (CPU,
> Memory) affinity to "Proximity Domains". This table is technically
> optional, but for performance information (see "HMAT") to be enumerated
> by linux it must be present.
>
>
> # Proximity Domain
> A proximity domain is ROUGHLY equivalent to "NUMA Node" - though a
> 1-to-1 mapping is not guaranteed. There are scenarios where "Proximity
> Domain 4" may map to "NUMA Node 3", for example. (See "NUMA Node Creation")
>
> # Memory Affinity
> Generally speaking, if a host does any amount of CXL fabric (decoder)
> programming in BIOS - an SRAT entry for that memory needs to be present.
>
> ```
> Subtable Type : 01 [Memory Affinity]
> Length : 28
> Proximity Domain : 00000001 <- NUMA Node 1
> Reserved1 : 0000
> Base Address : 000000C050000000 <- Physical Memory Region
> Address Length : 0000003CA0000000
> Reserved2 : 00000000
> Flags (decoded below) : 0000000B
> Enabled : 1
> Hot Pluggable : 1
> Non-Volatile : 0
> ```
>
> # Generic Initiator / Port
> In the scenario where CXL devices are not present or configured by
> BIOS, we may still want to generate proximity domain configurations
> for those devices. The Generic Initiator interfaces are intended to
> fill this gap, so that performance information can still be utilized
> when the devices become available at runtime.
>
> I won't cover the details here, for now, but I will link to the
> proosal from Dan Williams and Jonathan Cameron if you would like
proposal
> more information.
> https://lore.kernel.org/all/e1a52da9aec90766da5de51b1b839fd95d63a5af.camel@intel.com/
I know you said you'd not say more, but this is a bit misleading.
Maybe ignore Generic Initiators for this doc. They are relevant for
CXL but in the fabric they only matter for type 1 / 2 devices not
memory and only if the BIOS wants to do HMAT for end to end. Gets
more fun when they are in the host side of the root bridge.
# Generic Port
In the scenario where CXL memory devices are not present at boot, or
not configured by the BIOS or he BIOS has not provided full HMAT
descriptions for the configured memory, we may still want to
generate proximity domain configurations for those devices.
The Generic Port structures are intended to fill this gap, so
that performance information can still be utilized when the
devices are available at runtime by combining host information
with that discovered from devices.
Or just
# Generic Ports
These are fun ;)
>
> ====
> HMAT
> ====
> The Heterogeneous Memory Attributes Table contains information such as
> cache attributes and bandwidth and latency details for memory proximity
> domains. For the purpose of this document, we will only discuss the
> SSLIB entry.
No fun. You miss Intel's extensions to memory-side caches ;)
(which is wise!)
>
> # SLLBI
> The System Locality Latency and Bandwidth Information records latency
> and bandwidth information for proximity domains. This table is used by
> Linux to configure interleave weights and memory tiers.
>
> ```
> Heavily truncated for brevity
> Structure Type : 0001 [SLLBI]
> Data Type : 00 <- Latency
> Target Proximity Domain List : 00000000
> Target Proximity Domain List : 00000001
> Entry : 0080 <- DRAM LTC
> Entry : 0100 <- CXL LTC
>
> Structure Type : 0001 [SLLBI]
> Data Type : 03 <- Bandwidth
> Target Proximity Domain List : 00000000
> Target Proximity Domain List : 00000001
> Entry : 1200 <- DRAM BW
> Entry : 0200 <- CXL BW
> ```
>
>
> ---------------------------------
> Part 00: Linux Resource Creation.
> ---------------------------------
>
> ==================
> NUMA node creation
> ===================
> NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are
> identified at `__init` time, more specifically during `mm_init`.
>
> What this means is that the CEDT and SRAT must contain sufficient
> `proximity domain` information for linux to identify how many NUMA
> nodes are required (and what memory regions to associate with them).
Is it worth talking about what is effectively a constraint of the spec
and what is a Linux current constraint?
SRAT is only ACPI defined way of getting Proximity nodes. Linux chooses
to at most map those 1:1 with NUMA nodes.
CEDT adds on description of SPA ranges where there might be memory that Linux
might want to map to 1 or more NUMA nodes
>
> The relevant code exists in: linux/drivers/acpi/numa/srat.c
> ```
> static int __init
> acpi_parse_memory_affinity(union acpi_subtable_headers *header,
> const unsigned long table_end)
> {
> ... heavily truncated for brevity
> pxm = ma->proximity_domain;
> node = acpi_map_pxm_to_node(pxm);
> if (numa_add_memblk(node, start, end) < 0)
> ....
> node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE
> }
>
> static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> void *arg, const unsigned long table_end)
> {
> ... heavily truncated for brevity
> /*
> * The SRAT may have already described NUMA details for all,
> * or a portion of, this CFMWS HPA range. Extend the memblks
> * found for any portion of the window to cover the entire
> * window.
> */
> if (!numa_fill_memblks(start, end))
> return 0;
>
> /* No SRAT description. Create a new node. */
> node = acpi_map_pxm_to_node(*fake_pxm);
> if (numa_add_memblk(node, start, end) < 0)
> ....
> node_set(node, numa_nodes_parsed); <--- mark node N_POSSIBLE
> }
>
> int __init acpi_numa_init(void)
> {
> ...
> if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
> cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> acpi_parse_memory_affinity, 0);
> }
> /* fake_pxm is the next unused PXM value after SRAT parsing */
> acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> &fake_pxm);
>
> ```
>
> Basically, the heuristic is as follows:
> 1) Add one NUMA node per Proximity Domain described in SRAT
if it contains, memory, CPU or generic initiator.
> 2) If the SRAT describes all memory described by all CFMWS
> - do not create nodes for CFMWS
> 3) If SRAT does not describe all memory described by CFMWS
> - create a node for that CFMWS
>
> Generally speaking, you will see one NUMA node per Host bridge, unless
> inter-host-bridge interleave is in use (see Section 4 - Interleave).
I just love corners: QoS concerns might mean multiple CFMWS and hence
multiple nodes per host bridge (feel free to ignore this one - has
anyone seen this in the wild yet?) Similar mess for properties such
as persistence, sharing etc.
J
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-07 15:12 ` Gregory Price
@ 2025-03-13 17:00 ` Jonathan Cameron
0 siblings, 0 replies; 81+ messages in thread
From: Jonathan Cameron @ 2025-03-13 17:00 UTC (permalink / raw)
To: Gregory Price; +Cc: Yuquan Wang, lsf-pc, linux-mm, linux-cxl, linux-kernel
On Fri, 7 Mar 2025 10:12:57 -0500
Gregory Price <gourry@gourry.net> wrote:
> On Fri, Mar 07, 2025 at 10:20:31AM +0800, Yuquan Wang wrote:
> > > 2a) Is the BIOS programming decoders, or are you programming the
> > > decoder after boot?
> > Program the decoder after boot. It seems like currently bios for qemu could
> > not programm cxl both on x86(q35) and arm64(virt). I am trying to find a
> > cxl-enable bios for qemu virt to do some test.
>
> What's likely happening here then is that QEMU is not emitting an SRAT
> (either because the logic is missing or by design).
>
> From other discussions, this may be the intention of the GenPort work,
> which is intended to have placeholders in the SRAT for the Proximity
> Domains for devices to be initialized later (i.e. dynamically).
>
For QEMU you need to provide a whole bunch of config to get SRAT / HMAT
etc ( and BIOS never configures the stuff, it's all OS first.
I wrote a slightly pathological test case that should give the general idea
https://elixir.bootlin.com/qemu/v9.2.2/source/tests/qtest/bios-tables-test.c#L1940
It flushed out a few bugs :)
test_acpi_one(" -machine hmat=on,cxl=on"
" -smp 3,sockets=3"
" -m 128M,maxmem=384M,slots=2"
" -device pcie-root-port,chassis=1,id=pci.1"
" -device pci-testdev,bus=pci.1,"
"multifunction=on,addr=00.0"
" -device pci-testdev,bus=pci.1,addr=00.1"
" -device pci-testdev,bus=pci.1,id=gidev,addr=00.2"
" -device pxb-cxl,bus_nr=64,bus=pcie.0,id=cxl.1"
" -object memory-backend-ram,size=64M,id=ram0"
" -object memory-backend-ram,size=64M,id=ram1"
" -numa node,nodeid=0,cpus=0,memdev=ram0"
" -numa node,nodeid=1"
" -object acpi-generic-initiator,id=gi0,pci-dev=gidev,node=1"
" -numa node,nodeid=2"
" -object acpi-generic-port,id=gp0,pci-bus=cxl.1,node=2"
" -numa node,nodeid=3,cpus=1"
" -numa node,nodeid=4,memdev=ram1"
" -numa node,nodeid=5,cpus=2"
" -numa hmat-lb,initiator=0,target=0,hierarchy=memory,"
"data-type=access-latency,latency=10"
" -numa hmat-lb,initiator=0,target=0,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=800M"
" -numa hmat-lb,initiator=0,target=2,hierarchy=memory,"
"data-type=access-latency,latency=100"
" -numa hmat-lb,initiator=0,target=2,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=200M"
" -numa hmat-lb,initiator=0,target=4,hierarchy=memory,"
"data-type=access-latency,latency=100"
" -numa hmat-lb,initiator=0,target=4,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=200M"
" -numa hmat-lb,initiator=0,target=5,hierarchy=memory,"
"data-type=access-latency,latency=200"
" -numa hmat-lb,initiator=0,target=5,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=400M"
" -numa hmat-lb,initiator=1,target=0,hierarchy=memory,"
"data-type=access-latency,latency=500"
" -numa hmat-lb,initiator=1,target=0,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=100M"
" -numa hmat-lb,initiator=1,target=2,hierarchy=memory,"
"data-type=access-latency,latency=50"
" -numa hmat-lb,initiator=1,target=2,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=400M"
" -numa hmat-lb,initiator=1,target=4,hierarchy=memory,"
"data-type=access-latency,latency=50"
" -numa hmat-lb,initiator=1,target=4,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=800M"
" -numa hmat-lb,initiator=1,target=5,hierarchy=memory,"
"data-type=access-latency,latency=500"
" -numa hmat-lb,initiator=1,target=5,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=100M"
" -numa hmat-lb,initiator=3,target=0,hierarchy=memory,"
"data-type=access-latency,latency=20"
" -numa hmat-lb,initiator=3,target=0,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=400M"
" -numa hmat-lb,initiator=3,target=2,hierarchy=memory,"
"data-type=access-latency,latency=80"
" -numa hmat-lb,initiator=3,target=2,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=200M"
" -numa hmat-lb,initiator=3,target=4,hierarchy=memory,"
"data-type=access-latency,latency=80"
" -numa hmat-lb,initiator=3,target=4,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=200M"
" -numa hmat-lb,initiator=3,target=5,hierarchy=memory,"
"data-type=access-latency,latency=20"
" -numa hmat-lb,initiator=3,target=5,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=400M"
" -numa hmat-lb,initiator=5,target=0,hierarchy=memory,"
"data-type=access-latency,latency=20"
" -numa hmat-lb,initiator=5,target=0,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=400M"
" -numa hmat-lb,initiator=5,target=2,hierarchy=memory,"
"data-type=access-latency,latency=80"
" -numa hmat-lb,initiator=5,target=4,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=200M"
" -numa hmat-lb,initiator=5,target=4,hierarchy=memory,"
"data-type=access-latency,latency=80"
" -numa hmat-lb,initiator=5,target=2,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=200M"
" -numa hmat-lb,initiator=5,target=5,hierarchy=memory,"
"data-type=access-latency,latency=10"
" -numa hmat-lb,initiator=5,target=5,hierarchy=memory,"
"data-type=access-bandwidth,bandwidth=800M",
&data);
Jonathan
> ~Gregory
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity
2025-03-08 3:23 ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
@ 2025-03-13 17:20 ` Jonathan Cameron
2025-03-13 18:17 ` Gregory Price
0 siblings, 1 reply; 81+ messages in thread
From: Jonathan Cameron @ 2025-03-13 17:20 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Fri, 7 Mar 2025 22:23:05 -0500
Gregory Price <gourry@gourry.net> wrote:
> In the last section we discussed how the CEDT CFMWS and SRAT Memory
> Affinity structures are used by linux to "create" NUMA nodes (or at
> least mark them as possible). However, the examples I used suggested
> that there was a 1-to-1 relationship between CFMWS and devices or
> host bridges.
>
> This is not true - in fact, CFMWS are a simply a carve out of System
> Physical Address space which may be used to map any number of endpoint
> devices behind the associated Host Bridge(s).
>
> The limiting factor is what your platform vendor BIOS supports.
>
> This section describes a handful of *possible* configurations, what NUMA
> structure they will create, and what flexibility this provides.
>
> All of these CFMWS configurations are made up, and may or may not exist
> in real machines. They are a conceptual teching tool, not a roadmap.
>
> (When discussing interleave in this section, please note that I am
> intentionally omitting details about decoder programming, as this
> will be covered later.)
>
>
> -------------------------------
> One 2GB Device, Multiple CFMWS.
> -------------------------------
> Lets imagine we have one 2GB device attached to a host bridge.
>
> In this example, the device hosts 2GB of persistent memory - but we
> might want the flexibility to map capacity as volatile or persistent.
Fairly sure we block persistent in a volatile CFMWS in the kernel.
Any bios actually does this?
You might have a variable partition device but I thought in kernel at
least we decided that no one was building that crazy?
Maybe a QoS split is a better example to motivate one range, two places?
>
> The platform vendor may decide that they want to reserve two entirely
> separate system physical address ranges to represent the capacity.
>
> ```
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000200000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 000A <- Bit(3) - Persistant
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS)
> ```
>
> You might have a CEDT with two CFMWS as above, where the base addresses
> are `0x100000000` and `0x200000000` respectively, but whose window sizes
> cover the entire 2GB capacity of the device. This affords the user
> flexibility in where the memory is mapped depending on if it is mapped
> as volatile or persistent while keeping the two SPA ranges separate.
>
> This is allowed because the endpoint decoders commit device physical
> address space *in order*, meaning no two regions of device physical
> address space can be mapped to more than one system physical address.
>
> i.e.: DPA(0) can only map to SPA(0x200000000) xor SPA(0x100000000)
>
> (See Section 2a - decoder programming).
>
> -------------------------------------------------------------
> Two Devices On One Host Bridge - With and Without Interleave.
> -------------------------------------------------------------
> What if we wanted some capacity on each endpoint hosted on its own NUMA
> node, and wanted to interleave a portion of each device capacity?
If anyone hits the lock on commit (i.e. annoying BIOS) the ordering
checks on HPA kick in here and restrict flexibility a lot
(assuming I understand them correctly that is)
This is a good illustration of why we should at some point revisit
multiple NUMA nodes per CFMWS. We have to burn SPA space just
to get nodes. From a spec point of view all that is needed here
is a single CFMWS.
>
> We could produce the following CFMWS configuration.
> ```
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region 1
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000200000000 <- Memory Region 2
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000300000000 <- Memory Region 3
> Window size : 0000000100000000 <- 4GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> NUMA effect: 3 nodes marked POSSIBLE (1 for each CFMWS)
> ```
>
> In this configuration, we could still do what we did with the prior
> configuration (2 CFMWS), but we could also use the third root decoder
> to simplify decoder programming of interleave.
>
> Since the third region has sufficient capacity (4GB) to cover both
> devices (2GB/each), we can actually associate the entire capacity of
> both devices in that region.
>
> We'll discuss this decoder structure in-depth in Section 4.
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot
2025-03-13 16:12 ` Jonathan Cameron
@ 2025-03-13 17:20 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-13 17:20 UTC (permalink / raw)
To: Jonathan Cameron; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Mar 13, 2025 at 04:12:26PM +0000, Jonathan Cameron wrote:
> On Mon, 3 Mar 2025 19:32:43 -0500
> Gregory Price <gourry@gourry.net> wrote:
>
> >
> > SRAT entries are optional, CFMWS are required for each host bridge.
>
> They aren't required for each HB. You could have multiple host bridge and one CFMWS
> as long as you have decided to only support interleave.
> I would only expect to see this where the bios is instantiating CFMWS
> entries to match a specific locked down config though.
>
The further I dived into this, the more I realized CFMWS are the
opposite of required lol. Platform vendors can kind of do whatever they
want here.
I'll be taking another pass at this section since i've done more diving
in to write the interleave section. I probably got a handful of
comments here subtly wrong.
> >
> > If SRAT entries are present, one NUMA node is created for each detected
> > proximity domain in the SRAT. Additional NUMA nodes are created for each
> > CFMWS without a matching SRAT entry.
>
> Don't forget the fun of CFMWS covering multiple SRAT entries (I think
> we just go with the first one?)
>
Oh yeah, I guess that's technically possible. And technically each SRAT
could have a different proximity domain, because you know - value.
The dance between CFMWS and SRAT is quite intricate isn't it.
> >
> > CFMWS describes host-bridge information, and so if SRAT is missing - all
> > devices behind the host bridge will become naturally associated with the
> > same NUMA node.
>
> I wouldn't go with naturally for the reason below. It happens, but maybe
> not natural :)
>
Yeah as above, I got this subtly wrong. Thanks for the notes.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-13 16:55 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
@ 2025-03-13 17:30 ` Gregory Price
2025-03-14 11:14 ` Jonathan Cameron
2025-03-27 9:34 ` Yuquan Wang
1 sibling, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-13 17:30 UTC (permalink / raw)
To: Jonathan Cameron; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Mar 13, 2025 at 04:55:39PM +0000, Jonathan Cameron wrote:
>
> Maybe ignore Generic Initiators for this doc. They are relevant for
> CXL but in the fabric they only matter for type 1 / 2 devices not
> memory and only if the BIOS wants to do HMAT for end to end. Gets
> more fun when they are in the host side of the root bridge.
>
Fair, I wanted to reference the proposals but I personally don't have a
strong understanding of this yet. Dave Jiang mentioned wanting to write
some info on CDAT with some reference to the Generic Port work as well.
Some help understanding this a little better would be very much
appreciated, but I like your summary below. Noted for updated version.
> # Generic Port
>
> In the scenario where CXL memory devices are not present at boot, or
> not configured by the BIOS or he BIOS has not provided full HMAT
> descriptions for the configured memory, we may still want to
> generate proximity domain configurations for those devices.
> The Generic Port structures are intended to fill this gap, so
> that performance information can still be utilized when the
> devices are available at runtime by combining host information
> with that discovered from devices.
>
> Or just
> # Generic Ports
>
> These are fun ;)
>
> >
> > ====
> > HMAT
> > ====
> > The Heterogeneous Memory Attributes Table contains information such as
> > cache attributes and bandwidth and latency details for memory proximity
> > domains. For the purpose of this document, we will only discuss the
> > SSLIB entry.
>
> No fun. You miss Intel's extensions to memory-side caches ;)
> (which is wise!)
>
Yes yes, but I'm trying to be nice. I'm debating on writing the Section
4 interleave addendum on Zen5 too :P
> > ==================
> > NUMA node creation
> > ===================
> > NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are
> > identified at `__init` time, more specifically during `mm_init`.
> >
> > What this means is that the CEDT and SRAT must contain sufficient
> > `proximity domain` information for linux to identify how many NUMA
> > nodes are required (and what memory regions to associate with them).
>
> Is it worth talking about what is effectively a constraint of the spec
> and what is a Linux current constraint?
>
> SRAT is only ACPI defined way of getting Proximity nodes. Linux chooses
> to at most map those 1:1 with NUMA nodes.
> CEDT adds on description of SPA ranges where there might be memory that Linux
> might want to map to 1 or more NUMA nodes
>
Rather than asking if it's worth talking about, I'll spin that around
and ask what value the distinction adds. The source of the constraint
seems less relevant than "All nodes must be defined during mm_init by
something - be it ACPI or CXL source data".
Maybe if this turns into a book, it's worth breaking it out for
referential purposes (pointing to each point in each spec).
> >
> > Basically, the heuristic is as follows:
> > 1) Add one NUMA node per Proximity Domain described in SRAT
>
> if it contains, memory, CPU or generic initiator.
>
noted
> > 2) If the SRAT describes all memory described by all CFMWS
> > - do not create nodes for CFMWS
> > 3) If SRAT does not describe all memory described by CFMWS
> > - create a node for that CFMWS
> >
> > Generally speaking, you will see one NUMA node per Host bridge, unless
> > inter-host-bridge interleave is in use (see Section 4 - Interleave).
>
> I just love corners: QoS concerns might mean multiple CFMWS and hence
> multiple nodes per host bridge (feel free to ignore this one - has
> anyone seen this in the wild yet?) Similar mess for properties such
> as persistence, sharing etc.
This actually come up as a result of me writing this - this does exist
in the wild and is causing all kinds of fun on the weighted_interleave
functionality.
I plan to come back and add this as an addendum, but probably not until
after LSF.
We'll probably want to expand this into a library of case studies that
cover these different choices - in hopes of getting some set of
*suggested* configurations for platform vendors to help play nice with
linux (especially for things that actually consume these blasted nodes).
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity
2025-03-13 17:20 ` Jonathan Cameron
@ 2025-03-13 18:17 ` Gregory Price
2025-03-14 11:09 ` Jonathan Cameron
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-13 18:17 UTC (permalink / raw)
To: Jonathan Cameron; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Mar 13, 2025 at 05:20:04PM +0000, Jonathan Cameron wrote:
> Gregory Price <gourry@gourry.net> wrote:
>
> > -------------------------------
> > One 2GB Device, Multiple CFMWS.
> > -------------------------------
> > Lets imagine we have one 2GB device attached to a host bridge.
> >
> > In this example, the device hosts 2GB of persistent memory - but we
> > might want the flexibility to map capacity as volatile or persistent.
>
> Fairly sure we block persistent in a volatile CFMWS in the kernel.
> Any bios actually does this?
>
> You might have a variable partition device but I thought in kernel at
> least we decided that no one was building that crazy?
>
This was an example I pulled from Dan's notes elsewhere (i think).
I was unaware that we blocked mapping persistent as volatile. I was
working off the assumption that could be flexible mapped similar to...
er... older, non-cxl hardware... cough.
> Maybe a QoS split is a better example to motivate one range, two places?
>
That probably makes sense?
> > -------------------------------------------------------------
> > Two Devices On One Host Bridge - With and Without Interleave.
> > -------------------------------------------------------------
> > What if we wanted some capacity on each endpoint hosted on its own NUMA
> > node, and wanted to interleave a portion of each device capacity?
>
> If anyone hits the lock on commit (i.e. annoying BIOS) the ordering
> checks on HPA kick in here and restrict flexibility a lot
> (assuming I understand them correctly that is)
>
> This is a good illustration of why we should at some point revisit
> multiple NUMA nodes per CFMWS. We have to burn SPA space just
> to get nodes. From a spec point of view all that is needed here
> is a single CFMWS.
>
Along with the above note, and as mentioned on discord, I think this
whole section naturally evolves into a library of "Sane configurations"
and "We promise nothing for `reasons`" configurations.
Maybe that turns into a kernel doc section that requires updating if
a platform disagrees / comes up with new sane configurations. This is
certainly the most difficult area to lock down because we have no idea
who is going to `innovate` and how.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* [LSF/MM] CXL Boot to Bash - Section 6: Page allocation
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
` (4 preceding siblings ...)
2025-03-12 0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
@ 2025-03-14 3:21 ` Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
6 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-14 3:21 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel
Of course, the whole purpose of using CXL memory is to allocate it.
So lets talk about a real use case!
char* a_page = malloc(4096);
a_page[0] = '\x42'; /* Page fault and allocation occurs */
Congrats, you may or may not have allocated CXL memory!
Fin.
-----------------------------------------------------------------------
Ok, in all seriousness, the intent is hopefully to make this whole thing
as transparent as possible - but there's quite a bit of complexity in
how this is done while keeping things reasonably performant.
This section won't cover the general intricacies of how page allocation
works in the kernel, for that I would recommend Lorenzo Stoaks' book:
The Linux Memory Manager.
Notably, however, much of the content of this book concerned with Nodes
and Zones was written pre-CXL. For the sake of this section we'll focus
on how additional nodes and tiers *affect* allocations - whatever their
mechanism (faults, file access, explicit, etc).
That means I expect you'll at least have a base level understand of
virtual memory and allocation-on-fault behavior. Most of what we're
talking about here is reclaim and migration - not page faults.
--------------------------------
Nodes, Tiers, and Zones - Oh My!
--------------------------------
==========
NUMA Nodes
==========
A NUMA node can *tacitly* be thought of as a "collection of homogeneous
resources". This is a fancy way of saying "All the memory on a given
node should have the same performance characteristics."
As discussed in Sections 0 and 1, however, we saw how nodes are
consturcted quite arbitrarily. All that truly matters is what your
platform vendor has chosen to associate devices with "Proximity Domains"
in the various ACPI tables.
I'll stick with my moderately sane, and moderately wrong, definition.
Lets consider a system with 2 sockets and 1 CXL device attached to a
host bridge on each socket.
```
socket-interconnect
|
DRAM -- CPU0--------------CPU1 -- DRAM
| |
CXL0 CXL1
```
The "Locality" information for these devices is built in the ACPI SLIT
(System Locality Information Table).
For example (caveat - fake!):
```
Signature : "SLIT" [System Locality Information Table]
...
Localities : 0000000000000004
Locality 0 : 10 20 20 30
Locality 1 : 20 10 30 20
Locality 2 : FF FF 0A FF
Locality 3 : FF FF FF 0A
```
This is what shows up via the `numactl -H` command
```
$ numactl -H
node distances:
node 0 1 2 3
0: 16 32 32 48
1: 32 16 48 32
2: 255 255 16 255
3: 255 255 255 16
^^^ 255 typically means a node can't initiate access... typically
i.e. "has no processors"
```
These "Locality" values are "Abstract Distances" - i.e. fancy lies from
a black box piece of code that portends to describe something useful.
You may think NUMA nodes have a clean topological relationship like so:
```
node0--------------node1
| |
node2 node2
```
In reality, all Linux knows is that these are "relative distances".
```
Node 0 distance from other nodes:
0->[1,2]->3
Node 1 distance from other nodes:
1->[0,3]->2
```
Why does this matter?
Lets imagine a Node 0 CPU allocates a page, but Node 0 is out of memory.
Which node should Node 0 fall back to allocate from?
In our example above, nodes `[1,2]` seem like equally good options. In
reality, the cross-socket interconnect will usually be classified as
"closer" than a CXL device.
You can expect the following to be more realistic.
```
$ numactl -H
node distances:
node 0 1 2 3
0: 16 32 48 64
1: 32 16 64 48
2: 255 255 16 255
3: 255 255 255 16
Node 0 distance perspective:
0->1->2->3
Node 1 distance perspective
1->0->3->2
```
Which makes sense, because typically the cross-socket interconnect will
be faster than a CXL link, and for Node0 to access Node3, it must cross
both interconnects.
`Memory Tiers`, however, are quite a bit different.
=============
Memory Tiers.
=============
Memory tiers collect all similar-performance devices into a single "Tier".
These tiers can be inspected via sysfs:
[/sys/devices/virtual/memory_tiering/]# ls
memory_tier4 memory_tier961
[/sys/devices/virtual/memory_tiering/]# cat memory_tier4/nodelist
0-1
[/sys/devices/virtual/memory_tiering/]# cat memory_tier4/nodelist
2-3
On our example 2-socket system, both sockets hosting local DRAM Would
get lumped into the same tier, while both CXL devices would get lumped
into the same tier (lets assume they have the same latency/bandwidth).
```
memory_tier4--------------------
/ \ |
node0 node1 |
memory_tier961
/ \
node2 node3
```
This relationship, ostensibly, provides a quick and easy way to
determine a rough performance-based relationship between nodes.
This is, ostensibly, useful if you want to do memory tiering (demotion
and/or promotion). More on this in a bit.
Tiers are created based on performance data - typically provided by the
HMAT or CXL CDAT data. The memory-tiers component treats
socket-attached DRAM as the baseline, and generates its own abstract
distance (different that the SLIT abstract distance!).
```
int mt_perf_to_adistance(struct access_coordinate *perf, int *adist)
{
... snip ...
/*
* The abstract distance of a memory node is in direct proportion to
* its memory latency (read + write) and inversely proportional to its
* memory bandwidth (read + write). The abstract distance, memory
* latency, and memory bandwidth of the default DRAM nodes are used as
* the base.
*/
*adist = MEMTIER_ADISTANCE_DRAM *
(perf->read_latency + perf->write_latency) /
(default_dram_perf.read_latency + default_dram_perf.write_latency) *
(default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) /
(perf->read_bandwidth + perf->write_bandwidth);
}
```
The memory-tier component also provides a demotion-target mechanism,
which creates a recommended demotion-target based on a given node.
```
/**
* next_demotion_node() - Get the next node in the demotion path
* @node: The starting node to lookup the next node
*
* Return: node id for next memory node in the demotion path hierarchy
* from @node; NUMA_NO_NODE if @node is terminal. This does not keep
* @node online or guarantee that it *continues* to be the next demotion
* target.
*/
int next_demotion_node(int node);
```
The node_demotion map uses... SLIT provided node abstract distances to
determine the target!
```
/*
* node_demotion[] examples:
*
* Example 1:
*
* Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
*
* node distances:
* node 0 1 2 3
* 0 10 20 30 40
* 1 20 10 40 30
* 2 30 40 10 40
* 3 40 30 40 10
*
* memory_tiers0 = 0-1
* memory_tiers1 = 2-3
*
* node_demotion[0].preferred = 2
* node_demotion[1].preferred = 3
* node_demotion[2].preferred = <empty>
* node_demotion[3].preferred = <empty>
* ...
* /
```
As of 03/13/2025, there is no `next_promotion_node()` counterpart to
this function. As we'll probably learn in a later section:
Promotion Is Hard. (TM)
There is at least an interface to tell you whether a node is toptier:
bool node_is_toptier(int node);
The grand total of interfaces you need to know for the remainder of
this section are exactly:
next_demotion_node(int node)
I may consider another section on Memory Tiering in the future, but
this is sufficient for now.
< unwarrated snark >
"But Greg", you say, "it seems to me that memory-tiers as designed are
of dubious value considering hardware interleave as described in
Section 4 already combines multiple devices into a single node - and
lumping remote nodes regardless of socket-relationship is at best
misleading and at worst may actively cause performance regressions!"
Very astute observation. Maybe we should rethink this component a bit.
< / unwarranted snark >
=============
Memory Zones.
=============
In Section 3 (Memory Hotplug) we briefly discussed memory zones. For
the purpose of this section, all we need to know is how these zones
impact allocation. I will largely quote the Linux kernel docs here.
```
* ZONE_NORMAL is for normal memory that can be accessed by the kernel
all the time. DMA operations can be performed on pages in this zone
if the DMA devices support transfers to all addressable memory.
ZONE_NORMAL is always enabled.
* ZONE_MOVABLE is for normal accessible memory, just like ZONE_NORMAL.
The difference is that the contents of most pages in ZONE_MOVABLE are
movable. That means that while virtual addresses of these pages do
not change, their content may move between different physical pages.
```
note: ZONE_NORMAL allocations MAY be movable. ZONE_MOVABLE must be.
(for some definition of `must`, suspend disbelief for now)
We generally don't want kernel resources incidentally finding themselves
on CXL memory (a highly contended lock landing on far-memory by complete
happenstance would by absolutely tragic). However, there aren't many
mechanism to prevent this from occurring.
While the kernel may *explicitly* allocate ZONE_MOVABLE memory via
special interfaces, a typical `kmalloc()` call will utilize ZONE_NORMAL
memory, as most kernel allocations are not guaranteed to be movable.
That means most kernel allocations, should they happen to land on a
remote node, are *stuck there*. For most use cases, then, we will want
CXL memory onlined into ZONE_MOVABLE, because we'd like the option to
migrate memory off of these devices for a variety of reasons.
The most obvious mechanism to prevent the kernel from using CXL memory
is to online CXL memory in ZONE_MOVABLE.
However, ZONE_MOVABLE isn't without drawbacks. For example...
Gigantic (1GB) pages are not allocable from ZONE_MOVABLE. Many
hypervisors utilize Gigantic pages to limit TLB pressure. That means,
for now, VM use cases must pick between incidental kernel use, and
Gigantic page use.
This is all to say: Memory zone configuration affects your performance.
----------------
Page Allocation.
----------------
There are a variety of ways pages can be allocated, for this section
we'll just focus on a plain old allocate-on-fault interaction.
char* page = malloc(4096);
page[0] = '\x42'; /* Page fault, kernel allocates a page */
Assuming no special conditions (memory pressure, mempolicies, special
prefetchers - whatever), the default allocation policy in linux is to
allocate a page from the same node as the accessing CPU.
So if, in our example, we are running on a node1 CPU and hit the above
page fault, we'll allocate from the node1 DRAM.
```
Default allocation policy:
access
/ \- allocation
DRAM -- CPU0------CPU1 -- DRAM
| |
| |
CXL0 CXL1
```
Simple and, dare I say, elegant - really.
This is, of course, assuming, we have no special conditions - of which
there are, of course, many.
================
Memory Pressure.
================
Lets assume DRAM on node1 is pressured, and there is insufficient
headroom to allocate a page on node1. What should we do?
We have a few options.
1) Fall back to another node
2) Attempt to steal a physical page from someone else. ("reclaim")
Lets assume reclaim doesn't exist for a moment.
What node should fall back to? One might assume we would consider
attempting to allocate based on the NUMA node topology.
For example:
```
* node distances:
* node 0 1 2 3
* 0 10 20 30 40
* 1 20 10 40 30
* 2 30 40 10 40
* 3 40 30 40 10
```
In this topology, Node1 would prefer allocating from Node0 as a
secondary source, and subsequently fall back to Node3 and Node2 as
those nodes become pressured.
This is basically what happens. But is that what we want?
If a page is being allocated, it is almost by definition "hot", and
so this has lead the kernel to conclude that - generally speaking - new
allocations should be local unless explicitly requested otherwise.
So instead, by default, we will start engaging the reclaim system.
================================
Reclaim - LRU, Zones, and Nodes.
================================
In the scenario where memory is pressured and reclaim is in use, Linux
will go through a variety of phases to based on watermarks (Min, Low,
High memory availability). These watermarks are used to determine
when reclaim should run and when the system should block further
allocations to ensure the kernel has sufficient headroom to make
forward progress.
An allocation may cause a kernel daemon to start moving pages through
the LRU (least-recently-used) mechanism, or it may cause the task
itself to engage in the process.
The reclaim system may choose to swap pages to disk or to demote pages
from the local node to a remote node.
The key piece here is understanding the main LRU types and their
relationship to zones and nodes.
```
______node______
/ \
ZONE_NORMAL ZONE_MOVABLE
/ \ / \
active LRU inactive LRU active LRU inactive LRU
```
Typically reclaim is engaged when attempting an allocation and the
requested zone hits a low or min watermark. On our imaginary system,
lets assume we've set up the following structure.
```
node0 - DRAM node2 - CXL
| |
ZONE_NORMAL ZONE_MOVABLE
/ \ / \
active_lru inactive_lru active_lru inactive_lru
```
node0 (local) has no ZONE_MOVABLE, and node2 has no ZONE_NORMAL. Since
we always prefer allocations from the local node, we'll evicting pages
from ZONE_NORMAL on node0 - that's the only zone we can allocate from.
Specifically, reclaim will prefer to evict pages from the inactive lru
and "age off" pages from the active_lru to the inactive_lru. If
reclaim fails, it may then fail to allocate from the requested node and
fallback to another node to continue forward progress.
(or maybe OOM, or some other nebulous conditions - it's all really quite
complex and well documented in Lorenzo's book, highly recommended!)
==================
Swap vs. Demotion.
===================
By default, the reclaim system will only age pages from active to
inactive LRUs, and then move to evict pages from the inactive LRU
(possibly engaging swap or just nixing read-only page mappings).
However, reclaim can be configured to *demote* pages as well via
the sysfs option:
$ echo 1 > /sys/kernel/mm/numa/demotion_enabled
In this scenario, rather than evict from inactive LRU to swap, we
can demote a page from its current node to its closest demotion target.
```
mm/vmscan.c:
static bool can_demote(int nid, struct scan_control *sc)
{
if (!numa_demotion_enabled)
return false;
if (sc && sc->no_demotion)
return false;
if (next_demotion_node(nid) == NUMA_NO_NODE)
return false;
return true;
}
/*
* Take folios on @demote_folios and attempt to demote them to another node.
* Folios which are not demoted are left on @demote_folios.
*/
static unsigned int demote_folio_list(struct list_head *demote_folios,
struct pglist_data *pgdat)
{
...
/* Demotion ignores all cpuset and mempolicy settings */
migrate_pages(demote_folios, alloc_migrate_folio, NULL,
(unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
&nr_succeeded);
}
```
Notice that the comment here says "Demotion ignores cpuset". That means
if you turn this setting on, and you require strong cpuset.mems
isolation, you're in for a surprise! Another fun nuance to trip over.
======================================
Kernel Memory Tiering - A Short Story.
======================================
The story so far:
In the beginning [Memory Tiering] was created.
This has made a lot of people very angry and
been widely regarded as a bad move.
~ Douglas Adams, The Restaurant at the End of the [CXL Fabric]
There isn't a solid consensus on how memory tiering should be
implemented in the kernel, so I will refrain from commenting on
the various proposals for now.
This obviously ikely deserves its own section which tumbles over
6 or 7 different RFCs in varying states - and ever so slightly
misrepresents the work enough confuse everyone.
Lets not, for now.
So I will leave it here:
Most of these systems aim at 3 goals:
1) create space on the local nodes for new allocations
2) demote cold memory from local nodes to make room for hot memory
3) promote hot memory from remote nodes to reduce average latencies.
No one (largely) agrees what the best approach for this is, yet.
If I were to make one request before anyone proposes yet *another*
tiering mechanism, I would ask that you take a crack at implementing
`next_promotion_node()` first.
```
/**
* next_promotion_node() - Get the next node in the promotion path
* @node: The starting node to lookup the next node
*
* Return: node id for next memory node in the promotion path hierarchy
* from @node; NUMA_NO_NODE if @node is top tier. This does not keep
* @node online or guarantee that it *continues* to be the next promotion
* target.
*/
int next_promotion_node(int node);
```
That's the whole ballgame.
Fin.
-----------------------------------------------------------------------
Yes, that's basically it. The kernel prefers to allocate new pages from
the local node, and it tries to demote memory to make sure this can
happen. Otherwise - incidental direct allocation can occur on fallback.
But how you configure your CXL memory dictates all this behavior. So
it's extremely important that we get the configuration part right.
This will be the end of the Boot to Bash series for the purpose of
LSFMM 2025 background. We will likely continue in a github repo or
something from here on.
See you all in Montreal.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity
2025-03-13 18:17 ` Gregory Price
@ 2025-03-14 11:09 ` Jonathan Cameron
2025-03-14 13:46 ` Gregory Price
0 siblings, 1 reply; 81+ messages in thread
From: Jonathan Cameron @ 2025-03-14 11:09 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, 13 Mar 2025 14:17:57 -0400
Gregory Price <gourry@gourry.net> wrote:
> On Thu, Mar 13, 2025 at 05:20:04PM +0000, Jonathan Cameron wrote:
> > Gregory Price <gourry@gourry.net> wrote:
> >
> > > -------------------------------
> > > One 2GB Device, Multiple CFMWS.
> > > -------------------------------
> > > Lets imagine we have one 2GB device attached to a host bridge.
> > >
> > > In this example, the device hosts 2GB of persistent memory - but we
> > > might want the flexibility to map capacity as volatile or persistent.
> >
> > Fairly sure we block persistent in a volatile CFMWS in the kernel.
> > Any bios actually does this?
> >
> > You might have a variable partition device but I thought in kernel at
> > least we decided that no one was building that crazy?
> >
>
> This was an example I pulled from Dan's notes elsewhere (i think).
>
> I was unaware that we blocked mapping persistent as volatile. I was
> working off the assumption that could be flexible mapped similar to...
> er... older, non-cxl hardware... cough.
You can use it as volatile, but that doesn't mean we allow it in a CFMWS
that says the host PA range is not suitable for persistent.
A BIOS might though I think.
>
> > Maybe a QoS split is a better example to motivate one range, two places?
> >
>
> That probably makes sense?
>
> > > -------------------------------------------------------------
> > > Two Devices On One Host Bridge - With and Without Interleave.
> > > -------------------------------------------------------------
> > > What if we wanted some capacity on each endpoint hosted on its own NUMA
> > > node, and wanted to interleave a portion of each device capacity?
> >
> > If anyone hits the lock on commit (i.e. annoying BIOS) the ordering
> > checks on HPA kick in here and restrict flexibility a lot
> > (assuming I understand them correctly that is)
> >
> > This is a good illustration of why we should at some point revisit
> > multiple NUMA nodes per CFMWS. We have to burn SPA space just
> > to get nodes. From a spec point of view all that is needed here
> > is a single CFMWS.
> >
>
> Along with the above note, and as mentioned on discord, I think this
> whole section naturally evolves into a library of "Sane configurations"
> and "We promise nothing for `reasons`" configurations.
:) Snag is that as Dan pointed out on discord we assume this applies
even without the lock. So it is possible to have device and host
hardware combinations where things are forced to be very non-intuitive.
>
> Maybe that turns into a kernel doc section that requires updating if
> a platform disagrees / comes up with new sane configurations. This is
> certainly the most difficult area to lock down because we have no idea
> who is going to `innovate` and how.
Yup. It gets much more 'fun' once DCD partitions/ regions enter the game
as there are many more types of memory.
Jonathan
>
> ~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-13 17:30 ` Gregory Price
@ 2025-03-14 11:14 ` Jonathan Cameron
0 siblings, 0 replies; 81+ messages in thread
From: Jonathan Cameron @ 2025-03-14 11:14 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, 13 Mar 2025 13:30:58 -0400
Gregory Price <gourry@gourry.net> wrote:
> On Thu, Mar 13, 2025 at 04:55:39PM +0000, Jonathan Cameron wrote:
> >
> > Maybe ignore Generic Initiators for this doc. They are relevant for
> > CXL but in the fabric they only matter for type 1 / 2 devices not
> > memory and only if the BIOS wants to do HMAT for end to end. Gets
> > more fun when they are in the host side of the root bridge.
> >
>
> Fair, I wanted to reference the proposals but I personally don't have a
> strong understanding of this yet. Dave Jiang mentioned wanting to write
> some info on CDAT with some reference to the Generic Port work as well.
>
> Some help understanding this a little better would be very much
> appreciated, but I like your summary below. Noted for updated version.
>
> > # Generic Port
> >
> > In the scenario where CXL memory devices are not present at boot, or
> > not configured by the BIOS or he BIOS has not provided full HMAT
> > descriptions for the configured memory, we may still want to
> > generate proximity domain configurations for those devices.
> > The Generic Port structures are intended to fill this gap, so
> > that performance information can still be utilized when the
> > devices are available at runtime by combining host information
> > with that discovered from devices.
> >
> > Or just
> > # Generic Ports
> >
> > These are fun ;)
> >
>
> > >
> > > ====
> > > HMAT
> > > ====
> > > The Heterogeneous Memory Attributes Table contains information such as
> > > cache attributes and bandwidth and latency details for memory proximity
> > > domains. For the purpose of this document, we will only discuss the
> > > SSLIB entry.
> >
> > No fun. You miss Intel's extensions to memory-side caches ;)
> > (which is wise!)
> >
>
> Yes yes, but I'm trying to be nice. I'm debating on writing the Section
> 4 interleave addendum on Zen5 too :P
What do they get up to? I've not seen that one yet!
May be a case of 'Hold my beer' for these crazies.
>
> > > ==================
> > > NUMA node creation
> > > ===================
> > > NUMA nodes are *NOT* hot-pluggable. All *POSSIBLE* NUMA nodes are
> > > identified at `__init` time, more specifically during `mm_init`.
> > >
> > > What this means is that the CEDT and SRAT must contain sufficient
> > > `proximity domain` information for linux to identify how many NUMA
> > > nodes are required (and what memory regions to associate with them).
> >
> > Is it worth talking about what is effectively a constraint of the spec
> > and what is a Linux current constraint?
> >
> > SRAT is only ACPI defined way of getting Proximity nodes. Linux chooses
> > to at most map those 1:1 with NUMA nodes.
> > CEDT adds on description of SPA ranges where there might be memory that Linux
> > might want to map to 1 or more NUMA nodes
> >
>
> Rather than asking if it's worth talking about, I'll spin that around
> and ask what value the distinction adds. The source of the constraint
> seems less relevant than "All nodes must be defined during mm_init by
> something - be it ACPI or CXL source data".
>
> Maybe if this turns into a book, it's worth breaking it out for
> referential purposes (pointing to each point in each spec).
Fair point. It doesn't add much.
>
> > >
> > > Basically, the heuristic is as follows:
> > > 1) Add one NUMA node per Proximity Domain described in SRAT
> >
> > if it contains, memory, CPU or generic initiator.
> >
>
> noted
>
> > > 2) If the SRAT describes all memory described by all CFMWS
> > > - do not create nodes for CFMWS
> > > 3) If SRAT does not describe all memory described by CFMWS
> > > - create a node for that CFMWS
> > >
> > > Generally speaking, you will see one NUMA node per Host bridge, unless
> > > inter-host-bridge interleave is in use (see Section 4 - Interleave).
> >
> > I just love corners: QoS concerns might mean multiple CFMWS and hence
> > multiple nodes per host bridge (feel free to ignore this one - has
> > anyone seen this in the wild yet?) Similar mess for properties such
> > as persistence, sharing etc.
>
> This actually come up as a result of me writing this - this does exist
> in the wild and is causing all kinds of fun on the weighted_interleave
> functionality.
>
> I plan to come back and add this as an addendum, but probably not until
> after LSF.
>
> We'll probably want to expand this into a library of case studies that
> cover these different choices - in hopes of getting some set of
> *suggested* configurations for platform vendors to help play nice with
> linux (especially for things that actually consume these blasted nodes).
Agreed. We'll be looking back on this in a year or so and thinking, wasn't
life nice an simple back then!
Jonathan
>
> ~Gregory
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity
2025-03-14 11:09 ` Jonathan Cameron
@ 2025-03-14 13:46 ` Gregory Price
0 siblings, 0 replies; 81+ messages in thread
From: Gregory Price @ 2025-03-14 13:46 UTC (permalink / raw)
To: Jonathan Cameron; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Fri, Mar 14, 2025 at 11:09:42AM +0000, Jonathan Cameron wrote:
> >
> > I was unaware that we blocked mapping persistent as volatile. I was
> > working off the assumption that could be flexible mapped similar to...
> > er... older, non-cxl hardware... cough.
>
> You can use it as volatile, but that doesn't mean we allow it in a CFMWS
> that says the host PA range is not suitable for persistent.
> A BIOS might though I think.
>
aaaaaaaaaaaaah this helps. Ok, we can repurpose the hardware, but not
the CFMWS. Even more pressure on platforms to get it right :P.
> >
> > Along with the above note, and as mentioned on discord, I think this
> > whole section naturally evolves into a library of "Sane configurations"
> > and "We promise nothing for `reasons`" configurations.
>
> :) Snag is that as Dan pointed out on discord we assume this applies
> even without the lock. So it is possible to have device and host
> hardware combinations where things are forced to be very non-intuitive.
>
Right, but i think that falls into "We promise nothing, for `reasons`".
At the very least it would give us a communication tool that helps
bridge the gap between platform, linux, and end-users.
Or it'd just makes it all worse, one of the two.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
` (5 preceding siblings ...)
2025-03-14 3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
@ 2025-03-18 17:09 ` Gregory Price
2025-04-02 4:49 ` Gregory Price
6 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-18 17:09 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel, mhocko
Updating the summary to this because the old one no longer describes the
goal of the session well enough.
---
I'd like to propose a discussion of how CXL volatile memory device
configuration is creating a complex memory initialization/allocation
process. This is making it difficult for users to get "what they want".
Example issues for discussion:
- Linux-arch defined memory block size and alignment requirements
- CXL as ZONE_NORMAL vs ZONE_MOVABLE impacts on memmap and performance
- NUMA node and memory tier allocations, and new evolutions
a) multiple NUMA-nodes per device
b) confusing state of memory-tier component
- cma and 1GB hugepage allocation from ZONE_MOVABLE on CXL
Most of this complexity is being generated by how BIOS, ACPI, build, and
boot config options interactions with each other. The best documented
description of these interactions is: "Complicated, Subtle, and Scary".
Having better documentation on "Linux Expectations for CXL Platforms"
will help guide innovations in this space - but requires some agreement
on limitations to keep existing components functional. This is becoming
increasingly important as more CXL hardware is placed into production.
The live session will not cover the full initialization process, which
I've attempted to describe and discuss in [1][2][3][4][5][6][7][8].
The intent is to evolve this series into more explicit documentation for
platform developers and help bridge the gap between maintainers that
need background on "Why the heck does CXL need this?".
[1] Section 0: ACPI Tables
https://lore.kernel.org/linux-mm/Z8jORKIWC3ZwtzI4@gourry-fedora-PF4VCD3F/
[2] Section 0a: CFMWS and NUMA Flexibility
https://lore.kernel.org/linux-mm/Z8u4GTrr-UytqXCB@gourry-fedora-PF4VCD3F/
[3] Section 1: BIOS, EFI, and Early Boot
https://lore.kernel.org/linux-mm/Z6LKJZkcdjuit2Ck@gourry-fedora-PF4VCD3F/
[4] Section 2: The Drivers
https://lore.kernel.org/linux-mm/Z6OMcLt3SrsZjgvw@gourry-fedora-PF4VCD3F/
[5] Section 2a: CXL Decoder Programming
https://lore.kernel.org/linux-mm/Z8o2HfVd0P_tMhV2@gourry-fedora-PF4VCD3F/
[6] Section 3: Memory (block) Hotplug
https://lore.kernel.org/linux-mm/Z7OWmDXEYhT0BB0X@gourry-fedora-PF4VCD3F/
[7] Section 4: Interleave
https://lore.kernel.org/linux-mm/Z9DQnjPWbkjqrI9n@gourry-fedora-PF4VCD3F/
[8] Section 5: Page Allocation
https://lore.kernel.org/linux-mm/Z9Ogp9fCEPoORfnh@gourry-fedora-PF4VCD3F/
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 4: Interleave
2025-03-12 0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13 8:31 ` Yuquan Wang
@ 2025-03-26 9:28 ` Yuquan Wang
2025-03-26 12:53 ` Gregory Price
1 sibling, 1 reply; 81+ messages in thread
From: Yuquan Wang @ 2025-03-26 9:28 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Tue, Mar 11, 2025 at 08:09:02PM -0400, Gregory Price wrote:
> -----------------------
> Combination Interleave.
> -----------------------
> Lets consider now a system where 2 Host Bridges have 2 CXL devices each,
> and we want to interleave the entire set. This requires us to make use
> of both inter and intra host bridge interleave.
>
> First, we can interleave this with the a single CEDT entry, the same as
> the first inter-host-bridge CEDT (now assuming 1GB per device).
>
> ```
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000300000000 <- Memory Region
> Window size : 0000000100000000 <- 4GB
> Interleave Members (2^n) : 01 <- 2-way interleave
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
> Next Target : 00000006 <- Host Bridge _UID
> ```
>
> This gives us a NUMA structure as follows:
> ```
> NUMA Structure:
>
> ---------- -------- | ----------
> | cpu0 |-----| DRAM |----|---| Node 0 |
> ---------- -------- | ----------
> / \ |
> ------- ------- | ----------
> | HB0 |-----| HB1 |-------------|---| Node 1 |
> ------- ------- | ----------
> / \ / \ |
> CXL0 CXL1 CXL2 CXL3 |
> ```
>
> And the respective decoder programming looks as follows
> ```
> Decoders:
> CXL Root
> |
> decoder0.0
> IW:2 IG:256
> [0x300000000, 0x3FFFFFFFF]
> / \
> Host Bridge 7 Host Bridge 6
> / \
> decoder1.0 decoder2.0
> IW:2 IG:512 IW:2 IG:512
> [0x300000000, 0x3FFFFFFFFF] [0x300000000, 0x3FFFFFFFF]
> / \ / \
> endpoint0 endpoint1 endpoint2 endpoint3
> | | | |
> decoder3.0 decoder4.0 decoder5.0 decoder6.0
> IW:4 IG:256 IW:4 IG:256
> [0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF]
> ```
>
> Notice at both the root and the host bridge, the Interleave Ways is 2.
> There are two targets at each level. The host bridge has a granularity
> of 512 to capture its parent's ways and granularity (`2*256`).
>
> Each decoder is programmed with the total number of targets (4) and the
> overall granularity (256B).
>
Sorry, I tried to set this topology on Qemu Virt and used:
"cxl create-region -d decoder0.0 -t ram -m mem0,mem1,mem2,mem3"
but it failed with:
"cxl region: validate_ways: Interleave ways 2 is less than number of memdevs specified: 4"
It seems like the CFMWs IW should be 4?
Yuquan
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 4: Interleave
2025-03-26 9:28 ` Yuquan Wang
@ 2025-03-26 12:53 ` Gregory Price
2025-03-27 2:20 ` Yuquan Wang
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-26 12:53 UTC (permalink / raw)
To: Yuquan Wang; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Wed, Mar 26, 2025 at 05:28:00PM +0800, Yuquan Wang wrote:
> >
> > Notice at both the root and the host bridge, the Interleave Ways is 2.
> > There are two targets at each level. The host bridge has a granularity
> > of 512 to capture its parent's ways and granularity (`2*256`).
> >
> > Each decoder is programmed with the total number of targets (4) and the
> > overall granularity (256B).
> >
>
> Sorry, I tried to set this topology on Qemu Virt and used:
> "cxl create-region -d decoder0.0 -t ram -m mem0,mem1,mem2,mem3"
>
> but it failed with:
> "cxl region: validate_ways: Interleave ways 2 is less than number of memdevs specified: 4"
>
> It seems like the CFMWs IW should be 4?
>
It has been a while since i've interacted with QEMU's interleave stuff,
but IIRC (at least back when I was working on it) most configurations
had 1 device per host bridge - in which case the CFMWS IW should be 4
with each of the host bridges described in it.
I'm not sure you can do multiple devices per host bridge without a
switch setup.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 4: Interleave
2025-03-26 12:53 ` Gregory Price
@ 2025-03-27 2:20 ` Yuquan Wang
2025-03-27 2:51 ` [Lsf-pc] " Dan Williams
0 siblings, 1 reply; 81+ messages in thread
From: Yuquan Wang @ 2025-03-27 2:20 UTC (permalink / raw)
To: Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
On Wed, Mar 26, 2025 at 08:53:14AM -0400, Gregory Price wrote:
> On Wed, Mar 26, 2025 at 05:28:00PM +0800, Yuquan Wang wrote:
> > >
> > > Notice at both the root and the host bridge, the Interleave Ways is 2.
> > > There are two targets at each level. The host bridge has a granularity
> > > of 512 to capture its parent's ways and granularity (`2*256`).
> > >
> > > Each decoder is programmed with the total number of targets (4) and the
> > > overall granularity (256B).
> > >
> >
> > Sorry, I tried to set this topology on Qemu Virt and used:
> > "cxl create-region -d decoder0.0 -t ram -m mem0,mem1,mem2,mem3"
> >
> > but it failed with:
> > "cxl region: validate_ways: Interleave ways 2 is less than number of memdevs specified: 4"
> >
> > It seems like the CFMWs IW should be 4?
> >
>
> It has been a while since i've interacted with QEMU's interleave stuff,
> but IIRC (at least back when I was working on it) most configurations
> had 1 device per host bridge - in which case the CFMWS IW should be 4
> with each of the host bridges described in it.
>
> I'm not sure you can do multiple devices per host bridge without a
> switch setup.
>
Qemu counld add 'cxl-rp' under a cxl host bridge. Below is my qemu
command:
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
-device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \
-device cxl-type3,bus=root_port0,volatile-memdev=mem2,id=cxl-mem1 \
-device cxl-type3,bus=root_port1,volatile-memdev=mem3,id=cxl-mem2 \
-device pxb-cxl,bus_nr=20,bus=pcie.0,id=cxl.2 \
-device cxl-rp,port=2,bus=cxl.2,id=root_port2,chassis=0,slot=2 \
-device cxl-rp,port=3,bus=cxl.2,id=root_port3,chassis=0,slot=3 \
-device cxl-type3,bus=root_port2,volatile-memdev=mem4,id=cxl-mem3 \
-device cxl-type3,bus=root_port3,volatile-memdev=mem5,id=cxl-mem4 \
-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=2G \
My lspci shows:
-+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge
| +-01.0 Red Hat, Inc. Virtio network device
| +-02.0 Red Hat, Inc. Virtio block device
| +-03.0 Red Hat, Inc. QEMU PCIe Expander bridge
| \-04.0 Red Hat, Inc. QEMU PCIe Expander bridge
+-[0000:0c]-+-00.0-[0d]----00.0 Intel Corporation Device 0d93 (CXL)
| \-01.0-[0e]----00.0 Intel Corporation Device 0d93 (CXL)
\-[0000:14]-+-00.0-[15]----00.0 Intel Corporation Device 0d93 (CXL)
\-01.0-[16]----00.0 Intel Corporation Device 0d93 (CXL)
My cxl list shows:
[
{
"memdev":"mem1",
"ram_size":268435456,
"serial":0,
"host":"0000:15:00.0"
},
{
"memdev":"mem0",
"ram_size":268435456,
"serial":0,
"host":"0000:16:00.0"
},
{
"memdev":"mem2",
"ram_size":268435456,
"serial":0,
"host":"0000:0e:00.0"
},
{
"memdev":"mem3",
"ram_size":268435456,
"serial":0,
"host":"0000:0d:00.0"
}
]
Then:
# cxl create-region -d decoder0.0 -t ram -m mem0,mem1,mem2,mem3
cxl region: validate_ways: Interleave ways 2 is less than number of memdevs specified: 4
cxl region: cmd_create_region: created 0 regions
This case confuesed me :(
Yuquan
> ~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [Lsf-pc] [LSF/MM] CXL Boot to Bash - Section 4: Interleave
2025-03-27 2:20 ` Yuquan Wang
@ 2025-03-27 2:51 ` Dan Williams
2025-03-27 6:29 ` Yuquan Wang
0 siblings, 1 reply; 81+ messages in thread
From: Dan Williams @ 2025-03-27 2:51 UTC (permalink / raw)
To: Yuquan Wang, Gregory Price; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
Yuquan Wang wrote:
> On Wed, Mar 26, 2025 at 08:53:14AM -0400, Gregory Price wrote:
> > On Wed, Mar 26, 2025 at 05:28:00PM +0800, Yuquan Wang wrote:
> > > >
> > > > Notice at both the root and the host bridge, the Interleave Ways is 2.
> > > > There are two targets at each level. The host bridge has a granularity
> > > > of 512 to capture its parent's ways and granularity (`2*256`).
> > > >
> > > > Each decoder is programmed with the total number of targets (4) and the
> > > > overall granularity (256B).
> > > >
> > >
> > > Sorry, I tried to set this topology on Qemu Virt and used:
> > > "cxl create-region -d decoder0.0 -t ram -m mem0,mem1,mem2,mem3"
> > >
> > > but it failed with:
> > > "cxl region: validate_ways: Interleave ways 2 is less than number of memdevs specified: 4"
> > >
> > > It seems like the CFMWs IW should be 4?
> > >
> >
> > It has been a while since i've interacted with QEMU's interleave stuff,
> > but IIRC (at least back when I was working on it) most configurations
> > had 1 device per host bridge - in which case the CFMWS IW should be 4
> > with each of the host bridges described in it.
> >
> > I'm not sure you can do multiple devices per host bridge without a
> > switch setup.
> >
> Qemu counld add 'cxl-rp' under a cxl host bridge. Below is my qemu
> command:
>
> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
> -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \
> -device cxl-type3,bus=root_port0,volatile-memdev=mem2,id=cxl-mem1 \
> -device cxl-type3,bus=root_port1,volatile-memdev=mem3,id=cxl-mem2 \
> -device pxb-cxl,bus_nr=20,bus=pcie.0,id=cxl.2 \
> -device cxl-rp,port=2,bus=cxl.2,id=root_port2,chassis=0,slot=2 \
> -device cxl-rp,port=3,bus=cxl.2,id=root_port3,chassis=0,slot=3 \
> -device cxl-type3,bus=root_port2,volatile-memdev=mem4,id=cxl-mem3 \
> -device cxl-type3,bus=root_port3,volatile-memdev=mem5,id=cxl-mem4 \
> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=2G \
>
> My lspci shows:
>
> -+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge
> | +-01.0 Red Hat, Inc. Virtio network device
> | +-02.0 Red Hat, Inc. Virtio block device
> | +-03.0 Red Hat, Inc. QEMU PCIe Expander bridge
> | \-04.0 Red Hat, Inc. QEMU PCIe Expander bridge
> +-[0000:0c]-+-00.0-[0d]----00.0 Intel Corporation Device 0d93 (CXL)
> | \-01.0-[0e]----00.0 Intel Corporation Device 0d93 (CXL)
> \-[0000:14]-+-00.0-[15]----00.0 Intel Corporation Device 0d93 (CXL)
> \-01.0-[16]----00.0 Intel Corporation Device 0d93 (CXL)
>
> My cxl list shows:
> [
> {
> "memdev":"mem1",
> "ram_size":268435456,
> "serial":0,
> "host":"0000:15:00.0"
> },
> {
> "memdev":"mem0",
> "ram_size":268435456,
> "serial":0,
> "host":"0000:16:00.0"
> },
> {
> "memdev":"mem2",
> "ram_size":268435456,
> "serial":0,
> "host":"0000:0e:00.0"
> },
> {
> "memdev":"mem3",
> "ram_size":268435456,
> "serial":0,
> "host":"0000:0d:00.0"
> }
> ]
>
> Then:
>
> # cxl create-region -d decoder0.0 -t ram -m mem0,mem1,mem2,mem3
> cxl region: validate_ways: Interleave ways 2 is less than number of memdevs specified: 4
> cxl region: cmd_create_region: created 0 regions
>
> This case confuesed me :(
What is the output of:
cxl list -M -d decoder0.0
...my expectation is that it only finds 2 potential endpoint devices
that are mapped by that decoder, I.e. QEMU did not produce a CFMWS that
interleaves the 2 host-bridges.
The error message could be improved to clarify that only endpoints
mapped by the given decoder are candidates to be members of a region.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [Lsf-pc] [LSF/MM] CXL Boot to Bash - Section 4: Interleave
2025-03-27 2:51 ` [Lsf-pc] " Dan Williams
@ 2025-03-27 6:29 ` Yuquan Wang
0 siblings, 0 replies; 81+ messages in thread
From: Yuquan Wang @ 2025-03-27 6:29 UTC (permalink / raw)
To: dan.j.williams, gourry; +Cc: lsf-pc, linux-mm, linux-cxl, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 4831 bytes --]
On 2025-03-27 10:51, dan.j.williams, gourry wrote:
Yuquan Wang wrote:
> On Wed, Mar 26, 2025 at 08:53:14AM -0400, Gregory Price wrote:
> > On Wed, Mar 26, 2025 at 05:28:00PM +0800, Yuquan Wang wrote:
> > > >
> > > > Notice at both the root and the host bridge, the Interleave Ways is 2.
> > > > There are two targets at each level. The host bridge has a granularity
> > > > of 512 to capture its parent's ways and granularity (`2*256`).
> > > >
> > > > Each decoder is programmed with the total number of targets (4) and the
> > > > overall granularity (256B).
> > > >
> > >
> > > Sorry, I tried to set this topology on Qemu Virt and used:
> > > "cxl create-region -d decoder0.0 -t ram -m mem0,mem1,mem2,mem3"
> > >
> > > but it failed with:
> > > "cxl region: validate_ways: Interleave ways 2 is less than number of memdevs specified: 4"
> > >
> > > It seems like the CFMWs IW should be 4?
> > >
> >
> > It has been a while since i've interacted with QEMU's interleave stuff,
> > but IIRC (at least back when I was working on it) most configurations
> > had 1 device per host bridge - in which case the CFMWS IW should be 4
> > with each of the host bridges described in it.
> >
> > I'm not sure you can do multiple devices per host bridge without a
> > switch setup.
> >
> Qemu counld add 'cxl-rp' under a cxl host bridge. Below is my qemu
> command:
>
> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
> -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \
> -device cxl-type3,bus=root_port0,volatile-memdev=mem2,id=cxl-mem1 \
> -device cxl-type3,bus=root_port1,volatile-memdev=mem3,id=cxl-mem2 \
> -device pxb-cxl,bus_nr=20,bus=pcie.0,id=cxl.2 \
> -device cxl-rp,port=2,bus=cxl.2,id=root_port2,chassis=0,slot=2 \
> -device cxl-rp,port=3,bus=cxl.2,id=root_port3,chassis=0,slot=3 \
> -device cxl-type3,bus=root_port2,volatile-memdev=mem4,id=cxl-mem3 \
> -device cxl-type3,bus=root_port3,volatile-memdev=mem5,id=cxl-mem4 \
> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=2G \
>
> My lspci shows:
>
> -+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge
> | +-01.0 Red Hat, Inc. Virtio network device
> | +-02.0 Red Hat, Inc. Virtio block device
> | +-03.0 Red Hat, Inc. QEMU PCIe Expander bridge
> | \-04.0 Red Hat, Inc. QEMU PCIe Expander bridge
> +-[0000:0c]-+-00.0-[0d]----00.0 Intel Corporation Device 0d93 (CXL)
> | \-01.0-[0e]----00.0 Intel Corporation Device 0d93 (CXL)
> \-[0000:14]-+-00.0-[15]----00.0 Intel Corporation Device 0d93 (CXL)
> \-01.0-[16]----00.0 Intel Corporation Device 0d93 (CXL)
>
> My cxl list shows:
> [
> {
> "memdev":"mem1",
> "ram_size":268435456,
> "serial":0,
> "host":"0000:15:00.0"
> },
> {
> "memdev":"mem0",
> "ram_size":268435456,
> "serial":0,
> "host":"0000:16:00.0"
> },
> {
> "memdev":"mem2",
> "ram_size":268435456,
> "serial":0,
> "host":"0000:0e:00.0"
> },
> {
> "memdev":"mem3",
> "ram_size":268435456,
> "serial":0,
> "host":"0000:0d:00.0"
> }
> ]
>
> Then:
>
> # cxl create-region -d decoder0.0 -t ram -m mem0,mem1,mem2,mem3
> cxl region: validate_ways: Interleave ways 2 is less than number of memdevs specified: 4
> cxl region: cmd_create_region: created 0 regions
>
> This case confuesed me :(
What is the output of:
cxl list -M -d decoder0.0
...my expectation is that it only finds 2 potential endpoint devices
that are mapped by that decoder, I.e. QEMU did not produce a CFMWS that
interleaves the 2 host-bridges.
The error message could be improved to clarify that only endpoints
mapped by the given decoder are candidates to be members of a region.
# cxl list -d decoder0.0
[
{
"decoder":"decoder0.0",
"resource":1099511627776,
"size":2147483648,
"interleave_ways":2,
"interleave_granularity":256,
"max_available_extent":2147483648,
"pmem_capable":true,
"volatile_capable":true,
"accelmem_capable":true,
"qos_class":0,
"nr_targets":2
}
]
CFMWS:
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000010000000000
Window size : 0000000080000000
Interleave Members : 01
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 000F
QtgId : 0000
First Target : 0000000C
Next Target : 00000014
Yuquan
[-- Attachment #2: Type: text/html, Size: 9104 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-13 16:55 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30 ` Gregory Price
@ 2025-03-27 9:34 ` Yuquan Wang
2025-03-27 12:36 ` Gregory Price
1 sibling, 1 reply; 81+ messages in thread
From: Yuquan Wang @ 2025-03-27 9:34 UTC (permalink / raw)
To: Jonathan Cameron; +Cc: Gregory Price, lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Mar 13, 2025 at 04:55:39PM +0000, Jonathan Cameron wrote:
> >
> > Basically, the heuristic is as follows:
> > 1) Add one NUMA node per Proximity Domain described in SRAT
>
> if it contains, memory, CPU or generic initiator.
In the future, srat.c would add one seperate NUMA node for each
Generic Port in SRAT.
System firmware should know the performance characteristics between
CPU/GI to the GP, and the static HMAT should include this coordinate.
Is my understanding right?
Yuquan
>
> > 2) If the SRAT describes all memory described by all CFMWS
> > - do not create nodes for CFMWS
> > 3) If SRAT does not describe all memory described by CFMWS
> > - create a node for that CFMWS
> >
> > Generally speaking, you will see one NUMA node per Host bridge, unless
> > inter-host-bridge interleave is in use (see Section 4 - Interleave).
>
> I just love corners: QoS concerns might mean multiple CFMWS and hence
> multiple nodes per host bridge (feel free to ignore this one - has
> anyone seen this in the wild yet?) Similar mess for properties such
> as persistence, sharing etc.
>
> J
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-27 9:34 ` Yuquan Wang
@ 2025-03-27 12:36 ` Gregory Price
2025-03-27 13:21 ` Dan Williams
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-27 12:36 UTC (permalink / raw)
To: Yuquan Wang; +Cc: Jonathan Cameron, lsf-pc, linux-mm, linux-cxl, linux-kernel
On Thu, Mar 27, 2025 at 05:34:54PM +0800, Yuquan Wang wrote:
>
> In the future, srat.c would add one seperate NUMA node for each
> Generic Port in SRAT.
>
> System firmware should know the performance characteristics between
> CPU/GI to the GP, and the static HMAT should include this coordinate.
>
> Is my understanding right?
>
>
HMAT is static configuration data. A GI/GP might not have its
performance data known until the device is added. In this case, the
HMAT may only contain generic (inaccurate) information - or more likely
just nothing. It's up to the platform vendor.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-27 12:36 ` Gregory Price
@ 2025-03-27 13:21 ` Dan Williams
2025-03-27 16:36 ` Gregory Price
0 siblings, 1 reply; 81+ messages in thread
From: Dan Williams @ 2025-03-27 13:21 UTC (permalink / raw)
To: Gregory Price, Yuquan Wang
Cc: Jonathan Cameron, lsf-pc, linux-mm, linux-cxl, linux-kernel
Gregory Price wrote:
> On Thu, Mar 27, 2025 at 05:34:54PM +0800, Yuquan Wang wrote:
> >
> > In the future, srat.c would add one seperate NUMA node for each
> > Generic Port in SRAT.
> >
> > System firmware should know the performance characteristics between
> > CPU/GI to the GP, and the static HMAT should include this coordinate.
> >
> > Is my understanding right?
> >
> >
>
> HMAT is static configuration data. A GI/GP might not have its
> performance data known until the device is added.
The GP data is static and expected to be valid for all host bridges in
advance of any devices arriving.
> In this case, the HMAT may only contain generic (inaccurate)
> information - or more likely just nothing.
It may contain incomplete information, but not dynamic information.
Now, technically ACPI does allow for a notification to replace the
static HMAT, i.e. acpi_evaluate_object(..., "_HMA", ...), but Linux
does not support that. I am not aware of any platform that implements
that.
> It's up to the platform vendor.
Hopefully saying the above out loud manifests that Linux never needs to
worry about dynamic GP information in the future and all the other
dynamics remain in the physical hotplug domain.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-27 13:21 ` Dan Williams
@ 2025-03-27 16:36 ` Gregory Price
2025-03-31 23:49 ` [Lsf-pc] " Dan Williams
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-03-27 16:36 UTC (permalink / raw)
To: Dan Williams
Cc: Yuquan Wang, Jonathan Cameron, lsf-pc, linux-mm, linux-cxl,
linux-kernel
On Thu, Mar 27, 2025 at 09:21:55AM -0400, Dan Williams wrote:
> Gregory Price wrote:
> > On Thu, Mar 27, 2025 at 05:34:54PM +0800, Yuquan Wang wrote:
> > >
> > > In the future, srat.c would add one seperate NUMA node for each
> > > Generic Port in SRAT.
> > >
> > > System firmware should know the performance characteristics between
> > > CPU/GI to the GP, and the static HMAT should include this coordinate.
> > >
> > > Is my understanding right?
> > >
> > >
> >
> > HMAT is static configuration data. A GI/GP might not have its
> > performance data known until the device is added.
>
> The GP data is static and expected to be valid for all host bridges in
> advance of any devices arriving.
>
Sorry, just shuffling words here for clarity. Making sure I understand:
The GP data is static and enables Linux to do things like reserve numa
nodes for any devices might arrive in the future (i.e. create static
objects that cannot be created post-__init).
If there's no device, there should not be any HMAT data. If / when a
device arrives, it's up to the OS to acquire that information from the
device (e.g. CDAT). At this point the ACPI tables are not (shouldn't
be) involved - it's all OS/device interactions.
I should note that I don't have a full grasp of the GP ACPI stuff yet,
so doing my best to grok it as I go here.
~Gregory
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [Lsf-pc] [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources
2025-03-27 16:36 ` Gregory Price
@ 2025-03-31 23:49 ` Dan Williams
0 siblings, 0 replies; 81+ messages in thread
From: Dan Williams @ 2025-03-31 23:49 UTC (permalink / raw)
To: Gregory Price, Dan Williams
Cc: Yuquan Wang, Jonathan Cameron, lsf-pc, linux-mm, linux-cxl,
linux-kernel
Gregory Price wrote:
> On Thu, Mar 27, 2025 at 09:21:55AM -0400, Dan Williams wrote:
> > Gregory Price wrote:
> > > On Thu, Mar 27, 2025 at 05:34:54PM +0800, Yuquan Wang wrote:
> > > >
> > > > In the future, srat.c would add one seperate NUMA node for each
> > > > Generic Port in SRAT.
> > > >
> > > > System firmware should know the performance characteristics between
> > > > CPU/GI to the GP, and the static HMAT should include this coordinate.
> > > >
> > > > Is my understanding right?
> > > >
> > > >
> > >
> > > HMAT is static configuration data. A GI/GP might not have its
> > > performance data known until the device is added.
> >
> > The GP data is static and expected to be valid for all host bridges in
> > advance of any devices arriving.
> >
>
> Sorry, just shuffling words here for clarity. Making sure I understand:
>
> The GP data is static and enables Linux to do things like reserve numa
> nodes for any devices might arrive in the future (i.e. create static
> objects that cannot be created post-__init).
Small nuance, the CFMWS is what Linux uses to reserve numa nodes, the GP
data is there to dynamically craft the equivalent of HMAT data for those
nodes when the device shows up.
Recall that the CFMWS enumerates a QoS class for each CXL window. That
QoS class is decided by some (waves hands) coordination between host
platform and device vendors. So there is some, opaque to the OS,
decisions about which devices should be mapped in what window.
See "9.17.3.1 _DSM Function for Retrieving QTG ID" for that opaque
process.
Linux today just reports whether a device has any memory capacity that
matches any free-capacity-window QoS class, but does not enforce that
they must be compatible. This follows the assumption that it is better
to make capacity available than perfectly match performance
characteristics.
> If there's no device, there should not be any HMAT data.
...beyond GP data.
> If / when a device arrives, it's up to the OS to acquire that
> information from the device (e.g. CDAT). At this point the ACPI
> tables are not (shouldn't be) involved - it's all OS/device
> interactions.
>
> I should note that I don't have a full grasp of the GP ACPI stuff yet,
> so doing my best to grok it as I go here.
Again, no worries, this documentation is pulling this all together into
one story.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
@ 2025-04-02 4:49 ` Gregory Price
[not found] ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-04-02 4:49 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-cxl, linux-kernel, mhocko, a.manzanares
Slides up at:
https://github.com/gourryinverse/cxl-boot-to-bash/blob/main/presentations/lsfmm25.pdf
Docs Transitioning to:
https://github.com/gourryinverse/cxl-boot-to-bash/tree/main
Browsable Page:
https://gourryinverse.github.io/cxl-boot-to-bash/
All help hacking on this is welcome. I imagine we'll convert some
portion of this over to kernel docs, but not sure what.
~Gregory Price
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
2025-03-06 23:56 ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07 0:57 ` Zhijian Li (Fujitsu)
@ 2025-04-02 6:45 ` Zhijian Li (Fujitsu)
2025-04-02 14:18 ` Gregory Price
1 sibling, 1 reply; 81+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-04-02 6:45 UTC (permalink / raw)
To: Gregory Price, lsf-pc@lists.linux-foundation.org
Cc: linux-mm@kvack.org, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org
Hi Gregory,
On 07/03/2025 07:56, Gregory Price wrote:
> What if instead, we had two 256MB endpoints on the same host bridge?
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region
> Window size : 0000000020000000 <- 512MB
> Interleave Members (2^n) : 00 <- Not interleaved
>
> Memory Map:
> [mem 0x0000000100000000-0x0000000120000000] usable <- SPA
>
> Decoders
> decoder0.0
> range=[0x100000000, 0x120000000]
> |
> decoder1.0
> range=[0x100000000, 0x120000000]
> / \
> decoded2.0 decoder3.0
> range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
> ```
It reminds me that during construct_region(), it requires decoder range in the
switch/host-bridge is exact same with the endpoint decoder. see
match_switch_decoder_by_range()
If so, does following decoders make sense?
Decoders
decoder0.0
range=[0x100000000, 0x120000000]
|
+------------+-----------+
/ \
| Host-bridge contains |
decoder1.0 2 decoders decoder1.1
range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
/ \
decoded2.0 decoder3.0
range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
Thanks
Zhijian
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
2025-04-02 6:45 ` Zhijian Li (Fujitsu)
@ 2025-04-02 14:18 ` Gregory Price
2025-04-08 3:10 ` Zhijian Li (Fujitsu)
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-04-02 14:18 UTC (permalink / raw)
To: Zhijian Li (Fujitsu)
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org
On Wed, Apr 02, 2025 at 06:45:33AM +0000, Zhijian Li (Fujitsu) wrote:
> Hi Gregory,
>
>
> On 07/03/2025 07:56, Gregory Price wrote:
> > What if instead, we had two 256MB endpoints on the same host bridge?
> >
> > ```
> > CEDT
> > Subtable Type : 01 [CXL Fixed Memory Window Structure]
> > Reserved : 00
> > Length : 002C
> > Reserved : 00000000
> > Window base address : 0000000100000000 <- Memory Region
> > Window size : 0000000020000000 <- 512MB
> > Interleave Members (2^n) : 00 <- Not interleaved
> >
> > Memory Map:
> > [mem 0x0000000100000000-0x0000000120000000] usable <- SPA
> >
> > Decoders
> > decoder0.0
> > range=[0x100000000, 0x120000000]
> > |
> > decoder1.0
> > range=[0x100000000, 0x120000000]
> > / \
> > decoded2.0 decoder3.0
> > range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
> > ```
>
> It reminds me that during construct_region(), it requires decoder range in the
> switch/host-bridge is exact same with the endpoint decoder. see
> match_switch_decoder_by_range()
>
> If so, does following decoders make sense?
>
>
> Decoders
> decoder0.0
> range=[0x100000000, 0x120000000]
> |
> +------------+-----------+
> / \
> | Host-bridge contains |
> decoder1.0 2 decoders decoder1.1
> range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
> / \
> decoded2.0 decoder3.0
> range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
>
You are correct, i'll update this.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices
[not found] ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
@ 2025-04-07 16:14 ` Adam Manzanares
0 siblings, 0 replies; 81+ messages in thread
From: Adam Manzanares @ 2025-04-07 16:14 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
mhocko@suse.com
On Wed, Apr 02, 2025 at 12:49:55AM -0400, Gregory Price wrote:
> Slides up at:
> https://github.com/gourryinverse/cxl-boot-to-bash/blob/main/presentations/lsfmm25.pdf
>
> Docs Transitioning to:
> https://github.com/gourryinverse/cxl-boot-to-bash/tree/main
>
> Browsable Page:
> https://gourryinverse.github.io/cxl-boot-to-bash/
>
> All help hacking on this is welcome. I imagine we'll convert some
> portion of this over to kernel docs, but not sure what.
Thanks for presenting at LSF/MM and posting links
>
> ~Gregory Price
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
2025-04-02 14:18 ` Gregory Price
@ 2025-04-08 3:10 ` Zhijian Li (Fujitsu)
2025-04-08 4:14 ` Gregory Price
0 siblings, 1 reply; 81+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-04-08 3:10 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org
Hi Gregory and CXL
One more (silly) question as below
On 02/04/2025 22:18, Gregory Price wrote:
> On Wed, Apr 02, 2025 at 06:45:33AM +0000, Zhijian Li (Fujitsu) wrote:
>> Hi Gregory,
>>
>>
>> On 07/03/2025 07:56, Gregory Price wrote:
>>> What if instead, we had two 256MB endpoints on the same host bridge?
>>>
>>> ```
>>> CEDT
>>> Subtable Type : 01 [CXL Fixed Memory Window Structure]
>>> Reserved : 00
>>> Length : 002C
>>> Reserved : 00000000
>>> Window base address : 0000000100000000 <- Memory Region
>>> Window size : 0000000020000000 <- 512MB
>>> Interleave Members (2^n) : 00 <- Not interleaved
>>>
>>> Memory Map:
>>> [mem 0x0000000100000000-0x0000000120000000] usable <- SPA
>>>
>>> Decoders
>>> decoder0.0
>>> range=[0x100000000, 0x120000000]
>>> |
>>> decoder1.0
>>> range=[0x100000000, 0x120000000]
>>> / \
>>> decoded2.0 decoder3.0
>>> range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
>>> ```
>>
>> It reminds me that during construct_region(), it requires decoder range in the
>> switch/host-bridge is exact same with the endpoint decoder. see
>> match_switch_decoder_by_range()
From the code, we can infer this point. However, is this just a solution implemented in software,
or is it explicitly mandated by the CXL SPEC or elsewhere? If you are aware, please let me know.
I have been trying for days to find documentary evidence to persuade our firmware team that,
during device provisioning, the programming of the HDM decoder should adhere to this principle:
The range in the HDM decoder should be exactly the same between the device and its upstream switch.
Thanks
Zhijian
>>
>> If so, does following decoders make sense?
>>
>>
>> Decoders
>> decoder0.0
>> range=[0x100000000, 0x120000000]
>> |
>> +------------+-----------+
>> / \
>> | Host-bridge contains |
>> decoder1.0 2 decoders decoder1.1
>> range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
>> / \
>> decoded2.0 decoder3.0
>> range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
>>
>
> You are correct, i'll update this.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
2025-04-08 3:10 ` Zhijian Li (Fujitsu)
@ 2025-04-08 4:14 ` Gregory Price
2025-04-08 5:37 ` Zhijian Li (Fujitsu)
0 siblings, 1 reply; 81+ messages in thread
From: Gregory Price @ 2025-04-08 4:14 UTC (permalink / raw)
To: Zhijian Li (Fujitsu)
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org
On Tue, Apr 08, 2025 at 03:10:24AM +0000, Zhijian Li (Fujitsu) wrote:
> >> On 07/03/2025 07:56, Gregory Price wrote:
> >>> What if instead, we had two 256MB endpoints on the same host bridge?
> >>>
> >>> ```
> >>> CEDT
> >>> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> >>> Reserved : 00
> >>> Length : 002C
> >>> Reserved : 00000000
> >>> Window base address : 0000000100000000 <- Memory Region
> >>> Window size : 0000000020000000 <- 512MB
> >>> Interleave Members (2^n) : 00 <- Not interleaved
> >>>
> >>> Memory Map:
> >>> [mem 0x0000000100000000-0x0000000120000000] usable <- SPA
> >>>
> >>> Decoders
> >>> decoder0.0
> >>> range=[0x100000000, 0x120000000]
> >>> |
> >>> decoder1.0
> >>> range=[0x100000000, 0x120000000]
> >>> / \
> >>> decoded2.0 decoder3.0
> >>> range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
> >>> ```
> >>
> >> It reminds me that during construct_region(), it requires decoder range in the
> >> switch/host-bridge is exact same with the endpoint decoder. see
> >> match_switch_decoder_by_range()
>
>
> From the code, we can infer this point. However, is this just a solution implemented in software,
> or is it explicitly mandated by the CXL SPEC or elsewhere? If you are aware, please let me know.
>
The description you've quoted here is incorrect, as I didn't fully
understand the correct interleave configuration. I plan on re-writing
this portion with correct configurations over the next month.
Linux does expect all decoders from root to endpoint to be programmed
with the same range*[2].
please keep an eye on [1] for updates, i won't be updating this thread
with further edits.
> I have been trying for days to find documentary evidence to persuade our firmware team that,
> during device provisioning, the programming of the HDM decoder should adhere to this principle:
> The range in the HDM decoder should be exactly the same between the device and its upstream switch.
>
In general, everything included in this guide does not care about what
the spec says is possible - it only concerns itself with what linux
supports. If there is a mechanism described in the spec that isn't
supported, it is expected that an interested vendor will come along to
help support it.
However, the current Linux driver absolutely expects the range in the
HDM decoders should be exactly the same from root to endpoint*.
My reading of the 3.1 spec suggests this is also defined by implication
of the "Implementation Notes" at the end of section
8.2.4.20 CXL HDM Decoder Capability Structure
IMPLEMENTATION NOTE
CXL Host Bridge and Upstream Switch Port Decode Flow
IMPLEMENTATION NOTE
Device Decode Logic
The host bridge/USP implementation note describes extracting bits for
routing, while the device decode logic describes active translation from
HPA to DPA.
~Gregory
[1] https://gourryinverse.github.io/cxl-boot-to-bash/
^ with the exception of Zen5 [2], which I don't recommend you replicate
[2] https://lore.kernel.org/linux-cxl/20250218132356.1809075-1-rrichter@amd.com/T/#t
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
2025-04-08 4:14 ` Gregory Price
@ 2025-04-08 5:37 ` Zhijian Li (Fujitsu)
0 siblings, 0 replies; 81+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-04-08 5:37 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org
Gregory,
On 08/04/2025 12:14, Gregory Price wrote:
> My reading of the 3.1 spec suggests this is also defined by implication
> of the "Implementation Notes" at the end of section
>
> 8.2.4.20 CXL HDM Decoder Capability Structure
>
> IMPLEMENTATION NOTE
> CXL Host Bridge and Upstream Switch Port Decode Flow
>
> IMPLEMENTATION NOTE
> Device Decode Logic
Great, I am grateful for your enlightening guidance.
Many many thanks again.
^ permalink raw reply [flat|nested] 81+ messages in thread
end of thread, other threads:[~2025-04-08 5:37 UTC | newest]
Thread overview: 81+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-26 20:19 [LSF/MM] Linux management of volatile CXL memory devices - boot to bash Gregory Price
2025-02-05 2:17 ` [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot Gregory Price
2025-02-18 10:12 ` Yuquan Wang
2025-02-18 16:11 ` Gregory Price
2025-02-20 16:30 ` Jonathan Cameron
2025-02-20 16:52 ` Gregory Price
2025-03-04 0:32 ` Gregory Price
2025-03-13 16:12 ` Jonathan Cameron
2025-03-13 17:20 ` Gregory Price
2025-03-10 10:45 ` Yuquan Wang
2025-03-10 14:19 ` Gregory Price
2025-02-05 16:06 ` CXL Boot to Bash - Section 2: The Drivers Gregory Price
2025-02-06 0:47 ` Dan Williams
2025-02-06 15:59 ` Gregory Price
2025-03-04 1:32 ` Gregory Price
2025-03-06 23:56 ` CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming Gregory Price
2025-03-07 0:57 ` Zhijian Li (Fujitsu)
2025-03-07 15:07 ` Gregory Price
2025-03-11 2:48 ` Zhijian Li (Fujitsu)
2025-04-02 6:45 ` Zhijian Li (Fujitsu)
2025-04-02 14:18 ` Gregory Price
2025-04-08 3:10 ` Zhijian Li (Fujitsu)
2025-04-08 4:14 ` Gregory Price
2025-04-08 5:37 ` Zhijian Li (Fujitsu)
2025-02-17 20:05 ` CXL Boot to Bash - Section 3: Memory (block) Hotplug Gregory Price
2025-02-18 16:24 ` David Hildenbrand
2025-02-18 17:03 ` Gregory Price
2025-02-18 17:49 ` Yang Shi
2025-02-18 18:04 ` Gregory Price
2025-02-18 19:25 ` David Hildenbrand
2025-02-18 20:25 ` Gregory Price
2025-02-18 20:57 ` David Hildenbrand
2025-02-19 1:10 ` Gregory Price
2025-02-19 8:53 ` David Hildenbrand
2025-02-19 16:14 ` Gregory Price
2025-02-20 17:50 ` Yang Shi
2025-02-20 18:43 ` Gregory Price
2025-02-20 19:26 ` David Hildenbrand
2025-02-20 19:35 ` Gregory Price
2025-02-20 19:44 ` David Hildenbrand
2025-02-20 20:06 ` Gregory Price
2025-03-11 14:53 ` Zi Yan
2025-03-11 15:58 ` Gregory Price
2025-03-11 16:08 ` Zi Yan
2025-03-11 16:15 ` Gregory Price
2025-03-11 16:35 ` Oscar Salvador
2025-03-05 22:20 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Gregory Price
2025-03-05 22:44 ` Dave Jiang
2025-03-05 23:34 ` Gregory Price
2025-03-05 23:41 ` Dave Jiang
2025-03-06 0:09 ` Gregory Price
2025-03-06 1:37 ` Yuquan Wang
2025-03-06 17:08 ` Gregory Price
2025-03-07 2:20 ` Yuquan Wang
2025-03-07 15:12 ` Gregory Price
2025-03-13 17:00 ` Jonathan Cameron
2025-03-08 3:23 ` [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity Gregory Price
2025-03-13 17:20 ` Jonathan Cameron
2025-03-13 18:17 ` Gregory Price
2025-03-14 11:09 ` Jonathan Cameron
2025-03-14 13:46 ` Gregory Price
2025-03-13 16:55 ` [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources Jonathan Cameron
2025-03-13 17:30 ` Gregory Price
2025-03-14 11:14 ` Jonathan Cameron
2025-03-27 9:34 ` Yuquan Wang
2025-03-27 12:36 ` Gregory Price
2025-03-27 13:21 ` Dan Williams
2025-03-27 16:36 ` Gregory Price
2025-03-31 23:49 ` [Lsf-pc] " Dan Williams
2025-03-12 0:09 ` [LSF/MM] CXL Boot to Bash - Section 4: Interleave Gregory Price
2025-03-13 8:31 ` Yuquan Wang
2025-03-13 16:48 ` Gregory Price
2025-03-26 9:28 ` Yuquan Wang
2025-03-26 12:53 ` Gregory Price
2025-03-27 2:20 ` Yuquan Wang
2025-03-27 2:51 ` [Lsf-pc] " Dan Williams
2025-03-27 6:29 ` Yuquan Wang
2025-03-14 3:21 ` [LSF/MM] CXL Boot to Bash - Section 6: Page allocation Gregory Price
2025-03-18 17:09 ` [LSFMM] Updated: Linux Management of Volatile CXL Memory Devices Gregory Price
2025-04-02 4:49 ` Gregory Price
[not found] ` <CGME20250407161445uscas1p19322b476cafd59f9d7d6e1877f3148b8@uscas1p1.samsung.com>
2025-04-07 16:14 ` Adam Manzanares
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).