* [PATCH v3 00/17] CXL Boot to Bash Documentation
@ 2025-05-12 16:21 Gregory Price
2025-05-12 16:21 ` [PATCH v3 01/17] cxl: update documentation structure in prep for new docs Gregory Price
` (17 more replies)
0 siblings, 18 replies; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet, Joshua Hahn
v3:
- Cross-links (Bagas)
- Grammar and spelling (Randy)
- added fixups to access-coordinates (Bagas)
- Drop TODO sections (use-case, memory-tiering, CDAT/UEFI, SRAT Genport)
I unfortunately won't be able to come back around to this for
a while, so I'd rather not let this rot.
---
This series converts CXL Boot to Bash Docs from LSFMM '25 to Linux
Kernel Docs. In brief, this document covers (almost) everything Linux
expects from platforms to successfully bring volatile CXL memory
capacity online as a DAX device and/or SystemRAM.
It covers:
- Platform configuration data (ACPI Tables, EFI Memory Map, EFI Configs)
- Linux Build and Boot Parameters
- Linux consumption of Platform, Build, and Boot params
- Linux creation of base resources (NUMA nodes, memory tiers, etc)
- CXL Driver probe process and sysfs structure
- DAX Driver interactions between the CXL driver and memory hotplug
- Memory hotplug interactions
- Page allocator interactions (NUMA nodes, Memory Zones, Reclaim, etc).
Included are example platform configurations (ACPI tables) and cxl
decoder configurations to guide platform developers on expected
configurations (which may be more strict than the CXL spec).
Co-developed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
Gregory Price (17):
cxl: update documentation structure in prep for new docs
cxl: docs - access-coordinates doc fixups
cxl: docs/devices - add cxl device and protocol reference
cxl: docs/platform/bios-and-efi documentation
cxl: docs/platform/acpi reference documentation
cxl: docs/platform/example-configs documentation
cxl: docs/linux - overview
cxl: docs/linux - early boot configuration
cxl: docs/linux - add cxl-driver theory of operation
cxl: docs/linux/cxl-driver - add example configurations
cxl: docs/linux/dax-driver documentation
cxl: docs/linux/memory-hotplug
cxl: docs/allocation/dax
cxl: docs/allocation/page-allocator
cxl: docs/allocation/reclaim
cxl: docs/allocation/hugepages
cxl: docs - add self-referencing cross-links
.../driver-api/cxl/allocation/dax.rst | 60 ++
.../driver-api/cxl/allocation/hugepages.rst | 32 +
.../cxl/allocation/page-allocator.rst | 85 +++
.../driver-api/cxl/allocation/reclaim.rst | 51 ++
.../driver-api/cxl/devices/device-types.rst | 165 +++++
Documentation/driver-api/cxl/index.rst | 45 +-
.../cxl/{ => linux}/access-coordinates.rst | 35 +-
.../driver-api/cxl/linux/cxl-driver.rst | 630 ++++++++++++++++++
.../driver-api/cxl/linux/dax-driver.rst | 43 ++
.../driver-api/cxl/linux/early-boot.rst | 137 ++++
.../example-configurations/hb-interleave.rst | 314 +++++++++
.../intra-hb-interleave.rst | 291 ++++++++
.../multi-interleave.rst | 401 +++++++++++
.../example-configurations/single-device.rst | 246 +++++++
.../driver-api/cxl/linux/memory-hotplug.rst | 78 +++
.../driver-api/cxl/linux/overview.rst | 103 +++
.../driver-api/cxl/platform/acpi.rst | 76 +++
.../driver-api/cxl/platform/acpi/cedt.rst | 62 ++
.../driver-api/cxl/platform/acpi/dsdt.rst | 28 +
.../driver-api/cxl/platform/acpi/hmat.rst | 32 +
.../driver-api/cxl/platform/acpi/slit.rst | 21 +
.../driver-api/cxl/platform/acpi/srat.rst | 44 ++
.../driver-api/cxl/platform/bios-and-efi.rst | 262 ++++++++
.../cxl/platform/example-configs.rst | 13 +
.../example-configurations/flexible.rst | 296 ++++++++
.../example-configurations/hb-interleave.rst | 107 +++
.../multi-dev-per-hb.rst | 90 +++
.../example-configurations/one-dev-per-hb.rst | 136 ++++
...ry-devices.rst => theory-of-operation.rst} | 10 +-
29 files changed, 3867 insertions(+), 26 deletions(-)
create mode 100644 Documentation/driver-api/cxl/allocation/dax.rst
create mode 100644 Documentation/driver-api/cxl/allocation/hugepages.rst
create mode 100644 Documentation/driver-api/cxl/allocation/page-allocator.rst
create mode 100644 Documentation/driver-api/cxl/allocation/reclaim.rst
create mode 100644 Documentation/driver-api/cxl/devices/device-types.rst
rename Documentation/driver-api/cxl/{ => linux}/access-coordinates.rst (84%)
create mode 100644 Documentation/driver-api/cxl/linux/cxl-driver.rst
create mode 100644 Documentation/driver-api/cxl/linux/dax-driver.rst
create mode 100644 Documentation/driver-api/cxl/linux/early-boot.rst
create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/hb-interleave.rst
create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/intra-hb-interleave.rst
create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/multi-interleave.rst
create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/single-device.rst
create mode 100644 Documentation/driver-api/cxl/linux/memory-hotplug.rst
create mode 100644 Documentation/driver-api/cxl/linux/overview.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi/cedt.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi/dsdt.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi/hmat.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi/slit.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi/srat.rst
create mode 100644 Documentation/driver-api/cxl/platform/bios-and-efi.rst
create mode 100644 Documentation/driver-api/cxl/platform/example-configs.rst
create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
rename Documentation/driver-api/cxl/{memory-devices.rst => theory-of-operation.rst} (98%)
--
2.49.0
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH v3 01/17] cxl: update documentation structure in prep for new docs
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 22:46 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 02/17] cxl: docs - access-coordinates doc fixups Gregory Price
` (16 subsequent siblings)
17 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Restructure the cxl folder to make adding docs per-page cleaner.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/driver-api/cxl/index.rst | 16 +++++++++++++---
.../cxl/{ => linux}/access-coordinates.rst | 0
...emory-devices.rst => theory-of-operation.rst} | 10 +++++-----
3 files changed, 18 insertions(+), 8 deletions(-)
rename Documentation/driver-api/cxl/{ => linux}/access-coordinates.rst (100%)
rename Documentation/driver-api/cxl/{memory-devices.rst => theory-of-operation.rst} (98%)
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index 965ba90e8fb7..fe1594dc6778 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -4,12 +4,22 @@
Compute Express Link
====================
+CXL device configuration has a complex handoff between platform (Hardware,
+BIOS, EFI), OS (early boot, core kernel, driver), and user policy decisions
+that have impacts on each other. The docs here break up configurations steps.
+
+.. toctree::
+ :maxdepth: 2
+ :caption: Overview
+
+ theory-of-operation
+ maturity-map
+
.. toctree::
:maxdepth: 1
+ :caption: Linux Kernel Configuration
- memory-devices
- access-coordinates
+ linux/access-coordinates
- maturity-map
.. only:: subproject and html
diff --git a/Documentation/driver-api/cxl/access-coordinates.rst b/Documentation/driver-api/cxl/linux/access-coordinates.rst
similarity index 100%
rename from Documentation/driver-api/cxl/access-coordinates.rst
rename to Documentation/driver-api/cxl/linux/access-coordinates.rst
diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/theory-of-operation.rst
similarity index 98%
rename from Documentation/driver-api/cxl/memory-devices.rst
rename to Documentation/driver-api/cxl/theory-of-operation.rst
index d732c42526df..32739e253453 100644
--- a/Documentation/driver-api/cxl/memory-devices.rst
+++ b/Documentation/driver-api/cxl/theory-of-operation.rst
@@ -1,9 +1,9 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
-===================================
-Compute Express Link Memory Devices
-===================================
+===============================================
+Compute Express Link Driver Theory of Operation
+===============================================
A Compute Express Link Memory Device is a CXL component that implements the
CXL.mem protocol. It contains some amount of volatile memory, persistent memory,
@@ -14,8 +14,8 @@ that optionally define a device's contribution to an interleaved address
range across multiple devices underneath a host-bridge or interleaved
across host-bridges.
-CXL Bus: Theory of Operation
-============================
+The CXL Bus
+===========
Similar to how a RAID driver takes disk objects and assembles them into a new
logical device, the CXL subsystem is tasked to take PCIe and ACPI objects and
assemble them into a CXL.mem decode topology. The need for runtime configuration
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 02/17] cxl: docs - access-coordinates doc fixups
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
2025-05-12 16:21 ` [PATCH v3 01/17] cxl: update documentation structure in prep for new docs Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 22:47 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 03/17] cxl: docs/devices - add cxl device and protocol reference Gregory Price
` (15 subsequent siblings)
17 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet, Randy Dunlap, Bagas Sanjaya
Place the hierarchy diagram in access-coordinates.rst in a code block.
Fix a few grammar issues.
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../cxl/linux/access-coordinates.rst | 30 +++++++++----------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/Documentation/driver-api/cxl/linux/access-coordinates.rst b/Documentation/driver-api/cxl/linux/access-coordinates.rst
index b07950ea30c9..e408ecbc4038 100644
--- a/Documentation/driver-api/cxl/linux/access-coordinates.rst
+++ b/Documentation/driver-api/cxl/linux/access-coordinates.rst
@@ -26,20 +26,20 @@ There can be multiple switches under an RP. There can be multiple RPs under
a CXL Host Bridge (HB). There can be multiple HBs under a CXL Fixed Memory
Window Structure (CFMWS).
-An example hierarchy:
+An example hierarchy::
-> CFMWS 0
-> |
-> _________|_________
-> | |
-> ACPI0017-0 ACPI0017-1
-> GP0/HB0/ACPI0016-0 GP1/HB1/ACPI0016-1
-> | | | |
-> RP0 RP1 RP2 RP3
-> | | | |
-> SW 0 SW 1 SW 2 SW 3
-> | | | | | | | |
-> EP0 EP1 EP2 EP3 EP4 EP5 EP6 EP7
+ CFMWS 0
+ |
+ _________|_________
+ | |
+ ACPI0017-0 ACPI0017-1
+ GP0/HB0/ACPI0016-0 GP1/HB1/ACPI0016-1
+ | | | |
+ RP0 RP1 RP2 RP3
+ | | | |
+ SW 0 SW 1 SW 2 SW 3
+ | | | | | | | |
+ EP0 EP1 EP2 EP3 EP4 EP5 EP6 EP7
Computation for the example hierarchy:
@@ -82,8 +82,8 @@ this point all the bandwidths are aggregated per each host bridge, which is
also the index for the resulting xarray.
The next step is to take the min() of the per host bridge bandwidth and the
-bandwidth from the Generic Port (GP). The bandwidths for the GP is retrieved
-via ACPI tables SRAT/HMAT. The min bandwidth are aggregated under the same
+bandwidth from the Generic Port (GP). The bandwidths for the GP are retrieved
+via ACPI tables SRAT/HMAT. The minimum bandwidth are aggregated under the same
ACPI0017 device to form a new xarray.
Finally, the cxl_region_update_bandwidth() is called and the aggregated
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 03/17] cxl: docs/devices - add cxl device and protocol reference
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
2025-05-12 16:21 ` [PATCH v3 01/17] cxl: update documentation structure in prep for new docs Gregory Price
2025-05-12 16:21 ` [PATCH v3 02/17] cxl: docs - access-coordinates doc fixups Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 23:08 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 04/17] cxl: docs/platform/bios-and-efi documentation Gregory Price
` (14 subsequent siblings)
17 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Add a simple device primer sufficient to understand the theory
of operation documentation.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../driver-api/cxl/devices/device-types.rst | 165 ++++++++++++++++++
Documentation/driver-api/cxl/index.rst | 6 +
2 files changed, 171 insertions(+)
create mode 100644 Documentation/driver-api/cxl/devices/device-types.rst
diff --git a/Documentation/driver-api/cxl/devices/device-types.rst b/Documentation/driver-api/cxl/devices/device-types.rst
new file mode 100644
index 000000000000..c70564cf0be3
--- /dev/null
+++ b/Documentation/driver-api/cxl/devices/device-types.rst
@@ -0,0 +1,165 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Devices and Protocols
+=====================
+
+The type of CXL device (Memory, Accelerator, etc) dictates many configuration steps. This section
+covers some basic background on device types and on-device resources used by the platform and OS
+which impact configuration.
+
+Protocols
+=========
+
+There are three core protocols to CXL. For the purpose of this documentation,
+we will only discuss very high level definitions as the specific hardware
+details are largely abstracted away from Linux. See the CXL specification
+for more details.
+
+CXL.io
+------
+The basic interaction protocol, similar to PCIe configuration mechanisms.
+Typically used for initialization, configuration, and I/O access for anything
+other than memory (CXL.mem) or cache (CXL.cache) operations.
+
+The Linux CXL driver exposes access to .io functionalty via the various sysfs
+interfaces and /dev/cxl/ devices (which exposes direct access to device
+mailboxes).
+
+CXL.cache
+---------
+The mechanism by which a device may coherently access and cache host memory.
+
+Largely transparent to Linux once configured.
+
+CXL.mem
+---------
+The mechanism by which the CPU may coherently access and cache device memory.
+
+Largely transparent to Linux once configured.
+
+
+Device Types
+============
+
+Type-1
+------
+
+A Type-1 CXL device:
+
+* Supports cxl.io and cxl.cache protocols
+* Implements a fully coherent cache
+* Allows Device-to-Host coherence and Host-to-Device snoops.
+* Does NOT have host-managed device memory (HDM)
+
+Typical examples of type-1 devices is a Smart NIC - which may want to
+directly operate on host-memory (DMA) to store incoming packets. These
+devices largely rely on CPU-attached memory.
+
+Type-2
+------
+
+A Type-2 CXL Device:
+
+* Supports cxl.io, cxl.cache, and cxl.mem protocols
+* Optionally implements coherent cache and Host-Managed Device Memory
+* Is typically an accelerator device w/ high bandwidth memory.
+
+The primary difference between a type-1 and type-2 device is the presence
+of host-managed device memory, which allows the device to operate on a
+local memory bank - while the CPU sill has coherent DMA to the same memory.
+
+The allows things like GPUs to expose their memory via DAX devices or file
+descriptors, allows drivers and programs direct access to device memory
+rather than use block-transfer semantics.
+
+Type-3
+------
+
+A Type-3 CXL Device
+
+* Supports cxl.io and cxl.mem
+* Implements Host-Managed Device Memory
+* May provide either Volatile or Persistent memory capacity (or both).
+
+A basic example of a type-3 device is a simple memory expander, whose
+local memory capacity is exposed to the CPU for access directly via
+basic coherent DMA.
+
+Switch
+------
+
+A CXL switch is a device capacity of routing any CXL (and by extension, PCIe)
+protocol between an upstream, downstream, or peer devices. Many devices, such
+as Multi-Logical Devices, imply the presence of switching in some manner.
+
+Logical Devices and Heads
+-------------------------
+
+A CXL device may present one or more "Logical Devices" to one or more hosts
+(via physical "Heads").
+
+A Single-Logical Device (SLD) is a device which presents a single device to
+one or more heads.
+
+A Multi-Logical Device (MLD) is a device which may present multiple devices
+to one or more devices.
+
+A Single-Headed Device exposes only a single physical connection.
+
+A Multi-Headed Device exposes multiple physical connections.
+
+MHSLD
+~~~~~
+A Multi-Headed Single-Logical Device (MHSLD) exposes a single logical
+device to multiple heads which may be connected to one or more discrete
+hosts. An example of this would be a simple memory-pool which may be
+statically configured (prior to boot) to expose portions of its memory
+to Linux via the CEDT ACPI table.
+
+MHMLD
+~~~~~
+A Multi-Headed Multi-Logical Device (MHMLD) exposes multiple logical
+devices to multiple heads which may be connected to one or more discrete
+hosts. An example of this would be a Dynamic Capacity Device or which
+may be configured at runtime to expose portions of its memory to Linux.
+
+Example Devices
+===============
+
+Memory Expander
+---------------
+The simplest form of Type-3 device is a memory expander. A memory expander
+exposes Host-Managed Device Memory (HDM) to Linux. This memory may be
+Volatile or Non-Volatile (Persistent).
+
+Memory Expanders will typically be considered a form of Single-Headed,
+Single-Logical Device - as its form factor will typically be an add-in-card
+(AIC) or some other similar form-factor.
+
+The Linux CXL driver provides support for static or dynamic configuration of
+basic memory expanders. The platform may program decoders prior to OS init
+(e.g. auto-decoders), or the user may program the fabric if the platform
+defers these operations to the OS.
+
+Multiple Memory Expanders may be added to an external chassis and exposed to
+a host via a head attached to a CXL switch. This is a "memory pool", and
+would be considered an MHSLD or MHMLD depending on the management capabilities
+provided by the switch platform.
+
+As of v6.14, Linux does not provide a formalized interface to manage non-DCD
+MHSLD or MHMLD devices.
+
+Dynamic Capacity Device (DCD)
+-----------------------------
+
+A Dynamic Capacity Device is a Type-3 device which provides dynamic management
+of memory capacity. The basic premise of a DCD to provide an allocator-like
+interface for physical memory capacity to a "Fabric Manager" (an external,
+privileged host with privileges to change configurations for other hosts).
+
+A DCD manages "Memory Extents", which may be volatile or persistent. Extents
+may also be exclusive to a single host or shared across multiple hosts.
+
+As of v6.14, Linux does not provide a formalized interface to manage DCD
+devices, however there is active work on LKML targeting future release.
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index fe1594dc6778..a2d1c5b18a8a 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -15,6 +15,12 @@ that have impacts on each other. The docs here break up configurations steps.
theory-of-operation
maturity-map
+.. toctree::
+ :maxdepth: 2
+ :caption: Device Reference
+
+ devices/device-types
+
.. toctree::
:maxdepth: 1
:caption: Linux Kernel Configuration
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 04/17] cxl: docs/platform/bios-and-efi documentation
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (2 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 03/17] cxl: docs/devices - add cxl device and protocol reference Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 23:31 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 05/17] cxl: docs/platform/acpi reference documentation Gregory Price
` (13 subsequent siblings)
17 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Add some docs on CXL configurations done in bios/efi that affect
linux configuration - information vendors may care to consider.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/driver-api/cxl/index.rst | 6 +
.../driver-api/cxl/platform/bios-and-efi.rst | 262 ++++++++++++++++++
2 files changed, 268 insertions(+)
create mode 100644 Documentation/driver-api/cxl/platform/bios-and-efi.rst
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index a2d1c5b18a8a..ffa0462ad950 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -21,6 +21,12 @@ that have impacts on each other. The docs here break up configurations steps.
devices/device-types
+.. toctree::
+ :maxdepth: 2
+ :caption: Platform Configuration
+
+ platform/bios-and-efi
+
.. toctree::
:maxdepth: 1
:caption: Linux Kernel Configuration
diff --git a/Documentation/driver-api/cxl/platform/bios-and-efi.rst b/Documentation/driver-api/cxl/platform/bios-and-efi.rst
new file mode 100644
index 000000000000..552a83992bcc
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/bios-and-efi.rst
@@ -0,0 +1,262 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+BIOS/EFI Configuration
+======================
+
+BIOS and EFI are largely responsible for configuring static information about
+devices (or potential future devices) such that Linux can build the appropriate
+logical representations of these devices.
+
+At a high level, this is what occurs during this phase of configuration.
+
+* The bootloader starts the BIOS/EFI.
+
+* BIOS/EFI do early device probe to determine static configuration
+
+* BIOS/EFI creates ACPI Tables that describe static config for the OS
+
+* BIOS/EFI create the system memory map (EFI Memory Map, E820, etc)
+
+* BIOS/EFI calls :code:`start_kernel` and begins the Linux Early Boot process.
+
+Much of what this section is concerned with is ACPI Table production and
+static memory map configuration. More detail on these tables can be found
+under Platform Configuration -> ACPI Table Reference.
+
+.. note::
+ Platform Vendors should read carefully, as this sections has recommendations
+ on physical memory region size and alignment, memory holes, HDM interleave,
+ and what linux expects of HDM decoders trying to work with these features.
+
+UEFI Settings
+=============
+If your platform supports it, the :code:`uefisettings` command can be used to
+read/write EFI settings. Changes will be reflected on the next reboot. Kexec
+is not a sufficient reboot.
+
+One notable configuration here is the EFI_MEMORY_SP (Specific Purpose) bit.
+When this is enabled, this bit tells linux to defer management of a memory
+region to a driver (in this case, the CXL driver). Otherwise, the memory is
+treated as "normal memory", and is exposed to the page allocator during
+:code:`__init`.
+
+uefisettings examples
+---------------------
+
+:code:`uefisettings identify` ::
+
+ uefisettings identify
+
+ bios_vendor: xxx
+ bios_version: xxx
+ bios_release: xxx
+ bios_date: xxx
+ product_name: xxx
+ product_family: xxx
+ product_version: xxx
+
+On some AMD platforms, the :code:`EFI_MEMORY_SP` bit is set via the :code:`CXL
+Memory Attribute` field. This may be called something else on your platform.
+
+:code:`uefisettings get "CXL Memory Attribute"` ::
+
+ selector: xxx
+ ...
+ question: Question {
+ name: "CXL Memory Attribute",
+ answer: "Enabled",
+ ...
+ }
+
+Physical Memory Map
+===================
+
+Physical Address Region Alignment
+---------------------------------
+
+As of Linux v6.14, the hotplug memory system requires memory regions to be
+uniform in size and alignment. While the CXL specification allows for memory
+regions as small as 256MB, the supported memory block size and alignment for
+hotplugged memory is architecture-defined.
+
+A Linux memory blocks may be as small as 128MB and increase in powers of two.
+
+* On ARM, the default block size and alignment is either 128MB or 256MB.
+
+* On x86, the default block size is 256MB, and increases to 2GB as the
+ capacity of the system increases up to 64GB.
+
+For best support across versions, platform vendors should place CXL memory at
+a 2GB aligned base address, and regions should be 2GB aligned. This also helps
+prevent the creating thousands of memory devices (one per block).
+
+Memory Holes
+------------
+
+Holes in the memory map are tricky. Consider a 4GB device located at base
+address 0x100000000, but with the following memory map ::
+
+ ---------------------
+ | 0x100000000 |
+ | CXL |
+ | 0x1BFFFFFFF |
+ ---------------------
+ | 0x1C0000000 |
+ | MEMORY HOLE |
+ | 0x1FFFFFFFF |
+ ---------------------
+ | 0x200000000 |
+ | CXL CONT. |
+ | 0x23FFFFFFF |
+ ---------------------
+
+There are two issues to consider:
+
+* decoder programming, and
+* memory block alignment.
+
+If your architecture requires 2GB uniform size and aligned memory blocks, the
+only capacity Linux is capable of mapping (as of v6.14) would be the capacity
+from `0x100000000-0x180000000`. The remaining capacity will be stranded, as
+they are not of 2GB aligned length.
+
+Assuming your architecture and memory configuration allows 1GB memory blocks,
+this memory map is supported and this should be presented as multiple CFMWS
+in the CEDT that describe each side of the memory hole separately - along with
+matching decoders.
+
+Multiple decoders can (and should) be used to manage such a memory hole (see
+below), but each chunk of a memory hole should be aligned to a reasonable block
+size (larger alignment is always better). If you intend to have memory holes
+in the memory map, expect to use one decoder per contiguous chunk of host
+physical memory.
+
+As of v6.14, Linux does provide support for memory hotplug of multiple
+physical memory regions separated by a memory hole described by a single
+HDM decoder.
+
+
+Decoder Programming
+===================
+If BIOS/EFI intends to program the decoders to be statically configured,
+there are a few things to consider to avoid major pitfalls that will
+prevent Linux compatibility. Some of these recommendations are not
+required "per the specification", but Linux makes no guarantees of support
+otherwise.
+
+
+Translation Point
+-----------------
+Per the specification, the only decoders which **TRANSLATE** Host Physical
+Address (HPA) to Device Physical Address (DPA) are the **Endpoint Decoders**.
+All other decoders in the fabric are intended to route accesses without
+translating the addresses.
+
+This is heavily implied by the specification, see: ::
+
+ CXL Specification 3.1
+ 8.2.4.20: CXL HDM Decoder Capability Structure
+ - Implementation Note: CXL Host Bridge and Upstream Switch Port Decoder Flow
+ - Implementation Note: Device Decoder Logic
+
+Given this, Linux makes a strong assumption that decoders between CPU and
+endpoint will all be programmed with addresses ranges that are subsets of
+their parent decoder.
+
+Due to some ambiguity in how Architecture, ACPI, PCI, and CXL specifications
+"hand off" responsibility between domains, some early adopting platforms
+attempted to do translation at the originating memory controller or host
+bridge. This configuration requires a platform specific extension to the
+driver and is not officially endorsed - despite being supported.
+
+It is *highly recommended* **NOT** to do this; otherwise, you are on your own
+to implement driver support for your platform.
+
+Interleave and Configuration Flexibility
+----------------------------------------
+If providing cross-host-bridge interleave, a CFMWS entry in the CEDT must be
+presented with target host-bridges for the interleaved device sets (there may
+be multiple behind each host bridge).
+
+If providing intra-host-bridge interleaving, only 1 CFMWS entry in the CEDT is
+required for that host bridge - if it covers the entire capacity of the devices
+behind the host bridge.
+
+If intending to provide users flexibility in programming decoders beyond the
+root, you may want to provide multiple CFMWS entries in the CEDT intended for
+different purposes. For example, you may want to consider adding:
+
+1) A CFMWS entry to cover all interleavable host bridges.
+2) A CFMWS entry to cover all devices on a single host bridge.
+3) A CFMWS entry to cover each device.
+
+A platform may choose to add all of these, or change the mode based on a BIOS
+setting. For each CFMWS entry, Linux expects descriptions of the described
+memory regions in the SRAT to determine the number of NUMA nodes it should
+reserve during early boot / init.
+
+As of v6.14, Linux will create a NUMA node for each CEDT CFMWS entry, even if
+a matching SRAT entry does not exist; however, this is not guaranteed in the
+future and such a configuration should be avoided.
+
+Memory Holes
+------------
+If your platform includes memory holes intersparsed between your CXL memory, it
+is recommended to utilize multiple decoders to cover these regions of memory,
+rather than try to program the decoders to accept the entire range and expect
+Linux to manage the overlap.
+
+For example, consider the Memory Hole described above ::
+
+ ---------------------
+ | 0x100000000 |
+ | CXL |
+ | 0x1BFFFFFFF |
+ ---------------------
+ | 0x1C0000000 |
+ | MEMORY HOLE |
+ | 0x1FFFFFFFF |
+ ---------------------
+ | 0x200000000 |
+ | CXL CONT. |
+ | 0x23FFFFFFF |
+ ---------------------
+
+Assuming this is provided by a single device attached directly to a host bridge,
+Linux would expect the following decoder programming ::
+
+ ----------------------- -----------------------
+ | root-decoder-0 | | root-decoder-1 |
+ | base: 0x100000000 | | base: 0x200000000 |
+ | size: 0xC0000000 | | size: 0x40000000 |
+ ----------------------- -----------------------
+ | |
+ ----------------------- -----------------------
+ | HB-decoder-0 | | HB-decoder-1 |
+ | base: 0x100000000 | | base: 0x200000000 |
+ | size: 0xC0000000 | | size: 0x40000000 |
+ ----------------------- -----------------------
+ | |
+ ----------------------- -----------------------
+ | ep-decoder-0 | | ep-decoder-1 |
+ | base: 0x100000000 | | base: 0x200000000 |
+ | size: 0xC0000000 | | size: 0x40000000 |
+ ----------------------- -----------------------
+
+With a CEDT configuration with two CFMWS describing the above root decoders.
+
+Linux makes no guarantee of support for strange memory hole situations.
+
+Multi-Media Devices
+-------------------
+The CFMWS field of the CEDT has special restriction bits which describe whether
+the described memory region allows volatile or persistent memory (or both). If
+the platform intends to support either:
+
+1) A device with multiple medias, or
+2) Using a persistent memory device as normal memory
+
+A platform may wish to create multiple CEDT CFMWS entries to describe the same
+memory, with the intent of allowing the end user flexibility in how that memory
+is configured. Linux does not presently have strong requirements in this area.
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 05/17] cxl: docs/platform/acpi reference documentation
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (3 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 04/17] cxl: docs/platform/bios-and-efi documentation Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 23:49 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 06/17] cxl: docs/platform/example-configs documentation Gregory Price
` (12 subsequent siblings)
17 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Add basic ACPI table information needed to understand the CXL
driver probe process.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/driver-api/cxl/index.rst | 1 +
.../driver-api/cxl/platform/acpi.rst | 76 +++++++++++++++++++
.../driver-api/cxl/platform/acpi/cedt.rst | 62 +++++++++++++++
.../driver-api/cxl/platform/acpi/dsdt.rst | 28 +++++++
.../driver-api/cxl/platform/acpi/hmat.rst | 32 ++++++++
.../driver-api/cxl/platform/acpi/slit.rst | 21 +++++
.../driver-api/cxl/platform/acpi/srat.rst | 44 +++++++++++
7 files changed, 264 insertions(+)
create mode 100644 Documentation/driver-api/cxl/platform/acpi.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi/cedt.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi/dsdt.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi/hmat.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi/slit.rst
create mode 100644 Documentation/driver-api/cxl/platform/acpi/srat.rst
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index ffa0462ad950..336322dc35a0 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -26,6 +26,7 @@ that have impacts on each other. The docs here break up configurations steps.
:caption: Platform Configuration
platform/bios-and-efi
+ platform/acpi
.. toctree::
:maxdepth: 1
diff --git a/Documentation/driver-api/cxl/platform/acpi.rst b/Documentation/driver-api/cxl/platform/acpi.rst
new file mode 100644
index 000000000000..ee7e6bd4c43d
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/acpi.rst
@@ -0,0 +1,76 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========
+ACPI Tables
+===========
+
+ACPI is the "Advanced Configuration and Power Interface", which is a standard
+that defines how platforms and OS manage power and configure computer hardware.
+For the purpose of this theory of operation, when referring to "ACPI" we will
+usually refer to "ACPI Tables" - which are the way a platform (BIOS/EFI)
+communicates static configuration information to the operation system.
+
+The Following ACPI tables contain *static* configuration and performance data
+about CXL devices.
+
+.. toctree::
+ :maxdepth: 1
+
+ acpi/cedt.rst
+ acpi/srat.rst
+ acpi/hmat.rst
+ acpi/slit.rst
+ acpi/dsdt.rst
+
+The SRAT table may also contain generic port/initiator content that is intended
+to describe the generic port, but not information about the rest of the path to
+the endpoint.
+
+Linux uses these tables to configure kernel resources for statically configured
+(by BIOS/EFI) CXL devices, such as:
+
+- NUMA nodes
+- Memory Tiers
+- NUMA Abstract Distances
+- SystemRAM Memory Regions
+- Weighted Interleave Node Weights
+
+ACPI Debugging
+==============
+
+The :code:`acpidump -b` command dumps the ACPI tables into binary format.
+
+The :code:`iasl -d` command disassembles the files into human readable format.
+
+Example :code:`acpidump -b && iasl -d cedt.dat` ::
+
+ [000h 0000 4] Signature : "CEDT" [CXL Early Discovery Table]
+
+Common Issues
+-------------
+Most failures described here result in a failure of the driver to surface
+memory as a DAX device and/or kmem.
+
+* CEDT CFMWS targets list UIDs do not match CEDT CHBS UIDs.
+* CEDT CFMWS targets list UIDs do not match DSDT CXL Host Bridge UIDs.
+* CEDT CFMWS Restriction Bits are not correct.
+* CEDT CFMWS Memory regions are poorly aligned.
+* CEDT CFMWS Memory regions spans a platform memory hole.
+* CEDT CHBS UIDs do not match DSDT CXL Host Bridge UIDs.
+* CEDT CHBS Specification version is incorrect.
+* SRAT is missing regions described in CEDT CFMWS.
+
+ * Result: failure to create a NUMA node for the region, or
+ region is placed in wrong node.
+
+* HMAT is missing data for regions described in CEDT CFMWS.
+
+ * Result: NUMA node being placed in the wrong memory tier.
+
+* SLIT has bad data.
+
+ * Result: Lots of performance mechanisms in the kernel will be very unhappy.
+
+All of these issues will appear to users as if the driver is failing to
+support CXL - when in reality they are all the failure of a platform to
+configure the ACPI tables correctly.
diff --git a/Documentation/driver-api/cxl/platform/acpi/cedt.rst b/Documentation/driver-api/cxl/platform/acpi/cedt.rst
new file mode 100644
index 000000000000..1d9c9d3592dc
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/acpi/cedt.rst
@@ -0,0 +1,62 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+CEDT - CXL Early Discovery Table
+================================
+
+The CXL Early Discovery Table is generated by BIOS to describe the CXL memory
+regions configured at boot by the BIOS.
+
+CHBS
+====
+The CXL Host Bridge Structure describes CXL host bridges. Other than describing
+device register information, it reports the specific host bridge UID for this
+host bridge. These host bridge ID's will be referenced in other tables.
+
+Example ::
+
+ Subtable Type : 00 [CXL Host Bridge Structure]
+ Reserved : 00
+ Length : 0020
+ Associated host bridge : 00000007 <- Host bridge _UID
+ Specification version : 00000001
+ Reserved : 00000000
+ Register base : 0000010370400000
+ Register length : 0000000000010000
+
+CFMWS
+=====
+The CXL Fixed Memory Window structure describes a memory region associated
+with one or more CXL host bridges (as described by the CHBS). It additionally
+describes any inter-host-bridge interleave configuration that may have been
+programmed by BIOS.
+
+Example ::
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 000000C050000000 <- Memory Region
+ Window size : 0000003CA0000000
+ Interleave Members (2^n) : 01 <- Interleave configuration
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000007 <- Host Bridge _UID
+ Next Target : 00000006 <- Host Bridge _UID
+
+The restriction field dictates what this SPA range may be used for (memory type,
+voltile vs persistent, etc). One or more bits may be set. ::
+
+ Bit[0]: CXL Type 2 Memory
+ Bit[1]: CXL Type 3 Memory
+ Bit[2]: Volatile Memory
+ Bit[3]: Persistent Memory
+ Bit[4]: Fixed Config (HPA cannot be re-used)
+
+INTRA-host-bridge interleave (multiple devices on one host bridge) is NOT
+reported in this structure, and is solely defined via CXL device decoder
+programming (host bridge and endpoint decoders).
diff --git a/Documentation/driver-api/cxl/platform/acpi/dsdt.rst b/Documentation/driver-api/cxl/platform/acpi/dsdt.rst
new file mode 100644
index 000000000000..b4583b01d67d
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/acpi/dsdt.rst
@@ -0,0 +1,28 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================
+DSDT - Differentiated system Description Table
+==============================================
+
+This table describes what peripherals a machine has.
+
+This table's UIDs for CXL devices - specifically host bridges, must be
+consistent with the contents of the CEDT, otherwise the CXL driver will
+fail to probe correctly.
+
+Example Compute Express Link Host Bridge ::
+
+ Scope (_SB)
+ {
+ Device (S0D0)
+ {
+ Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
+ Name (_CID, Package (0x02) // _CID: Compatible ID
+ {
+ EisaId ("PNP0A08") /* PCI Express Bus */,
+ EisaId ("PNP0A03") /* PCI Bus */
+ })
+ ...
+ Name (_UID, 0x05) // _UID: Unique ID
+ ...
+ }
diff --git a/Documentation/driver-api/cxl/platform/acpi/hmat.rst b/Documentation/driver-api/cxl/platform/acpi/hmat.rst
new file mode 100644
index 000000000000..095a26f02a37
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/acpi/hmat.rst
@@ -0,0 +1,32 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================
+HMAT - Heterogeneous Memory Attribute Table
+===========================================
+
+The Heterogeneous Memory Attributes Table contains information such as cache
+attributes and bandwidth and latency details for memory proximity domains.
+For the purpose of this document, we will only discuss the SSLIB entry.
+
+SLLBI
+=====
+The System Locality Latency and Bandwidth Information records latency and
+bandwidth information for proximity domains.
+
+This table is used by Linux to configure interleave weights and memory tiers.
+
+Example (Heavily truncated for brevity) ::
+
+ Structure Type : 0001 [SLLBI]
+ Data Type : 00 <- Latency
+ Target Proximity Domain List : 00000000
+ Target Proximity Domain List : 00000001
+ Entry : 0080 <- DRAM LTC
+ Entry : 0100 <- CXL LTC
+
+ Structure Type : 0001 [SLLBI]
+ Data Type : 03 <- Bandwidth
+ Target Proximity Domain List : 00000000
+ Target Proximity Domain List : 00000001
+ Entry : 1200 <- DRAM BW
+ Entry : 0200 <- CXL BW
diff --git a/Documentation/driver-api/cxl/platform/acpi/slit.rst b/Documentation/driver-api/cxl/platform/acpi/slit.rst
new file mode 100644
index 000000000000..a56768e8fe41
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/acpi/slit.rst
@@ -0,0 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================
+SLIT - System Locality Information Table
+========================================
+
+The system locality information table provides "abstract distances" between
+accessor and memory nodes. Node without initiators (cpus) are infinitely (FF)
+distance away from all other nodes.
+
+The abstract distance described in this table does not describe any real
+latency of bandwidth information.
+
+Example ::
+
+ Signature : "SLIT" [System Locality Information Table]
+ Localities : 0000000000000004
+ Locality 0 : 10 20 20 30
+ Locality 1 : 20 10 30 20
+ Locality 2 : FF FF 0A FF
+ Locality 3 : FF FF FF 0A
diff --git a/Documentation/driver-api/cxl/platform/acpi/srat.rst b/Documentation/driver-api/cxl/platform/acpi/srat.rst
new file mode 100644
index 000000000000..56d7bbb18c3b
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/acpi/srat.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+SRAT - Static Resource Affinity Table
+=====================================
+
+The System/Static Resource Affinity Table describes resource (CPU, Memory)
+affinity to "Proximity Domains". This table is technically optional, but for
+performance information (see "HMAT") to be enumerated by linux it must be
+present.
+
+There is a careful dance between the CEDT and SRAT tables and how NUMA nodes are
+created. If things don't look quite the way you expect - check the SRAT Memory
+Affinity entries and CEDT CFMWS to determine what your platform actually
+supports in terms of flexible topologies.
+
+The SRAT may statically assign portions of a CFMWS SPA range to a specific
+proximity domains. See linux numa creation for more information about how
+this presents in the NUMA topology.
+
+Proximity Domain
+================
+A proximity domain is ROUGHLY equivalent to "NUMA Node" - though a 1-to-1
+mapping is not guaranteed. There are scenarios where "Proximity Domain 4" may
+map to "NUMA Node 3", for example. (See "NUMA Node Creation")
+
+Memory Affinity
+===============
+Generally speaking, if a host does any amount of CXL fabric (decoder)
+programming in BIOS - an SRAT entry for that memory needs to be present.
+
+Example ::
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000001 <- NUMA Node 1
+ Reserved1 : 0000
+ Base Address : 000000C050000000 <- Physical Memory Region
+ Address Length : 0000003CA0000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 06/17] cxl: docs/platform/example-configs documentation
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (4 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 05/17] cxl: docs/platform/acpi reference documentation Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-13 0:05 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 07/17] cxl: docs/linux - overview Gregory Price
` (11 subsequent siblings)
17 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Add example ACPI Table configurations for different sample platforms.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/driver-api/cxl/index.rst | 1 +
.../cxl/platform/example-configs.rst | 13 +
.../example-configurations/flexible.rst | 296 ++++++++++++++++++
.../example-configurations/hb-interleave.rst | 107 +++++++
.../multi-dev-per-hb.rst | 90 ++++++
.../example-configurations/one-dev-per-hb.rst | 136 ++++++++
6 files changed, 643 insertions(+)
create mode 100644 Documentation/driver-api/cxl/platform/example-configs.rst
create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index 336322dc35a0..6a5fb7e00c52 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -27,6 +27,7 @@ that have impacts on each other. The docs here break up configurations steps.
platform/bios-and-efi
platform/acpi
+ platform/example-configs
.. toctree::
:maxdepth: 1
diff --git a/Documentation/driver-api/cxl/platform/example-configs.rst b/Documentation/driver-api/cxl/platform/example-configs.rst
new file mode 100644
index 000000000000..90a10d7473c6
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/example-configs.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Example Platform Configurations
+###############################
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Contents
+
+ example-configurations/one-dev-per-hb.rst
+ example-configurations/multi-dev-per-hb.rst
+ example-configurations/hb-interleave.rst
+ example-configurations/flexible.rst
diff --git a/Documentation/driver-api/cxl/platform/example-configurations/flexible.rst b/Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
new file mode 100644
index 000000000000..e39daba65fa0
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
@@ -0,0 +1,296 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Flexible Presentation
+=====================
+This system has a single socket with two CXL host bridges. Each host bridge
+has two CXL memory expanders with a 4GB of memory (32GB total).
+
+On this system, the platform designer wanted to provide the user flexibility
+to configure the memory devices in various interleave or NUMA node
+configurations. So they provided every combination.
+
+Things to note:
+
+* Cross-Bridge interleave is described in one CFMWS that covers all capacity.
+* One CFMWS is also described per-host bridge.
+* One CFMWS is also described per-device.
+* This SRAT describes one node for each of the above CFMWS.
+* The HMAT describes performance for each node in the SRAT.
+
+CEDT ::
+
+ Subtable Type : 00 [CXL Host Bridge Structure]
+ Reserved : 00
+ Length : 0020
+ Associated host bridge : 00000007
+ Specification version : 00000001
+ Reserved : 00000000
+ Register base : 0000010370400000
+ Register length : 0000000000010000
+
+ Subtable Type : 00 [CXL Host Bridge Structure]
+ Reserved : 00
+ Length : 0020
+ Associated host bridge : 00000006
+ Specification version : 00000001
+ Reserved : 00000000
+ Register base : 0000010380800000
+ Register length : 0000000000010000
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000001000000000
+ Window size : 0000000400000000
+ Interleave Members (2^n) : 01
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000007
+ Second Target : 00000006
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000002000000000
+ Window size : 0000000200000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000007
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000002200000000
+ Window size : 0000000200000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000006
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000003000000000
+ Window size : 0000000100000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000007
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000003100000000
+ Window size : 0000000100000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000007
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000003200000000
+ Window size : 0000000100000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000006
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000003300000000
+ Window size : 0000000100000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000006
+
+SRAT ::
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000001
+ Reserved1 : 0000
+ Base Address : 0000001000000000
+ Address Length : 0000000400000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000002
+ Reserved1 : 0000
+ Base Address : 0000002000000000
+ Address Length : 0000000200000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000003
+ Reserved1 : 0000
+ Base Address : 0000002200000000
+ Address Length : 0000000200000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000004
+ Reserved1 : 0000
+ Base Address : 0000003000000000
+ Address Length : 0000000100000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000005
+ Reserved1 : 0000
+ Base Address : 0000003100000000
+ Address Length : 0000000100000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000006
+ Reserved1 : 0000
+ Base Address : 0000003200000000
+ Address Length : 0000000100000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000007
+ Reserved1 : 0000
+ Base Address : 0000003300000000
+ Address Length : 0000000100000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+HMAT ::
+
+ Structure Type : 0001 [SLLBI]
+ Data Type : 00 [Latency]
+ Target Proximity Domain List : 00000000
+ Target Proximity Domain List : 00000001
+ Target Proximity Domain List : 00000002
+ Target Proximity Domain List : 00000003
+ Target Proximity Domain List : 00000004
+ Target Proximity Domain List : 00000005
+ Target Proximity Domain List : 00000006
+ Target Proximity Domain List : 00000007
+ Entry : 0080
+ Entry : 0100
+ Entry : 0100
+ Entry : 0100
+ Entry : 0100
+ Entry : 0100
+ Entry : 0100
+ Entry : 0100
+
+ Structure Type : 0001 [SLLBI]
+ Data Type : 03 [Bandwidth]
+ Target Proximity Domain List : 00000000
+ Target Proximity Domain List : 00000001
+ Target Proximity Domain List : 00000002
+ Target Proximity Domain List : 00000003
+ Target Proximity Domain List : 00000004
+ Target Proximity Domain List : 00000005
+ Target Proximity Domain List : 00000006
+ Target Proximity Domain List : 00000007
+ Entry : 1200
+ Entry : 0400
+ Entry : 0200
+ Entry : 0200
+ Entry : 0100
+ Entry : 0100
+ Entry : 0100
+ Entry : 0100
+
+SLIT ::
+
+ Signature : "SLIT" [System Locality Information Table]
+ Localities : 0000000000000003
+ Locality 0 : 10 20 20 20 20 20 20 20
+ Locality 1 : FF 0A FF FF FF FF FF FF
+ Locality 2 : FF FF 0A FF FF FF FF FF
+ Locality 3 : FF FF FF 0A FF FF FF FF
+ Locality 4 : FF FF FF FF 0A FF FF FF
+ Locality 5 : FF FF FF FF FF 0A FF FF
+ Locality 6 : FF FF FF FF FF FF 0A FF
+ Locality 7 : FF FF FF FF FF FF FF 0A
+
+DSDT ::
+
+ Scope (_SB)
+ {
+ Device (S0D0)
+ {
+ Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
+ ...
+ Name (_UID, 0x07) // _UID: Unique ID
+ }
+ ...
+ Device (S0D5)
+ {
+ Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
+ ...
+ Name (_UID, 0x06) // _UID: Unique ID
+ }
+ }
diff --git a/Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst b/Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
new file mode 100644
index 000000000000..ce07e6162f26
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
@@ -0,0 +1,107 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Cross-Host-Bridge Interleave
+============================
+This system has a single socket with two CXL host bridges. Each host bridge
+has a single CXL memory expander with a 4GB of memory.
+
+Things to note:
+
+* Cross-Bridge interleave is described.
+* The expanders are described by a single CFMWS.
+* This SRAT describes one node for both host bridges.
+* The HMAT describes a single node's performance.
+
+CEDT ::
+
+ Subtable Type : 00 [CXL Host Bridge Structure]
+ Reserved : 00
+ Length : 0020
+ Associated host bridge : 00000007
+ Specification version : 00000001
+ Reserved : 00000000
+ Register base : 0000010370400000
+ Register length : 0000000000010000
+
+ Subtable Type : 00 [CXL Host Bridge Structure]
+ Reserved : 00
+ Length : 0020
+ Associated host bridge : 00000006
+ Specification version : 00000001
+ Reserved : 00000000
+ Register base : 0000010380800000
+ Register length : 0000000000010000
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000001000000000
+ Window size : 0000000200000000
+ Interleave Members (2^n) : 01
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000007
+ Second Target : 00000006
+
+SRAT ::
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000001
+ Reserved1 : 0000
+ Base Address : 0000001000000000
+ Address Length : 0000000200000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+HMAT ::
+
+ Structure Type : 0001 [SLLBI]
+ Data Type : 00 [Latency]
+ Target Proximity Domain List : 00000000
+ Target Proximity Domain List : 00000001
+ Target Proximity Domain List : 00000002
+ Entry : 0080
+ Entry : 0100
+
+ Structure Type : 0001 [SLLBI]
+ Data Type : 03 [Bandwidth]
+ Target Proximity Domain List : 00000000
+ Target Proximity Domain List : 00000001
+ Target Proximity Domain List : 00000002
+ Entry : 1200
+ Entry : 0400
+
+SLIT ::
+
+ Signature : "SLIT" [System Locality Information Table]
+ Localities : 0000000000000003
+ Locality 0 : 10 20
+ Locality 1 : FF 0A
+
+DSDT ::
+
+ Scope (_SB)
+ {
+ Device (S0D0)
+ {
+ Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
+ ...
+ Name (_UID, 0x07) // _UID: Unique ID
+ }
+ ...
+ Device (S0D5)
+ {
+ Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
+ ...
+ Name (_UID, 0x06) // _UID: Unique ID
+ }
+ }
diff --git a/Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst b/Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
new file mode 100644
index 000000000000..6adf7c639490
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
@@ -0,0 +1,90 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+Multiple Devices per Host Bridge
+================================
+
+In this example system we will have a single socket and one CXL host bridge.
+There are two CXL memory expanders with 4GB attached to the host bridge.
+
+Things to note:
+
+* Intra-Bridge interleave is not described here.
+* The expanders are described by a single CEDT/CFMWS.
+* This CEDT/SRAT describes one node for both devices.
+* There is only one proximity domain the HMAT for both devices.
+
+CEDT ::
+
+ Subtable Type : 00 [CXL Host Bridge Structure]
+ Reserved : 00
+ Length : 0020
+ Associated host bridge : 00000007
+ Specification version : 00000001
+ Reserved : 00000000
+ Register base : 0000010370400000
+ Register length : 0000000000010000
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000001000000000
+ Window size : 0000000200000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000007
+
+SRAT ::
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000001
+ Reserved1 : 0000
+ Base Address : 0000001000000000
+ Address Length : 0000000200000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+HMAT ::
+
+ Structure Type : 0001 [SLLBI]
+ Data Type : 00 [Latency]
+ Target Proximity Domain List : 00000000
+ Target Proximity Domain List : 00000001
+ Entry : 0080
+ Entry : 0100
+
+ Structure Type : 0001 [SLLBI]
+ Data Type : 03 [Bandwidth]
+ Target Proximity Domain List : 00000000
+ Target Proximity Domain List : 00000001
+ Entry : 1200
+ Entry : 0200
+
+SLIT ::
+
+ Signature : "SLIT" [System Locality Information Table]
+ Localities : 0000000000000003
+ Locality 0 : 10 20
+ Locality 1 : FF 0A
+
+DSDT ::
+
+ Scope (_SB)
+ {
+ Device (S0D0)
+ {
+ Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
+ ...
+ Name (_UID, 0x07) // _UID: Unique ID
+ }
+ ...
+ }
diff --git a/Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst b/Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
new file mode 100644
index 000000000000..b89ba3cab98f
--- /dev/null
+++ b/Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
@@ -0,0 +1,136 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+One Device per Host Bridge
+==========================
+
+This system has a single socket with two CXL host bridges. Each host bridge
+has a single CXL memory expander with a 4GB of memory.
+
+Things to note:
+
+* Cross-Bridge interleave is not being used.
+* The expanders are in two separate but adjascent memory regions.
+* This CEDT/SRAT describes one node per device
+* The expanders have the same performance and will be in the same memory tier.
+
+CEDT ::
+
+ Subtable Type : 00 [CXL Host Bridge Structure]
+ Reserved : 00
+ Length : 0020
+ Associated host bridge : 00000007
+ Specification version : 00000001
+ Reserved : 00000000
+ Register base : 0000010370400000
+ Register length : 0000000000010000
+
+ Subtable Type : 00 [CXL Host Bridge Structure]
+ Reserved : 00
+ Length : 0020
+ Associated host bridge : 00000006
+ Specification version : 00000001
+ Reserved : 00000000
+ Register base : 0000010380800000
+ Register length : 0000000000010000
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000001000000000
+ Window size : 0000000100000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000007
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Reserved : 00
+ Length : 002C
+ Reserved : 00000000
+ Window base address : 0000001100000000
+ Window size : 0000000100000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ Reserved : 0000
+ Granularity : 00000000
+ Restrictions : 0006
+ QtgId : 0001
+ First Target : 00000006
+
+SRAT ::
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000001
+ Reserved1 : 0000
+ Base Address : 0000001000000000
+ Address Length : 0000000100000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+ Subtable Type : 01 [Memory Affinity]
+ Length : 28
+ Proximity Domain : 00000002
+ Reserved1 : 0000
+ Base Address : 0000001100000000
+ Address Length : 0000000100000000
+ Reserved2 : 00000000
+ Flags (decoded below) : 0000000B
+ Enabled : 1
+ Hot Pluggable : 1
+ Non-Volatile : 0
+
+HMAT ::
+
+ Structure Type : 0001 [SLLBI]
+ Data Type : 00 [Latency]
+ Target Proximity Domain List : 00000000
+ Target Proximity Domain List : 00000001
+ Target Proximity Domain List : 00000002
+ Entry : 0080
+ Entry : 0100
+ Entry : 0100
+
+ Structure Type : 0001 [SLLBI]
+ Data Type : 03 [Bandwidth]
+ Target Proximity Domain List : 00000000
+ Target Proximity Domain List : 00000001
+ Target Proximity Domain List : 00000002
+ Entry : 1200
+ Entry : 0200
+ Entry : 0200
+
+SLIT ::
+
+ Signature : "SLIT" [System Locality Information Table]
+ Localities : 0000000000000003
+ Locality 0 : 10 20 20
+ Locality 1 : FF 0A FF
+ Locality 2 : FF FF 0A
+
+DSDT ::
+
+ Scope (_SB)
+ {
+ Device (S0D0)
+ {
+ Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
+ ...
+ Name (_UID, 0x07) // _UID: Unique ID
+ }
+ ...
+ Device (S0D5)
+ {
+ Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
+ ...
+ Name (_UID, 0x06) // _UID: Unique ID
+ }
+ }
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 07/17] cxl: docs/linux - overview
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (5 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 06/17] cxl: docs/platform/example-configs documentation Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-13 0:09 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 08/17] cxl: docs/linux - early boot configuration Gregory Price
` (10 subsequent siblings)
17 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Add type-3 device configuration overview that explains the probe
process for a type-3 device from early-boot through memory-hotplug.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/driver-api/cxl/index.rst | 3 +-
.../driver-api/cxl/linux/overview.rst | 103 ++++++++++++++++++
2 files changed, 105 insertions(+), 1 deletion(-)
create mode 100644 Documentation/driver-api/cxl/linux/overview.rst
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index 6a5fb7e00c52..bc2228c77c32 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -30,9 +30,10 @@ that have impacts on each other. The docs here break up configurations steps.
platform/example-configs
.. toctree::
- :maxdepth: 1
+ :maxdepth: 2
:caption: Linux Kernel Configuration
+ linux/overview
linux/access-coordinates
diff --git a/Documentation/driver-api/cxl/linux/overview.rst b/Documentation/driver-api/cxl/linux/overview.rst
new file mode 100644
index 000000000000..648beb2c8c83
--- /dev/null
+++ b/Documentation/driver-api/cxl/linux/overview.rst
@@ -0,0 +1,103 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+Overview
+========
+
+This section presents the configuration process of a CXL Type-3 memory device,
+and how it is ultimately exposed to users as either a :code:`DAX` device or
+normal memory pages via the kernel's page allocator.
+
+Portions marked with a bullet are points at which certain kernel objects
+are generated.
+
+1) Early Boot
+
+ a) BIOS, Build, and Boot Parameters
+
+ i) EFI_MEMORY_SP
+ ii) CONFIG_EFI_SOFT_RESERVE
+ iii) CONFIG_MHP_DEFAULT_ONLINE_TYPE
+ iv) nosoftreserve
+
+ b) Memory Map Creation
+
+ i) EFI Memory Map / E820 Consulted for Soft-Reserved
+
+ * CXL Memory is set aside to be handled by the CXL driver
+
+ * Soft-Reserved IO Resource created for CFMWS entry
+
+ c) NUMA Node Creation
+
+ * Nodes created from ACPI CEDT CFMWS and SRAT Proximity domains (PXM)
+
+ d) Memory Tier Creation
+
+ * A default memory_tier is created with all nodes.
+
+ e) Contiguous Memory Allocation
+
+ * Any requested CMA is allocated from Online nodes
+
+ f) Init Finishes, Drivers start probing
+
+2) ACPI and PCI Drivers
+
+ a) Detects PCI device is CXL, marking it for probe by CXL driver
+
+3) CXL Driver Operation
+
+ a) Base device creation
+
+ * root, port, and memdev devices created
+ * CEDT CFMWS IO Resource creation
+
+ b) Decoder creation
+
+ * root, switch, and endpoint decoders created
+
+ c) Logical device creation
+
+ * memory_region and endpoint devices created
+
+ d) Devices are associated with each other
+
+ * If auto-decoder (BIOS-programmed decoders), driver validates
+ configurations, builds associations, and locks configs at probe time.
+
+ * If user-configured, validation and associations are built at
+ decoder-commit time.
+
+ e) Regions surfaced as DAX region
+
+ * dax_region created
+
+ * DAX device created via DAX driver
+
+4) DAX Driver Operation
+
+ a) DAX driver surfaces DAX region as one of two dax device modes
+
+ * kmem - dax device is converted to hotplug memory blocks
+
+ * DAX kmem IO Resource creation
+
+ * hmem - dax device is left as daxdev to be accessed as a file.
+
+ * If hmem, journey ends here.
+
+ b) DAX kmem surfaces memory region to Memory Hotplug to add to page
+ allocator as "driver managed memory"
+
+5) Memory Hotplug
+
+ a) mhp component surfaces a dax device memory region as multiple memory
+ blocks to the page allocator
+
+ * blocks appear in :code:`/sys/bus/memory/devices` and linked to a NUMA node
+
+ b) blocks are onlined into the requested zone (NORMAL or MOVABLE)
+
+ * Memory is marked "Driver Managed" to avoid kexec from using it as region
+ for kernel updates
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 08/17] cxl: docs/linux - early boot configuration
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (6 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 07/17] cxl: docs/linux - overview Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-13 17:56 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 09/17] cxl: docs/linux - add cxl-driver theory of operation Gregory Price
` (9 subsequent siblings)
17 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Document __init time configurations that affect CXL driver probe
process and memory region configuration.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/driver-api/cxl/index.rst | 1 +
.../driver-api/cxl/linux/early-boot.rst | 131 ++++++++++++++++++
2 files changed, 132 insertions(+)
create mode 100644 Documentation/driver-api/cxl/linux/early-boot.rst
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index bc2228c77c32..d2eefe575604 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -34,6 +34,7 @@ that have impacts on each other. The docs here break up configurations steps.
:caption: Linux Kernel Configuration
linux/overview
+ linux/early-boot
linux/access-coordinates
diff --git a/Documentation/driver-api/cxl/linux/early-boot.rst b/Documentation/driver-api/cxl/linux/early-boot.rst
new file mode 100644
index 000000000000..8c1c497bc772
--- /dev/null
+++ b/Documentation/driver-api/cxl/linux/early-boot.rst
@@ -0,0 +1,131 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+Linux Init (Early Boot)
+=======================
+
+Linux configuration is split into two major steps: Early-Boot and everything else.
+
+During early boot, Linux sets up immutable resources (such as numa nodes), while
+later operations include things like driver probe and memory hotplug. Linux may
+read EFI and ACPI information throughout this process to configure logical
+representations of the devices.
+
+During Linux Early Boot stage (functions in the kernel that have the __init
+decorator), the system takes the resources created by EFI/BIOS (ACPI tables)
+and turns them into resources that the kernel can consume.
+
+
+BIOS, Build and Boot Options
+============================
+
+There are 4 pre-boot options that need to be considered during kernel build
+which dictate how memory will be managed by Linux during early boot.
+
+* EFI_MEMORY_SP
+
+ * BIOS/EFI Option that dictates whether memory is SystemRAM or
+ Specific Purpose. Specific Purpose memory will be deferred to
+ drivers to manage - and not immediately exposed as system RAM.
+
+* CONFIG_EFI_SOFT_RESERVE
+
+ * Linux Build config option that dictates whether the kernel supports
+ Specific Purpose memory.
+
+* CONFIG_MHP_DEFAULT_ONLINE_TYPE
+
+ * Linux Build config that dictates whether and how Specific Purpose memory
+ converted to a dax device should be managed (left as DAX or onlined as
+ SystemRAM in ZONE_NORMAL or ZONE_MOVABLE).
+
+* nosoftreserve
+
+ * Linux kernel boot option that dictates whether Soft Reserve should be
+ supported. Similar to CONFIG_EFI_SOFT_RESERVE.
+
+Memory Map Creation
+===================
+
+While the kernel parses the EFI memory map, if :code:`Specific Purpose` memory
+is supported and detected, it will set this region aside as
+:code:`SOFT_RESERVED`.
+
+If :code:`EFI_MEMORY_SP=0`, :code:`CONFIG_EFI_SOFT_RESERVE=n`, or
+:code:`nosoftreserve=y` - Linux will default a CXL device memory region to
+SystemRAM. This will expose the memory to the kernel page allocator in
+:code:`ZONE_NORMAL`, making it available for use for most allocations (including
+:code:`struct page` and page tables).
+
+If `Specific Purpose` is set and supported, :code:`CONFIG_MHP_DEFAULT_ONLINE_TYPE_*`
+dictates whether the memory is onlined by default (:code:`_OFFLINE` or
+:code:`_ONLINE_*`), and if online which zone to online this memory to by default
+(:code:`_NORMAL` or :code:`_MOVABLE`).
+
+If placed in :code:`ZONE_MOVABLE`, the memory will not be available for most
+kernel allocations (such as :code:`struct page` or page tables). This may
+significant impact performance depending on the memory capacity of the system.
+
+
+NUMA Node Reservation
+=====================
+
+Linux refers to the proximity domains (:code:`PXM`) defined in the SRAT to
+create NUMA nodes in :code:`acpi_numa_init`. Typically, there is a 1:1 relation
+between :code:`PXM` and NUMA node IDs.
+
+SRAT is the only ACPI defined way of defining Proximity Domains. Linux chooses
+to, at most, map those 1:1 with NUMA nodes. CEDT adds a description of SPA
+ranges which Linux may wish to map to one or more NUMA nodes.
+
+If there are CXL ranges in the CFMWS but not in SRAT, then a fake :code:`PXM`
+is created (as of v6.15). In the future, Linux may reject CFMWS not described
+by SRAT due to the ambiguity of proximity domain association.
+
+It is important to note that NUMA node creation cannot be done at runtime. All
+possible NUMA nodes are identified at :code:`__init` time, more specifically
+during :code:`mm_init`. The CEDT and SRAT must contain sufficient :code:`PXM`
+data for Linux to identify NUMA nodes their associated memory regions.
+
+The relevant code exists in: :code:`linux/drivers/acpi/numa/srat.c`.
+
+See the Example Platform Configurations section for more information.
+
+Memory Tiers Creation
+=====================
+Memory tiers are a collection of NUMA nodes grouped by performance characteristics.
+During :code:`__init`, Linux initializes the system with a default memory tier that
+contains all nodes marked :code:`N_MEMORY`.
+
+:code:`memory_tier_init` is called at boot for all nodes with memory online by
+default. :code:`memory_tier_late_init` is called during late-init for nodes setup
+during driver configuration.
+
+Nodes are only marked :code:`N_MEMORY` if they have *online* memory.
+
+Tier membership can be inspected in ::
+
+ /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
+ 0-1
+
+If nodes are grouped which have clear difference in performance, check the HMAT
+and CDAT information for the CXL nodes. All nodes default to the DRAM tier,
+unless HMAT/CDAT information is reported to the memory_tier component via
+`access_coordinates`.
+
+Contiguous Memory Allocation
+============================
+The contiguous memory allocator (CMA) enables reservation of contiguous memory
+regions on NUMA nodes during early boot. However, CMA cannot reserve memory
+on NUMA nodes that are not online during early boot. ::
+
+ void __init hugetlb_cma_reserve(int order) {
+ if (!node_online(nid))
+ /* do not allow reservations */
+ }
+
+This means if users intend to defer management of CXL memory to the driver, CMA
+cannot be used to guarantee huge page allocations. If enabling CXL memory as
+SystemRAM in `ZONE_NORMAL` during early boot, CMA reservations per-node can be
+made with the :code:`cma_pernuma` or :code:`numa_cma` kernel command line
+parameters.
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 09/17] cxl: docs/linux - add cxl-driver theory of operation
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (7 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 08/17] cxl: docs/linux - early boot configuration Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 16:21 ` [PATCH v3 10/17] cxl: docs/linux/cxl-driver - add example configurations Gregory Price
` (8 subsequent siblings)
17 siblings, 0 replies; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Add docs for the CXL driver that explains the base devices,
decoder types, region types, mailbox interfaces, and decoder
programming.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/driver-api/cxl/index.rst | 1 +
.../driver-api/cxl/linux/cxl-driver.rst | 522 ++++++++++++++++++
2 files changed, 523 insertions(+)
create mode 100644 Documentation/driver-api/cxl/linux/cxl-driver.rst
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index d2eefe575604..df3c7763c79a 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -35,6 +35,7 @@ that have impacts on each other. The docs here break up configurations steps.
linux/overview
linux/early-boot
+ linux/cxl-driver
linux/access-coordinates
diff --git a/Documentation/driver-api/cxl/linux/cxl-driver.rst b/Documentation/driver-api/cxl/linux/cxl-driver.rst
new file mode 100644
index 000000000000..9a9e8ecee578
--- /dev/null
+++ b/Documentation/driver-api/cxl/linux/cxl-driver.rst
@@ -0,0 +1,522 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+CXL Driver Operation
+====================
+
+The devices described in this section are present in ::
+
+ /sys/bus/cxl/devices/
+ /dev/cxl/
+
+The :code:`cxl-cli` library, maintained as part of the NDTCL project, may
+be used to script interactions with these devices.
+
+Drivers
+=======
+The CXL driver is split into a number of drivers.
+
+* cxl_core - fundamental init interface and core object creation
+* cxl_port - initializes root and provides port enumeration interface.
+* cxl_acpi - initializes root decoders and interacts with ACPI data.
+* cxl_p/mem - initializes memory devices
+* cxl_pci - uses cxl_port to enumates the actual fabric hierarchy.
+
+Driver Devices
+==============
+Here is an example from a single-socket system with 4 host bridges. Two host
+bridges have a single memory device attached, and the devices are interleaved
+into a single memory region. The memory region has been converted to dax. ::
+
+ # ls /sys/bus/cxl/devices/
+ dax_region0 decoder3.0 decoder6.0 mem0 port3
+ decoder0.0 decoder4.0 decoder6.1 mem1 port4
+ decoder1.0 decoder5.0 endpoint5 port1 region0
+ decoder2.0 decoder5.1 endpoint6 port2 root0
+
+For this section we'll explore the devices present in this configuration, but
+we'll explore more configurations in-depth in example configurations below.
+
+Base Devices
+------------
+Most devices in a CXL fabric are a `port` of some kind (because each
+device mostly routes request from one device to the next, rather than
+provide a manageable service).
+
+Root
+~~~~
+The `CXL Root` is logical object created by the `cxl_acpi` driver during
+:code:`cxl_acpi_probe` - if the :code:`ACPI0017` `Compute Express Link
+Root Object` Device Class is found.
+
+The Root contains links to:
+
+* `Host Bridge Ports` defined by ACPI CEDT CHBS.
+
+* `Root Decoders` defined by ACPI CEDT CFMWS.
+
+::
+
+ # ls /sys/bus/cxl/devices/root0
+ decoder0.0 dport0 dport5 port2 subsystem
+ decoders_committed dport1 modalias port3 uevent
+ devtype dport4 port1 port4 uport
+
+ # cat /sys/bus/cxl/devices/root0/devtype
+ cxl_port
+
+ # cat port1/devtype
+ cxl_port
+
+ # cat decoder0.0/devtype
+ cxl_decoder_root
+
+The root is first `logical port` in the CXL fabric, as presented by the Linux
+CXL driver. The `CXL root` is a special type of `switch port`, in that it
+only has downstream port connections.
+
+Port
+~~~~
+A `port` object is better described as a `switch port`. It may represent a
+host bridge to the root or an actual switch port on a switch. A `switch port`
+contains one or more decoders used to route memory requests downstream ports,
+which may be connected to another `switch port` or an `endpoint port`.
+
+::
+
+ # ls /sys/bus/cxl/devices/port1
+ decoder1.0 dport0 driver parent_dport uport
+ decoders_committed dport113 endpoint5 subsystem
+ devtype dport2 modalias uevent
+
+ # cat devtype
+ cxl_port
+
+ # cat decoder1.0/devtype
+ cxl_decoder_switch
+
+ # cat endpoint5/devtype
+ cxl_port
+
+CXL `Host Bridges` in the fabric are probed during :code:`cxl_acpi_probe` at
+the time the `CXL Root` is probed. The allows for the immediate logical
+connection to between the root and host bridge.
+
+* The root has a downstream port connection to a host bridge
+
+* The host bridge has an upstream port connection to the root.
+
+* The host bridge has one or more downstream port connections to switch
+ or endpoint ports.
+
+A `Host Bridge` is a special type of CXL `switch port`. It is explicitly
+defined in the ACPI specification via `ACPI0016` ID. `Host Bridge` ports
+will be probed at `acpi_probe` time, while similar ports on an actual switch
+will be probed later. Otherwise, switch and host bridge ports look very
+similar - the both contain switch decoders which route accesses between
+upstream and downstream ports.
+
+Endpoint
+~~~~~~~~
+An `endpoint` is a terminal port in the fabric. This is a `logical device`,
+and may be one of many `logical devices` presented by a memory device. It
+is still considered a type of `port` in the fabric.
+
+An `endpoint` contains `endpoint decoders` available for use and the
+*Coherent Device Attribute Table* (CDAT) used to describe the capabilities
+of the device. ::
+
+ # ls /sys/bus/cxl/devices/endpoint5
+ CDAT decoders_committed modalias uevent
+ decoder5.0 devtype parent_dport uport
+ decoder5.1 driver subsystem
+
+ # cat /sys/bus/cxl/devices/endpoint5/devtype
+ cxl_port
+
+ # cat /sys/bus/cxl/devices/endpoint5/decoder5.0/devtype
+ cxl_decoder_endpoint
+
+
+Memory Device (memdev)
+~~~~~~~~~~~~~~~~~~~~~~
+A `memdev` is probed and added by the `cxl_pci` driver in :code:`cxl_pci_probe`
+and is managed by the `cxl_mem` driver. It primarily provides the `IOCTL`
+interface to a memory device, via :code:`/dev/cxl/memN`, and exposes various
+device configuration data. ::
+
+ # ls /sys/bus/cxl/devices/mem0
+ dev firmware_version payload_max security uevent
+ driver label_storage_size pmem serial
+ firmware numa_node ram subsystem
+
+
+Decoders
+--------
+A `Decoder` is short for a CXL Host-Managed Device Memory (HDM) Decoder. It is
+a device that routes accesses through the CXL fabric to an endpoint, and at
+the endpoint translates a `Host Physical` to `Device Physical` Addressing.
+
+The CXL 3.1 specification heavily implies that only endpoint decoders should
+engage in translation of `Host Physical Address` to `Device Physical Address`.
+::
+
+ 8.2.4.20 CXL HDM Decoder Capability Structure
+
+ IMPLEMENTATION NOTE
+ CXL Host Bridge and Upstream Switch Port Decode Flow
+
+ IMPLEMENTATION NOTE
+ Device Decode Logic
+
+These notes imply that there are two logical groups of decoders.
+
+* Routing Decoder - a decoder which routes accesses but does not translate
+ addresses from HPA to DPA.
+
+* Translating Decoder - a decoder which translates accesses from HPA to DPA
+ for an endpoint to service.
+
+The CXL drivers distinguish 3 decoder types: root, switch, and endpoint. Only
+endpoint decoders are Translating Decoders, all others are Routing Decoders.
+
+.. note:: PLATFORM VENDORS BE AWARE
+
+ Linux makes a strong assumption that endpoint decoders are the only decoder
+ in the fabric that actively translates HPA to DPA. Linux assumes routing
+ decoders pass the HPA unchanged to the next decoder in the fabric.
+
+ It is therefore assumed that any given decoder in the fabric will have an
+ address range that is a subset of its upstream port decoder. Any deviation
+ from this scheme undefined per the specification. Linux prioritizes
+ spec-defined / architectural behavior.
+
+Decoders may have one or more `Downstream Targets` if configured to interleave
+memory accesses. This will be presented in sysfs via the :code:`target_list`
+parameter.
+
+Root Decoder
+~~~~~~~~~~~~
+A `Root Decoder` is logical construct of the physical address and interleave
+configurations present in the ACPI CEDT CFMWS. Linux presents this information
+as a decoder present in the `CXL Root`. We consider this a `Root Decoder`,
+though technically it exists on the boundary of the CXL specification and
+platform-specific CXL root implementations.
+
+Linux considers these logical decoders a type of `Routing Decoder`, and is the
+first decoder in the CXL fabric to receive a memory access from the platform's
+memory controllers.
+
+`Root Decoders` are created during :code:`cxl_acpi_probe`. One root decoder
+is created per CFMWS entry in the ACPI CEDT.
+
+The :code:`target_list` parameter is filled by the CFMWS target fields. Targets
+of a root decoder are `Host Bridges`, which means interleave done at the root
+decoder level is an `Inter-Host-Bridge Interleave`.
+
+Only root decoders are capable of `Inter-Host-Bridge Interleave`.
+
+Such interleaves must be configured by the platform and described in the ACPI
+CEDT CFMWS, as the target CXL host bridge UIDs in the CFMWS must match the CXL
+host bridge UIDs in the ACPI CEDT CHBS and ACPI DSDT.
+
+Interleave settings in a rootdecoder describe how to interleave accesses among
+the *immediate downstream targets*, not the entire interleave set.
+
+The memory range described in the root decoder is used to
+
+1) Create a memory region (:code:`region0` in this example), and
+
+2) Associate the region with an IO Memory Resource (:code:`kernel/resource.c`)
+
+::
+
+ # ls /sys/bus/cxl/devices/decoder0.0/
+ cap_pmem devtype region0
+ cap_ram interleave_granularity size
+ cap_type2 interleave_ways start
+ cap_type3 locked subsystem
+ create_ram_region modalias target_list
+ delete_region qos_class uevent
+
+ # cat /sys/bus/cxl/devices/decoder0.0/region0/resource
+ 0xc050000000
+
+The IO Memory Resource is created during early boot when the CFMWS region is
+identified in the EFI Memory Map or E820 table (on x86).
+
+Root decoders are defined as a separate devtype, but are also a type
+of `Switch Decoder` due to having downstream targets. ::
+
+ # cat /sys/bus/cxl/devices/decoder0.0/devtype
+ cxl_decoder_root
+
+Switch Decoder
+~~~~~~~~~~~~~~
+Any non-root, translating decoder is considered a `Switch Decoder`, and will
+present with the type :code:`cxl_decoder_switch`. Both `Host Bridge` and `CXL
+Switch` (device) decoders are of type :code:`cxl_decoder_switch`. ::
+
+ # ls /sys/bus/cxl/devices/decoder1.0/
+ devtype locked size target_list
+ interleave_granularity modalias start target_type
+ interleave_ways region subsystem uevent
+
+ # cat /sys/bus/cxl/devices/decoder1.0/devtype
+ cxl_decoder_switch
+
+ # cat /sys/bus/cxl/devices/decoder1.0/region
+ region0
+
+A `Switch Decoder` has associations between a region defined by a root
+decoder and downstream target ports. Interleaving done within a switch decoder
+is a multi-downstream-port interleave (or `Intra-Host-Bridge Interleave` for
+host bridges).
+
+Interleave settings in a switch decoder describe how to interleave accesses
+among the *immediate downstream targets*, not the entire interleave set.
+
+Switch decoders are created during :code:`cxl_switch_port_probe` in the
+:code:`cxl_port` driver, and is created based on a PCI device's DVSEC
+registers.
+
+Switch decoder programming is validated during probe if the platform programs
+them during boot (See `Auto Decoders` below), or on commit if programmed at
+runtime (See `Runtime Programming` below).
+
+
+Endpoint Decoder
+~~~~~~~~~~~~~~~~
+Any decoder attached to a *terminal* point in the CXL fabric (`An Endpoint`) is
+considered an `Endpoint Decoder`. Endpoint decoders are of type
+:code:`cxl_decoder_endpoint`. ::
+
+ # ls /sys/bus/cxl/devices/decoder5.0
+ devtype locked start
+ dpa_resource modalias subsystem
+ dpa_size mode target_type
+ interleave_granularity region uevent
+ interleave_ways size
+
+ # cat /sys/bus/cxl/devices/decoder5.0/devtype
+ cxl_decoder_endpoint
+
+ # cat /sys/bus/cxl/devices/decoder5.0/region
+ region0
+
+An `Endpoint Decoder` has an association with a region defined by a root
+decoder and describes the device-local resource associated with this region.
+
+Unlike root and switch decoders, endpoint decoders translate `Host Physical` to
+`Device Physical` address ranges. The interleave settings on an endpoint
+therefore describe the entire *interleave set*.
+
+`Device Physical Address` regions must be committed in-order. For example, the
+DPA region starting at 0x80000000 cannot be committed before the DPA region
+starting at 0x0.
+
+As of Linux v6.15, Linux does not support *imbalanced* interleave setups, all
+endpoints in an interleave set are expected to have the same interleave
+settings (granularity and ways must be the same).
+
+Endpoint decoders are created during :code:`cxl_endpoint_port_probe` in the
+:code:`cxl_port` driver, and is created based on a PCI device's DVSEC registers.
+
+Regions
+-------
+
+Memory Region
+~~~~~~~~~~~~~
+A `Memory Region` is a logical construct that connects a set of CXL ports in
+the fabric to an IO Memory Resource. It is ultimately used to expose the memory
+on these devices to the DAX subsystem via a `DAX Region`.
+
+An example RAM region: ::
+
+ # ls /sys/bus/cxl/devices/region0/
+ access0 devtype modalias subsystem uuid
+ access1 driver mode target0
+ commit interleave_granularity resource target1
+ dax_region0 interleave_ways size uevent
+
+A memory region can be constructed during endpoint probe, if decoders were
+programmed by BIOS/EFI (see `Auto Decoders`), or by creating a region manually
+via a `Root Decoder`'s :code:`create_ram_region` or :code:`create_pmem_region`
+interfaces.
+
+The interleave settings in a `Memory Region` describe the configuration of the
+`Interleave Set` - and are what can be expected to be seen in the endpoint
+interleave settings.
+
+
+DAX Region
+~~~~~~~~~~
+A `DAX Region` is used to convert a CXL `Memory Region` to a DAX device. A
+DAX device may then be accessed directly via a file descriptor interface, or
+converted to System RAM via the DAX kmem driver. See the DAX driver section
+for more details. ::
+
+ # ls /sys/bus/cxl/devices/dax_region0/
+ dax0.0 devtype modalias uevent
+ dax_region driver subsystem
+
+
+Mailbox Interfaces
+------------------
+A mailbox command interface for each device is exposed in ::
+
+ /dev/cxl/mem0
+ /dev/cxl/mem1
+
+These mailboxes may receive any specification-defined command. Raw commands
+(custom commands) can only be sent to these interfaces if the build config
+:code:`CXL_MEM_RAW_COMMANDS` is set. This is considered a debug and/or
+development interface, not an officially supported mechanism for creation
+of vendor-specific commands (see the `fwctl` subsystem for that).
+
+Decoder Programming
+===================
+
+Runtime Programming
+-------------------
+During probe, the only decoders *required* to be programmed are `Root Decoders`.
+In reality, `Root Decoders` are a logical construct to describe the memory
+region and interleave configuration at the host bridge level - as described
+in the ACPI CEDT CFMWS.
+
+All other `Switch` and `Endpoint` decoders may be programmed by the user
+at runtime - if the platform supports such configurations.
+
+This interaction is what creates a `Software Defined Memory` environment.
+
+See the :code:`cxl-cli` documentation for more information about how to
+configure CXL decoders at runtime.
+
+Auto Decoders
+-------------
+Auto Decoders are decoders programmed by BIOS/EFI at boot time, and are
+almost always locked (cannot be changed). This is done by a platform
+which may have a static configuration - or certain quirks which may prevent
+dynamic runtime changes to the decoders (such as requiring additional
+controller programming within the CPU complex outside the scope of CXL).
+
+Auto Decoders are probed automatically as long as the devices and memory
+regions they are associated with probe without issue. When probing Auto
+Decoders, the driver's primary responsibility is to ensure the fabric is
+sane - as-if validating runtime programmed regions and decoders.
+
+If Linux cannot validate auto-decoder configuration, the memory will not
+be surfaced as a DAX device - and therefore not be exposed to the page
+allocator - effectively stranding it.
+
+Interleave
+----------
+
+The Linux CXL driver supports `Cross-Link First` interleave. This dictates
+how interleave is programmed at each decoder step, as the driver validates
+the relationships between a decoder and it's parent.
+
+For example, in a `Cross-Link First` interleave setup with 16 endpoints
+attached to 4 host bridges, linux expects the following ways/granularity
+across the root, host bridge, and endpoints respectively. ::
+
+ ways granularity
+ root 4 256
+ host bridge 4 1024
+ endpoint 16 256
+
+At the root, every a given access will be routed to the
+:code:`((HPA / 256) % 4)th` target host bridge. Within a host bridge, every
+:code:`((HPA / 1024) % 4)th` target endpoint. Each endpoint will translate
+the access based on the entire 16 device interleave set.
+
+Unbalanced interleave sets are not supported - decoders at a similar point
+in the hierarchy (e.g. all host bridge decoders) must have the same ways and
+granularity configuration.
+
+At Root
+~~~~~~~
+Root decoder interleave is defined by the ACPI CEDT CFMWS. The CEDT
+may actually define multiple CFMWS configurations to describe the same
+physical capacity - with the intent to allow users to decide at runtime
+whether to online memory as interleaved or non-interleaved. ::
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Window base address : 0000000100000000
+ Window size : 0000000100000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ First Target : 00000007
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Window base address : 0000000200000000
+ Window size : 0000000100000000
+ Interleave Members (2^n) : 00
+ Interleave Arithmetic : 00
+ First Target : 00000006
+
+ Subtable Type : 01 [CXL Fixed Memory Window Structure]
+ Window base address : 0000000300000000
+ Window size : 0000000200000000
+ Interleave Members (2^n) : 01
+ Interleave Arithmetic : 00
+ First Target : 00000007
+ Next Target : 00000006
+
+In this example, the CFMWS defines two discrete non-interleaved 4GB regions
+for each host bridge, and one interleaved 8GB region that targets both. This
+would result in 3 root decoders presenting in the root. ::
+
+ # ls /sys/bus/cxl/devices/root0
+ decoder0.0 decoder0.1 decoder0.2
+
+ # cat /sys/bus/cxl/devices/decoder0.0/target_list start size
+ 7
+ 0x100000000
+ 0x100000000
+
+ # cat /sys/bus/cxl/devices/decoder0.1/target_list start size
+ 6
+ 0x200000000
+ 0x100000000
+
+ # cat /sys/bus/cxl/devices/decoder0.2/target_list start size
+ 7,6
+ 0x300000000
+ 0x200000000
+
+These decoders are not runtime programmable. They are used to generate a
+`Memory Region` to bring this memory online with runtime programmed settings
+at the `Switch` and `Endpoint` decoders.
+
+At Host Bridge or Switch
+~~~~~~~~~~~~~~~~~~~~~~~~
+`Host Bridge` and `Switch` decoders are programmable via the following fields:
+
+- :code:`start` - the HPA region associated with the memory region
+- :code:`size` - the size of the region
+- :code:`target_list` - the list of downstream ports
+- :code:`interleave_ways` - the number downstream ports to interleave across
+- :code:`interleave_granularity` - the granularity to interleave at.
+
+Linux expects the :code:`interleave_granularity` of switch decoders to be
+derived from their upstream port connections. In `Cross-Link First` interleave
+configurations, the :code:`interleave_granularity` of a decoder is equal to
+:code:`parent_interleave_granularity * parent_interleave_ways`.
+
+At Endpoint
+~~~~~~~~~~~
+`Endpoint Decoders` are programmed similar to Host Bridge and Switch decoders,
+with the exception that the ways and granularity are defined by the interleave
+set (e.g. the interleave settings defined by the associated `Memory Region`).
+
+- :code:`start` - the HPA region associated with the memory region
+- :code:`size` - the size of the region
+- :code:`interleave_ways` - the number endpoints in the interleave set
+- :code:`interleave_granularity` - the granularity to interleave at.
+
+These settings are used by endpoint decoders to *Translate* memory requests
+from HPA to DPA. This is why they must be aware of the entire interleave set.
+
+Linux does not support unbalanced interleave configurations. As a result, all
+endpoints in an interleave set must have the same ways and granularity.
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 10/17] cxl: docs/linux/cxl-driver - add example configurations
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (8 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 09/17] cxl: docs/linux - add cxl-driver theory of operation Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 16:21 ` [PATCH v3 11/17] cxl: docs/linux/dax-driver documentation Gregory Price
` (7 subsequent siblings)
17 siblings, 0 replies; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Add 4 example configurations:
- single device
- cross-host-bridge interleave
- intra-host-bridge-interleave
- multi-level interleave
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../driver-api/cxl/linux/cxl-driver.rst | 10 +
.../example-configurations/hb-interleave.rst | 314 ++++++++++++++
.../intra-hb-interleave.rst | 291 +++++++++++++
.../multi-interleave.rst | 401 ++++++++++++++++++
.../example-configurations/single-device.rst | 246 +++++++++++
5 files changed, 1262 insertions(+)
create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/hb-interleave.rst
create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/intra-hb-interleave.rst
create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/multi-interleave.rst
create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/single-device.rst
diff --git a/Documentation/driver-api/cxl/linux/cxl-driver.rst b/Documentation/driver-api/cxl/linux/cxl-driver.rst
index 9a9e8ecee578..486baf8551aa 100644
--- a/Documentation/driver-api/cxl/linux/cxl-driver.rst
+++ b/Documentation/driver-api/cxl/linux/cxl-driver.rst
@@ -520,3 +520,13 @@ from HPA to DPA. This is why they must be aware of the entire interleave set.
Linux does not support unbalanced interleave configurations. As a result, all
endpoints in an interleave set must have the same ways and granularity.
+
+Example Configurations
+======================
+.. toctree::
+ :maxdepth: 1
+
+ example-configurations/single-device.rst
+ example-configurations/hb-interleave.rst
+ example-configurations/intra-hb-interleave.rst
+ example-configurations/multi-interleave.rst
diff --git a/Documentation/driver-api/cxl/linux/example-configurations/hb-interleave.rst b/Documentation/driver-api/cxl/linux/example-configurations/hb-interleave.rst
new file mode 100644
index 000000000000..f071490763a2
--- /dev/null
+++ b/Documentation/driver-api/cxl/linux/example-configurations/hb-interleave.rst
@@ -0,0 +1,314 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Inter-Host-Bridge Interleave
+============================
+This cxl-cli configuration dump shows the following host configuration:
+
+* A single socket system with one CXL root
+* CXL Root has Four (4) CXL Host Bridges
+* Two CXL Host Bridges have a single CXL Memory Expander Attached
+* The CXL root is configured to interleave across the two host bridges.
+
+This output is generated by :code:`cxl list -v` and describes the relationships
+between objects exposed in :code:`/sys/bus/cxl/devices/`.
+
+::
+
+ [
+ {
+ "bus":"root0",
+ "provider":"ACPI.CXL",
+ "nr_dports":4,
+ "dports":[
+ {
+ "dport":"pci0000:00",
+ "alias":"ACPI0016:01",
+ "id":0
+ },
+ {
+ "dport":"pci0000:a8",
+ "alias":"ACPI0016:02",
+ "id":4
+ },
+ {
+ "dport":"pci0000:2a",
+ "alias":"ACPI0016:03",
+ "id":1
+ },
+ {
+ "dport":"pci0000:d2",
+ "alias":"ACPI0016:00",
+ "id":5
+ }
+ ],
+
+This chunk shows the CXL "bus" (root0) has 4 downstream ports attached to CXL
+Host Bridges. The `Root` can be considered the singular upstream port attached
+to the platform's memory controller - which routes memory requests to it.
+
+The `ports:root0` section lays out how each of these downstream ports are
+configured. If a port is not configured (id's 0 and 1), they are omitted.
+
+::
+
+ "ports:root0":[
+ {
+ "port":"port1",
+ "host":"pci0000:d2",
+ "depth":1,
+ "nr_dports":3,
+ "dports":[
+ {
+ "dport":"0000:d2:01.1",
+ "alias":"device:02",
+ "id":0
+ },
+ {
+ "dport":"0000:d2:01.3",
+ "alias":"device:05",
+ "id":2
+ },
+ {
+ "dport":"0000:d2:07.1",
+ "alias":"device:0d",
+ "id":113
+ }
+ ],
+
+This chunk shows the available downstream ports associated with the CXL Host
+Bridge :code:`port1`. In this case, :code:`port1` has 3 available downstream
+ports: :code:`dport1`, :code:`dport2`, and :code:`dport113`..
+
+::
+
+ "endpoints:port1":[
+ {
+ "endpoint":"endpoint5",
+ "host":"mem0",
+ "parent_dport":"0000:d2:01.1",
+ "depth":2,
+ "memdev":{
+ "memdev":"mem0",
+ "ram_size":137438953472,
+ "serial":0,
+ "numa_node":0,
+ "host":"0000:d3:00.0"
+ },
+ "decoders:endpoint5":[
+ {
+ "decoder":"decoder5.0",
+ "resource":825975898112,
+ "size":274877906944,
+ "interleave_ways":2,
+ "interleave_granularity":256,
+ "region":"region0",
+ "dpa_resource":0,
+ "dpa_size":137438953472,
+ "mode":"ram"
+ }
+ ]
+ }
+ ],
+
+This chunk shows the endpoints attached to the host bridge :code:`port1`.
+
+:code:`endpoint5` contains a single configured decoder :code:`decoder5.0`
+which has the same interleave configuration as :code:`region0` (shown later).
+
+Next we have the decodesr belonging to the host bridge:
+
+::
+
+ "decoders:port1":[
+ {
+ "decoder":"decoder1.0",
+ "resource":825975898112,
+ "size":274877906944,
+ "interleave_ways":1,
+ "region":"region0",
+ "nr_targets":1,
+ "targets":[
+ {
+ "target":"0000:d2:01.1",
+ "alias":"device:02",
+ "position":0,
+ "id":0
+ }
+ ]
+ }
+ ]
+ },
+
+Host Bridge :code:`port1` has a single decoder (:code:`decoder1.0`), whose only
+target is :code:`dport1` - which is attached to :code:`endpoint5`.
+
+The following chunk shows a similar configuration for Host Bridge :code:`port3`,
+the second host bridge with a memory device attached.
+
+::
+
+ {
+ "port":"port3",
+ "host":"pci0000:a8",
+ "depth":1,
+ "nr_dports":1,
+ "dports":[
+ {
+ "dport":"0000:a8:01.1",
+ "alias":"device:c3",
+ "id":0
+ }
+ ],
+ "endpoints:port3":[
+ {
+ "endpoint":"endpoint6",
+ "host":"mem1",
+ "parent_dport":"0000:a8:01.1",
+ "depth":2,
+ "memdev":{
+ "memdev":"mem1",
+ "ram_size":137438953472,
+ "serial":0,
+ "numa_node":0,
+ "host":"0000:a9:00.0"
+ },
+ "decoders:endpoint6":[
+ {
+ "decoder":"decoder6.0",
+ "resource":825975898112,
+ "size":274877906944,
+ "interleave_ways":2,
+ "interleave_granularity":256,
+ "region":"region0",
+ "dpa_resource":0,
+ "dpa_size":137438953472,
+ "mode":"ram"
+ }
+ ]
+ }
+ ],
+ "decoders:port3":[
+ {
+ "decoder":"decoder3.0",
+ "resource":825975898112,
+ "size":274877906944,
+ "interleave_ways":1,
+ "region":"region0",
+ "nr_targets":1,
+ "targets":[
+ {
+ "target":"0000:a8:01.1",
+ "alias":"device:c3",
+ "position":0,
+ "id":0
+ }
+ ]
+ }
+ ]
+ },
+
+
+The next chunk shows the two CXL host bridges without attached endpoints.
+
+::
+
+ {
+ "port":"port2",
+ "host":"pci0000:00",
+ "depth":1,
+ "nr_dports":2,
+ "dports":[
+ {
+ "dport":"0000:00:01.3",
+ "alias":"device:55",
+ "id":2
+ },
+ {
+ "dport":"0000:00:07.1",
+ "alias":"device:5d",
+ "id":113
+ }
+ ]
+ },
+ {
+ "port":"port4",
+ "host":"pci0000:2a",
+ "depth":1,
+ "nr_dports":1,
+ "dports":[
+ {
+ "dport":"0000:2a:01.1",
+ "alias":"device:d0",
+ "id":0
+ }
+ ]
+ }
+ ],
+
+Next we have the `Root Decoders` belonging to :code:`root0`. This root decoder
+applies the interleave across the downstream ports :code:`port1` and
+:code:`port3` - with a granularity of 256 bytes.
+
+This information is generated by the CXL driver reading the ACPI CEDT CMFWS.
+
+::
+
+ "decoders:root0":[
+ {
+ "decoder":"decoder0.0",
+ "resource":825975898112,
+ "size":274877906944,
+ "interleave_ways":2,
+ "interleave_granularity":256,
+ "max_available_extent":0,
+ "volatile_capable":true,
+ "nr_targets":2,
+ "targets":[
+ {
+ "target":"pci0000:a8",
+ "alias":"ACPI0016:02",
+ "position":1,
+ "id":4
+ },
+ {
+ "target":"pci0000:d2",
+ "alias":"ACPI0016:00",
+ "position":0,
+ "id":5
+ }
+ ],
+
+Finally we have the `Memory Region` associated with the `Root Decoder`
+:code:`decoder0.0`. This region describes the overall interleave configuration
+of the interleave set.
+
+::
+
+ "regions:decoder0.0":[
+ {
+ "region":"region0",
+ "resource":825975898112,
+ "size":274877906944,
+ "type":"ram",
+ "interleave_ways":2,
+ "interleave_granularity":256,
+ "decode_state":"commit",
+ "mappings":[
+ {
+ "position":1,
+ "memdev":"mem1",
+ "decoder":"decoder6.0"
+ },
+ {
+ "position":0,
+ "memdev":"mem0",
+ "decoder":"decoder5.0"
+ }
+ ]
+ }
+ ]
+ }
+ ]
+ }
+ ]
diff --git a/Documentation/driver-api/cxl/linux/example-configurations/intra-hb-interleave.rst b/Documentation/driver-api/cxl/linux/example-configurations/intra-hb-interleave.rst
new file mode 100644
index 000000000000..077dfaf8458d
--- /dev/null
+++ b/Documentation/driver-api/cxl/linux/example-configurations/intra-hb-interleave.rst
@@ -0,0 +1,291 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Intra-Host-Bridge Interleave
+============================
+This cxl-cli configuration dump shows the following host configuration:
+
+* A single socket system with one CXL root
+* CXL Root has Four (4) CXL Host Bridges
+* One (1) CXL Host Bridges has two CXL Memory Expanders Attached
+* The Host bridge decoder is programmed to interleave across the expanders.
+
+This output is generated by :code:`cxl list -v` and describes the relationships
+between objects exposed in :code:`/sys/bus/cxl/devices/`.
+
+::
+
+ [
+ {
+ "bus":"root0",
+ "provider":"ACPI.CXL",
+ "nr_dports":4,
+ "dports":[
+ {
+ "dport":"pci0000:00",
+ "alias":"ACPI0016:01",
+ "id":0
+ },
+ {
+ "dport":"pci0000:a8",
+ "alias":"ACPI0016:02",
+ "id":4
+ },
+ {
+ "dport":"pci0000:2a",
+ "alias":"ACPI0016:03",
+ "id":1
+ },
+ {
+ "dport":"pci0000:d2",
+ "alias":"ACPI0016:00",
+ "id":5
+ }
+ ],
+
+This chunk shows the CXL "bus" (root0) has 4 downstream ports attached to CXL
+Host Bridges. The `Root` can be considered the singular upstream port attached
+to the platform's memory controller - which routes memory requests to it.
+
+The `ports:root0` section lays out how each of these downstream ports are
+configured. If a port is not configured (id's 0 and 1), they are omitted.
+
+::
+
+ "ports:root0":[
+ {
+ "port":"port1",
+ "host":"pci0000:d2",
+ "depth":1,
+ "nr_dports":3,
+ "dports":[
+ {
+ "dport":"0000:d2:01.1",
+ "alias":"device:02",
+ "id":0
+ },
+ {
+ "dport":"0000:d2:01.3",
+ "alias":"device:05",
+ "id":2
+ },
+ {
+ "dport":"0000:d2:07.1",
+ "alias":"device:0d",
+ "id":113
+ }
+ ],
+
+This chunk shows the available downstream ports associated with the CXL Host
+Bridge :code:`port1`. In this case, :code:`port1` has 3 available downstream
+ports: :code:`dport1`, :code:`dport2`, and :code:`dport113`..
+
+::
+
+ "endpoints:port1":[
+ {
+ "endpoint":"endpoint5",
+ "host":"mem0",
+ "parent_dport":"0000:d2:01.1",
+ "depth":2,
+ "memdev":{
+ "memdev":"mem0",
+ "ram_size":137438953472,
+ "serial":0,
+ "numa_node":0,
+ "host":"0000:d3:00.0"
+ },
+ "decoders:endpoint5":[
+ {
+ "decoder":"decoder5.0",
+ "resource":825975898112,
+ "size":274877906944,
+ "interleave_ways":2,
+ "interleave_granularity":256,
+ "region":"region0",
+ "dpa_resource":0,
+ "dpa_size":137438953472,
+ "mode":"ram"
+ }
+ ]
+ },
+ {
+ "endpoint":"endpoint6",
+ "host":"mem1",
+ "parent_dport":"0000:d2:01.3,
+ "depth":2,
+ "memdev":{
+ "memdev":"mem1",
+ "ram_size":137438953472,
+ "serial":0,
+ "numa_node":0,
+ "host":"0000:a9:00.0"
+ },
+ "decoders:endpoint6":[
+ {
+ "decoder":"decoder6.0",
+ "resource":825975898112,
+ "size":274877906944,
+ "interleave_ways":2,
+ "interleave_granularity":256,
+ "region":"region0",
+ "dpa_resource":0,
+ "dpa_size":137438953472,
+ "mode":"ram"
+ }
+ ]
+ }
+ ],
+
+This chunk shows the endpoints attached to the host bridge :code:`port1`.
+
+:code:`endpoint5` contains a single configured decoder :code:`decoder5.0`
+which has the same interleave configuration memory region they belong to
+(show later).
+
+Next we have the decoders belonging to the host bridge:
+
+::
+
+ "decoders:port1":[
+ {
+ "decoder":"decoder1.0",
+ "resource":825975898112,
+ "size":274877906944,
+ "interleave_ways":2,
+ "interleave_granularity":256,
+ "region":"region0",
+ "nr_targets":2,
+ "targets":[
+ {
+ "target":"0000:d2:01.1",
+ "alias":"device:02",
+ "position":0,
+ "id":0
+ },
+ {
+ "target":"0000:d2:01.3",
+ "alias":"device:05",
+ "position":1,
+ "id":0
+ }
+ ]
+ }
+ ]
+ },
+
+Host Bridge :code:`port1` has a single decoder (:code:`decoder1.0`) with two
+targets: :code:`dport1` and :code:`dport3` - which are attached to
+:code:`endpoint5` and :code:`endpoint6` respectively.
+
+The host bridge decoder interleaves these devices at a 256 byte granularity.
+
+The next chunk shows the three CXL host bridges without attached endpoints.
+
+::
+
+ {
+ "port":"port2",
+ "host":"pci0000:00",
+ "depth":1,
+ "nr_dports":2,
+ "dports":[
+ {
+ "dport":"0000:00:01.3",
+ "alias":"device:55",
+ "id":2
+ },
+ {
+ "dport":"0000:00:07.1",
+ "alias":"device:5d",
+ "id":113
+ }
+ ]
+ },
+ {
+ "port":"port3",
+ "host":"pci0000:a8",
+ "depth":1,
+ "nr_dports":1,
+ "dports":[
+ {
+ "dport":"0000:a8:01.1",
+ "alias":"device:c3",
+ "id":0
+ }
+ ],
+ },
+ {
+ "port":"port4",
+ "host":"pci0000:2a",
+ "depth":1,
+ "nr_dports":1,
+ "dports":[
+ {
+ "dport":"0000:2a:01.1",
+ "alias":"device:d0",
+ "id":0
+ }
+ ]
+ }
+ ],
+
+Next we have the `Root Decoders` belonging to :code:`root0`. This root decoder
+applies the interleave across the downstream ports :code:`port1` and
+:code:`port3` - with a granularity of 256 bytes.
+
+This information is generated by the CXL driver reading the ACPI CEDT CMFWS.
+
+::
+
+ "decoders:root0":[
+ {
+ "decoder":"decoder0.0",
+ "resource":825975898112,
+ "size":274877906944,
+ "interleave_ways":1,
+ "max_available_extent":0,
+ "volatile_capable":true,
+ "nr_targets":2,
+ "targets":[
+ {
+ "target":"pci0000:a8",
+ "alias":"ACPI0016:02",
+ "position":1,
+ "id":4
+ },
+ ],
+
+Finally we have the `Memory Region` associated with the `Root Decoder`
+:code:`decoder0.0`. This region describes the overall interleave configuration
+of the interleave set.
+
+::
+
+ "regions:decoder0.0":[
+ {
+ "region":"region0",
+ "resource":825975898112,
+ "size":274877906944,
+ "type":"ram",
+ "interleave_ways":2,
+ "interleave_granularity":256,
+ "decode_state":"commit",
+ "mappings":[
+ {
+ "position":1,
+ "memdev":"mem1",
+ "decoder":"decoder6.0"
+ },
+ {
+ "position":0,
+ "memdev":"mem0",
+ "decoder":"decoder5.0"
+ }
+ ]
+ }
+ ]
+ }
+ ]
+ }
+ ]
diff --git a/Documentation/driver-api/cxl/linux/example-configurations/multi-interleave.rst b/Documentation/driver-api/cxl/linux/example-configurations/multi-interleave.rst
new file mode 100644
index 000000000000..008f9053c630
--- /dev/null
+++ b/Documentation/driver-api/cxl/linux/example-configurations/multi-interleave.rst
@@ -0,0 +1,401 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Multi-Level Interleave
+======================
+This cxl-cli configuration dump shows the following host configuration:
+
+* A single socket system with one CXL root
+* CXL Root has Four (4) CXL Host Bridges
+* Two CXL Host Bridges have a two CXL Memory Expanders Attached each.
+* The CXL root is configured to interleave across the two host bridges.
+* Each host bridge with expanders interleaves across two endpoints.
+
+This output is generated by :code:`cxl list -v` and describes the relationships
+between objects exposed in :code:`/sys/bus/cxl/devices/`.
+
+::
+
+ [
+ {
+ "bus":"root0",
+ "provider":"ACPI.CXL",
+ "nr_dports":4,
+ "dports":[
+ {
+ "dport":"pci0000:00",
+ "alias":"ACPI0016:01",
+ "id":0
+ },
+ {
+ "dport":"pci0000:a8",
+ "alias":"ACPI0016:02",
+ "id":4
+ },
+ {
+ "dport":"pci0000:2a",
+ "alias":"ACPI0016:03",
+ "id":1
+ },
+ {
+ "dport":"pci0000:d2",
+ "alias":"ACPI0016:00",
+ "id":5
+ }
+ ],
+
+This chunk shows the CXL "bus" (root0) has 4 downstream ports attached to CXL
+Host Bridges. The `Root` can be considered the singular upstream port attached
+to the platform's memory controller - which routes memory requests to it.
+
+The `ports:root0` section lays out how each of these downstream ports are
+configured. If a port is not configured (id's 0 and 1), they are omitted.
+
+::
+
+ "ports:root0":[
+ {
+ "port":"port1",
+ "host":"pci0000:d2",
+ "depth":1,
+ "nr_dports":3,
+ "dports":[
+ {
+ "dport":"0000:d2:01.1",
+ "alias":"device:02",
+ "id":0
+ },
+ {
+ "dport":"0000:d2:01.3",
+ "alias":"device:05",
+ "id":2
+ },
+ {
+ "dport":"0000:d2:07.1",
+ "alias":"device:0d",
+ "id":113
+ }
+ ],
+
+This chunk shows the available downstream ports associated with the CXL Host
+Bridge :code:`port1`. In this case, :code:`port1` has 3 available downstream
+ports: :code:`dport0`, :code:`dport2`, and :code:`dport113`.
+
+::
+
+ "endpoints:port1":[
+ {
+ "endpoint":"endpoint5",
+ "host":"mem0",
+ "parent_dport":"0000:d2:01.1",
+ "depth":2,
+ "memdev":{
+ "memdev":"mem0",
+ "ram_size":137438953472,
+ "serial":0,
+ "numa_node":0,
+ "host":"0000:d3:00.0"
+ },
+ "decoders:endpoint5":[
+ {
+ "decoder":"decoder5.0",
+ "resource":825975898112,
+ "size":549755813888,
+ "interleave_ways":4,
+ "interleave_granularity":256,
+ "region":"region0",
+ "dpa_resource":0,
+ "dpa_size":137438953472,
+ "mode":"ram"
+ }
+ ]
+ },
+ {
+ "endpoint":"endpoint6",
+ "host":"mem1",
+ "parent_dport":"0000:d2:01.3",
+ "depth":2,
+ "memdev":{
+ "memdev":"mem1",
+ "ram_size":137438953472,
+ "serial":0,
+ "numa_node":0,
+ "host":"0000:d3:00.0"
+ },
+ "decoders:endpoint6":[
+ {
+ "decoder":"decoder6.0",
+ "resource":825975898112,
+ "size":549755813888,
+ "interleave_ways":4,
+ "interleave_granularity":256,
+ "region":"region0",
+ "dpa_resource":0,
+ "dpa_size":137438953472,
+ "mode":"ram"
+ }
+ ]
+ }
+ ],
+
+This chunk shows the endpoints attached to the host bridge :code:`port1`.
+
+:code:`endpoint5` contains a single configured decoder :code:`decoder5.0`
+which has the same interleave configuration as :code:`region0` (shown later).
+
+:code:`endpoint6` contains a single configured decoder :code:`decoder5.0`
+which has the same interleave configuration as :code:`region0` (shown later).
+
+Next we have the decoders belonging to the host bridge:
+
+::
+
+ "decoders:port1":[
+ {
+ "decoder":"decoder1.0",
+ "resource":825975898112,
+ "size":549755813888,
+ "interleave_ways":2,
+ "interleave_granularity":512,
+ "region":"region0",
+ "nr_targets":2,
+ "targets":[
+ {
+ "target":"0000:d2:01.1",
+ "alias":"device:02",
+ "position":0,
+ "id":0
+ },
+ {
+ "target":"0000:d2:01.3",
+ "alias":"device:05",
+ "position":2,
+ "id":0
+ }
+ ]
+ }
+ ]
+ },
+
+Host Bridge :code:`port1` has a single decoder (:code:`decoder1.0`), whose
+targets are :code:`dport0` and :code:`dport2` - which are attached to
+:code:`endpoint5` and :code:`endpoint6` respectively.
+
+The following chunk shows a similar configuration for Host Bridge :code:`port3`,
+the second host bridge with a memory device attached.
+
+::
+
+ {
+ "port":"port3",
+ "host":"pci0000:a8",
+ "depth":1,
+ "nr_dports":1,
+ "dports":[
+ {
+ "dport":"0000:a8:01.1",
+ "alias":"device:c3",
+ "id":0
+ },
+ {
+ "dport":"0000:a8:01.3",
+ "alias":"device:c5",
+ "id":0
+ }
+ ],
+ "endpoints:port3":[
+ {
+ "endpoint":"endpoint7",
+ "host":"mem2",
+ "parent_dport":"0000:a8:01.1",
+ "depth":2,
+ "memdev":{
+ "memdev":"mem2",
+ "ram_size":137438953472,
+ "serial":0,
+ "numa_node":0,
+ "host":"0000:a9:00.0"
+ },
+ "decoders:endpoint7":[
+ {
+ "decoder":"decoder7.0",
+ "resource":825975898112,
+ "size":549755813888,
+ "interleave_ways":4,
+ "interleave_granularity":256,
+ "region":"region0",
+ "dpa_resource":0,
+ "dpa_size":137438953472,
+ "mode":"ram"
+ }
+ ]
+ },
+ {
+ "endpoint":"endpoint8",
+ "host":"mem3",
+ "parent_dport":"0000:a8:01.3",
+ "depth":2,
+ "memdev":{
+ "memdev":"mem3",
+ "ram_size":137438953472,
+ "serial":0,
+ "numa_node":0,
+ "host":"0000:a9:00.0"
+ },
+ "decoders:endpoint8":[
+ {
+ "decoder":"decoder8.0",
+ "resource":825975898112,
+ "size":549755813888,
+ "interleave_ways":4,
+ "interleave_granularity":256,
+ "region":"region0",
+ "dpa_resource":0,
+ "dpa_size":137438953472,
+ "mode":"ram"
+ }
+ ]
+ }
+ ],
+ "decoders:port3":[
+ {
+ "decoder":"decoder3.0",
+ "resource":825975898112,
+ "size":549755813888,
+ "interleave_ways":2,
+ "interleave_granularity":512,
+ "region":"region0",
+ "nr_targets":1,
+ "targets":[
+ {
+ "target":"0000:a8:01.1",
+ "alias":"device:c3",
+ "position":1,
+ "id":0
+ },
+ {
+ "target":"0000:a8:01.3",
+ "alias":"device:c5",
+ "position":3,
+ "id":0
+ }
+ ]
+ }
+ ]
+ },
+
+
+The next chunk shows the two CXL host bridges without attached endpoints.
+
+::
+
+ {
+ "port":"port2",
+ "host":"pci0000:00",
+ "depth":1,
+ "nr_dports":2,
+ "dports":[
+ {
+ "dport":"0000:00:01.3",
+ "alias":"device:55",
+ "id":2
+ },
+ {
+ "dport":"0000:00:07.1",
+ "alias":"device:5d",
+ "id":113
+ }
+ ]
+ },
+ {
+ "port":"port4",
+ "host":"pci0000:2a",
+ "depth":1,
+ "nr_dports":1,
+ "dports":[
+ {
+ "dport":"0000:2a:01.1",
+ "alias":"device:d0",
+ "id":0
+ }
+ ]
+ }
+ ],
+
+Next we have the `Root Decoders` belonging to :code:`root0`. This root decoder
+applies the interleave across the downstream ports :code:`port1` and
+:code:`port3` - with a granularity of 256 bytes.
+
+This information is generated by the CXL driver reading the ACPI CEDT CMFWS.
+
+::
+
+ "decoders:root0":[
+ {
+ "decoder":"decoder0.0",
+ "resource":825975898112,
+ "size":549755813888,
+ "interleave_ways":2,
+ "interleave_granularity":256,
+ "max_available_extent":0,
+ "volatile_capable":true,
+ "nr_targets":2,
+ "targets":[
+ {
+ "target":"pci0000:a8",
+ "alias":"ACPI0016:02",
+ "position":1,
+ "id":4
+ },
+ {
+ "target":"pci0000:d2",
+ "alias":"ACPI0016:00",
+ "position":0,
+ "id":5
+ }
+ ],
+
+Finally we have the `Memory Region` associated with the `Root Decoder`
+:code:`decoder0.0`. This region describes the overall interleave configuration
+of the interleave set. So we see there are a total of :code:`4` interleave
+targets across 4 endpoint decoders.
+
+::
+
+ "regions:decoder0.0":[
+ {
+ "region":"region0",
+ "resource":825975898112,
+ "size":549755813888,
+ "type":"ram",
+ "interleave_ways":4,
+ "interleave_granularity":256,
+ "decode_state":"commit",
+ "mappings":[
+ {
+ "position":3,
+ "memdev":"mem3",
+ "decoder":"decoder8.0"
+ },
+ {
+ "position":2,
+ "memdev":"mem1",
+ "decoder":"decoder6.0"
+ }
+ {
+ "position":1,
+ "memdev":"mem2",
+ "decoder":"decoder7.0"
+ },
+ {
+ "position":0,
+ "memdev":"mem0",
+ "decoder":"decoder5.0"
+ }
+ ]
+ }
+ ]
+ }
+ ]
+ }
+ ]
diff --git a/Documentation/driver-api/cxl/linux/example-configurations/single-device.rst b/Documentation/driver-api/cxl/linux/example-configurations/single-device.rst
new file mode 100644
index 000000000000..5fd38eb0aaf4
--- /dev/null
+++ b/Documentation/driver-api/cxl/linux/example-configurations/single-device.rst
@@ -0,0 +1,246 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Single Device
+=============
+This cxl-cli configuration dump shows the following host configuration:
+
+* A single socket system with one CXL root
+* CXL Root has Four (4) CXL Host Bridges
+* One CXL Host Bridges has a single CXL Memory Expander Attached
+* No interleave is present.
+
+This output is generated by :code:`cxl list -v` and describes the relationships
+between objects exposed in :code:`/sys/bus/cxl/devices/`.
+
+::
+
+ [
+ {
+ "bus":"root0",
+ "provider":"ACPI.CXL",
+ "nr_dports":4,
+ "dports":[
+ {
+ "dport":"pci0000:00",
+ "alias":"ACPI0016:01",
+ "id":0
+ },
+ {
+ "dport":"pci0000:a8",
+ "alias":"ACPI0016:02",
+ "id":4
+ },
+ {
+ "dport":"pci0000:2a",
+ "alias":"ACPI0016:03",
+ "id":1
+ },
+ {
+ "dport":"pci0000:d2",
+ "alias":"ACPI0016:00",
+ "id":5
+ }
+ ],
+
+This chunk shows the CXL "bus" (root0) has 4 downstream ports attached to CXL
+Host Bridges. The `Root` can be considered the singular upstream port attached
+to the platform's memory controller - which routes memory requests to it.
+
+The `ports:root0` section lays out how each of these downstream ports are
+configured. If a port is not configured (id's 0, 1, and 4), they are omitted.
+
+::
+
+ "ports:root0":[
+ {
+ "port":"port1",
+ "host":"pci0000:d2",
+ "depth":1,
+ "nr_dports":3,
+ "dports":[
+ {
+ "dport":"0000:d2:01.1",
+ "alias":"device:02",
+ "id":0
+ },
+ {
+ "dport":"0000:d2:01.3",
+ "alias":"device:05",
+ "id":2
+ },
+ {
+ "dport":"0000:d2:07.1",
+ "alias":"device:0d",
+ "id":113
+ }
+ ],
+
+This chunk shows the available downstream ports associated with the CXL Host
+Bridge :code:`port1`. In this case, :code:`port1` has 3 available downstream
+ports: :code:`dport1`, :code:`dport2`, and :code:`dport113`..
+
+::
+
+ "endpoints:port1":[
+ {
+ "endpoint":"endpoint5",
+ "host":"mem0",
+ "parent_dport":"0000:d2:01.1",
+ "depth":2,
+ "memdev":{
+ "memdev":"mem0",
+ "ram_size":137438953472,
+ "serial":0,
+ "numa_node":0,
+ "host":"0000:d3:00.0"
+ },
+ "decoders:endpoint5":[
+ {
+ "decoder":"decoder5.0",
+ "resource":825975898112,
+ "size":137438953472,
+ "interleave_ways":1,
+ "region":"region0",
+ "dpa_resource":0,
+ "dpa_size":137438953472,
+ "mode":"ram"
+ }
+ ]
+ }
+ ],
+
+This chunk shows the endpoints attached to the host bridge :code:`port1`.
+
+:code:`endpoint5` contains a single configured decoder :code:`decoder5.0`
+which has the same interleave configuration as :code:`region0` (shown later).
+
+Next we have the decoders belonging to the host bridge:
+
+::
+
+ "decoders:port1":[
+ {
+ "decoder":"decoder1.0",
+ "resource":825975898112,
+ "size":137438953472,
+ "interleave_ways":1,
+ "region":"region0",
+ "nr_targets":1,
+ "targets":[
+ {
+ "target":"0000:d2:01.1",
+ "alias":"device:02",
+ "position":0,
+ "id":0
+ }
+ ]
+ }
+ ]
+ },
+
+Host Bridge :code:`port1` has a single decoder (:code:`decoder1.0`), whose only
+target is :code:`dport1` - which is attached to :code:`endpoint5`.
+
+The next chunk shows the three CXL host bridges without attached endpoints.
+
+::
+
+ {
+ "port":"port2",
+ "host":"pci0000:00",
+ "depth":1,
+ "nr_dports":2,
+ "dports":[
+ {
+ "dport":"0000:00:01.3",
+ "alias":"device:55",
+ "id":2
+ },
+ {
+ "dport":"0000:00:07.1",
+ "alias":"device:5d",
+ "id":113
+ }
+ ]
+ },
+ {
+ "port":"port3",
+ "host":"pci0000:a8",
+ "depth":1,
+ "nr_dports":1,
+ "dports":[
+ {
+ "dport":"0000:a8:01.1",
+ "alias":"device:c3",
+ "id":0
+ }
+ ]
+ },
+ {
+ "port":"port4",
+ "host":"pci0000:2a",
+ "depth":1,
+ "nr_dports":1,
+ "dports":[
+ {
+ "dport":"0000:2a:01.1",
+ "alias":"device:d0",
+ "id":0
+ }
+ ]
+ }
+ ],
+
+Next we have the `Root Decoders` belonging to :code:`root0`. This root decoder
+is a pass-through decoder because :code:`interleave_ways` is set to :code:`1`.
+
+This information is generated by the CXL driver reading the ACPI CEDT CMFWS.
+
+::
+
+ "decoders:root0":[
+ {
+ "decoder":"decoder0.0",
+ "resource":825975898112,
+ "size":137438953472,
+ "interleave_ways":1,
+ "max_available_extent":0,
+ "volatile_capable":true,
+ "nr_targets":1,
+ "targets":[
+ {
+ "target":"pci0000:d2",
+ "alias":"ACPI0016:00",
+ "position":0,
+ "id":5
+ }
+ ],
+
+Finally we have the `Memory Region` associated with the `Root Decoder`
+:code:`decoder0.0`. This region describes the discrete region associated
+with the lone device.
+
+::
+
+ "regions:decoder0.0":[
+ {
+ "region":"region0",
+ "resource":825975898112,
+ "size":137438953472,
+ "type":"ram",
+ "interleave_ways":1,
+ "decode_state":"commit",
+ "mappings":[
+ {
+ "position":0,
+ "memdev":"mem0",
+ "decoder":"decoder5.0"
+ }
+ ]
+ }
+ ]
+ }
+ ]
+ }
+ ]
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 11/17] cxl: docs/linux/dax-driver documentation
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (9 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 10/17] cxl: docs/linux/cxl-driver - add example configurations Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 16:21 ` [PATCH v3 12/17] cxl: docs/linux/memory-hotplug Gregory Price
` (6 subsequent siblings)
17 siblings, 0 replies; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Add documentation on how the CXL driver interacts with the DAX driver.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/driver-api/cxl/index.rst | 1 +
.../driver-api/cxl/linux/cxl-driver.rst | 115 ++++++++++++++++--
.../driver-api/cxl/linux/dax-driver.rst | 43 +++++++
3 files changed, 149 insertions(+), 10 deletions(-)
create mode 100644 Documentation/driver-api/cxl/linux/dax-driver.rst
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index df3c7763c79a..f2127968ea78 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -36,6 +36,7 @@ that have impacts on each other. The docs here break up configurations steps.
linux/overview
linux/early-boot
linux/cxl-driver
+ linux/dax-driver
linux/access-coordinates
diff --git a/Documentation/driver-api/cxl/linux/cxl-driver.rst b/Documentation/driver-api/cxl/linux/cxl-driver.rst
index 486baf8551aa..cf6b397abdb1 100644
--- a/Documentation/driver-api/cxl/linux/cxl-driver.rst
+++ b/Documentation/driver-api/cxl/linux/cxl-driver.rst
@@ -34,6 +34,32 @@ into a single memory region. The memory region has been converted to dax. ::
decoder1.0 decoder5.0 endpoint5 port1 region0
decoder2.0 decoder5.1 endpoint6 port2 root0
+
+.. kernel-render:: DOT
+ :alt: Digraph of CXL fabric describing host-bridge interleaving
+ :caption: Diagraph of CXL fabric with a host-bridge interleave memory region
+
+ digraph foo {
+ "root0" -> "port1";
+ "root0" -> "port3";
+ "root0" -> "decoder0.0";
+ "port1" -> "endpoint5";
+ "port3" -> "endpoint6";
+ "port1" -> "decoder1.0";
+ "port3" -> "decoder3.0";
+ "endpoint5" -> "decoder5.0";
+ "endpoint6" -> "decoder6.0";
+ "decoder0.0" -> "region0";
+ "decoder0.0" -> "decoder1.0";
+ "decoder0.0" -> "decoder3.0";
+ "decoder1.0" -> "decoder5.0";
+ "decoder3.0" -> "decoder6.0";
+ "decoder5.0" -> "region0";
+ "decoder6.0" -> "region0";
+ "region0" -> "dax_region0";
+ "dax_region0" -> "dax0.0";
+ }
+
For this section we'll explore the devices present in this configuration, but
we'll explore more configurations in-depth in example configurations below.
@@ -41,7 +67,7 @@ Base Devices
------------
Most devices in a CXL fabric are a `port` of some kind (because each
device mostly routes request from one device to the next, rather than
-provide a manageable service).
+provide a direct service).
Root
~~~~
@@ -53,6 +79,8 @@ The Root contains links to:
* `Host Bridge Ports` defined by ACPI CEDT CHBS.
+* `Downstream Ports` typically connected to `Host Bridge Ports`.
+
* `Root Decoders` defined by ACPI CEDT CFMWS.
::
@@ -150,6 +178,27 @@ device configuration data. ::
driver label_storage_size pmem serial
firmware numa_node ram subsystem
+A Memory Device is a discrete base object that is not a port. While the
+physical device it belongs to may also host an `endpoint`, the relationship
+between an `endpoint` and a `memdev` is not captured in sysfs.
+
+Port Relationships
+~~~~~~~~~~~~~~~~~~
+In our example described above, there are four host bridges attached to the
+root, and two of the host bridges have one endpoint attached.
+
+.. kernel-render:: DOT
+ :alt: Digraph of CXL fabric describing host-bridge interleaving
+ :caption: Diagraph of CXL fabric with a host-bridge interleave memory region
+
+ digraph foo {
+ "root0" -> "port1";
+ "root0" -> "port2";
+ "root0" -> "port3";
+ "root0" -> "port4";
+ "port1" -> "endpoint5";
+ "port3" -> "endpoint6";
+ }
Decoders
--------
@@ -322,6 +371,29 @@ settings (granularity and ways must be the same).
Endpoint decoders are created during :code:`cxl_endpoint_port_probe` in the
:code:`cxl_port` driver, and is created based on a PCI device's DVSEC registers.
+Decoder Relationships
+~~~~~~~~~~~~~~~~~~~~~
+In our example described above, there is one root decoder which routes memory
+accesses over two host bridges. Each host bridge has a decoder which routes
+access to their singular endpoint targets. Each endpoint has a decoder which
+translates HPA to DPA and services the memory request.
+
+The driver validates relationships between ports by decoder programming, so
+we can think of decoders being related in a similarly hierarchical fashion to
+ports.
+
+.. kernel-render:: DOT
+ :alt: Digraph of hierarchical relationship between root, switch, and endpoint decoders.
+ :caption: Diagraph of CXL root, switch, and endpoint decoders.
+
+ digraph foo {
+ "root0" -> "decoder0.0";
+ "decoder0.0" -> "decoder1.0";
+ "decoder0.0" -> "decoder3.0";
+ "decoder1.0" -> "decoder5.0";
+ "decoder3.0" -> "decoder6.0";
+ }
+
Regions
-------
@@ -348,6 +420,17 @@ The interleave settings in a `Memory Region` describe the configuration of the
`Interleave Set` - and are what can be expected to be seen in the endpoint
interleave settings.
+.. kernel-render:: DOT
+ :alt: Digraph of CXL memory region relationships between root and endpoint decoders.
+ :caption: Regions are created based on root decoder configurations. Endpoint decoders
+ must be programmed with the same interleave settings as the region.
+
+ digraph foo {
+ "root0" -> "decoder0.0";
+ "decoder0.0" -> "region0";
+ "region0" -> "decoder5.0";
+ "region0" -> "decoder6.0";
+ }
DAX Region
~~~~~~~~~~
@@ -360,7 +443,6 @@ for more details. ::
dax0.0 devtype modalias uevent
dax_region driver subsystem
-
Mailbox Interfaces
------------------
A mailbox command interface for each device is exposed in ::
@@ -418,17 +500,30 @@ the relationships between a decoder and it's parent.
For example, in a `Cross-Link First` interleave setup with 16 endpoints
attached to 4 host bridges, linux expects the following ways/granularity
-across the root, host bridge, and endpoints respectively. ::
+across the root, host bridge, and endpoints respectively.
+
+.. flat-table:: 4x4 cross-link first interleave settings
+
+ * - decoder
+ - ways
+ - granularity
- ways granularity
- root 4 256
- host bridge 4 1024
- endpoint 16 256
+ * - root
+ - 4
+ - 256
+
+ * - host bridge
+ - 4
+ - 1024
+
+ * - endpoint
+ - 16
+ - 256
At the root, every a given access will be routed to the
:code:`((HPA / 256) % 4)th` target host bridge. Within a host bridge, every
-:code:`((HPA / 1024) % 4)th` target endpoint. Each endpoint will translate
-the access based on the entire 16 device interleave set.
+:code:`((HPA / 1024) % 4)th` target endpoint. Each endpoint translates based
+on the entire 16 device interleave set.
Unbalanced interleave sets are not supported - decoders at a similar point
in the hierarchy (e.g. all host bridge decoders) must have the same ways and
@@ -467,7 +562,7 @@ In this example, the CFMWS defines two discrete non-interleaved 4GB regions
for each host bridge, and one interleaved 8GB region that targets both. This
would result in 3 root decoders presenting in the root. ::
- # ls /sys/bus/cxl/devices/root0
+ # ls /sys/bus/cxl/devices/root0/decoder*
decoder0.0 decoder0.1 decoder0.2
# cat /sys/bus/cxl/devices/decoder0.0/target_list start size
diff --git a/Documentation/driver-api/cxl/linux/dax-driver.rst b/Documentation/driver-api/cxl/linux/dax-driver.rst
new file mode 100644
index 000000000000..10d953a2167b
--- /dev/null
+++ b/Documentation/driver-api/cxl/linux/dax-driver.rst
@@ -0,0 +1,43 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+DAX Driver Operation
+====================
+The `Direct Access Device` driver was originally designed to provide a
+memory-like access mechanism to memory-like block-devices. It was
+extended to support CXL Memory Devices, which provide user-configured
+memory devices.
+
+The CXL subsystem depends on the DAX subsystem to either:
+
+- Generate a file-like interface to userland via :code:`/dev/daxN.Y`, or
+- Engage the memory-hotplug interface to add CXL memory to page allocator.
+
+The DAX subsystem exposes this ability through the `cxl_dax_region` driver.
+A `dax_region` provides the translation between a CXL `memory_region` and
+a `DAX Device`.
+
+DAX Device
+==========
+A `DAX Device` is a file-like interface exposed in :code:`/dev/daxN.Y`. A
+memory region exposed via dax device can be accessed via userland software
+via the :code:`mmap()` system-call. The result is direct mappings to the
+CXL capacity in the task's page tables.
+
+Users wishing to manually handle allocation of CXL memory should use this
+interface.
+
+kmem conversion
+===============
+The :code:`dax_kmem` driver converts a `DAX Device` into a series of `hotplug
+memory blocks` managed by :code:`kernel/memory-hotplug.c`. This capacity
+will be exposed to the kernel page allocator in the user-selected memory
+zone.
+
+The :code:`memmap_on_memory` setting (both global and DAX device local)
+dictates where the kernell will allocate the :code:`struct folio` descriptors
+for this memory will come from. If :code:`memmap_on_memory` is set, memory
+hotplug will set aside a portion of the memory block capacity to allocate
+folios. If unset, the memory is allocated via a normal :code:`GFP_KERNEL`
+allocation - and as a result will most likely land on the local NUM node of the
+CPU executing the hotplug operation.
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 12/17] cxl: docs/linux/memory-hotplug
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (10 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 11/17] cxl: docs/linux/dax-driver documentation Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 16:21 ` [PATCH v3 13/17] cxl: docs/allocation/dax Gregory Price
` (5 subsequent siblings)
17 siblings, 0 replies; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Add documentation on how the CXL driver surfaces memory through the
DAX driver and memory-hotplug.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Documentation/driver-api/cxl/index.rst | 1 +
.../driver-api/cxl/linux/memory-hotplug.rst | 78 +++++++++++++++++++
2 files changed, 79 insertions(+)
create mode 100644 Documentation/driver-api/cxl/linux/memory-hotplug.rst
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index f2127968ea78..35c5b0c6f95e 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -37,6 +37,7 @@ that have impacts on each other. The docs here break up configurations steps.
linux/early-boot
linux/cxl-driver
linux/dax-driver
+ linux/memory-hotplug
linux/access-coordinates
diff --git a/Documentation/driver-api/cxl/linux/memory-hotplug.rst b/Documentation/driver-api/cxl/linux/memory-hotplug.rst
new file mode 100644
index 000000000000..af368c2bc9cf
--- /dev/null
+++ b/Documentation/driver-api/cxl/linux/memory-hotplug.rst
@@ -0,0 +1,78 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Memory Hotplug
+==============
+The final phase of surfacing CXL memory to the kernel page allocator is for
+the `DAX` driver to surface a `Driver Managed` memory region via the
+memory-hotplug component.
+
+There are four major configurations to consider:
+
+1) Default Online Behavior (on/off and zone)
+2) Hotplug Memory Block size
+3) Memory Map Resource location
+4) Driver-Managed Memory Designation
+
+Default Online Behavior
+=======================
+The default-online behavior of hotplug memory is dictated by the following,
+in order of precedence:
+
+- :code:`CONFIG_MHP_DEFAULT_ONLINE_TYPE` Build Configuration
+- :code:`memhp_default_state` Boot parameter
+- :code:`/sys/devices/system/memory/auto_online_blocks` value
+
+These dictate whether hotplugged memory blocks arrive in one of three states:
+
+1) Offline
+2) Online in :code:`ZONE_NORMAL`
+3) Online in :code:`ZONE_MOVABLE`
+
+:code:`ZONE_NORMAL` implies this capacity may be used for almost any allocation,
+while :code:`ZONE_MOVABLE` implies this capacity should only be used for
+migratable allocations.
+
+:code:`ZONE_MOVABLE` attempts to retain the hotplug-ability of a memory block
+so that it the entire region may be hot-unplugged at a later time. Any capacity
+onlined into :code:`ZONE_NORMAL` should be considered permanently attached to
+the page allocator.
+
+Hotplug Memory Block Size
+=========================
+By default, on most architectures, the Hotplug Memory Block Size is either
+128MB or 256MB. On x86, the block size increases up to 2GB as total memory
+capacity exceeds 64GB. As of v6.15, Linux does not take into account the
+size and alignment of the ACPI CEDT CFMWS regions (see Early Boot docs) when
+deciding the Hotplug Memory Block Size.
+
+Memory Map
+==========
+The location of :code:`struct folio` allocations to represent the hotplugged
+memory capacity are dictated by the following system settings:
+
+- :code:`/sys_module/memory_hotplug/parameters/memmap_on_memory`
+- :code:`/sys/bus/dax/devices/daxN.Y/memmap_on_memory`
+
+If both of these parameters are set to true, :code:`struct folio` for this
+capacity will be carved out of the memory block being onlined. This has
+performance implications if the memory is particularly high-latency and
+its :code:`struct folio` becomes hotly contended.
+
+If either parameter is set to false, :code:`struct folio` for this capacity
+will be allocated from the local node of the processor running the hotplug
+procedure. This capacity will be allocated from :code:`ZONE_NORMAL` on
+that node, as it is a :code:`GFP_KERNEL` allocation.
+
+Systems with extremely large amounts of :code:`ZONE_MOVABLE` memory (e.g.
+CXL memory pools) must ensure that there is sufficient local
+:code:`ZONE_NORMAL` capacity to host the memory map for the hotplugged capacity.
+
+Driver Managed Memory
+=====================
+The DAX driver surfaces this memory to memory-hotplug as "Driver Managed". This
+is not a configurable setting, but it's important to note that driver managed
+memory is explicitly excluded from use during kexec. This is required to ensure
+any reset or out-of-band operations that the CXL device may be subject to during
+a functional system-reboot (such as a reset-on-probe) will not cause portions of
+the kexec kernel to be overwritten.
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 13/17] cxl: docs/allocation/dax
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (11 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 12/17] cxl: docs/linux/memory-hotplug Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 16:21 ` [PATCH v3 14/17] cxl: docs/allocation/page-allocator Gregory Price
` (4 subsequent siblings)
17 siblings, 0 replies; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Small example of accessing CXL memory capacity via DAX device
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../driver-api/cxl/allocation/dax.rst | 60 +++++++++++++++++++
Documentation/driver-api/cxl/index.rst | 5 ++
2 files changed, 65 insertions(+)
create mode 100644 Documentation/driver-api/cxl/allocation/dax.rst
diff --git a/Documentation/driver-api/cxl/allocation/dax.rst b/Documentation/driver-api/cxl/allocation/dax.rst
new file mode 100644
index 000000000000..c6f7a5da832f
--- /dev/null
+++ b/Documentation/driver-api/cxl/allocation/dax.rst
@@ -0,0 +1,60 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========
+DAX Devices
+===========
+CXL capacity exposed as a DAX device can be accessed directly via mmap.
+Users may wish to use this interface mechanism to write their own userland
+CXL allocator, or to managed shared or persistent memory regions across multiple
+hosts.
+
+If the capacity is shared across hosts or persistent, appropriate flushing
+mechanisms must be employed unless the region supports Snoop Back-Invalidate.
+
+Note that mappings must be aligned (size and base) to the dax device's base
+alignment, which is typically 2MB - but maybe be configured larger.
+
+::
+
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <stdint.h>
+ #include <sys/mman.h>
+ #include <fcntl.h>
+ #include <unistd.h>
+
+ #define DEVICE_PATH "/dev/dax0.0" // Replace DAX device path
+ #define DEVICE_SIZE (4ULL * 1024 * 1024 * 1024) // 4GB
+
+ int main() {
+ int fd;
+ void* mapped_addr;
+
+ /* Open the DAX device */
+ fd = open(DEVICE_PATH, O_RDWR);
+ if (fd < 0) {
+ perror("open");
+ return -1;
+ }
+
+ /* Map the device into memory */
+ mapped_addr = mmap(NULL, DEVICE_SIZE, PROT_READ | PROT_WRITE,
+ MAP_SHARED, fd, 0);
+ if (mapped_addr == MAP_FAILED) {
+ perror("mmap");
+ close(fd);
+ return -1;
+ }
+
+ printf("Mapped address: %p\n", mapped_addr);
+
+ /* You can now access the device through the mapped address */
+ uint64_t* ptr = (uint64_t*)mapped_addr;
+ *ptr = 0x1234567890abcdef; // Write a value to the device
+ printf("Value at address %p: 0x%016llx\n", ptr, *ptr);
+
+ /* Clean up */
+ munmap(mapped_addr, DEVICE_SIZE);
+ close(fd);
+ return 0;
+ }
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index 35c5b0c6f95e..6e7497f4811a 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -40,5 +40,10 @@ that have impacts on each other. The docs here break up configurations steps.
linux/memory-hotplug
linux/access-coordinates
+.. toctree::
+ :maxdepth: 2
+ :caption: Memory Allocation
+
+ allocation/dax
.. only:: subproject and html
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 14/17] cxl: docs/allocation/page-allocator
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (12 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 13/17] cxl: docs/allocation/dax Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 16:34 ` Matthew Wilcox
2025-05-12 16:21 ` [PATCH v3 15/17] cxl: docs/allocation/reclaim Gregory Price
` (3 subsequent siblings)
17 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Document some interesting interactions that occur when exposing CXL
memory capacity to page allocator.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../cxl/allocation/page-allocator.rst | 85 +++++++++++++++++++
Documentation/driver-api/cxl/index.rst | 1 +
2 files changed, 86 insertions(+)
create mode 100644 Documentation/driver-api/cxl/allocation/page-allocator.rst
diff --git a/Documentation/driver-api/cxl/allocation/page-allocator.rst b/Documentation/driver-api/cxl/allocation/page-allocator.rst
new file mode 100644
index 000000000000..7b8fe1b8d5bb
--- /dev/null
+++ b/Documentation/driver-api/cxl/allocation/page-allocator.rst
@@ -0,0 +1,85 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+The Page Allocator
+==================
+
+The kernel page allocator services all general page allocation requests, such
+as :code:`kmalloc`. CXL configuration steps affect the behavior of the page
+allocator based on the selected `Memory Zone` and `NUMA node` the capacity is
+placed in.
+
+This section mostly focuses on how these configurations affect the page
+allocator (as of Linux v6.15) rather than the overall page allocator behavior.
+
+NUMA nodes and mempolicy
+========================
+Unless a task explicitly registers a mempolicy, the default memory policy
+of the linux kernel is to allocate memory from the `local NUMA node` first,
+and fall back to other nodes only if the local node is pressured.
+
+Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes,
+with the CXL memory being non-local. Technically, however, it is possible
+for a compute node to have no local DRAM, and for CXL memory to be the
+`local` capacity for that compute node.
+
+
+Memory Zones
+============
+CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`.
+
+As of v6.15, the page allocator attempts to allocate from the highest
+available and compatible ZONE for an allocation from the local node first.
+
+An example of a `zone incompatibility` is attempting to service an allocation
+marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`. Kernel allocations are
+typically not migratable, and as a result can only be serviced from
+:code:`ZONE_NORMAL` or lower.
+
+To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
+:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it
+will fallback to allocate from :code:`ZONE_NORMAL`.
+
+
+Zone and Node Quirks
+====================
+Let's consider a configuration where the local DRAM capacity is largely onlined
+into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The
+CXL capacity has the opposite configuration - all onlined in
+:code:`ZONE_MOVABLE`.
+
+Under the default allocation policy, the page allocator will completely skip
+:code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of
+Linux v6.15, the page allocator does (approximately) the following: ::
+
+ for (each zone in local_node):
+
+ for (each node in fallback_order):
+
+ attempt_allocation(gfp_flags);
+
+Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is
+functionally unreachable for direct allocation. As a result, the only way
+for CXL capacity to be used is via `demotion` in the reclaim path.
+
+This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE`
+capacity - when that capacity is depleted, the page allocator will actually
+prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages.
+
+We may wish to invert this priority in future Linux versions.
+
+If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes
+when the DRAM nodes are depleted. See the reclaim section for more details.
+
+
+CGroups and CPUSets
+===================
+Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined
+in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by
+containers to limit the accessibility of certain NUMA nodes for tasks in that
+container. Users may wish to utilize this in multi-tenant systems where some
+tasks prefer not to use slower memory.
+
+In the reclaim section we'll discuss some limitations of this interface to
+prevent demotions of shared data to CXL memory (if demotions are enabled).
+
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index 6e7497f4811a..7acab7e7df96 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -45,5 +45,6 @@ that have impacts on each other. The docs here break up configurations steps.
:caption: Memory Allocation
allocation/dax
+ allocation/page-allocator
.. only:: subproject and html
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 15/17] cxl: docs/allocation/reclaim
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (13 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 14/17] cxl: docs/allocation/page-allocator Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 16:21 ` [PATCH v3 16/17] cxl: docs/allocation/hugepages Gregory Price
` (2 subsequent siblings)
17 siblings, 0 replies; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Document a bit about how reclaim interacts with various CXL
configurations.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../driver-api/cxl/allocation/reclaim.rst | 51 +++++++++++++++++++
Documentation/driver-api/cxl/index.rst | 1 +
2 files changed, 52 insertions(+)
create mode 100644 Documentation/driver-api/cxl/allocation/reclaim.rst
diff --git a/Documentation/driver-api/cxl/allocation/reclaim.rst b/Documentation/driver-api/cxl/allocation/reclaim.rst
new file mode 100644
index 000000000000..f40f1cae391a
--- /dev/null
+++ b/Documentation/driver-api/cxl/allocation/reclaim.rst
@@ -0,0 +1,51 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+Reclaim
+=======
+Another way CXL memory can be utilized *indirectly* is via the reclaim system
+in :code:`mm/vmscan.c`. Reclaim is engaged when memory capacity on the system
+becomes pressured based on global and cgroup-local `watermark` settings.
+
+In this section we won't discuss the `watermark` configurations, just how CXL
+memory can be consumed by various pieces of reclaim system.
+
+Demotion
+========
+By default, the reclaim system will prefer swap (or zswap) when reclaiming
+memory. Enabling :code:`kernel/mm/numa/demotion_enabled` will cause vmscan
+to opportunistically prefer distant NUMA nodes to swap or zswap, if capacity
+is available.
+
+Demotion engages the :code:`mm/memory_tier.c` component to determine the
+next demotion node. The next demotion node is based on the :code:`HMAT`
+or :code:`CDAT` performance data.
+
+cpusets.mems_allowed quirk
+--------------------------
+In Linux v6.15 and below, demotion does not respect :code:`cpusets.mems_allowed`
+when migrating pages. As a result, if demotion is enabled, vmscan cannot
+guarantee isolation of a container's memory from nodes not set in mems_allowed.
+
+In Linux v6.XX and up, demotion does attempt to respect
+:code:`cpusets.mems_allowed`; however, certain classes of shared memory
+originally instantiated by another cgroup (such as common libraries - e.g.
+libc) may still be demoted. As a result, the mems_allowed interface still
+cannot provide perfect isolation from the remote nodes.
+
+ZSwap and Node Preference
+=========================
+In Linux v6.15 and below, ZSwap allocates memory from the local node of the
+processor for the new pages being compressed. Since pages being compressed
+are typically cold, the result is a cold page becomes promoted - only to
+be later demoted as it ages off the LRU.
+
+In Linux v6.XX, ZSwap tries to prefer the node of the page being compressed
+as the allocation target for the compression page. This helps prevent
+thrashing.
+
+Demotion with ZSwap
+===================
+When enabling both Demotion and ZSwap, you create a situation where ZSwap
+will prefer the slowest form of CXL memory by default until that tier of
+memory is exhausted.
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index 7acab7e7df96..d3ab928d4d7c 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -46,5 +46,6 @@ that have impacts on each other. The docs here break up configurations steps.
allocation/dax
allocation/page-allocator
+ allocation/reclaim
.. only:: subproject and html
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 16/17] cxl: docs/allocation/hugepages
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (14 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 15/17] cxl: docs/allocation/reclaim Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-12 16:21 ` [PATCH v3 17/17] cxl: docs - add self-referencing cross-links Gregory Price
2025-05-13 20:38 ` [PATCH v3 00/17] CXL Boot to Bash Documentation Dave Jiang
17 siblings, 0 replies; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
Add docs on how CXL capacity interacts with CMA and HugeTLB allocation
interfaces.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../driver-api/cxl/allocation/hugepages.rst | 32 +++++++++++++++++++
Documentation/driver-api/cxl/index.rst | 1 +
2 files changed, 33 insertions(+)
create mode 100644 Documentation/driver-api/cxl/allocation/hugepages.rst
diff --git a/Documentation/driver-api/cxl/allocation/hugepages.rst b/Documentation/driver-api/cxl/allocation/hugepages.rst
new file mode 100644
index 000000000000..1023c6922829
--- /dev/null
+++ b/Documentation/driver-api/cxl/allocation/hugepages.rst
@@ -0,0 +1,32 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Huge Pages
+==========
+
+Contiguous Memory Allocator
+===========================
+CXL Memory onlined as SystemRAM during early boot is eligible for use by CMA,
+as the NUMA node hosting that capacity will be `Online` at the time CMA
+carves out contiguous capacity.
+
+CXL Memory deferred to the CXL Driver for configuration cannot have its
+capacity allocated by CMA - as the NUMA node hosting the capacity is `Offline`
+at :code:`__init` time - when CMA carves out contiguous capacity.
+
+HugeTLB
+=======
+Different huge page sizes allow different memory configurations.
+
+2MB Huge Pages
+--------------
+All CXL capacity regardless of configuration time or memory zone is eligible
+for use as 2MB huge pages.
+
+1GB Huge Pages
+--------------
+CXL capacity onlined in :code:`ZONE_NORMAL` is eligible for 1GB Gigantic Page
+allocation.
+
+CXL capacity onlined in :code:`ZONE_MOVABLE` is not eligible for 1GB Gigantic
+Page allocation.
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index d3ab928d4d7c..366faf851fc7 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -47,5 +47,6 @@ that have impacts on each other. The docs here break up configurations steps.
allocation/dax
allocation/page-allocator
allocation/reclaim
+ allocation/hugepages.rst
.. only:: subproject and html
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v3 17/17] cxl: docs - add self-referencing cross-links
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (15 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 16/17] cxl: docs/allocation/hugepages Gregory Price
@ 2025-05-12 16:21 ` Gregory Price
2025-05-13 20:38 ` [PATCH v3 00/17] CXL Boot to Bash Documentation Dave Jiang
17 siblings, 0 replies; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:21 UTC (permalink / raw)
To: linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet, Bagas Sanjaya
Add some crosslinks between pages in the CXL docs - mostly to the
ACPI tables.
Suggested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../driver-api/cxl/devices/device-types.rst | 2 +-
.../cxl/linux/access-coordinates.rst | 7 ++--
.../driver-api/cxl/linux/cxl-driver.rst | 35 ++++++++++---------
.../driver-api/cxl/linux/early-boot.rst | 32 ++++++++++-------
.../driver-api/cxl/platform/bios-and-efi.rst | 12 +++----
.../example-configurations/flexible.rst | 10 +++---
.../example-configurations/hb-interleave.rst | 10 +++---
.../multi-dev-per-hb.rst | 10 +++---
.../example-configurations/one-dev-per-hb.rst | 10 +++---
9 files changed, 69 insertions(+), 59 deletions(-)
diff --git a/Documentation/driver-api/cxl/devices/device-types.rst b/Documentation/driver-api/cxl/devices/device-types.rst
index c70564cf0be3..f5e4330c1cfe 100644
--- a/Documentation/driver-api/cxl/devices/device-types.rst
+++ b/Documentation/driver-api/cxl/devices/device-types.rst
@@ -115,7 +115,7 @@ A Multi-Headed Single-Logical Device (MHSLD) exposes a single logical
device to multiple heads which may be connected to one or more discrete
hosts. An example of this would be a simple memory-pool which may be
statically configured (prior to boot) to expose portions of its memory
-to Linux via the CEDT ACPI table.
+to Linux via :doc:`CEDT <../platform/acpi/cedt>`.
MHMLD
~~~~~
diff --git a/Documentation/driver-api/cxl/linux/access-coordinates.rst b/Documentation/driver-api/cxl/linux/access-coordinates.rst
index e408ecbc4038..71024fa0f561 100644
--- a/Documentation/driver-api/cxl/linux/access-coordinates.rst
+++ b/Documentation/driver-api/cxl/linux/access-coordinates.rst
@@ -24,7 +24,7 @@ asymmetry in properties does not happen and all paths to EPs are equal.
There can be multiple switches under an RP. There can be multiple RPs under
a CXL Host Bridge (HB). There can be multiple HBs under a CXL Fixed Memory
-Window Structure (CFMWS).
+Window Structure (CFMWS) in the :doc:`CEDT <../platform/acpi/cedt>`.
An example hierarchy::
@@ -83,8 +83,9 @@ also the index for the resulting xarray.
The next step is to take the min() of the per host bridge bandwidth and the
bandwidth from the Generic Port (GP). The bandwidths for the GP are retrieved
-via ACPI tables SRAT/HMAT. The minimum bandwidth are aggregated under the same
-ACPI0017 device to form a new xarray.
+via ACPI tables (:doc:`SRAT <../platform/acpi/srat>` and
+:doc:`HMAT <../platform/acpi/hmat>`). The minimum bandwidth are aggregated
+under the same ACPI0017 device to form a new xarray.
Finally, the cxl_region_update_bandwidth() is called and the aggregated
bandwidth from all the members of the last xarray is updated for the
diff --git a/Documentation/driver-api/cxl/linux/cxl-driver.rst b/Documentation/driver-api/cxl/linux/cxl-driver.rst
index cf6b397abdb1..9759e90c3cf1 100644
--- a/Documentation/driver-api/cxl/linux/cxl-driver.rst
+++ b/Documentation/driver-api/cxl/linux/cxl-driver.rst
@@ -77,11 +77,11 @@ Root Object` Device Class is found.
The Root contains links to:
-* `Host Bridge Ports` defined by ACPI CEDT CHBS.
+* `Host Bridge Ports` defined by CHBS in the :doc:`CEDT<../platform/acpi/cedt>`
* `Downstream Ports` typically connected to `Host Bridge Ports`.
-* `Root Decoders` defined by ACPI CEDT CFMWS.
+* `Root Decoders` defined by CFMWS the :doc:`CEDT<../platform/acpi/cedt>`
::
@@ -150,9 +150,8 @@ An `endpoint` is a terminal port in the fabric. This is a `logical device`,
and may be one of many `logical devices` presented by a memory device. It
is still considered a type of `port` in the fabric.
-An `endpoint` contains `endpoint decoders` available for use and the
-*Coherent Device Attribute Table* (CDAT) used to describe the capabilities
-of the device. ::
+An `endpoint` contains `endpoint decoders` and the device's Coherent Device
+Attribute Table (which describes the device's capabilities). ::
# ls /sys/bus/cxl/devices/endpoint5
CDAT decoders_committed modalias uevent
@@ -247,17 +246,18 @@ parameter.
Root Decoder
~~~~~~~~~~~~
A `Root Decoder` is logical construct of the physical address and interleave
-configurations present in the ACPI CEDT CFMWS. Linux presents this information
-as a decoder present in the `CXL Root`. We consider this a `Root Decoder`,
-though technically it exists on the boundary of the CXL specification and
-platform-specific CXL root implementations.
+configurations present in the CFMWS field of the :doc:`CEDT
+<../platform/acpi/cedt>`.
+Linux presents this information as a decoder present in the `CXL Root`. We
+consider this a `Root Decoder`, though technically it exists on the boundary
+of the CXL specification and platform-specific CXL root implementations.
Linux considers these logical decoders a type of `Routing Decoder`, and is the
first decoder in the CXL fabric to receive a memory access from the platform's
memory controllers.
`Root Decoders` are created during :code:`cxl_acpi_probe`. One root decoder
-is created per CFMWS entry in the ACPI CEDT.
+is created per CFMWS entry in the :doc:`CEDT <../platform/acpi/cedt>`.
The :code:`target_list` parameter is filled by the CFMWS target fields. Targets
of a root decoder are `Host Bridges`, which means interleave done at the root
@@ -267,9 +267,11 @@ Only root decoders are capable of `Inter-Host-Bridge Interleave`.
Such interleaves must be configured by the platform and described in the ACPI
CEDT CFMWS, as the target CXL host bridge UIDs in the CFMWS must match the CXL
-host bridge UIDs in the ACPI CEDT CHBS and ACPI DSDT.
+host bridge UIDs in the CHBS field of the :doc:`CEDT
+<../platform/acpi/cedt>` and the UID field of CXL Host Bridges defined in
+the :doc:`DSDT <../platform/acpi/dsdt>`.
-Interleave settings in a rootdecoder describe how to interleave accesses among
+Interleave settings in a root decoder describe how to interleave accesses among
the *immediate downstream targets*, not the entire interleave set.
The memory range described in the root decoder is used to
@@ -531,10 +533,11 @@ granularity configuration.
At Root
~~~~~~~
-Root decoder interleave is defined by the ACPI CEDT CFMWS. The CEDT
-may actually define multiple CFMWS configurations to describe the same
-physical capacity - with the intent to allow users to decide at runtime
-whether to online memory as interleaved or non-interleaved. ::
+Root decoder interleave is defined by CFMWS field of the :doc:`CEDT
+<../platform/acpi/cedt>`. The CEDT may actually define multiple CFMWS
+configurations to describe the same physical capacity, with the intent to allow
+users to decide at runtime whether to online memory as interleaved or
+non-interleaved. ::
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Window base address : 0000000100000000
diff --git a/Documentation/driver-api/cxl/linux/early-boot.rst b/Documentation/driver-api/cxl/linux/early-boot.rst
index 8c1c497bc772..a7fc6fc85fbe 100644
--- a/Documentation/driver-api/cxl/linux/early-boot.rst
+++ b/Documentation/driver-api/cxl/linux/early-boot.rst
@@ -12,8 +12,9 @@ read EFI and ACPI information throughout this process to configure logical
representations of the devices.
During Linux Early Boot stage (functions in the kernel that have the __init
-decorator), the system takes the resources created by EFI/BIOS (ACPI tables)
-and turns them into resources that the kernel can consume.
+decorator), the system takes the resources created by EFI/BIOS
+(:doc:`ACPI tables <../platform/acpi>`) and turns them into resources that the
+kernel can consume.
BIOS, Build and Boot Options
@@ -70,13 +71,14 @@ significant impact performance depending on the memory capacity of the system.
NUMA Node Reservation
=====================
-Linux refers to the proximity domains (:code:`PXM`) defined in the SRAT to
-create NUMA nodes in :code:`acpi_numa_init`. Typically, there is a 1:1 relation
-between :code:`PXM` and NUMA node IDs.
+Linux refers to the proximity domains (:code:`PXM`) defined in the :doc:`SRAT
+<../platform/acpi/srat>` to create NUMA nodes in :code:`acpi_numa_init`.
+Typically, there is a 1:1 relation between :code:`PXM` and NUMA node IDs.
-SRAT is the only ACPI defined way of defining Proximity Domains. Linux chooses
-to, at most, map those 1:1 with NUMA nodes. CEDT adds a description of SPA
-ranges which Linux may wish to map to one or more NUMA nodes.
+The SRAT is the only ACPI defined way of defining Proximity Domains. Linux
+chooses to, at most, map those 1:1 with NUMA nodes.
+:doc:`CEDT <../platform/acpi/cedt>` adds a description of SPA ranges which
+Linux may map to one or more NUMA nodes.
If there are CXL ranges in the CFMWS but not in SRAT, then a fake :code:`PXM`
is created (as of v6.15). In the future, Linux may reject CFMWS not described
@@ -89,7 +91,8 @@ data for Linux to identify NUMA nodes their associated memory regions.
The relevant code exists in: :code:`linux/drivers/acpi/numa/srat.c`.
-See the Example Platform Configurations section for more information.
+See :doc:`Example Platform Configurations <../platform/example-configs>`
+for more info.
Memory Tiers Creation
=====================
@@ -108,10 +111,13 @@ Tier membership can be inspected in ::
/sys/devices/virtual/memory_tiering/memory_tierN/nodelist
0-1
-If nodes are grouped which have clear difference in performance, check the HMAT
-and CDAT information for the CXL nodes. All nodes default to the DRAM tier,
-unless HMAT/CDAT information is reported to the memory_tier component via
-`access_coordinates`.
+If nodes are grouped which have clear difference in performance, check the
+:doc:`HMAT <../platform/acpi/hmat>` and CDAT information for the CXL nodes. All
+nodes default to the DRAM tier, unless HMAT/CDAT information is reported to the
+memory_tier component via `access_coordinates`.
+
+For more, see :doc:`CXL access coordinates documentation
+<../linux/access-coordinates>`.
Contiguous Memory Allocation
============================
diff --git a/Documentation/driver-api/cxl/platform/bios-and-efi.rst b/Documentation/driver-api/cxl/platform/bios-and-efi.rst
index 552a83992bcc..645322632cc9 100644
--- a/Documentation/driver-api/cxl/platform/bios-and-efi.rst
+++ b/Documentation/driver-api/cxl/platform/bios-and-efi.rst
@@ -22,7 +22,7 @@ At a high level, this is what occurs during this phase of configuration.
Much of what this section is concerned with is ACPI Table production and
static memory map configuration. More detail on these tables can be found
-under Platform Configuration -> ACPI Table Reference.
+at :doc:`ACPI Tables <acpi>`.
.. note::
Platform Vendors should read carefully, as this sections has recommendations
@@ -175,9 +175,9 @@ to implement driver support for your platform.
Interleave and Configuration Flexibility
----------------------------------------
-If providing cross-host-bridge interleave, a CFMWS entry in the CEDT must be
-presented with target host-bridges for the interleaved device sets (there may
-be multiple behind each host bridge).
+If providing cross-host-bridge interleave, a CFMWS entry in the :doc:`CEDT
+<acpi/cedt>` must be presented with target host-bridges for the interleaved
+device sets (there may be multiple behind each host bridge).
If providing intra-host-bridge interleaving, only 1 CFMWS entry in the CEDT is
required for that host bridge - if it covers the entire capacity of the devices
@@ -193,8 +193,8 @@ different purposes. For example, you may want to consider adding:
A platform may choose to add all of these, or change the mode based on a BIOS
setting. For each CFMWS entry, Linux expects descriptions of the described
-memory regions in the SRAT to determine the number of NUMA nodes it should
-reserve during early boot / init.
+memory regions in the :doc:`SRAT <acpi/srat>` to determine the number of
+NUMA nodes it should reserve during early boot / init.
As of v6.14, Linux will create a NUMA node for each CEDT CFMWS entry, even if
a matching SRAT entry does not exist; however, this is not guaranteed in the
diff --git a/Documentation/driver-api/cxl/platform/example-configurations/flexible.rst b/Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
index e39daba65fa0..dab704b6fcc2 100644
--- a/Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
+++ b/Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
@@ -18,7 +18,7 @@ Things to note:
* This SRAT describes one node for each of the above CFMWS.
* The HMAT describes performance for each node in the SRAT.
-CEDT ::
+:doc:`CEDT <../acpi/cedt>`::
Subtable Type : 00 [CXL Host Bridge Structure]
Reserved : 00
@@ -137,7 +137,7 @@ CEDT ::
QtgId : 0001
First Target : 00000006
-SRAT ::
+:doc:`SRAT <../acpi/srat>`::
Subtable Type : 01 [Memory Affinity]
Length : 28
@@ -223,7 +223,7 @@ SRAT ::
Hot Pluggable : 1
Non-Volatile : 0
-HMAT ::
+:doc:`HMAT <../acpi/hmat>`::
Structure Type : 0001 [SLLBI]
Data Type : 00 [Latency]
@@ -263,7 +263,7 @@ HMAT ::
Entry : 0100
Entry : 0100
-SLIT ::
+:doc:`SLIT <../acpi/slit>`::
Signature : "SLIT" [System Locality Information Table]
Localities : 0000000000000003
@@ -276,7 +276,7 @@ SLIT ::
Locality 6 : FF FF FF FF FF FF 0A FF
Locality 7 : FF FF FF FF FF FF FF 0A
-DSDT ::
+:doc:`DSDT <../acpi/dsdt>`::
Scope (_SB)
{
diff --git a/Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst b/Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
index ce07e6162f26..c474dcf09fb0 100644
--- a/Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
+++ b/Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
@@ -13,7 +13,7 @@ Things to note:
* This SRAT describes one node for both host bridges.
* The HMAT describes a single node's performance.
-CEDT ::
+:doc:`CEDT <../acpi/cedt>`::
Subtable Type : 00 [CXL Host Bridge Structure]
Reserved : 00
@@ -48,7 +48,7 @@ CEDT ::
First Target : 00000007
Second Target : 00000006
-SRAT ::
+:doc:`SRAT <../acpi/srat>`::
Subtable Type : 01 [Memory Affinity]
Length : 28
@@ -62,7 +62,7 @@ SRAT ::
Hot Pluggable : 1
Non-Volatile : 0
-HMAT ::
+:doc:`HMAT <../acpi/hmat>`::
Structure Type : 0001 [SLLBI]
Data Type : 00 [Latency]
@@ -80,14 +80,14 @@ HMAT ::
Entry : 1200
Entry : 0400
-SLIT ::
+:doc:`SLIT <../acpi/slit>`::
Signature : "SLIT" [System Locality Information Table]
Localities : 0000000000000003
Locality 0 : 10 20
Locality 1 : FF 0A
-DSDT ::
+:doc:`DSDT <../acpi/dsdt>`::
Scope (_SB)
{
diff --git a/Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst b/Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
index 6adf7c639490..a7854a79dbbd 100644
--- a/Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
+++ b/Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
@@ -14,7 +14,7 @@ Things to note:
* This CEDT/SRAT describes one node for both devices.
* There is only one proximity domain the HMAT for both devices.
-CEDT ::
+:doc:`CEDT <../acpi/cedt>`::
Subtable Type : 00 [CXL Host Bridge Structure]
Reserved : 00
@@ -39,7 +39,7 @@ CEDT ::
QtgId : 0001
First Target : 00000007
-SRAT ::
+:doc:`SRAT <../acpi/srat>`::
Subtable Type : 01 [Memory Affinity]
Length : 28
@@ -53,7 +53,7 @@ SRAT ::
Hot Pluggable : 1
Non-Volatile : 0
-HMAT ::
+:doc:`HMAT <../acpi/hmat>`::
Structure Type : 0001 [SLLBI]
Data Type : 00 [Latency]
@@ -69,14 +69,14 @@ HMAT ::
Entry : 1200
Entry : 0200
-SLIT ::
+:doc:`SLIT <../acpi/slit>`::
Signature : "SLIT" [System Locality Information Table]
Localities : 0000000000000003
Locality 0 : 10 20
Locality 1 : FF 0A
-DSDT ::
+:doc:`DSDT <../acpi/dsdt>`::
Scope (_SB)
{
diff --git a/Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst b/Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
index b89ba3cab98f..aebda0eb3e17 100644
--- a/Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
+++ b/Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
@@ -14,7 +14,7 @@ Things to note:
* This CEDT/SRAT describes one node per device
* The expanders have the same performance and will be in the same memory tier.
-CEDT ::
+:doc:`CEDT <../acpi/cedt>`::
Subtable Type : 00 [CXL Host Bridge Structure]
Reserved : 00
@@ -62,7 +62,7 @@ CEDT ::
QtgId : 0001
First Target : 00000006
-SRAT ::
+:doc:`SRAT <../acpi/srat>`::
Subtable Type : 01 [Memory Affinity]
Length : 28
@@ -88,7 +88,7 @@ SRAT ::
Hot Pluggable : 1
Non-Volatile : 0
-HMAT ::
+:doc:`HMAT <../acpi/hmat>`::
Structure Type : 0001 [SLLBI]
Data Type : 00 [Latency]
@@ -108,7 +108,7 @@ HMAT ::
Entry : 0200
Entry : 0200
-SLIT ::
+:doc:`SLIT <../acpi/slit>`::
Signature : "SLIT" [System Locality Information Table]
Localities : 0000000000000003
@@ -116,7 +116,7 @@ SLIT ::
Locality 1 : FF 0A FF
Locality 2 : FF FF 0A
-DSDT ::
+:doc:`DSDT <../acpi/dsdt>`::
Scope (_SB)
{
--
2.49.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH v3 14/17] cxl: docs/allocation/page-allocator
2025-05-12 16:21 ` [PATCH v3 14/17] cxl: docs/allocation/page-allocator Gregory Price
@ 2025-05-12 16:34 ` Matthew Wilcox
2025-05-12 16:38 ` Gregory Price
0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2025-05-12 16:34 UTC (permalink / raw)
To: Gregory Price
Cc: linux-cxl, linux-doc, linux-kernel, kernel-team, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
ira.weiny, dan.j.williams, corbet
On Mon, May 12, 2025 at 12:21:31PM -0400, Gregory Price wrote:
> Document some interesting interactions that occur when exposing CXL
> memory capacity to page allocator.
We should not do this. Asking the page allocator for memory (eg for
slab) should never return memory on CXL. There need to be special
interfaces for clients that know they can tolerate the added latency.
NAK this concept, and NAK this specific document. I have no comment on
the previous documents.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 14/17] cxl: docs/allocation/page-allocator
2025-05-12 16:34 ` Matthew Wilcox
@ 2025-05-12 16:38 ` Gregory Price
2025-05-12 17:52 ` Matthew Wilcox
0 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 16:38 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-cxl, linux-doc, linux-kernel, kernel-team, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
ira.weiny, dan.j.williams, corbet
On Mon, May 12, 2025 at 05:34:56PM +0100, Matthew Wilcox wrote:
> On Mon, May 12, 2025 at 12:21:31PM -0400, Gregory Price wrote:
> > Document some interesting interactions that occur when exposing CXL
> > memory capacity to page allocator.
>
> We should not do this. Asking the page allocator for memory (eg for
> slab) should never return memory on CXL. There need to be special
> interfaces for clients that know they can tolerate the added latency.
>
> NAK this concept, and NAK this specific document. I have no comment on
> the previous documents.
This describes what presently exists, so i'm not sure of what value a
NAK here is.
Feel free to submit patches that deletes the existing code if you want
it removed from the documentation.
~Gregory
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 14/17] cxl: docs/allocation/page-allocator
2025-05-12 16:38 ` Gregory Price
@ 2025-05-12 17:52 ` Matthew Wilcox
2025-05-12 18:09 ` Gregory Price
0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2025-05-12 17:52 UTC (permalink / raw)
To: Gregory Price
Cc: linux-cxl, linux-doc, linux-kernel, kernel-team, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
ira.weiny, dan.j.williams, corbet
On Mon, May 12, 2025 at 12:38:47PM -0400, Gregory Price wrote:
> On Mon, May 12, 2025 at 05:34:56PM +0100, Matthew Wilcox wrote:
> > On Mon, May 12, 2025 at 12:21:31PM -0400, Gregory Price wrote:
> > > Document some interesting interactions that occur when exposing CXL
> > > memory capacity to page allocator.
> >
> > We should not do this. Asking the page allocator for memory (eg for
> > slab) should never return memory on CXL. There need to be special
> > interfaces for clients that know they can tolerate the added latency.
> >
> > NAK this concept, and NAK this specific document. I have no comment on
> > the previous documents.
>
> This describes what presently exists, so i'm not sure of what value a
> NAK here is.
>
> Feel free to submit patches that deletes the existing code if you want
> it removed from the documentation.
Who sneaked that in when?
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 14/17] cxl: docs/allocation/page-allocator
2025-05-12 17:52 ` Matthew Wilcox
@ 2025-05-12 18:09 ` Gregory Price
2025-05-13 2:39 ` dan.j.williams
0 siblings, 1 reply; 33+ messages in thread
From: Gregory Price @ 2025-05-12 18:09 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-cxl, linux-doc, linux-kernel, kernel-team, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
ira.weiny, dan.j.williams, corbet
On Mon, May 12, 2025 at 06:52:31PM +0100, Matthew Wilcox wrote:
> >
> > Feel free to submit patches that deletes the existing code if you want
> > it removed from the documentation.
>
> Who sneaked that in when?
The ACPI and EFI folks when they allowed for CXL memory to be marked
EFI_CONVENTIONAL_MEMORY - which means Linux can't actually differentiate
between DRAM and CXL during __init and brings it online in the page
allocator as SystemRAM in ZONE_NORMAL (attached to the NUMA node that
maps to the Proximity Domain in the SRAT).
Not sure there's anything you can do about that.
And for DAX:
09d09e04d2 (cxl/dax: Create dax devices for CXL RAM regions)
Which allows for EFI_MEMORY_SP / Soft Reserved CXL regions to be brought
up as a DAX devices (which can be bound to SystemRAM via DAX kmem).
Wasn't much sneaking going on here - DAX kmem has been around and hacked
on since 2019, and probably some years before that.
~Gregory
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 01/17] cxl: update documentation structure in prep for new docs
2025-05-12 16:21 ` [PATCH v3 01/17] cxl: update documentation structure in prep for new docs Gregory Price
@ 2025-05-12 22:46 ` Dave Jiang
0 siblings, 0 replies; 33+ messages in thread
From: Dave Jiang @ 2025-05-12 22:46 UTC (permalink / raw)
To: Gregory Price, linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
corbet
On 5/12/25 9:21 AM, Gregory Price wrote:
> Restructure the cxl folder to make adding docs per-page cleaner.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> Documentation/driver-api/cxl/index.rst | 16 +++++++++++++---
> .../cxl/{ => linux}/access-coordinates.rst | 0
> ...emory-devices.rst => theory-of-operation.rst} | 10 +++++-----
> 3 files changed, 18 insertions(+), 8 deletions(-)
> rename Documentation/driver-api/cxl/{ => linux}/access-coordinates.rst (100%)
> rename Documentation/driver-api/cxl/{memory-devices.rst => theory-of-operation.rst} (98%)
>
> diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
> index 965ba90e8fb7..fe1594dc6778 100644
> --- a/Documentation/driver-api/cxl/index.rst
> +++ b/Documentation/driver-api/cxl/index.rst
> @@ -4,12 +4,22 @@
> Compute Express Link
> ====================
>
> +CXL device configuration has a complex handoff between platform (Hardware,
> +BIOS, EFI), OS (early boot, core kernel, driver), and user policy decisions
> +that have impacts on each other. The docs here break up configurations steps.
> +
> +.. toctree::
> + :maxdepth: 2
> + :caption: Overview
> +
> + theory-of-operation
> + maturity-map
> +
> .. toctree::
> :maxdepth: 1
> + :caption: Linux Kernel Configuration
>
> - memory-devices
> - access-coordinates
> + linux/access-coordinates
>
> - maturity-map
>
> .. only:: subproject and html
> diff --git a/Documentation/driver-api/cxl/access-coordinates.rst b/Documentation/driver-api/cxl/linux/access-coordinates.rst
> similarity index 100%
> rename from Documentation/driver-api/cxl/access-coordinates.rst
> rename to Documentation/driver-api/cxl/linux/access-coordinates.rst
> diff --git a/Documentation/driver-api/cxl/memory-devices.rst b/Documentation/driver-api/cxl/theory-of-operation.rst
> similarity index 98%
> rename from Documentation/driver-api/cxl/memory-devices.rst
> rename to Documentation/driver-api/cxl/theory-of-operation.rst
> index d732c42526df..32739e253453 100644
> --- a/Documentation/driver-api/cxl/memory-devices.rst
> +++ b/Documentation/driver-api/cxl/theory-of-operation.rst
> @@ -1,9 +1,9 @@
> .. SPDX-License-Identifier: GPL-2.0
> .. include:: <isonum.txt>
>
> -===================================
> -Compute Express Link Memory Devices
> -===================================
> +===============================================
> +Compute Express Link Driver Theory of Operation
> +===============================================
>
> A Compute Express Link Memory Device is a CXL component that implements the
> CXL.mem protocol. It contains some amount of volatile memory, persistent memory,
> @@ -14,8 +14,8 @@ that optionally define a device's contribution to an interleaved address
> range across multiple devices underneath a host-bridge or interleaved
> across host-bridges.
>
> -CXL Bus: Theory of Operation
> -============================
> +The CXL Bus
> +===========
> Similar to how a RAID driver takes disk objects and assembles them into a new
> logical device, the CXL subsystem is tasked to take PCIe and ACPI objects and
> assemble them into a CXL.mem decode topology. The need for runtime configuration
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 02/17] cxl: docs - access-coordinates doc fixups
2025-05-12 16:21 ` [PATCH v3 02/17] cxl: docs - access-coordinates doc fixups Gregory Price
@ 2025-05-12 22:47 ` Dave Jiang
0 siblings, 0 replies; 33+ messages in thread
From: Dave Jiang @ 2025-05-12 22:47 UTC (permalink / raw)
To: Gregory Price, linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
corbet, Randy Dunlap, Bagas Sanjaya
On 5/12/25 9:21 AM, Gregory Price wrote:
> Place the hierarchy diagram in access-coordinates.rst in a code block.
>
> Fix a few grammar issues.
>
> Suggested-by: Randy Dunlap <rdunlap@infradead.org>
> Suggested-by: Bagas Sanjaya <bagasdotme@gmail.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> .../cxl/linux/access-coordinates.rst | 30 +++++++++----------
> 1 file changed, 15 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/driver-api/cxl/linux/access-coordinates.rst b/Documentation/driver-api/cxl/linux/access-coordinates.rst
> index b07950ea30c9..e408ecbc4038 100644
> --- a/Documentation/driver-api/cxl/linux/access-coordinates.rst
> +++ b/Documentation/driver-api/cxl/linux/access-coordinates.rst
> @@ -26,20 +26,20 @@ There can be multiple switches under an RP. There can be multiple RPs under
> a CXL Host Bridge (HB). There can be multiple HBs under a CXL Fixed Memory
> Window Structure (CFMWS).
>
> -An example hierarchy:
> +An example hierarchy::
>
> -> CFMWS 0
> -> |
> -> _________|_________
> -> | |
> -> ACPI0017-0 ACPI0017-1
> -> GP0/HB0/ACPI0016-0 GP1/HB1/ACPI0016-1
> -> | | | |
> -> RP0 RP1 RP2 RP3
> -> | | | |
> -> SW 0 SW 1 SW 2 SW 3
> -> | | | | | | | |
> -> EP0 EP1 EP2 EP3 EP4 EP5 EP6 EP7
> + CFMWS 0
> + |
> + _________|_________
> + | |
> + ACPI0017-0 ACPI0017-1
> + GP0/HB0/ACPI0016-0 GP1/HB1/ACPI0016-1
> + | | | |
> + RP0 RP1 RP2 RP3
> + | | | |
> + SW 0 SW 1 SW 2 SW 3
> + | | | | | | | |
> + EP0 EP1 EP2 EP3 EP4 EP5 EP6 EP7
>
> Computation for the example hierarchy:
>
> @@ -82,8 +82,8 @@ this point all the bandwidths are aggregated per each host bridge, which is
> also the index for the resulting xarray.
>
> The next step is to take the min() of the per host bridge bandwidth and the
> -bandwidth from the Generic Port (GP). The bandwidths for the GP is retrieved
> -via ACPI tables SRAT/HMAT. The min bandwidth are aggregated under the same
> +bandwidth from the Generic Port (GP). The bandwidths for the GP are retrieved
> +via ACPI tables SRAT/HMAT. The minimum bandwidth are aggregated under the same
> ACPI0017 device to form a new xarray.
>
> Finally, the cxl_region_update_bandwidth() is called and the aggregated
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 03/17] cxl: docs/devices - add cxl device and protocol reference
2025-05-12 16:21 ` [PATCH v3 03/17] cxl: docs/devices - add cxl device and protocol reference Gregory Price
@ 2025-05-12 23:08 ` Dave Jiang
2025-05-12 23:22 ` Gregory Price
0 siblings, 1 reply; 33+ messages in thread
From: Dave Jiang @ 2025-05-12 23:08 UTC (permalink / raw)
To: Gregory Price, linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
corbet
On 5/12/25 9:21 AM, Gregory Price wrote:
> Add a simple device primer sufficient to understand the theory
> of operation documentation.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> .../driver-api/cxl/devices/device-types.rst | 165 ++++++++++++++++++
> Documentation/driver-api/cxl/index.rst | 6 +
> 2 files changed, 171 insertions(+)
> create mode 100644 Documentation/driver-api/cxl/devices/device-types.rst
>
> diff --git a/Documentation/driver-api/cxl/devices/device-types.rst b/Documentation/driver-api/cxl/devices/device-types.rst
> new file mode 100644
> index 000000000000..c70564cf0be3
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/devices/device-types.rst
> @@ -0,0 +1,165 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Devices and Protocols
> +=====================
> +
> +The type of CXL device (Memory, Accelerator, etc) dictates many configuration steps. This section
> +covers some basic background on device types and on-device resources used by the platform and OS
> +which impact configuration.
> +
> +Protocols
> +=========
> +
> +There are three core protocols to CXL. For the purpose of this documentation,
> +we will only discuss very high level definitions as the specific hardware
> +details are largely abstracted away from Linux. See the CXL specification
> +for more details.
> +
> +CXL.io
> +------
> +The basic interaction protocol, similar to PCIe configuration mechanisms.
> +Typically used for initialization, configuration, and I/O access for anything
> +other than memory (CXL.mem) or cache (CXL.cache) operations.
> +
> +The Linux CXL driver exposes access to .io functionalty via the various sysfs
> +interfaces and /dev/cxl/ devices (which exposes direct access to device
> +mailboxes).
> +
> +CXL.cache
> +---------
> +The mechanism by which a device may coherently access and cache host memory.
> +
> +Largely transparent to Linux once configured.
> +
> +CXL.mem
> +---------
> +The mechanism by which the CPU may coherently access and cache device memory.
> +
> +Largely transparent to Linux once configured.
> +
> +
> +Device Types
> +============
> +
> +Type-1
> +------
> +
> +A Type-1 CXL device:
> +
> +* Supports cxl.io and cxl.cache protocols
> +* Implements a fully coherent cache
> +* Allows Device-to-Host coherence and Host-to-Device snoops.
> +* Does NOT have host-managed device memory (HDM)
> +
> +Typical examples of type-1 devices is a Smart NIC - which may want to
> +directly operate on host-memory (DMA) to store incoming packets. These
> +devices largely rely on CPU-attached memory.
> +
> +Type-2
> +------
> +
> +A Type-2 CXL Device:
> +
> +* Supports cxl.io, cxl.cache, and cxl.mem protocols
> +* Optionally implements coherent cache and Host-Managed Device Memory
> +* Is typically an accelerator device w/ high bandwidth memory.
> +
> +The primary difference between a type-1 and type-2 device is the presence
> +of host-managed device memory, which allows the device to operate on a
> +local memory bank - while the CPU sill has coherent DMA to the same memory.
> +
> +The allows things like GPUs to expose their memory via DAX devices or file
> +descriptors, allows drivers and programs direct access to device memory
> +rather than use block-transfer semantics.
> +
> +Type-3
> +------
> +
> +A Type-3 CXL Device
> +
> +* Supports cxl.io and cxl.mem
> +* Implements Host-Managed Device Memory
> +* May provide either Volatile or Persistent memory capacity (or both).
> +
> +A basic example of a type-3 device is a simple memory expander, whose
> +local memory capacity is exposed to the CPU for access directly via
> +basic coherent DMA.
> +
> +Switch
> +------
> +
> +A CXL switch is a device capacity of routing any CXL (and by extension, PCIe)
> +protocol between an upstream, downstream, or peer devices. Many devices, such
> +as Multi-Logical Devices, imply the presence of switching in some manner.
> +
> +Logical Devices and Heads
> +-------------------------
> +
> +A CXL device may present one or more "Logical Devices" to one or more hosts
> +(via physical "Heads").
> +
> +A Single-Logical Device (SLD) is a device which presents a single device to
> +one or more heads.
> +
> +A Multi-Logical Device (MLD) is a device which may present multiple devices
> +to one or more devices.
> +
> +A Single-Headed Device exposes only a single physical connection.
> +
> +A Multi-Headed Device exposes multiple physical connections.
> +
> +MHSLD
> +~~~~~
> +A Multi-Headed Single-Logical Device (MHSLD) exposes a single logical
> +device to multiple heads which may be connected to one or more discrete
> +hosts. An example of this would be a simple memory-pool which may be
> +statically configured (prior to boot) to expose portions of its memory
> +to Linux via the CEDT ACPI table.
> +
> +MHMLD
> +~~~~~
> +A Multi-Headed Multi-Logical Device (MHMLD) exposes multiple logical
> +devices to multiple heads which may be connected to one or more discrete
> +hosts. An example of this would be a Dynamic Capacity Device or which
> +may be configured at runtime to expose portions of its memory to Linux.
> +
> +Example Devices
> +===============
> +
> +Memory Expander
> +---------------
> +The simplest form of Type-3 device is a memory expander. A memory expander
> +exposes Host-Managed Device Memory (HDM) to Linux. This memory may be
> +Volatile or Non-Volatile (Persistent).
> +
> +Memory Expanders will typically be considered a form of Single-Headed,
> +Single-Logical Device - as its form factor will typically be an add-in-card
> +(AIC) or some other similar form-factor.
> +
> +The Linux CXL driver provides support for static or dynamic configuration of
> +basic memory expanders. The platform may program decoders prior to OS init
> +(e.g. auto-decoders), or the user may program the fabric if the platform
> +defers these operations to the OS.
> +
> +Multiple Memory Expanders may be added to an external chassis and exposed to
> +a host via a head attached to a CXL switch. This is a "memory pool", and
> +would be considered an MHSLD or MHMLD depending on the management capabilities
> +provided by the switch platform.
> +
> +As of v6.14, Linux does not provide a formalized interface to manage non-DCD
> +MHSLD or MHMLD devices.
> +
> +Dynamic Capacity Device (DCD)
> +-----------------------------
> +
> +A Dynamic Capacity Device is a Type-3 device which provides dynamic management
> +of memory capacity. The basic premise of a DCD to provide an allocator-like
> +interface for physical memory capacity to a "Fabric Manager" (an external,
> +privileged host with privileges to change configurations for other hosts).
> +
> +A DCD manages "Memory Extents", which may be volatile or persistent. Extents
> +may also be exclusive to a single host or shared across multiple hosts.
> +
> +As of v6.14, Linux does not provide a formalized interface to manage DCD
> +devices, however there is active work on LKML targeting future release.
I wonder instead of referring to a kernel version, maybe refer to the CXL maturity map on support status.
DJ
> diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
> index fe1594dc6778..a2d1c5b18a8a 100644
> --- a/Documentation/driver-api/cxl/index.rst
> +++ b/Documentation/driver-api/cxl/index.rst
> @@ -15,6 +15,12 @@ that have impacts on each other. The docs here break up configurations steps.
> theory-of-operation
> maturity-map
>
> +.. toctree::
> + :maxdepth: 2
> + :caption: Device Reference
> +
> + devices/device-types
> +
> .. toctree::
> :maxdepth: 1
> :caption: Linux Kernel Configuration
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 03/17] cxl: docs/devices - add cxl device and protocol reference
2025-05-12 23:08 ` Dave Jiang
@ 2025-05-12 23:22 ` Gregory Price
0 siblings, 0 replies; 33+ messages in thread
From: Gregory Price @ 2025-05-12 23:22 UTC (permalink / raw)
To: Dave Jiang
Cc: linux-cxl, linux-doc, linux-kernel, kernel-team, dave,
jonathan.cameron, alison.schofield, vishal.l.verma, ira.weiny,
dan.j.williams, corbet
On Mon, May 12, 2025 at 04:08:24PM -0700, Dave Jiang wrote:
> > +As of v6.14, Linux does not provide a formalized interface to manage DCD
> > +devices, however there is active work on LKML targeting future release.
>
> I wonder instead of referring to a kernel version, maybe refer to the CXL maturity map on support status.
>
> DJ
>
This is a good idea for cxl-specific stuff. There was another patch or
two with working like this. Might be worth collecting these all and
just updating the wording in a follow up patch.
~Gregory
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 04/17] cxl: docs/platform/bios-and-efi documentation
2025-05-12 16:21 ` [PATCH v3 04/17] cxl: docs/platform/bios-and-efi documentation Gregory Price
@ 2025-05-12 23:31 ` Dave Jiang
0 siblings, 0 replies; 33+ messages in thread
From: Dave Jiang @ 2025-05-12 23:31 UTC (permalink / raw)
To: Gregory Price, linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
corbet
On 5/12/25 9:21 AM, Gregory Price wrote:
> Add some docs on CXL configurations done in bios/efi that affect
> linux configuration - information vendors may care to consider.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> Documentation/driver-api/cxl/index.rst | 6 +
> .../driver-api/cxl/platform/bios-and-efi.rst | 262 ++++++++++++++++++
> 2 files changed, 268 insertions(+)
> create mode 100644 Documentation/driver-api/cxl/platform/bios-and-efi.rst
>
> diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
> index a2d1c5b18a8a..ffa0462ad950 100644
> --- a/Documentation/driver-api/cxl/index.rst
> +++ b/Documentation/driver-api/cxl/index.rst
> @@ -21,6 +21,12 @@ that have impacts on each other. The docs here break up configurations steps.
>
> devices/device-types
>
> +.. toctree::
> + :maxdepth: 2
> + :caption: Platform Configuration
> +
> + platform/bios-and-efi
> +
> .. toctree::
> :maxdepth: 1
> :caption: Linux Kernel Configuration
> diff --git a/Documentation/driver-api/cxl/platform/bios-and-efi.rst b/Documentation/driver-api/cxl/platform/bios-and-efi.rst
> new file mode 100644
> index 000000000000..552a83992bcc
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/bios-and-efi.rst
> @@ -0,0 +1,262 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +======================
> +BIOS/EFI Configuration
> +======================
> +
> +BIOS and EFI are largely responsible for configuring static information about
> +devices (or potential future devices) such that Linux can build the appropriate
> +logical representations of these devices.
> +
> +At a high level, this is what occurs during this phase of configuration.
> +
> +* The bootloader starts the BIOS/EFI.
> +
> +* BIOS/EFI do early device probe to determine static configuration
> +
> +* BIOS/EFI creates ACPI Tables that describe static config for the OS
> +
> +* BIOS/EFI create the system memory map (EFI Memory Map, E820, etc)
> +
> +* BIOS/EFI calls :code:`start_kernel` and begins the Linux Early Boot process.
> +
> +Much of what this section is concerned with is ACPI Table production and
> +static memory map configuration. More detail on these tables can be found
> +under Platform Configuration -> ACPI Table Reference.
> +
> +.. note::
> + Platform Vendors should read carefully, as this sections has recommendations
> + on physical memory region size and alignment, memory holes, HDM interleave,
> + and what linux expects of HDM decoders trying to work with these features.
> +
> +UEFI Settings
> +=============
> +If your platform supports it, the :code:`uefisettings` command can be used to
> +read/write EFI settings. Changes will be reflected on the next reboot. Kexec
> +is not a sufficient reboot.
> +
> +One notable configuration here is the EFI_MEMORY_SP (Specific Purpose) bit.
> +When this is enabled, this bit tells linux to defer management of a memory
> +region to a driver (in this case, the CXL driver). Otherwise, the memory is
> +treated as "normal memory", and is exposed to the page allocator during
> +:code:`__init`.
> +
> +uefisettings examples
> +---------------------
> +
> +:code:`uefisettings identify` ::
> +
> + uefisettings identify
> +
> + bios_vendor: xxx
> + bios_version: xxx
> + bios_release: xxx
> + bios_date: xxx
> + product_name: xxx
> + product_family: xxx
> + product_version: xxx
> +
> +On some AMD platforms, the :code:`EFI_MEMORY_SP` bit is set via the :code:`CXL
> +Memory Attribute` field. This may be called something else on your platform.
> +
> +:code:`uefisettings get "CXL Memory Attribute"` ::
> +
> + selector: xxx
> + ...
> + question: Question {
> + name: "CXL Memory Attribute",
> + answer: "Enabled",
> + ...
> + }
> +
> +Physical Memory Map
> +===================
> +
> +Physical Address Region Alignment
> +---------------------------------
> +
> +As of Linux v6.14, the hotplug memory system requires memory regions to be
> +uniform in size and alignment. While the CXL specification allows for memory
> +regions as small as 256MB, the supported memory block size and alignment for
> +hotplugged memory is architecture-defined.
> +
> +A Linux memory blocks may be as small as 128MB and increase in powers of two.
> +
> +* On ARM, the default block size and alignment is either 128MB or 256MB.
> +
> +* On x86, the default block size is 256MB, and increases to 2GB as the
> + capacity of the system increases up to 64GB.
> +
> +For best support across versions, platform vendors should place CXL memory at
> +a 2GB aligned base address, and regions should be 2GB aligned. This also helps
> +prevent the creating thousands of memory devices (one per block).
> +
> +Memory Holes
> +------------
> +
> +Holes in the memory map are tricky. Consider a 4GB device located at base
> +address 0x100000000, but with the following memory map ::
> +
> + ---------------------
> + | 0x100000000 |
> + | CXL |
> + | 0x1BFFFFFFF |
> + ---------------------
> + | 0x1C0000000 |
> + | MEMORY HOLE |
> + | 0x1FFFFFFFF |
> + ---------------------
> + | 0x200000000 |
> + | CXL CONT. |
> + | 0x23FFFFFFF |
> + ---------------------
> +
> +There are two issues to consider:
> +
> +* decoder programming, and
> +* memory block alignment.
> +
> +If your architecture requires 2GB uniform size and aligned memory blocks, the
> +only capacity Linux is capable of mapping (as of v6.14) would be the capacity
> +from `0x100000000-0x180000000`. The remaining capacity will be stranded, as
> +they are not of 2GB aligned length.
> +
> +Assuming your architecture and memory configuration allows 1GB memory blocks,
> +this memory map is supported and this should be presented as multiple CFMWS
> +in the CEDT that describe each side of the memory hole separately - along with
> +matching decoders.
> +
> +Multiple decoders can (and should) be used to manage such a memory hole (see
> +below), but each chunk of a memory hole should be aligned to a reasonable block
> +size (larger alignment is always better). If you intend to have memory holes
> +in the memory map, expect to use one decoder per contiguous chunk of host
> +physical memory.
> +
> +As of v6.14, Linux does provide support for memory hotplug of multiple
> +physical memory regions separated by a memory hole described by a single
> +HDM decoder.
> +
> +
> +Decoder Programming
> +===================
> +If BIOS/EFI intends to program the decoders to be statically configured,
> +there are a few things to consider to avoid major pitfalls that will
> +prevent Linux compatibility. Some of these recommendations are not
> +required "per the specification", but Linux makes no guarantees of support
> +otherwise.
> +
> +
> +Translation Point
> +-----------------
> +Per the specification, the only decoders which **TRANSLATE** Host Physical
> +Address (HPA) to Device Physical Address (DPA) are the **Endpoint Decoders**.
> +All other decoders in the fabric are intended to route accesses without
> +translating the addresses.
> +
> +This is heavily implied by the specification, see: ::
> +
> + CXL Specification 3.1
> + 8.2.4.20: CXL HDM Decoder Capability Structure
> + - Implementation Note: CXL Host Bridge and Upstream Switch Port Decoder Flow
> + - Implementation Note: Device Decoder Logic
> +
> +Given this, Linux makes a strong assumption that decoders between CPU and
> +endpoint will all be programmed with addresses ranges that are subsets of
> +their parent decoder.
> +
> +Due to some ambiguity in how Architecture, ACPI, PCI, and CXL specifications
> +"hand off" responsibility between domains, some early adopting platforms
> +attempted to do translation at the originating memory controller or host
> +bridge. This configuration requires a platform specific extension to the
> +driver and is not officially endorsed - despite being supported.
> +
> +It is *highly recommended* **NOT** to do this; otherwise, you are on your own
> +to implement driver support for your platform.
> +
> +Interleave and Configuration Flexibility
> +----------------------------------------
> +If providing cross-host-bridge interleave, a CFMWS entry in the CEDT must be
> +presented with target host-bridges for the interleaved device sets (there may
> +be multiple behind each host bridge).
> +
> +If providing intra-host-bridge interleaving, only 1 CFMWS entry in the CEDT is
> +required for that host bridge - if it covers the entire capacity of the devices
> +behind the host bridge.
> +
> +If intending to provide users flexibility in programming decoders beyond the
> +root, you may want to provide multiple CFMWS entries in the CEDT intended for
> +different purposes. For example, you may want to consider adding:
> +
> +1) A CFMWS entry to cover all interleavable host bridges.
> +2) A CFMWS entry to cover all devices on a single host bridge.
> +3) A CFMWS entry to cover each device.
> +
> +A platform may choose to add all of these, or change the mode based on a BIOS
> +setting. For each CFMWS entry, Linux expects descriptions of the described
> +memory regions in the SRAT to determine the number of NUMA nodes it should
> +reserve during early boot / init.
> +
> +As of v6.14, Linux will create a NUMA node for each CEDT CFMWS entry, even if
> +a matching SRAT entry does not exist; however, this is not guaranteed in the
> +future and such a configuration should be avoided.
> +
> +Memory Holes
> +------------
> +If your platform includes memory holes intersparsed between your CXL memory, it
s/intersparsed/interspersed/
DJ
> +is recommended to utilize multiple decoders to cover these regions of memory,
> +rather than try to program the decoders to accept the entire range and expect
> +Linux to manage the overlap.
> +
> +For example, consider the Memory Hole described above ::
> +
> + ---------------------
> + | 0x100000000 |
> + | CXL |
> + | 0x1BFFFFFFF |
> + ---------------------
> + | 0x1C0000000 |
> + | MEMORY HOLE |
> + | 0x1FFFFFFFF |
> + ---------------------
> + | 0x200000000 |
> + | CXL CONT. |
> + | 0x23FFFFFFF |
> + ---------------------
> +
> +Assuming this is provided by a single device attached directly to a host bridge,
> +Linux would expect the following decoder programming ::
> +
> + ----------------------- -----------------------
> + | root-decoder-0 | | root-decoder-1 |
> + | base: 0x100000000 | | base: 0x200000000 |
> + | size: 0xC0000000 | | size: 0x40000000 |
> + ----------------------- -----------------------
> + | |
> + ----------------------- -----------------------
> + | HB-decoder-0 | | HB-decoder-1 |
> + | base: 0x100000000 | | base: 0x200000000 |
> + | size: 0xC0000000 | | size: 0x40000000 |
> + ----------------------- -----------------------
> + | |
> + ----------------------- -----------------------
> + | ep-decoder-0 | | ep-decoder-1 |
> + | base: 0x100000000 | | base: 0x200000000 |
> + | size: 0xC0000000 | | size: 0x40000000 |
> + ----------------------- -----------------------
> +
> +With a CEDT configuration with two CFMWS describing the above root decoders.
> +
> +Linux makes no guarantee of support for strange memory hole situations.
> +
> +Multi-Media Devices
> +-------------------
> +The CFMWS field of the CEDT has special restriction bits which describe whether
> +the described memory region allows volatile or persistent memory (or both). If
> +the platform intends to support either:
> +
> +1) A device with multiple medias, or
> +2) Using a persistent memory device as normal memory
> +
> +A platform may wish to create multiple CEDT CFMWS entries to describe the same
> +memory, with the intent of allowing the end user flexibility in how that memory
> +is configured. Linux does not presently have strong requirements in this area.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 05/17] cxl: docs/platform/acpi reference documentation
2025-05-12 16:21 ` [PATCH v3 05/17] cxl: docs/platform/acpi reference documentation Gregory Price
@ 2025-05-12 23:49 ` Dave Jiang
0 siblings, 0 replies; 33+ messages in thread
From: Dave Jiang @ 2025-05-12 23:49 UTC (permalink / raw)
To: Gregory Price, linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
corbet
On 5/12/25 9:21 AM, Gregory Price wrote:
> Add basic ACPI table information needed to understand the CXL
> driver probe process.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> Documentation/driver-api/cxl/index.rst | 1 +
> .../driver-api/cxl/platform/acpi.rst | 76 +++++++++++++++++++
> .../driver-api/cxl/platform/acpi/cedt.rst | 62 +++++++++++++++
> .../driver-api/cxl/platform/acpi/dsdt.rst | 28 +++++++
> .../driver-api/cxl/platform/acpi/hmat.rst | 32 ++++++++
> .../driver-api/cxl/platform/acpi/slit.rst | 21 +++++
> .../driver-api/cxl/platform/acpi/srat.rst | 44 +++++++++++
> 7 files changed, 264 insertions(+)
> create mode 100644 Documentation/driver-api/cxl/platform/acpi.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi/cedt.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi/dsdt.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi/hmat.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi/slit.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi/srat.rst
>
> diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
> index ffa0462ad950..336322dc35a0 100644
> --- a/Documentation/driver-api/cxl/index.rst
> +++ b/Documentation/driver-api/cxl/index.rst
> @@ -26,6 +26,7 @@ that have impacts on each other. The docs here break up configurations steps.
> :caption: Platform Configuration
>
> platform/bios-and-efi
> + platform/acpi
>
> .. toctree::
> :maxdepth: 1
> diff --git a/Documentation/driver-api/cxl/platform/acpi.rst b/Documentation/driver-api/cxl/platform/acpi.rst
> new file mode 100644
> index 000000000000..ee7e6bd4c43d
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/acpi.rst
> @@ -0,0 +1,76 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===========
> +ACPI Tables
> +===========
> +
> +ACPI is the "Advanced Configuration and Power Interface", which is a standard
> +that defines how platforms and OS manage power and configure computer hardware.
> +For the purpose of this theory of operation, when referring to "ACPI" we will
> +usually refer to "ACPI Tables" - which are the way a platform (BIOS/EFI)
> +communicates static configuration information to the operation system.
> +
> +The Following ACPI tables contain *static* configuration and performance data
> +about CXL devices.
> +
> +.. toctree::
> + :maxdepth: 1
> +
> + acpi/cedt.rst
> + acpi/srat.rst
> + acpi/hmat.rst
> + acpi/slit.rst
> + acpi/dsdt.rst
> +
> +The SRAT table may also contain generic port/initiator content that is intended
> +to describe the generic port, but not information about the rest of the path to
> +the endpoint.
> +
> +Linux uses these tables to configure kernel resources for statically configured
> +(by BIOS/EFI) CXL devices, such as:
> +
> +- NUMA nodes
> +- Memory Tiers
> +- NUMA Abstract Distances
> +- SystemRAM Memory Regions
> +- Weighted Interleave Node Weights
> +
> +ACPI Debugging
> +==============
> +
> +The :code:`acpidump -b` command dumps the ACPI tables into binary format.
> +
> +The :code:`iasl -d` command disassembles the files into human readable format.
> +
> +Example :code:`acpidump -b && iasl -d cedt.dat` ::
> +
> + [000h 0000 4] Signature : "CEDT" [CXL Early Discovery Table]
> +
> +Common Issues
> +-------------
> +Most failures described here result in a failure of the driver to surface
> +memory as a DAX device and/or kmem.
> +
> +* CEDT CFMWS targets list UIDs do not match CEDT CHBS UIDs.
> +* CEDT CFMWS targets list UIDs do not match DSDT CXL Host Bridge UIDs.
> +* CEDT CFMWS Restriction Bits are not correct.
> +* CEDT CFMWS Memory regions are poorly aligned.
> +* CEDT CFMWS Memory regions spans a platform memory hole.
> +* CEDT CHBS UIDs do not match DSDT CXL Host Bridge UIDs.
> +* CEDT CHBS Specification version is incorrect.
> +* SRAT is missing regions described in CEDT CFMWS.
> +
> + * Result: failure to create a NUMA node for the region, or
> + region is placed in wrong node.
> +
> +* HMAT is missing data for regions described in CEDT CFMWS.
> +
> + * Result: NUMA node being placed in the wrong memory tier.
> +
> +* SLIT has bad data.
> +
> + * Result: Lots of performance mechanisms in the kernel will be very unhappy.
> +
> +All of these issues will appear to users as if the driver is failing to
> +support CXL - when in reality they are all the failure of a platform to
> +configure the ACPI tables correctly.
> diff --git a/Documentation/driver-api/cxl/platform/acpi/cedt.rst b/Documentation/driver-api/cxl/platform/acpi/cedt.rst
> new file mode 100644
> index 000000000000..1d9c9d3592dc
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/acpi/cedt.rst
> @@ -0,0 +1,62 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +================================
> +CEDT - CXL Early Discovery Table
> +================================
> +
> +The CXL Early Discovery Table is generated by BIOS to describe the CXL memory
> +regions configured at boot by the BIOS.
> +
> +CHBS
> +====
> +The CXL Host Bridge Structure describes CXL host bridges. Other than describing
> +device register information, it reports the specific host bridge UID for this
> +host bridge. These host bridge ID's will be referenced in other tables.
> +
> +Example ::
> +
> + Subtable Type : 00 [CXL Host Bridge Structure]
> + Reserved : 00
> + Length : 0020
> + Associated host bridge : 00000007 <- Host bridge _UID
> + Specification version : 00000001
> + Reserved : 00000000
> + Register base : 0000010370400000
> + Register length : 0000000000010000
> +
> +CFMWS
> +=====
> +The CXL Fixed Memory Window structure describes a memory region associated
> +with one or more CXL host bridges (as described by the CHBS). It additionally
> +describes any inter-host-bridge interleave configuration that may have been
> +programmed by BIOS.
> +
> +Example ::
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 000000C050000000 <- Memory Region
> + Window size : 0000003CA0000000
> + Interleave Members (2^n) : 01 <- Interleave configuration
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000007 <- Host Bridge _UID
> + Next Target : 00000006 <- Host Bridge _UID
> +
> +The restriction field dictates what this SPA range may be used for (memory type,
> +voltile vs persistent, etc). One or more bits may be set. ::
> +
> + Bit[0]: CXL Type 2 Memory
> + Bit[1]: CXL Type 3 Memory
> + Bit[2]: Volatile Memory
> + Bit[3]: Persistent Memory
> + Bit[4]: Fixed Config (HPA cannot be re-used)
> +
> +INTRA-host-bridge interleave (multiple devices on one host bridge) is NOT
> +reported in this structure, and is solely defined via CXL device decoder
> +programming (host bridge and endpoint decoders).
> diff --git a/Documentation/driver-api/cxl/platform/acpi/dsdt.rst b/Documentation/driver-api/cxl/platform/acpi/dsdt.rst
> new file mode 100644
> index 000000000000..b4583b01d67d
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/acpi/dsdt.rst
> @@ -0,0 +1,28 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==============================================
> +DSDT - Differentiated system Description Table
> +==============================================
> +
> +This table describes what peripherals a machine has.
> +
> +This table's UIDs for CXL devices - specifically host bridges, must be
> +consistent with the contents of the CEDT, otherwise the CXL driver will
> +fail to probe correctly.
> +
> +Example Compute Express Link Host Bridge ::
> +
> + Scope (_SB)
> + {
> + Device (S0D0)
> + {
> + Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
> + Name (_CID, Package (0x02) // _CID: Compatible ID
> + {
> + EisaId ("PNP0A08") /* PCI Express Bus */,
> + EisaId ("PNP0A03") /* PCI Bus */
> + })
> + ...
> + Name (_UID, 0x05) // _UID: Unique ID
> + ...
> + }
> diff --git a/Documentation/driver-api/cxl/platform/acpi/hmat.rst b/Documentation/driver-api/cxl/platform/acpi/hmat.rst
> new file mode 100644
> index 000000000000..095a26f02a37
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/acpi/hmat.rst
> @@ -0,0 +1,32 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===========================================
> +HMAT - Heterogeneous Memory Attribute Table
> +===========================================
> +
> +The Heterogeneous Memory Attributes Table contains information such as cache
> +attributes and bandwidth and latency details for memory proximity domains.
> +For the purpose of this document, we will only discuss the SSLIB entry.
> +
> +SLLBI
> +=====
> +The System Locality Latency and Bandwidth Information records latency and
> +bandwidth information for proximity domains.
> +
> +This table is used by Linux to configure interleave weights and memory tiers.
> +
> +Example (Heavily truncated for brevity) ::
> +
> + Structure Type : 0001 [SLLBI]
> + Data Type : 00 <- Latency
> + Target Proximity Domain List : 00000000
> + Target Proximity Domain List : 00000001
> + Entry : 0080 <- DRAM LTC
> + Entry : 0100 <- CXL LTC
> +
> + Structure Type : 0001 [SLLBI]
> + Data Type : 03 <- Bandwidth
> + Target Proximity Domain List : 00000000
> + Target Proximity Domain List : 00000001
> + Entry : 1200 <- DRAM BW
> + Entry : 0200 <- CXL BW
> diff --git a/Documentation/driver-api/cxl/platform/acpi/slit.rst b/Documentation/driver-api/cxl/platform/acpi/slit.rst
> new file mode 100644
> index 000000000000..a56768e8fe41
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/acpi/slit.rst
> @@ -0,0 +1,21 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========================================
> +SLIT - System Locality Information Table
> +========================================
> +
> +The system locality information table provides "abstract distances" between
> +accessor and memory nodes. Node without initiators (cpus) are infinitely (FF)
> +distance away from all other nodes.
> +
> +The abstract distance described in this table does not describe any real
> +latency of bandwidth information.
> +
> +Example ::
> +
> + Signature : "SLIT" [System Locality Information Table]
> + Localities : 0000000000000004
> + Locality 0 : 10 20 20 30
> + Locality 1 : 20 10 30 20
> + Locality 2 : FF FF 0A FF
> + Locality 3 : FF FF FF 0A
> diff --git a/Documentation/driver-api/cxl/platform/acpi/srat.rst b/Documentation/driver-api/cxl/platform/acpi/srat.rst
> new file mode 100644
> index 000000000000..56d7bbb18c3b
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/acpi/srat.rst
> @@ -0,0 +1,44 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================
> +SRAT - Static Resource Affinity Table
> +=====================================
> +
> +The System/Static Resource Affinity Table describes resource (CPU, Memory)
> +affinity to "Proximity Domains". This table is technically optional, but for
> +performance information (see "HMAT") to be enumerated by linux it must be
> +present.
> +
> +There is a careful dance between the CEDT and SRAT tables and how NUMA nodes are
> +created. If things don't look quite the way you expect - check the SRAT Memory
> +Affinity entries and CEDT CFMWS to determine what your platform actually
> +supports in terms of flexible topologies.
> +
> +The SRAT may statically assign portions of a CFMWS SPA range to a specific
> +proximity domains. See linux numa creation for more information about how
> +this presents in the NUMA topology.
> +
> +Proximity Domain
> +================
> +A proximity domain is ROUGHLY equivalent to "NUMA Node" - though a 1-to-1
> +mapping is not guaranteed. There are scenarios where "Proximity Domain 4" may
> +map to "NUMA Node 3", for example. (See "NUMA Node Creation")
> +
> +Memory Affinity
> +===============
> +Generally speaking, if a host does any amount of CXL fabric (decoder)
> +programming in BIOS - an SRAT entry for that memory needs to be present.
> +
> +Example ::
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000001 <- NUMA Node 1
> + Reserved1 : 0000
> + Base Address : 000000C050000000 <- Physical Memory Region
> + Address Length : 0000003CA0000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 06/17] cxl: docs/platform/example-configs documentation
2025-05-12 16:21 ` [PATCH v3 06/17] cxl: docs/platform/example-configs documentation Gregory Price
@ 2025-05-13 0:05 ` Dave Jiang
0 siblings, 0 replies; 33+ messages in thread
From: Dave Jiang @ 2025-05-13 0:05 UTC (permalink / raw)
To: Gregory Price, linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
corbet
On 5/12/25 9:21 AM, Gregory Price wrote:
> Add example ACPI Table configurations for different sample platforms.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> Documentation/driver-api/cxl/index.rst | 1 +
> .../cxl/platform/example-configs.rst | 13 +
> .../example-configurations/flexible.rst | 296 ++++++++++++++++++
> .../example-configurations/hb-interleave.rst | 107 +++++++
> .../multi-dev-per-hb.rst | 90 ++++++
> .../example-configurations/one-dev-per-hb.rst | 136 ++++++++
> 6 files changed, 643 insertions(+)
> create mode 100644 Documentation/driver-api/cxl/platform/example-configs.rst
> create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
> create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
> create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
> create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
>
> diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
> index 336322dc35a0..6a5fb7e00c52 100644
> --- a/Documentation/driver-api/cxl/index.rst
> +++ b/Documentation/driver-api/cxl/index.rst
> @@ -27,6 +27,7 @@ that have impacts on each other. The docs here break up configurations steps.
>
> platform/bios-and-efi
> platform/acpi
> + platform/example-configs
>
> .. toctree::
> :maxdepth: 1
> diff --git a/Documentation/driver-api/cxl/platform/example-configs.rst b/Documentation/driver-api/cxl/platform/example-configs.rst
> new file mode 100644
> index 000000000000..90a10d7473c6
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/example-configs.rst
> @@ -0,0 +1,13 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Example Platform Configurations
> +###############################
> +
> +.. toctree::
> + :maxdepth: 1
> + :caption: Contents
> +
> + example-configurations/one-dev-per-hb.rst
> + example-configurations/multi-dev-per-hb.rst
> + example-configurations/hb-interleave.rst
> + example-configurations/flexible.rst
> diff --git a/Documentation/driver-api/cxl/platform/example-configurations/flexible.rst b/Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
> new file mode 100644
> index 000000000000..e39daba65fa0
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
> @@ -0,0 +1,296 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Flexible Presentation
> +=====================
> +This system has a single socket with two CXL host bridges. Each host bridge
> +has two CXL memory expanders with a 4GB of memory (32GB total).
> +
> +On this system, the platform designer wanted to provide the user flexibility
> +to configure the memory devices in various interleave or NUMA node
> +configurations. So they provided every combination.
> +
> +Things to note:
> +
> +* Cross-Bridge interleave is described in one CFMWS that covers all capacity.
> +* One CFMWS is also described per-host bridge.
> +* One CFMWS is also described per-device.
> +* This SRAT describes one node for each of the above CFMWS.
> +* The HMAT describes performance for each node in the SRAT.
> +
> +CEDT ::
> +
> + Subtable Type : 00 [CXL Host Bridge Structure]
> + Reserved : 00
> + Length : 0020
> + Associated host bridge : 00000007
> + Specification version : 00000001
> + Reserved : 00000000
> + Register base : 0000010370400000
> + Register length : 0000000000010000
> +
> + Subtable Type : 00 [CXL Host Bridge Structure]
> + Reserved : 00
> + Length : 0020
> + Associated host bridge : 00000006
> + Specification version : 00000001
> + Reserved : 00000000
> + Register base : 0000010380800000
> + Register length : 0000000000010000
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000001000000000
> + Window size : 0000000400000000
> + Interleave Members (2^n) : 01
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000007
> + Second Target : 00000006
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000002000000000
> + Window size : 0000000200000000
> + Interleave Members (2^n) : 00
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000007
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000002200000000
> + Window size : 0000000200000000
> + Interleave Members (2^n) : 00
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000006
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000003000000000
> + Window size : 0000000100000000
> + Interleave Members (2^n) : 00
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000007
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000003100000000
> + Window size : 0000000100000000
> + Interleave Members (2^n) : 00
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000007
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000003200000000
> + Window size : 0000000100000000
> + Interleave Members (2^n) : 00
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000006
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000003300000000
> + Window size : 0000000100000000
> + Interleave Members (2^n) : 00
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000006
> +
> +SRAT ::
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000001
> + Reserved1 : 0000
> + Base Address : 0000001000000000
> + Address Length : 0000000400000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000002
> + Reserved1 : 0000
> + Base Address : 0000002000000000
> + Address Length : 0000000200000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000003
> + Reserved1 : 0000
> + Base Address : 0000002200000000
> + Address Length : 0000000200000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000004
> + Reserved1 : 0000
> + Base Address : 0000003000000000
> + Address Length : 0000000100000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000005
> + Reserved1 : 0000
> + Base Address : 0000003100000000
> + Address Length : 0000000100000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000006
> + Reserved1 : 0000
> + Base Address : 0000003200000000
> + Address Length : 0000000100000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000007
> + Reserved1 : 0000
> + Base Address : 0000003300000000
> + Address Length : 0000000100000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> +HMAT ::
> +
> + Structure Type : 0001 [SLLBI]
> + Data Type : 00 [Latency]
> + Target Proximity Domain List : 00000000
> + Target Proximity Domain List : 00000001
> + Target Proximity Domain List : 00000002
> + Target Proximity Domain List : 00000003
> + Target Proximity Domain List : 00000004
> + Target Proximity Domain List : 00000005
> + Target Proximity Domain List : 00000006
> + Target Proximity Domain List : 00000007
> + Entry : 0080
> + Entry : 0100
> + Entry : 0100
> + Entry : 0100
> + Entry : 0100
> + Entry : 0100
> + Entry : 0100
> + Entry : 0100
> +
> + Structure Type : 0001 [SLLBI]
> + Data Type : 03 [Bandwidth]
> + Target Proximity Domain List : 00000000
> + Target Proximity Domain List : 00000001
> + Target Proximity Domain List : 00000002
> + Target Proximity Domain List : 00000003
> + Target Proximity Domain List : 00000004
> + Target Proximity Domain List : 00000005
> + Target Proximity Domain List : 00000006
> + Target Proximity Domain List : 00000007
> + Entry : 1200
> + Entry : 0400
> + Entry : 0200
> + Entry : 0200
> + Entry : 0100
> + Entry : 0100
> + Entry : 0100
> + Entry : 0100
> +
> +SLIT ::
> +
> + Signature : "SLIT" [System Locality Information Table]
> + Localities : 0000000000000003
> + Locality 0 : 10 20 20 20 20 20 20 20
> + Locality 1 : FF 0A FF FF FF FF FF FF
> + Locality 2 : FF FF 0A FF FF FF FF FF
> + Locality 3 : FF FF FF 0A FF FF FF FF
> + Locality 4 : FF FF FF FF 0A FF FF FF
> + Locality 5 : FF FF FF FF FF 0A FF FF
> + Locality 6 : FF FF FF FF FF FF 0A FF
> + Locality 7 : FF FF FF FF FF FF FF 0A
> +
> +DSDT ::
> +
> + Scope (_SB)
> + {
> + Device (S0D0)
> + {
> + Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
> + ...
> + Name (_UID, 0x07) // _UID: Unique ID
> + }
> + ...
> + Device (S0D5)
> + {
> + Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
> + ...
> + Name (_UID, 0x06) // _UID: Unique ID
> + }
> + }
> diff --git a/Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst b/Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
> new file mode 100644
> index 000000000000..ce07e6162f26
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
> @@ -0,0 +1,107 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +============================
> +Cross-Host-Bridge Interleave
> +============================
> +This system has a single socket with two CXL host bridges. Each host bridge
> +has a single CXL memory expander with a 4GB of memory.
> +
> +Things to note:
> +
> +* Cross-Bridge interleave is described.
> +* The expanders are described by a single CFMWS.
> +* This SRAT describes one node for both host bridges.
> +* The HMAT describes a single node's performance.
> +
> +CEDT ::
> +
> + Subtable Type : 00 [CXL Host Bridge Structure]
> + Reserved : 00
> + Length : 0020
> + Associated host bridge : 00000007
> + Specification version : 00000001
> + Reserved : 00000000
> + Register base : 0000010370400000
> + Register length : 0000000000010000
> +
> + Subtable Type : 00 [CXL Host Bridge Structure]
> + Reserved : 00
> + Length : 0020
> + Associated host bridge : 00000006
> + Specification version : 00000001
> + Reserved : 00000000
> + Register base : 0000010380800000
> + Register length : 0000000000010000
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000001000000000
> + Window size : 0000000200000000
> + Interleave Members (2^n) : 01
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000007
> + Second Target : 00000006
> +
> +SRAT ::
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000001
> + Reserved1 : 0000
> + Base Address : 0000001000000000
> + Address Length : 0000000200000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> +HMAT ::
> +
> + Structure Type : 0001 [SLLBI]
> + Data Type : 00 [Latency]
> + Target Proximity Domain List : 00000000
> + Target Proximity Domain List : 00000001
> + Target Proximity Domain List : 00000002
> + Entry : 0080
> + Entry : 0100
> +
> + Structure Type : 0001 [SLLBI]
> + Data Type : 03 [Bandwidth]
> + Target Proximity Domain List : 00000000
> + Target Proximity Domain List : 00000001
> + Target Proximity Domain List : 00000002
> + Entry : 1200
> + Entry : 0400
> +
> +SLIT ::
> +
> + Signature : "SLIT" [System Locality Information Table]
> + Localities : 0000000000000003
> + Locality 0 : 10 20
> + Locality 1 : FF 0A
> +
> +DSDT ::
> +
> + Scope (_SB)
> + {
> + Device (S0D0)
> + {
> + Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
> + ...
> + Name (_UID, 0x07) // _UID: Unique ID
> + }
> + ...
> + Device (S0D5)
> + {
> + Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
> + ...
> + Name (_UID, 0x06) // _UID: Unique ID
> + }
> + }
> diff --git a/Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst b/Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
> new file mode 100644
> index 000000000000..6adf7c639490
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
> @@ -0,0 +1,90 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +================================
> +Multiple Devices per Host Bridge
> +================================
> +
> +In this example system we will have a single socket and one CXL host bridge.
> +There are two CXL memory expanders with 4GB attached to the host bridge.
> +
> +Things to note:
> +
> +* Intra-Bridge interleave is not described here.
> +* The expanders are described by a single CEDT/CFMWS.
> +* This CEDT/SRAT describes one node for both devices.
> +* There is only one proximity domain the HMAT for both devices.
> +
> +CEDT ::
> +
> + Subtable Type : 00 [CXL Host Bridge Structure]
> + Reserved : 00
> + Length : 0020
> + Associated host bridge : 00000007
> + Specification version : 00000001
> + Reserved : 00000000
> + Register base : 0000010370400000
> + Register length : 0000000000010000
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000001000000000
> + Window size : 0000000200000000
> + Interleave Members (2^n) : 00
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000007
> +
> +SRAT ::
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000001
> + Reserved1 : 0000
> + Base Address : 0000001000000000
> + Address Length : 0000000200000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> +HMAT ::
> +
> + Structure Type : 0001 [SLLBI]
> + Data Type : 00 [Latency]
> + Target Proximity Domain List : 00000000
> + Target Proximity Domain List : 00000001
> + Entry : 0080
> + Entry : 0100
> +
> + Structure Type : 0001 [SLLBI]
> + Data Type : 03 [Bandwidth]
> + Target Proximity Domain List : 00000000
> + Target Proximity Domain List : 00000001
> + Entry : 1200
> + Entry : 0200
> +
> +SLIT ::
> +
> + Signature : "SLIT" [System Locality Information Table]
> + Localities : 0000000000000003
> + Locality 0 : 10 20
> + Locality 1 : FF 0A
> +
> +DSDT ::
> +
> + Scope (_SB)
> + {
> + Device (S0D0)
> + {
> + Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
> + ...
> + Name (_UID, 0x07) // _UID: Unique ID
> + }
> + ...
> + }
> diff --git a/Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst b/Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
> new file mode 100644
> index 000000000000..b89ba3cab98f
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
> @@ -0,0 +1,136 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========================
> +One Device per Host Bridge
> +==========================
> +
> +This system has a single socket with two CXL host bridges. Each host bridge
> +has a single CXL memory expander with a 4GB of memory.
> +
> +Things to note:
> +
> +* Cross-Bridge interleave is not being used.
> +* The expanders are in two separate but adjascent memory regions.
> +* This CEDT/SRAT describes one node per device
> +* The expanders have the same performance and will be in the same memory tier.
> +
> +CEDT ::
> +
> + Subtable Type : 00 [CXL Host Bridge Structure]
> + Reserved : 00
> + Length : 0020
> + Associated host bridge : 00000007
> + Specification version : 00000001
> + Reserved : 00000000
> + Register base : 0000010370400000
> + Register length : 0000000000010000
> +
> + Subtable Type : 00 [CXL Host Bridge Structure]
> + Reserved : 00
> + Length : 0020
> + Associated host bridge : 00000006
> + Specification version : 00000001
> + Reserved : 00000000
> + Register base : 0000010380800000
> + Register length : 0000000000010000
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000001000000000
> + Window size : 0000000100000000
> + Interleave Members (2^n) : 00
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000007
> +
> + Subtable Type : 01 [CXL Fixed Memory Window Structure]
> + Reserved : 00
> + Length : 002C
> + Reserved : 00000000
> + Window base address : 0000001100000000
> + Window size : 0000000100000000
> + Interleave Members (2^n) : 00
> + Interleave Arithmetic : 00
> + Reserved : 0000
> + Granularity : 00000000
> + Restrictions : 0006
> + QtgId : 0001
> + First Target : 00000006
> +
> +SRAT ::
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000001
> + Reserved1 : 0000
> + Base Address : 0000001000000000
> + Address Length : 0000000100000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> + Subtable Type : 01 [Memory Affinity]
> + Length : 28
> + Proximity Domain : 00000002
> + Reserved1 : 0000
> + Base Address : 0000001100000000
> + Address Length : 0000000100000000
> + Reserved2 : 00000000
> + Flags (decoded below) : 0000000B
> + Enabled : 1
> + Hot Pluggable : 1
> + Non-Volatile : 0
> +
> +HMAT ::
> +
> + Structure Type : 0001 [SLLBI]
> + Data Type : 00 [Latency]
> + Target Proximity Domain List : 00000000
> + Target Proximity Domain List : 00000001
> + Target Proximity Domain List : 00000002
> + Entry : 0080
> + Entry : 0100
> + Entry : 0100
> +
> + Structure Type : 0001 [SLLBI]
> + Data Type : 03 [Bandwidth]
> + Target Proximity Domain List : 00000000
> + Target Proximity Domain List : 00000001
> + Target Proximity Domain List : 00000002
> + Entry : 1200
> + Entry : 0200
> + Entry : 0200
> +
> +SLIT ::
> +
> + Signature : "SLIT" [System Locality Information Table]
> + Localities : 0000000000000003
> + Locality 0 : 10 20 20
> + Locality 1 : FF 0A FF
> + Locality 2 : FF FF 0A
> +
> +DSDT ::
> +
> + Scope (_SB)
> + {
> + Device (S0D0)
> + {
> + Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
> + ...
> + Name (_UID, 0x07) // _UID: Unique ID
> + }
> + ...
> + Device (S0D5)
> + {
> + Name (_HID, "ACPI0016" /* Compute Express Link Host Bridge */) // _HID: Hardware ID
> + ...
> + Name (_UID, 0x06) // _UID: Unique ID
> + }
> + }
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 07/17] cxl: docs/linux - overview
2025-05-12 16:21 ` [PATCH v3 07/17] cxl: docs/linux - overview Gregory Price
@ 2025-05-13 0:09 ` Dave Jiang
0 siblings, 0 replies; 33+ messages in thread
From: Dave Jiang @ 2025-05-13 0:09 UTC (permalink / raw)
To: Gregory Price, linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
corbet
On 5/12/25 9:21 AM, Gregory Price wrote:
> Add type-3 device configuration overview that explains the probe
> process for a type-3 device from early-boot through memory-hotplug.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> Documentation/driver-api/cxl/index.rst | 3 +-
> .../driver-api/cxl/linux/overview.rst | 103 ++++++++++++++++++
> 2 files changed, 105 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/driver-api/cxl/linux/overview.rst
>
> diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
> index 6a5fb7e00c52..bc2228c77c32 100644
> --- a/Documentation/driver-api/cxl/index.rst
> +++ b/Documentation/driver-api/cxl/index.rst
> @@ -30,9 +30,10 @@ that have impacts on each other. The docs here break up configurations steps.
> platform/example-configs
>
> .. toctree::
> - :maxdepth: 1
> + :maxdepth: 2
> :caption: Linux Kernel Configuration
>
> + linux/overview
> linux/access-coordinates
>
>
> diff --git a/Documentation/driver-api/cxl/linux/overview.rst b/Documentation/driver-api/cxl/linux/overview.rst
> new file mode 100644
> index 000000000000..648beb2c8c83
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/linux/overview.rst
> @@ -0,0 +1,103 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========
> +Overview
> +========
> +
> +This section presents the configuration process of a CXL Type-3 memory device,
> +and how it is ultimately exposed to users as either a :code:`DAX` device or
> +normal memory pages via the kernel's page allocator.
> +
> +Portions marked with a bullet are points at which certain kernel objects
> +are generated.
> +
> +1) Early Boot
> +
> + a) BIOS, Build, and Boot Parameters
> +
> + i) EFI_MEMORY_SP
> + ii) CONFIG_EFI_SOFT_RESERVE
> + iii) CONFIG_MHP_DEFAULT_ONLINE_TYPE
> + iv) nosoftreserve
> +
> + b) Memory Map Creation
> +
> + i) EFI Memory Map / E820 Consulted for Soft-Reserved
> +
> + * CXL Memory is set aside to be handled by the CXL driver
> +
> + * Soft-Reserved IO Resource created for CFMWS entry
> +
> + c) NUMA Node Creation
> +
> + * Nodes created from ACPI CEDT CFMWS and SRAT Proximity domains (PXM)
> +
> + d) Memory Tier Creation
> +
> + * A default memory_tier is created with all nodes.
> +
> + e) Contiguous Memory Allocation
> +
> + * Any requested CMA is allocated from Online nodes
> +
> + f) Init Finishes, Drivers start probing
> +
> +2) ACPI and PCI Drivers
> +
> + a) Detects PCI device is CXL, marking it for probe by CXL driver
> +
> +3) CXL Driver Operation
> +
> + a) Base device creation
> +
> + * root, port, and memdev devices created
> + * CEDT CFMWS IO Resource creation
> +
> + b) Decoder creation
> +
> + * root, switch, and endpoint decoders created
> +
> + c) Logical device creation
> +
> + * memory_region and endpoint devices created
> +
> + d) Devices are associated with each other
> +
> + * If auto-decoder (BIOS-programmed decoders), driver validates
> + configurations, builds associations, and locks configs at probe time.
> +
> + * If user-configured, validation and associations are built at
> + decoder-commit time.
> +
> + e) Regions surfaced as DAX region
> +
> + * dax_region created
> +
> + * DAX device created via DAX driver
> +
> +4) DAX Driver Operation
> +
> + a) DAX driver surfaces DAX region as one of two dax device modes
> +
> + * kmem - dax device is converted to hotplug memory blocks
> +
> + * DAX kmem IO Resource creation
> +
> + * hmem - dax device is left as daxdev to be accessed as a file.
> +
> + * If hmem, journey ends here.
> +
> + b) DAX kmem surfaces memory region to Memory Hotplug to add to page
> + allocator as "driver managed memory"
> +
> +5) Memory Hotplug
> +
> + a) mhp component surfaces a dax device memory region as multiple memory
> + blocks to the page allocator
> +
> + * blocks appear in :code:`/sys/bus/memory/devices` and linked to a NUMA node
> +
> + b) blocks are onlined into the requested zone (NORMAL or MOVABLE)
> +
> + * Memory is marked "Driver Managed" to avoid kexec from using it as region
> + for kernel updates
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 14/17] cxl: docs/allocation/page-allocator
2025-05-12 18:09 ` Gregory Price
@ 2025-05-13 2:39 ` dan.j.williams
0 siblings, 0 replies; 33+ messages in thread
From: dan.j.williams @ 2025-05-13 2:39 UTC (permalink / raw)
To: Gregory Price, Matthew Wilcox
Cc: linux-cxl, linux-doc, linux-kernel, kernel-team, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
ira.weiny, dan.j.williams, corbet
Gregory Price wrote:
> On Mon, May 12, 2025 at 06:52:31PM +0100, Matthew Wilcox wrote:
> > >
> > > Feel free to submit patches that deletes the existing code if you want
> > > it removed from the documentation.
> >
> > Who sneaked that in when?
>
> The ACPI and EFI folks when they allowed for CXL memory to be marked
> EFI_CONVENTIONAL_MEMORY - which means Linux can't actually differentiate
> between DRAM and CXL during __init and brings it online in the page
> allocator as SystemRAM in ZONE_NORMAL (attached to the NUMA node that
> maps to the Proximity Domain in the SRAT).
>
> Not sure there's anything you can do about that.
>
> And for DAX:
>
> 09d09e04d2 (cxl/dax: Create dax devices for CXL RAM regions)
>
> Which allows for EFI_MEMORY_SP / Soft Reserved CXL regions to be brought
> up as a DAX devices (which can be bound to SystemRAM via DAX kmem).
>
> Wasn't much sneaking going on here - DAX kmem has been around and hacked
> on since 2019, and probably some years before that.
Right.
These interfaces have been there for a long time and this documentation
is simply catching up with what is there today. I called for all of this
documentation to go upstream and have no problem defending it to Linus.
Appreciate all the work here Gregory!
Now, is device-dax and dax_kmem the long term solution for exposing
memory of this relative performance class? After LSF/MM this year I am
convinced the answer is "no". Specifically I want to see a solution that
meets what this astute LWN commenter recommended:
https://lwn.net/Articles/1017142/
We can delete documentation and infrastructure once we have the
replacement interface upstream and can start a deprecation process.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 08/17] cxl: docs/linux - early boot configuration
2025-05-12 16:21 ` [PATCH v3 08/17] cxl: docs/linux - early boot configuration Gregory Price
@ 2025-05-13 17:56 ` Dave Jiang
0 siblings, 0 replies; 33+ messages in thread
From: Dave Jiang @ 2025-05-13 17:56 UTC (permalink / raw)
To: Gregory Price, linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
corbet
On 5/12/25 9:21 AM, Gregory Price wrote:
> Document __init time configurations that affect CXL driver probe
> process and memory region configuration.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
> Documentation/driver-api/cxl/index.rst | 1 +
> .../driver-api/cxl/linux/early-boot.rst | 131 ++++++++++++++++++
> 2 files changed, 132 insertions(+)
> create mode 100644 Documentation/driver-api/cxl/linux/early-boot.rst
>
> diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
> index bc2228c77c32..d2eefe575604 100644
> --- a/Documentation/driver-api/cxl/index.rst
> +++ b/Documentation/driver-api/cxl/index.rst
> @@ -34,6 +34,7 @@ that have impacts on each other. The docs here break up configurations steps.
> :caption: Linux Kernel Configuration
>
> linux/overview
> + linux/early-boot
> linux/access-coordinates
>
>
> diff --git a/Documentation/driver-api/cxl/linux/early-boot.rst b/Documentation/driver-api/cxl/linux/early-boot.rst
> new file mode 100644
> index 000000000000..8c1c497bc772
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/linux/early-boot.rst
> @@ -0,0 +1,131 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=======================
> +Linux Init (Early Boot)
> +=======================
> +
> +Linux configuration is split into two major steps: Early-Boot and everything else.
> +
> +During early boot, Linux sets up immutable resources (such as numa nodes), while
> +later operations include things like driver probe and memory hotplug. Linux may
> +read EFI and ACPI information throughout this process to configure logical
> +representations of the devices.
> +
> +During Linux Early Boot stage (functions in the kernel that have the __init
> +decorator), the system takes the resources created by EFI/BIOS (ACPI tables)
> +and turns them into resources that the kernel can consume.
> +
> +
> +BIOS, Build and Boot Options
> +============================
> +
> +There are 4 pre-boot options that need to be considered during kernel build
> +which dictate how memory will be managed by Linux during early boot.
> +
> +* EFI_MEMORY_SP
> +
> + * BIOS/EFI Option that dictates whether memory is SystemRAM or
> + Specific Purpose. Specific Purpose memory will be deferred to
> + drivers to manage - and not immediately exposed as system RAM.
> +
> +* CONFIG_EFI_SOFT_RESERVE
> +
> + * Linux Build config option that dictates whether the kernel supports
> + Specific Purpose memory.
> +
> +* CONFIG_MHP_DEFAULT_ONLINE_TYPE
> +
> + * Linux Build config that dictates whether and how Specific Purpose memory
> + converted to a dax device should be managed (left as DAX or onlined as
> + SystemRAM in ZONE_NORMAL or ZONE_MOVABLE).
> +
> +* nosoftreserve
> +
> + * Linux kernel boot option that dictates whether Soft Reserve should be
> + supported. Similar to CONFIG_EFI_SOFT_RESERVE.
> +
> +Memory Map Creation
> +===================
> +
> +While the kernel parses the EFI memory map, if :code:`Specific Purpose` memory
> +is supported and detected, it will set this region aside as
> +:code:`SOFT_RESERVED`.
> +
> +If :code:`EFI_MEMORY_SP=0`, :code:`CONFIG_EFI_SOFT_RESERVE=n`, or
> +:code:`nosoftreserve=y` - Linux will default a CXL device memory region to
> +SystemRAM. This will expose the memory to the kernel page allocator in
> +:code:`ZONE_NORMAL`, making it available for use for most allocations (including
> +:code:`struct page` and page tables).
> +
> +If `Specific Purpose` is set and supported, :code:`CONFIG_MHP_DEFAULT_ONLINE_TYPE_*`
> +dictates whether the memory is onlined by default (:code:`_OFFLINE` or
> +:code:`_ONLINE_*`), and if online which zone to online this memory to by default
> +(:code:`_NORMAL` or :code:`_MOVABLE`).
> +
> +If placed in :code:`ZONE_MOVABLE`, the memory will not be available for most
> +kernel allocations (such as :code:`struct page` or page tables). This may
> +significant impact performance depending on the memory capacity of the system.
> +
> +
> +NUMA Node Reservation
> +=====================
> +
> +Linux refers to the proximity domains (:code:`PXM`) defined in the SRAT to
> +create NUMA nodes in :code:`acpi_numa_init`. Typically, there is a 1:1 relation
> +between :code:`PXM` and NUMA node IDs.
> +
> +SRAT is the only ACPI defined way of defining Proximity Domains. Linux chooses
> +to, at most, map those 1:1 with NUMA nodes. CEDT adds a description of SPA
> +ranges which Linux may wish to map to one or more NUMA nodes.
> +
> +If there are CXL ranges in the CFMWS but not in SRAT, then a fake :code:`PXM`
> +is created (as of v6.15). In the future, Linux may reject CFMWS not described
> +by SRAT due to the ambiguity of proximity domain association.
> +
> +It is important to note that NUMA node creation cannot be done at runtime. All
> +possible NUMA nodes are identified at :code:`__init` time, more specifically
> +during :code:`mm_init`. The CEDT and SRAT must contain sufficient :code:`PXM`
> +data for Linux to identify NUMA nodes their associated memory regions.
> +
> +The relevant code exists in: :code:`linux/drivers/acpi/numa/srat.c`.
> +
> +See the Example Platform Configurations section for more information.
> +
> +Memory Tiers Creation
> +=====================
> +Memory tiers are a collection of NUMA nodes grouped by performance characteristics.
> +During :code:`__init`, Linux initializes the system with a default memory tier that
> +contains all nodes marked :code:`N_MEMORY`.
> +
> +:code:`memory_tier_init` is called at boot for all nodes with memory online by
> +default. :code:`memory_tier_late_init` is called during late-init for nodes setup
> +during driver configuration.
> +
> +Nodes are only marked :code:`N_MEMORY` if they have *online* memory.
> +
> +Tier membership can be inspected in ::
> +
> + /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
> + 0-1
> +
> +If nodes are grouped which have clear difference in performance, check the HMAT
> +and CDAT information for the CXL nodes. All nodes default to the DRAM tier,
> +unless HMAT/CDAT information is reported to the memory_tier component via
> +`access_coordinates`.
> +
> +Contiguous Memory Allocation
> +============================
> +The contiguous memory allocator (CMA) enables reservation of contiguous memory
> +regions on NUMA nodes during early boot. However, CMA cannot reserve memory
> +on NUMA nodes that are not online during early boot. ::
> +
> + void __init hugetlb_cma_reserve(int order) {
> + if (!node_online(nid))
> + /* do not allow reservations */
> + }
> +
> +This means if users intend to defer management of CXL memory to the driver, CMA
> +cannot be used to guarantee huge page allocations. If enabling CXL memory as
> +SystemRAM in `ZONE_NORMAL` during early boot, CMA reservations per-node can be
> +made with the :code:`cma_pernuma` or :code:`numa_cma` kernel command line
> +parameters.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v3 00/17] CXL Boot to Bash Documentation
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
` (16 preceding siblings ...)
2025-05-12 16:21 ` [PATCH v3 17/17] cxl: docs - add self-referencing cross-links Gregory Price
@ 2025-05-13 20:38 ` Dave Jiang
17 siblings, 0 replies; 33+ messages in thread
From: Dave Jiang @ 2025-05-13 20:38 UTC (permalink / raw)
To: Gregory Price, linux-cxl
Cc: linux-doc, linux-kernel, kernel-team, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams,
corbet, Joshua Hahn
On 5/12/25 9:21 AM, Gregory Price wrote:
> v3:
> - Cross-links (Bagas)
> - Grammar and spelling (Randy)
> - added fixups to access-coordinates (Bagas)
> - Drop TODO sections (use-case, memory-tiering, CDAT/UEFI, SRAT Genport)
> I unfortunately won't be able to come back around to this for
> a while, so I'd rather not let this rot.
Applied to cxl/next
>
> ---
>
> This series converts CXL Boot to Bash Docs from LSFMM '25 to Linux
> Kernel Docs. In brief, this document covers (almost) everything Linux
> expects from platforms to successfully bring volatile CXL memory
> capacity online as a DAX device and/or SystemRAM.
>
> It covers:
>
> - Platform configuration data (ACPI Tables, EFI Memory Map, EFI Configs)
> - Linux Build and Boot Parameters
> - Linux consumption of Platform, Build, and Boot params
> - Linux creation of base resources (NUMA nodes, memory tiers, etc)
> - CXL Driver probe process and sysfs structure
> - DAX Driver interactions between the CXL driver and memory hotplug
> - Memory hotplug interactions
> - Page allocator interactions (NUMA nodes, Memory Zones, Reclaim, etc).
>
> Included are example platform configurations (ACPI tables) and cxl
> decoder configurations to guide platform developers on expected
> configurations (which may be more strict than the CXL spec).
>
> Co-developed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
>
> Gregory Price (17):
> cxl: update documentation structure in prep for new docs
> cxl: docs - access-coordinates doc fixups
> cxl: docs/devices - add cxl device and protocol reference
> cxl: docs/platform/bios-and-efi documentation
> cxl: docs/platform/acpi reference documentation
> cxl: docs/platform/example-configs documentation
> cxl: docs/linux - overview
> cxl: docs/linux - early boot configuration
> cxl: docs/linux - add cxl-driver theory of operation
> cxl: docs/linux/cxl-driver - add example configurations
> cxl: docs/linux/dax-driver documentation
> cxl: docs/linux/memory-hotplug
> cxl: docs/allocation/dax
> cxl: docs/allocation/page-allocator
> cxl: docs/allocation/reclaim
> cxl: docs/allocation/hugepages
> cxl: docs - add self-referencing cross-links
>
> .../driver-api/cxl/allocation/dax.rst | 60 ++
> .../driver-api/cxl/allocation/hugepages.rst | 32 +
> .../cxl/allocation/page-allocator.rst | 85 +++
> .../driver-api/cxl/allocation/reclaim.rst | 51 ++
> .../driver-api/cxl/devices/device-types.rst | 165 +++++
> Documentation/driver-api/cxl/index.rst | 45 +-
> .../cxl/{ => linux}/access-coordinates.rst | 35 +-
> .../driver-api/cxl/linux/cxl-driver.rst | 630 ++++++++++++++++++
> .../driver-api/cxl/linux/dax-driver.rst | 43 ++
> .../driver-api/cxl/linux/early-boot.rst | 137 ++++
> .../example-configurations/hb-interleave.rst | 314 +++++++++
> .../intra-hb-interleave.rst | 291 ++++++++
> .../multi-interleave.rst | 401 +++++++++++
> .../example-configurations/single-device.rst | 246 +++++++
> .../driver-api/cxl/linux/memory-hotplug.rst | 78 +++
> .../driver-api/cxl/linux/overview.rst | 103 +++
> .../driver-api/cxl/platform/acpi.rst | 76 +++
> .../driver-api/cxl/platform/acpi/cedt.rst | 62 ++
> .../driver-api/cxl/platform/acpi/dsdt.rst | 28 +
> .../driver-api/cxl/platform/acpi/hmat.rst | 32 +
> .../driver-api/cxl/platform/acpi/slit.rst | 21 +
> .../driver-api/cxl/platform/acpi/srat.rst | 44 ++
> .../driver-api/cxl/platform/bios-and-efi.rst | 262 ++++++++
> .../cxl/platform/example-configs.rst | 13 +
> .../example-configurations/flexible.rst | 296 ++++++++
> .../example-configurations/hb-interleave.rst | 107 +++
> .../multi-dev-per-hb.rst | 90 +++
> .../example-configurations/one-dev-per-hb.rst | 136 ++++
> ...ry-devices.rst => theory-of-operation.rst} | 10 +-
> 29 files changed, 3867 insertions(+), 26 deletions(-)
> create mode 100644 Documentation/driver-api/cxl/allocation/dax.rst
> create mode 100644 Documentation/driver-api/cxl/allocation/hugepages.rst
> create mode 100644 Documentation/driver-api/cxl/allocation/page-allocator.rst
> create mode 100644 Documentation/driver-api/cxl/allocation/reclaim.rst
> create mode 100644 Documentation/driver-api/cxl/devices/device-types.rst
> rename Documentation/driver-api/cxl/{ => linux}/access-coordinates.rst (84%)
> create mode 100644 Documentation/driver-api/cxl/linux/cxl-driver.rst
> create mode 100644 Documentation/driver-api/cxl/linux/dax-driver.rst
> create mode 100644 Documentation/driver-api/cxl/linux/early-boot.rst
> create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/hb-interleave.rst
> create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/intra-hb-interleave.rst
> create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/multi-interleave.rst
> create mode 100644 Documentation/driver-api/cxl/linux/example-configurations/single-device.rst
> create mode 100644 Documentation/driver-api/cxl/linux/memory-hotplug.rst
> create mode 100644 Documentation/driver-api/cxl/linux/overview.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi/cedt.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi/dsdt.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi/hmat.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi/slit.rst
> create mode 100644 Documentation/driver-api/cxl/platform/acpi/srat.rst
> create mode 100644 Documentation/driver-api/cxl/platform/bios-and-efi.rst
> create mode 100644 Documentation/driver-api/cxl/platform/example-configs.rst
> create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/flexible.rst
> create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/hb-interleave.rst
> create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/multi-dev-per-hb.rst
> create mode 100644 Documentation/driver-api/cxl/platform/example-configurations/one-dev-per-hb.rst
> rename Documentation/driver-api/cxl/{memory-devices.rst => theory-of-operation.rst} (98%)
>
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2025-05-13 20:38 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-12 16:21 [PATCH v3 00/17] CXL Boot to Bash Documentation Gregory Price
2025-05-12 16:21 ` [PATCH v3 01/17] cxl: update documentation structure in prep for new docs Gregory Price
2025-05-12 22:46 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 02/17] cxl: docs - access-coordinates doc fixups Gregory Price
2025-05-12 22:47 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 03/17] cxl: docs/devices - add cxl device and protocol reference Gregory Price
2025-05-12 23:08 ` Dave Jiang
2025-05-12 23:22 ` Gregory Price
2025-05-12 16:21 ` [PATCH v3 04/17] cxl: docs/platform/bios-and-efi documentation Gregory Price
2025-05-12 23:31 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 05/17] cxl: docs/platform/acpi reference documentation Gregory Price
2025-05-12 23:49 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 06/17] cxl: docs/platform/example-configs documentation Gregory Price
2025-05-13 0:05 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 07/17] cxl: docs/linux - overview Gregory Price
2025-05-13 0:09 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 08/17] cxl: docs/linux - early boot configuration Gregory Price
2025-05-13 17:56 ` Dave Jiang
2025-05-12 16:21 ` [PATCH v3 09/17] cxl: docs/linux - add cxl-driver theory of operation Gregory Price
2025-05-12 16:21 ` [PATCH v3 10/17] cxl: docs/linux/cxl-driver - add example configurations Gregory Price
2025-05-12 16:21 ` [PATCH v3 11/17] cxl: docs/linux/dax-driver documentation Gregory Price
2025-05-12 16:21 ` [PATCH v3 12/17] cxl: docs/linux/memory-hotplug Gregory Price
2025-05-12 16:21 ` [PATCH v3 13/17] cxl: docs/allocation/dax Gregory Price
2025-05-12 16:21 ` [PATCH v3 14/17] cxl: docs/allocation/page-allocator Gregory Price
2025-05-12 16:34 ` Matthew Wilcox
2025-05-12 16:38 ` Gregory Price
2025-05-12 17:52 ` Matthew Wilcox
2025-05-12 18:09 ` Gregory Price
2025-05-13 2:39 ` dan.j.williams
2025-05-12 16:21 ` [PATCH v3 15/17] cxl: docs/allocation/reclaim Gregory Price
2025-05-12 16:21 ` [PATCH v3 16/17] cxl: docs/allocation/hugepages Gregory Price
2025-05-12 16:21 ` [PATCH v3 17/17] cxl: docs - add self-referencing cross-links Gregory Price
2025-05-13 20:38 ` [PATCH v3 00/17] CXL Boot to Bash Documentation Dave Jiang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).