linux-cxl.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/16] CXL.mem error isolation support
@ 2025-07-30 21:47 Ben Cheatham
  2025-07-30 21:47 ` [PATCH 01/16] cxl/regs: Add cxl_unmap_component_regs() Ben Cheatham
                   ` (15 more replies)
  0 siblings, 16 replies; 24+ messages in thread
From: Ben Cheatham @ 2025-07-30 21:47 UTC (permalink / raw)
  To: linux-cxl, linux-pci, linux-acpi; +Cc: Ben Cheatham

Overview
========
This series adds support for the CXL.mem Timeout & Isolation Capability
as defined by the CXL 3.2 spec (section 8.2.4.24), with some extras
(explained below). This is an optional capability implemented by
CXL-capable PCIe Root Ports to prevent the host system from resetting
when the CXL.mem link times out or goes down (i.e. CXL memory device is
suprise removed or dies). Without this capability, the system is
expected to immediately reset or power off when either of these
conditions occurs.

When CXL.mem isolation is triggered, the CXL memory below the port is no
longer accessible. Writes to the memory from the host are expected to
silently drop, while a synchronous response is expected for reads. This
response is implemntation specific, but an example response would be poisoned
data.

The specific features enabled by this series are:
 - Enabling CXL.mem isolation on link down conditions and transaction
   timeout
 - Setting up, enabling, and handling CXL isolation interrupts
 - Preventing onlining/enabling of isolated CXL memory
 - Sysfs attributes for system administrators to tune isolation
   capabilities

The Extras
==========
The last 3 commits provide support for an ECN [1] submitted by AMD that
allows platform firmware to modify how the OS enables and handles CXL
isolation. The ECN contents are expected to land in revision 4 of the
CXL spec. The link at [1] is only accessible to CXL SSWG members, but
I've done my best to explain the changes in the relevant commits.

The changes in these commits could probably be moved to earlier commits,
but I've opted to leave them tacked on the end just in case anyone has a
problem with their inclusion.

Intended Behavior
=================
Due to how CXL memory is currently handled by Linux, this feature isn't
all that useful for type 3 cards. The intended behavior for type 3 cards
is to panic when isolation is triggered, which defeats the purpose of
the feature.

The reason I'm sending this out anyway is twofold:
1) I've seen rumblings that CXL memory will be part of it's own opt-in
allocator in the future and the memory may be safely removable at that
point.
2) CXL memory provided by a Type 2 card may be safely removable, though
it's left up to the type 2 endpoint driver to handle isolation recovery.

I've also not included a flow for isolation recovery. This is because I
a) don't have a system that supports it, and b) it's not applicable to
the type 3 driver.

Building the Set
================
This series is based on both Terry's port error handling patch set (v10)
and Dave's deferred downstream port probe set (v7). Terry's set was needed
since it introduces the uncorrectable CXL error = system panic paradigm, as
well as the routines for logging the AER info from the CXL subsystem.

I included Dave's set due to a timing issue I saw where the PCIe portdrv
code would run after the CXL ports that have the isolation capability
were probed. This caused the isolation set up to fail because the PCIe
portdrv provides the information to allocate the CXL isolation
interrupt. I tried deferring the probe, but the deferral caused the
cxl_mem driver to break because the port wasn't probed yet. I could have
introduced a scheme to get around this, but it seemed easier to just use
Dave's set to fix it.

The isolation support is gated behind the CXL core being built-in
because the CXL isolation PCIe service needs the mapping code in
cxl/core/regs.c. I realize a rework is planned for the PCIe portdrv to
(hopefully) not make this the case, so I've kept the code as minimal as
possible.

To build the set I applied Terry's set to the base commit below, Dave's
on top of that, then my patches.

Patch Breakdown
===============
Patches 3-5 & 12-13 will need eyes from PCIe folks.
Patch 14 needs an ACPI reviewer.

- Patches 1-2: Register mapping updates needed for isolation support
- Patches 3-5: CXL isolation service driver & MSI/-X vector allocation
- Patch 6: Enable CXL.mem isolation
- Patches 7-8: Set up and enable CXL isolation interrupts
- Patch 9: Preventing onlining isolated memory
- Patch 10: Enable CXL.mem transaction timeout
- Patch 11: cxl_pci isolation handler
- Patches 12-13: CXL isolation sysfs attributes
- Patch 14: ECN changes to CXL _OSC method
- Patches 15-16: ECN additions

[1]:
Link: https://members.computeexpresslink.org/wg/software_systems/document/3118

Ben Cheatham (16):
  cxl/regs: Add cxl_unmap_component_regs()
  cxl/regs: Add CXL Isolation capability mapping
  PCI: PCIe portdrv: Add CXL Isolation service driver
  PCI: PCIe portdrv: Allocate CXL isolation MSI/-X vector
  PCI: PCIe portdrv: Add interface for getting CXL isolation IRQ
  cxl/core: Enable CXL.mem isolation
  cxl/core: Set up isolation interrupts
  cxl/core: Enable CXL isolation interrupts
  cxl/core: Prevent onlining CXL memory behind isolated ports
  cxl/core: Enable CXL.mem timeout
  cxl/pci: Add isolation handler
  PCI: PCIe portdrv: Add cxl_isolation sysfs attributes
  cxl/core, PCI: PCIe portdrv: Add CXL timeout range programming
  ACPI: Add CXL isolation _OSC fields
  cxl/core, cxl/acpi: Enable CXL isolation based on _OSC handshake
  cxl/core, cxl/acpi: Add CXL isolation notify handler

 drivers/acpi/pci_root.c          |   9 +
 drivers/cxl/Kconfig              |  14 ++
 drivers/cxl/acpi.c               |  75 +++++++
 drivers/cxl/core/core.h          |   2 +
 drivers/cxl/core/pci.c           | 138 ++++++++++++
 drivers/cxl/core/port.c          | 248 +++++++++++++++++++++
 drivers/cxl/core/region.c        |   3 +
 drivers/cxl/core/regs.c          |  85 +++++--
 drivers/cxl/cxl.h                |  35 +++
 drivers/cxl/cxlmem.h             |   4 +
 drivers/cxl/pci.c                |   9 +
 drivers/pci/pci-sysfs.c          |   3 +
 drivers/pci/pci.h                |   4 +
 drivers/pci/pcie/Makefile        |   1 +
 drivers/pci/pcie/cxl_isolation.c | 371 +++++++++++++++++++++++++++++++
 drivers/pci/pcie/portdrv.c       |  21 +-
 drivers/pci/pcie/portdrv.h       |  18 +-
 include/cxl/isolation.h          |  66 ++++++
 include/linux/acpi.h             |   3 +
 19 files changed, 1086 insertions(+), 23 deletions(-)
 create mode 100644 drivers/pci/pcie/cxl_isolation.c
 create mode 100644 include/cxl/isolation.h

base-commit: a403fe6c0b17f472e01246eb350f5eef105243ac
-- 
2.34.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-08-22 19:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-30 21:47 [PATCH 00/16] CXL.mem error isolation support Ben Cheatham
2025-07-30 21:47 ` [PATCH 01/16] cxl/regs: Add cxl_unmap_component_regs() Ben Cheatham
2025-07-30 21:47 ` [PATCH 02/16] cxl/regs: Add CXL Isolation capability mapping Ben Cheatham
2025-07-30 21:47 ` [PATCH 03/16] PCI: PCIe portdrv: Add CXL Isolation service driver Ben Cheatham
2025-07-30 21:47 ` [PATCH 04/16] PCI: PCIe portdrv: Allocate CXL isolation MSI/-X vector Ben Cheatham
2025-08-04 21:39   ` Bjorn Helgaas
2025-08-06 17:58     ` Cheatham, Benjamin
2025-07-30 21:47 ` [PATCH 05/16] PCI: PCIe portdrv: Add interface for getting CXL isolation IRQ Ben Cheatham
2025-07-31  5:59   ` Lukas Wunner
2025-07-31 13:13     ` Cheatham, Benjamin
2025-07-30 21:47 ` [PATCH 06/16] cxl/core: Enable CXL.mem isolation Ben Cheatham
2025-07-30 21:47 ` [PATCH 07/16] cxl/core: Set up isolation interrupts Ben Cheatham
2025-07-30 21:47 ` [PATCH 08/16] cxl/core: Enable CXL " Ben Cheatham
2025-07-30 21:47 ` [PATCH 09/16] cxl/core: Prevent onlining CXL memory behind isolated ports Ben Cheatham
2025-07-30 21:47 ` [PATCH 10/16] cxl/core: Enable CXL.mem timeout Ben Cheatham
2025-07-30 21:47 ` [PATCH 11/16] cxl/pci: Add isolation handler Ben Cheatham
2025-07-30 21:47 ` [PATCH 12/16] PCI: PCIe portdrv: Add cxl_isolation sysfs attributes Ben Cheatham
2025-07-30 21:47 ` [PATCH 13/16] cxl/core, PCI: PCIe portdrv: Add CXL timeout range programming Ben Cheatham
2025-08-04 21:39   ` Bjorn Helgaas
2025-08-06 17:58     ` Cheatham, Benjamin
2025-07-30 21:47 ` [PATCH 14/16] ACPI: Add CXL isolation _OSC fields Ben Cheatham
2025-08-22 19:19   ` Rafael J. Wysocki
2025-07-30 21:47 ` [PATCH 15/16] cxl/core, cxl/acpi: Enable CXL isolation based on _OSC handshake Ben Cheatham
2025-07-30 21:47 ` [PATCH 16/16] cxl/core, cxl/acpi: Add CXL isolation notify handler Ben Cheatham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).