* RFC: CXL Isolation Support
@ 2026-01-30 19:47 Cheatham, Benjamin
2026-01-30 21:30 ` Gregory Price
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Cheatham, Benjamin @ 2026-01-30 19:47 UTC (permalink / raw)
To: linux-cxl; +Cc: benjamin.cheatham
Quick Background:
CXL.mem isolation and timeout is a mechanism that allows the host to
continue operation in the event a CXL.mem link goes down or a CXL.mem
transaction times out (semi-analogous to PCIe DPC for CXL)[1]. After CXL.mem
isolation is triggered all CXL memory below the root port is inaccessible.
At this point writes to the memory are dropped and reads return synchronous
exceptions (platform specific, but probably poisoned data). The alternative
to this support (which is the case now) is the host system resets when a
CXL.mem link goes down or a CXL.mem transaction timeouts out.
Why I'm Sending This:
I sent out a patch series a few months back that implemented CXL.mem
error isolation to this list [2]. It didn't really gain traction due
to not having a customer requesting it. We (AMD) have heard from some
customers that they are interested in this support, but aren't willing to
help out upstream. The main motivation behind using isolation we've heard
is that customers would like to use CXL but are worried about system
reliability since it's still a new technology.
My main goal here is to gauge whether we're wasting our time here on trying
to push this upstream. With that being said, here's some info on the technical
hurdles with implementing this feature:
Technical Details
=================
1. CXL memory may be used for kernel allocations
Kernel allocations in CXL aren't a problem at the moment because if the CXL.mem
link goes down the hardware resets. When isolation is enabled, this isn't the case.
The kernel can keep chugging along until it eventually errors out trying to access
the now inaccessible memory (possibly causing data corruption until then).
In my v1 submission, I opted to just panic the system when isolation occurs when
any CXL driver couldn't handle the event. The handler for type 3 devices
(cxl_pci driver) did some RAS stuff and then panic'ed the system. I think an isolation
handler for CXL device drivers will probably be part of a final solution, but the
handler in that series was hamstrung by allowing CXL memory into the system ram pool.
Possible Solutions
------------------
Keeping CXL memory out of sysram is doable today; it just requires a combination
of udev rules, tooling, and kernel configuration options. The flow (afaik) is to:
1) Configure the kernel to not automatically online hotplugged memory
2) Add a udev rule to remap CXL-backed DAX devices from sysram mode to devdax mode
when added to the system
Gregory Price has submitted a set that changes the second part of this flow to instead
use sysfs [3]. With that, the need for udev rules is removed and users (or their tooling)
can set their CXL memory to devdax mode before it's added to the system. At that point,
all that needs to be done is restrict enabling isolation to CXL devices in devdax-mapped
regions and make sure the memory mode doesn't change (i.e. devdax -> sysram).
2.PCIe Portdrv Dependency
CXL isolation interrupts are delivered using an MSI/-X interrupt, with the specific
vector being in the MMIO-space isolation capability register (CXL 3.2 8.2.4.24.1). This is
a problem because the PCIe portdrv is in charge of setting up MSI/-X interrupts,
but to map the isolation vector it needs the CXL register mapping code. Since
the portdrv is only available as built-in to the kernel, using isolation would require
restricting at least the cxl_core module to built-in.
Possible Solutions
------------------
There's a couple of things we could do here. First is to restrict isolation to when
the CXL core is built-in (CXL_BUS=y && depends on PCIEPORTBUS). I'm not particularly
happy about this approach since it removes the modularity of the CXL driver(s), but I
won't gripe if that's what's settled on.
Another approach would be to move the CXL register mapping code in cxl/core/regs.c to a
library, or always make the file built-in when CXL_BUS is selected. This is more palatable
(imo) but splits the CXL code up in a potentially weird way.
Last one is to rework the PCIe port bus driver to allow for re-allocating MSI/-X interrupts.
Jonathan Cameron sent out a series where there was some discussion on this. This support
would be limited to MSI-X interrupts only due to the PCI maintainers not wanting to add
more support for MSI [4]. This wouldn't work for AMD platforms because we use MSI interrupts
for this support. There is still a way to make this work, however. AMD server platforms
use the same MSI vector for all PCIe interrupts, so we could introduce a quirk to use that
same vector as another PCIe interrupt for CXL isolation. That would require no register mapping
code in the PCIe portdrv code but would introduce a platform quirk instead. I doubt anyone would
be happy about introducing a quirk but I thought I'd throw it out as an option.
Thanks for reading,
Ben
Footnotes
=========
[1]: CXL 3.2 spec, section 12.3 "Isolation on CXL.cache and CXL.mem"
[2]: https://lore.kernel.org/linux-cxl/20250730214718.10679-1-Benjamin.Cheatham@amd.com/
[3]: https://lore.kernel.org/linux-cxl/20260129210442.3951412-1-gourry@gourry.net/
[4]: https://lore.kernel.org/linux-pci/87plpsbbe5.ffs@tglx/
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: RFC: CXL Isolation Support
2026-01-30 19:47 RFC: CXL Isolation Support Cheatham, Benjamin
@ 2026-01-30 21:30 ` Gregory Price
2026-02-02 15:59 ` Jonathan Cameron
2026-02-02 17:30 ` Cheatham, Benjamin
2026-02-02 15:52 ` Jonathan Cameron
` (2 subsequent siblings)
3 siblings, 2 replies; 13+ messages in thread
From: Gregory Price @ 2026-01-30 21:30 UTC (permalink / raw)
To: Cheatham, Benjamin; +Cc: linux-cxl
On Fri, Jan 30, 2026 at 01:47:08PM -0600, Cheatham, Benjamin wrote:
> Technical Details
> =================
>
> 1. CXL memory may be used for kernel allocations
>
> Kernel allocations in CXL aren't a problem at the moment because if the CXL.mem
> link goes down the hardware resets. When isolation is enabled, this isn't the case.
> The kernel can keep chugging along until it eventually errors out trying to access
> the now inaccessible memory (possibly causing data corruption until then).
>
Two points:
1) isolating the normal native sysram usecase (N_MEMORY node) buys you
less reliability than you think. It only buys you kernel safety.
Keeping kernel memory out of CXL only helps the kernel report problems,
it doesn't necessarily guarantee any particular system state in
userland after that occurs.
It's entirely possible (and altogether likely) that a large swath of
userland will become inoperable if the entire link goes down and a bunch
of that memory was used.
So if the goal is to have the system fully recover, this would require
much more isolation policy than just keeping the kernel out of CXL - you
would need most of the baseline distro to set cpuset.mem policies for
core services that ensure the userspace environment doesn't completely
blow up.
But even then, you have bigger issues for shared file-backed VMAs.
e.g. imagine libc gets demoted and comes back poisoned: kabloooey
But, you can at least get data from the system that the link went
down and even have a chance to investigate before nicely blowing up.
Which is at least much more helpful.
2) if N_MEMORY_PRIVATE nodes become accepted, this is a different story
In this case, the nodes are *completely* isolated except for *explicit*
use cases. So a distro can in theory build a policy that dictates what
types of memory actually land on a particular device and how they can
get there.
The example would be using mempolicy to opt VMAs into private nodes,
while defaulting any given VMA to N_MEMORY nodes only. This would be a
more reliable way to enforce isolation of critical components.
---
From a reliability perspective, my opinion has always been that kernel
isolation is a way for the kernel to deliver higher quality debugging
signal when an issue does happen, because when you're trying to figure
out why a million machines with CXL blow up - that little bit of signal
can save you many weeks of trouble.
>
> Gregory Price has submitted a set that changes the second part of this flow to instead
> use sysfs [3]. With that, the need for udev rules is removed and users (or their tooling)
> can set their CXL memory to devdax mode before it's added to the system. At that point,
> all that needs to be done is restrict enabling isolation to CXL devices in devdax-mapped
> regions and make sure the memory mode doesn't change (i.e. devdax -> sysram).
>
Just a clarifying point - the sysfs entries are representations of
kernel objects.
You can still do what you describe with a UDEV rule, just now the
sysram_region provides a step in which you can set the ZONE_MOVABLE
rule *before* the dax_kmem driver is engaged.
UDEV still possible, but now cleaner.
detects(region0)
probe(sysram_region) <-- or cxl_devdax_region for device-dax instead
detects(sysram_region0)
write(online_type="online_movable")
probe(cxl_dev_kmem_Region)
tl;dr: Now you don't need a weird unbind step.
~Gregory
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: RFC: CXL Isolation Support
2026-01-30 21:30 ` Gregory Price
@ 2026-02-02 15:59 ` Jonathan Cameron
2026-02-02 16:50 ` Gregory Price
2026-02-02 17:30 ` Cheatham, Benjamin
1 sibling, 1 reply; 13+ messages in thread
From: Jonathan Cameron @ 2026-02-02 15:59 UTC (permalink / raw)
To: Gregory Price; +Cc: Cheatham, Benjamin, linux-cxl
On Fri, 30 Jan 2026 16:30:51 -0500
Gregory Price <gourry@gourry.net> wrote:
> On Fri, Jan 30, 2026 at 01:47:08PM -0600, Cheatham, Benjamin wrote:
> > Technical Details
> > =================
> >
> > 1. CXL memory may be used for kernel allocations
> >
> > Kernel allocations in CXL aren't a problem at the moment because if the CXL.mem
> > link goes down the hardware resets. When isolation is enabled, this isn't the case.
> > The kernel can keep chugging along until it eventually errors out trying to access
> > the now inaccessible memory (possibly causing data corruption until then).
> >
>
> Two points:
>
> 1) isolating the normal native sysram usecase (N_MEMORY node) buys you
> less reliability than you think. It only buys you kernel safety.
>
> Keeping kernel memory out of CXL only helps the kernel report problems,
> it doesn't necessarily guarantee any particular system state in
> userland after that occurs.
>
> It's entirely possible (and altogether likely) that a large swath of
> userland will become inoperable if the entire link goes down and a bunch
> of that memory was used.
>
> So if the goal is to have the system fully recover, this would require
> much more isolation policy than just keeping the kernel out of CXL - you
> would need most of the baseline distro to set cpuset.mem policies for
> core services that ensure the userspace environment doesn't completely
> blow up.
>
> But even then, you have bigger issues for shared file-backed VMAs.
>
> e.g. imagine libc gets demoted and comes back poisoned: kabloooey
>
> But, you can at least get data from the system that the link went
> down and even have a chance to investigate before nicely blowing up.
>
> Which is at least much more helpful.
Let's pretend it's tagged (just for ease of thinking about it).
So shareable that happens not be shared ever.
Application specific memory isolated to one application - using
famfs or similar.
Then it's safe enough, but maybe not that useful. It's a possible
path to get to a world in which type 3 mem an be isolated.
Jonathan
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: CXL Isolation Support
2026-02-02 15:59 ` Jonathan Cameron
@ 2026-02-02 16:50 ` Gregory Price
2026-02-02 17:31 ` Cheatham, Benjamin
0 siblings, 1 reply; 13+ messages in thread
From: Gregory Price @ 2026-02-02 16:50 UTC (permalink / raw)
To: Jonathan Cameron; +Cc: Cheatham, Benjamin, linux-cxl
On Mon, Feb 02, 2026 at 03:59:05PM +0000, Jonathan Cameron wrote:
> On Fri, 30 Jan 2026 16:30:51 -0500
> Gregory Price <gourry@gourry.net> wrote:
>
> > But even then, you have bigger issues for shared file-backed VMAs.
> >
> > e.g. imagine libc gets demoted and comes back poisoned: kabloooey
> >
> > But, you can at least get data from the system that the link went
> > down and even have a chance to investigate before nicely blowing up.
> >
> > Which is at least much more helpful.
>
> Let's pretend it's tagged (just for ease of thinking about it).
> So shareable that happens not be shared ever.
>
> Application specific memory isolated to one application - using
> famfs or similar.
>
> Then it's safe enough, but maybe not that useful. It's a possible
> path to get to a world in which type 3 mem an be isolated.
>
Absolutely. I agree any kind of explicit-use isolation is feasible to
implement large-scale recoverability.
But I suppose to clarify my concerns - I think isolation guarantees
require much more clarity on how involved BIOS can be.
Auto-regions online memory as nodes by default. If CXL memory is online
as a node (in the current kernel) - then isolation is broken, even in
ZONE_MOVABLE. No amount of desire for recoverability is feasible.
This obviously changes if the exposure is limited via some explicit
mechanism (FAMFS, N_MEMORY_PRIVATE, etc). But this is already true
of those paths - users of FAMFS will get SIGBUS'd instead of MCE'd on
poison, for example.
So if isolation is desired, then the default opinion should be that all
management of the CXL bus (endpoints, decoders, etc) should be deferred
to the driver and not programmed by the BIOS.
Auto-regions are basically incompatible with this feature.
(for more useless information - see: $REASONS)
~Gregory
--- $REASONS
This is partially informed by the fact that auto-regions are defined
as BIOS-programmed decoders - which the current driver auto-plugs
into the dax_kmem driver, and we're stuck with that unfortunate
backwards compatibility story for quite some time.
And the platform which use auto-regions may not adhere to expected
programming patterns due to some subtle deviations from the spec.
So tearing complexes down and reprogramming them may not even be
feasible. ( cough Zen5 :[ )
But more generally, proposing additional features for auto-regions
creates an incentive to push ever-increasingly-complex policy down
into the BIOS, which will just lead to sadness and heartache.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: CXL Isolation Support
2026-02-02 16:50 ` Gregory Price
@ 2026-02-02 17:31 ` Cheatham, Benjamin
0 siblings, 0 replies; 13+ messages in thread
From: Cheatham, Benjamin @ 2026-02-02 17:31 UTC (permalink / raw)
To: Gregory Price, Jonathan Cameron; +Cc: linux-cxl
On 2/2/2026 10:50 AM, Gregory Price wrote:
> On Mon, Feb 02, 2026 at 03:59:05PM +0000, Jonathan Cameron wrote:
>> On Fri, 30 Jan 2026 16:30:51 -0500
>> Gregory Price <gourry@gourry.net> wrote:
>>
>>> But even then, you have bigger issues for shared file-backed VMAs.
>>>
>>> e.g. imagine libc gets demoted and comes back poisoned: kabloooey
>>>
>>> But, you can at least get data from the system that the link went
>>> down and even have a chance to investigate before nicely blowing up.
>>>
>>> Which is at least much more helpful.
>>
>> Let's pretend it's tagged (just for ease of thinking about it).
>> So shareable that happens not be shared ever.
>>
>> Application specific memory isolated to one application - using
>> famfs or similar.
>>
>> Then it's safe enough, but maybe not that useful. It's a possible
>> path to get to a world in which type 3 mem an be isolated.
>>
>
> Absolutely. I agree any kind of explicit-use isolation is feasible to
> implement large-scale recoverability.
>
> But I suppose to clarify my concerns - I think isolation guarantees
> require much more clarity on how involved BIOS can be.
>
> Auto-regions online memory as nodes by default. If CXL memory is online
> as a node (in the current kernel) - then isolation is broken, even in
> ZONE_MOVABLE. No amount of desire for recoverability is feasible.
>
> This obviously changes if the exposure is limited via some explicit
> mechanism (FAMFS, N_MEMORY_PRIVATE, etc). But this is already true
> of those paths - users of FAMFS will get SIGBUS'd instead of MCE'd on
> poison, for example.
I should add that, at least on AMD platforms, if the CXL link goes down
the system will immediately reset (AFAIK). So this use case would require
enabling isolation just so the hardware doesn't rug-pull you.
>
> So if isolation is desired, then the default opinion should be that all
> management of the CXL bus (endpoints, decoders, etc) should be deferred
> to the driver and not programmed by the BIOS.
I think we're slowly moving this way, but it isn't feasible for current
AMD platforms. However, I don't think this is too much of an issue for
our platforms since recovery isn't supported anyway.
>
> Auto-regions are basically incompatible with this feature.
> (for more useless information - see: $REASONS)
>
> ~Gregory
>
> --- $REASONS
>
> This is partially informed by the fact that auto-regions are defined
> as BIOS-programmed decoders - which the current driver auto-plugs
> into the dax_kmem driver, and we're stuck with that unfortunate
> backwards compatibility story for quite some time.
>
> And the platform which use auto-regions may not adhere to expected
> programming patterns due to some subtle deviations from the spec.
>
> So tearing complexes down and reprogramming them may not even be
> feasible. ( cough Zen5 :[ )
>
> But more generally, proposing additional features for auto-regions
> creates an incentive to push ever-increasingly-complex policy down
> into the BIOS, which will just lead to sadness and heartache.
I agree :)
Thanks,
Ben
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: CXL Isolation Support
2026-01-30 21:30 ` Gregory Price
2026-02-02 15:59 ` Jonathan Cameron
@ 2026-02-02 17:30 ` Cheatham, Benjamin
2026-02-02 19:52 ` Gregory Price
1 sibling, 1 reply; 13+ messages in thread
From: Cheatham, Benjamin @ 2026-02-02 17:30 UTC (permalink / raw)
To: Gregory Price; +Cc: linux-cxl
On 1/30/2026 3:30 PM, Gregory Price wrote:
> On Fri, Jan 30, 2026 at 01:47:08PM -0600, Cheatham, Benjamin wrote:
>> Technical Details
>> =================
>>
>> 1. CXL memory may be used for kernel allocations
>>
>> Kernel allocations in CXL aren't a problem at the moment because if the CXL.mem
>> link goes down the hardware resets. When isolation is enabled, this isn't the case.
>> The kernel can keep chugging along until it eventually errors out trying to access
>> the now inaccessible memory (possibly causing data corruption until then).
>>
>
> Two points:
>
> 1) isolating the normal native sysram usecase (N_MEMORY node) buys you
> less reliability than you think. It only buys you kernel safety.
>
> Keeping kernel memory out of CXL only helps the kernel report problems,
> it doesn't necessarily guarantee any particular system state in
> userland after that occurs.
>
> It's entirely possible (and altogether likely) that a large swath of
> userland will become inoperable if the entire link goes down and a bunch
> of that memory was used.
>
> So if the goal is to have the system fully recover, this would require
> much more isolation policy than just keeping the kernel out of CXL - you
> would need most of the baseline distro to set cpuset.mem policies for
> core services that ensure the userspace environment doesn't completely
> blow up.
>
> But even then, you have bigger issues for shared file-backed VMAs.
>
> e.g. imagine libc gets demoted and comes back poisoned: kabloooey
>
> But, you can at least get data from the system that the link went
> down and even have a chance to investigate before nicely blowing up.
>
> Which is at least much more helpful.
Recovery is the end goal, but not the immediate one. AMD platforms don't support
recovery from isolation at the moment so the goal is to just keep the system
from immediately resetting when a CXL link goes down. All that let's you
do is a) choose when you reset the system for a repair and b) get more debugging
info, which you already highlighted.
>
>
>
> 2) if N_MEMORY_PRIVATE nodes become accepted, this is a different story
>
> In this case, the nodes are *completely* isolated except for *explicit*
> use cases. So a distro can in theory build a policy that dictates what
> types of memory actually land on a particular device and how they can
> get there.
Yep, this would make supporting this *a lot* easier. It would go from a
heavy lift for the end user (configure memory & guard against critical allocations)
to something you just leave enabled unless you want to hotremove a card.
>
> The example would be using mempolicy to opt VMAs into private nodes,
> while defaulting any given VMA to N_MEMORY nodes only. This would be a
> more reliable way to enforce isolation of critical components.
>
> ---
>
> From a reliability perspective, my opinion has always been that kernel
> isolation is a way for the kernel to deliver higher quality debugging
> signal when an issue does happen, because when you're trying to figure
> out why a million machines with CXL blow up - that little bit of signal
> can save you many weeks of trouble.
>
>>
>> Gregory Price has submitted a set that changes the second part of this flow to instead
>> use sysfs [3]. With that, the need for udev rules is removed and users (or their tooling)
>> can set their CXL memory to devdax mode before it's added to the system. At that point,
>> all that needs to be done is restrict enabling isolation to CXL devices in devdax-mapped
>> regions and make sure the memory mode doesn't change (i.e. devdax -> sysram).
>>
>
> Just a clarifying point - the sysfs entries are representations of
> kernel objects.
>
> You can still do what you describe with a UDEV rule, just now the
> sysram_region provides a step in which you can set the ZONE_MOVABLE
> rule *before* the dax_kmem driver is engaged.
>
> UDEV still possible, but now cleaner.
>
> detects(region0)
> probe(sysram_region) <-- or cxl_devdax_region for device-dax instead
>
> detects(sysram_region0)
> write(online_type="online_movable")
> probe(cxl_dev_kmem_Region)
>
> tl;dr: Now you don't need a weird unbind step.
That's what I meant to say. It's possible, but would become easier.
Thanks,
Ben
>
> ~Gregory
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: CXL Isolation Support
2026-02-02 17:30 ` Cheatham, Benjamin
@ 2026-02-02 19:52 ` Gregory Price
0 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-02-02 19:52 UTC (permalink / raw)
To: Cheatham, Benjamin; +Cc: linux-cxl
On Mon, Feb 02, 2026 at 11:30:58AM -0600, Cheatham, Benjamin wrote:
>
>
> On 1/30/2026 3:30 PM, Gregory Price wrote:
> > But, you can at least get data from the system that the link went
> > down and even have a chance to investigate before nicely blowing up.
> >
> > Which is at least much more helpful.
>
> Recovery is the end goal, but not the immediate one. AMD platforms don't support
> recovery from isolation at the moment so the goal is to just keep the system
> from immediately resetting when a CXL link goes down. All that let's you
> do is a) choose when you reset the system for a repair and b) get more debugging
> info, which you already highlighted.
>
Ah, another issue for type-3 devices is where the `struct page` are
hosted. if you want true isolation, you cannot allow memmap_on_mem=true
for CXL devices, otherwise you can't even guarantee kernel safety.
This has implications for maximum capacity of the CXL device (must have
sufficient non-isolated capacity to host isolated capacity memory map).
This policy is not plumbed from CXL into DAX at the moment, so whatever
the default for dax users is being used. (should be =false, i think)
~Gregory
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: CXL Isolation Support
2026-01-30 19:47 RFC: CXL Isolation Support Cheatham, Benjamin
2026-01-30 21:30 ` Gregory Price
@ 2026-02-02 15:52 ` Jonathan Cameron
2026-02-02 19:28 ` Vikram Sethi
2026-02-02 20:20 ` dan.j.williams
3 siblings, 0 replies; 13+ messages in thread
From: Jonathan Cameron @ 2026-02-02 15:52 UTC (permalink / raw)
To: Cheatham, Benjamin; +Cc: linux-cxl
> Possible Solutions
> ------------------
> There's a couple of things we could do here. First is to restrict isolation to when
> the CXL core is built-in (CXL_BUS=y && depends on PCIEPORTBUS). I'm not particularly
> happy about this approach since it removes the modularity of the CXL driver(s), but I
> won't gripe if that's what's settled on.
It's the sort of constraint we can relax later if that becomes possible.
>
> Another approach would be to move the CXL register mapping code in cxl/core/regs.c to a
> library, or always make the file built-in when CXL_BUS is selected. This is more palatable
> (imo) but splits the CXL code up in a potentially weird way.
>
> Last one is to rework the PCIe port bus driver to allow for re-allocating MSI/-X interrupts.
> Jonathan Cameron sent out a series where there was some discussion on this. This support
> would be limited to MSI-X interrupts only due to the PCI maintainers not wanting to add
> more support for MSI [4].
> This wouldn't work for AMD platforms because we use MSI interrupts
> for this support. There is still a way to make this work, however. AMD server platforms
> use the same MSI vector for all PCIe interrupts, so we could introduce a quirk to use that
> same vector as another PCIe interrupt for CXL isolation. That would require no register mapping
> code in the PCIe portdrv code but would introduce a platform quirk instead. I doubt anyone would
> be happy about introducing a quirk but I thought I'd throw it out as an option.
That whole approach (I failed to follow up so far) in [4] never took shared interrupts into account.
So might well have been quirk territory even with that rework done (with MSI as a possible
controversial follow up to the MSI-X version).
Jonathan
>
> Thanks for reading,
> Ben
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: CXL Isolation Support
2026-01-30 19:47 RFC: CXL Isolation Support Cheatham, Benjamin
2026-01-30 21:30 ` Gregory Price
2026-02-02 15:52 ` Jonathan Cameron
@ 2026-02-02 19:28 ` Vikram Sethi
2026-02-02 20:20 ` dan.j.williams
3 siblings, 0 replies; 13+ messages in thread
From: Vikram Sethi @ 2026-02-02 19:28 UTC (permalink / raw)
To: Cheatham, Benjamin, linux-cxl@vger.kernel.org, Natu, Mahesh,
Srirangan Madhavan
Hi Benjamin,
________________________________________
From: Cheatham, Benjamin <benjamin.cheatham@amd.com>
Sent: Friday, January 30, 2026 1:47 PM
To: linux-cxl@vger.kernel.org
Cc: benjamin.cheatham@amd.com
Subject: RFC: CXL Isolation Support
>Kernel allocations in CXL aren't a problem at the moment because if the CXL.mem
>link goes down the hardware resets. When isolation is enabled, this isn't the case.
>The kernel can keep chugging along until it eventually errors out trying to access
>the now inaccessible memory (possibly causing data corruption until then).
>In my v1 submission, I opted to just panic the system when isolation occurs when
>any CXL driver couldn't handle the event. The handler for type 3 devices
>(cxl_pci driver) did some RAS stuff and then panic'ed the system. I think an isolation
>handler for CXL device drivers will probably be part of a final solution, but the
>handler in that series was hamstrung by allowing CXL memory into the system ram pool.
The framework introduced here should also consider just .cache (type1) devices, and type2 devices.
For the latter, the usage of the coherent device memory can be controlled by the endpoint driver, and
can be limited to specific applications. Kernel panic is not the right scope for such devices for CXL Isolation at a RP which has just type 1 or type 2 device under it. Error_detected callbacks to endpoint driver followed by CXL reset of the device is
appropriate there (that can come in a follow up series once Srirangan's CXL reset of type2 devices series is merged [1]).
[1] https://lore.kernel.org/all/20260120222610.2227109-1-smadhavan@nvidia.com/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: CXL Isolation Support
2026-01-30 19:47 RFC: CXL Isolation Support Cheatham, Benjamin
` (2 preceding siblings ...)
2026-02-02 19:28 ` Vikram Sethi
@ 2026-02-02 20:20 ` dan.j.williams
2026-02-05 20:49 ` Cheatham, Benjamin
3 siblings, 1 reply; 13+ messages in thread
From: dan.j.williams @ 2026-02-02 20:20 UTC (permalink / raw)
To: Cheatham, Benjamin, linux-cxl; +Cc: benjamin.cheatham
Cheatham, Benjamin wrote:
> Quick Background:
> CXL.mem isolation and timeout is a mechanism that allows the host to
> continue operation in the event a CXL.mem link goes down or a CXL.mem
> transaction times out (semi-analogous to PCIe DPC for CXL)[1]. After CXL.mem
> isolation is triggered all CXL memory below the root port is inaccessible.
...and this is unrecoverable in the generic memory expansion case as
detailed previously [1].
[1]: http://lore.kernel.org/65cea1bc6ac0c_5e9bf294ed@dwillia2-xfh.jf.intel.com.notmuch
> At this point writes to the memory are dropped and reads return synchronous
> exceptions (platform specific, but probably poisoned data). The alternative
> to this support (which is the case now) is the host system resets when a
> CXL.mem link goes down or a CXL.mem transaction timeouts out.
>
> Why I'm Sending This:
> I sent out a patch series a few months back that implemented CXL.mem
> error isolation to this list [2]. It didn't really gain traction due
> to not having a customer requesting it. We (AMD) have heard from some
> customers that they are interested in this support, but aren't willing to
> help out upstream.
Then they get the status quo until that "interest" matures into shared
requirements definition, clarification of assumptions, and consensus of
tradeoffs.
> The main motivation behind using isolation we've heard
> is that customers would like to use CXL but are worried about system
> reliability since it's still a new technology.
That does not appear prohibitive given CXL uptake to date. Isolation
does not improve reliability on its own. It replaces hangs with poison
that is fatal outside of constrained use cases.
Now, all of the push back to date has been with respect to the general
purpose memory expansion use case. The way forward from there is new
evidence that the expected mitigations to make isolation useful still
result in a usable feature. The evidence of *that* is the new use case
that Vikram proposed several months back in the CXL collaboration call,
CXL Accelerator error recovery.
In that case there is a chance that the acclerator error model meets the
requirements to make isolation useful. Guarantees like 1:1 host bridge
to endpoint direct-attach, non-interleaved CXL.mem, and limited risk of
core kernel dependencies on that CXL.mem.
I am interested in the isolation for CXL accelerator discussion. I am
not interested in muddying through isolation for the general memory
expander use case without engagement from deployment use cases.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: RFC: CXL Isolation Support
2026-02-02 20:20 ` dan.j.williams
@ 2026-02-05 20:49 ` Cheatham, Benjamin
2026-02-05 21:52 ` dan.j.williams
0 siblings, 1 reply; 13+ messages in thread
From: Cheatham, Benjamin @ 2026-02-05 20:49 UTC (permalink / raw)
To: dan.j.williams, linux-cxl
On 2/2/2026 2:20 PM, dan.j.williams@intel.com wrote:
> Cheatham, Benjamin wrote:
>> Quick Background:
>> CXL.mem isolation and timeout is a mechanism that allows the host to
>> continue operation in the event a CXL.mem link goes down or a CXL.mem
>> transaction times out (semi-analogous to PCIe DPC for CXL)[1]. After CXL.mem
>> isolation is triggered all CXL memory below the root port is inaccessible.
>
> ...and this is unrecoverable in the generic memory expansion case as
> detailed previously [1].
>
> [1]: http://lore.kernel.org/65cea1bc6ac0c_5e9bf294ed@dwillia2-xfh.jf.intel.com.notmuch
>
>> At this point writes to the memory are dropped and reads return synchronous
>> exceptions (platform specific, but probably poisoned data). The alternative
>> to this support (which is the case now) is the host system resets when a
>> CXL.mem link goes down or a CXL.mem transaction timeouts out.
>>
>> Why I'm Sending This:
>> I sent out a patch series a few months back that implemented CXL.mem
>> error isolation to this list [2]. It didn't really gain traction due
>> to not having a customer requesting it. We (AMD) have heard from some
>> customers that they are interested in this support, but aren't willing to
>> help out upstream.
>
> Then they get the status quo until that "interest" matures into shared
> requirements definition, clarification of assumptions, and consensus of
> tradeoffs.
Understood.
>
>> The main motivation behind using isolation we've heard
>> is that customers would like to use CXL but are worried about system
>> reliability since it's still a new technology.
>
> That does not appear prohibitive given CXL uptake to date. Isolation
> does not improve reliability on its own. It replaces hangs with poison
> that is fatal outside of constrained use cases.
>
> Now, all of the push back to date has been with respect to the general
> purpose memory expansion use case. The way forward from there is new
> evidence that the expected mitigations to make isolation useful still
> result in a usable feature. The evidence of *that* is the new use case
> that Vikram proposed several months back in the CXL collaboration call,
> CXL Accelerator error recovery.
>
> In that case there is a chance that the acclerator error model meets the
> requirements to make isolation useful. Guarantees like 1:1 host bridge
> to endpoint direct-attach, non-interleaved CXL.mem, and limited risk of
> core kernel dependencies on that CXL.mem.
That's reasonable.
>
> I am interested in the isolation for CXL accelerator discussion. I am
> not interested in muddying through isolation for the general memory
> expander use case without engagement from deployment use cases.
I can't remember any internal discussions about using isolation for only accelerators so
I'll need to check and see if that's something we're interested in.
As for the memory expander case: would something like the N_PRIVATE node set Gregory sent out [1]
be enough to change your mind on this? It doesn't provide the same guarantees as a type 2 set
would, but it does limit the usage of CXL memory to be more like type 2 memory.
Regardless, I (really) won't bring it up again until we have someone who wants to deploy this thing.
Thanks,
Ben
[1]: https://lore.kernel.org/linux-mm/20260108203755.1163107-1-gourry@gourry.net/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: CXL Isolation Support
2026-02-05 20:49 ` Cheatham, Benjamin
@ 2026-02-05 21:52 ` dan.j.williams
2026-02-05 22:54 ` Gregory Price
0 siblings, 1 reply; 13+ messages in thread
From: dan.j.williams @ 2026-02-05 21:52 UTC (permalink / raw)
To: Cheatham, Benjamin, dan.j.williams, linux-cxl
Cheatham, Benjamin wrote:
> On 2/2/2026 2:20 PM, dan.j.williams@intel.com wrote:
[..]
> > I am interested in the isolation for CXL accelerator discussion. I am
> > not interested in muddying through isolation for the general memory
> > expander use case without engagement from deployment use cases.
>
> I can't remember any internal discussions about using isolation for only accelerators so
> I'll need to check and see if that's something we're interested in.
>
> As for the memory expander case: would something like the N_PRIVATE
> node set Gregory sent out [1] be enough to change your mind on this?
> It doesn't provide the same guarantees as a type 2 set would, but it
> does limit the usage of CXL memory to be more like type 2 memory.
It is not clear to me that can provide the guarantees needed to make
isolation useful. It is still fundamentally a core-mm services
capability. That effort is not well served if it needs to handle the
constraint of: "consider that at any moment this entire node starts
throwing poison, the system needs to stay alive through that event".
> Regardless, I (really) won't bring it up again until we have someone who wants to deploy this thing.
I am always open to reconsidering positions in the face of new evidence.
The workable aspect of a direct-attached-accelerator first approach is
that the accelerator driver has direct ownership of the error model.
Compare that to the core-mm which only considers a page failure model,
not a node failure model. The direct-attached-accelerator constraint is
enough to get some core enabling in the pipeline. Expansions from that
baseline can be evaluated later.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: CXL Isolation Support
2026-02-05 21:52 ` dan.j.williams
@ 2026-02-05 22:54 ` Gregory Price
0 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-02-05 22:54 UTC (permalink / raw)
To: dan.j.williams; +Cc: Cheatham, Benjamin, linux-cxl
On Thu, Feb 05, 2026 at 01:52:32PM -0800, dan.j.williams@intel.com wrote:
> Cheatham, Benjamin wrote:
> > On 2/2/2026 2:20 PM, dan.j.williams@intel.com wrote:
> [..]
> > > I am interested in the isolation for CXL accelerator discussion. I am
> > > not interested in muddying through isolation for the general memory
> > > expander use case without engagement from deployment use cases.
> >
> > I can't remember any internal discussions about using isolation for only accelerators so
> > I'll need to check and see if that's something we're interested in.
> >
> > As for the memory expander case: would something like the N_PRIVATE
> > node set Gregory sent out [1] be enough to change your mind on this?
> > It doesn't provide the same guarantees as a type 2 set would, but it
> > does limit the usage of CXL memory to be more like type 2 memory.
>
> It is not clear to me that can provide the guarantees needed to make
> isolation useful. It is still fundamentally a core-mm services
> capability. That effort is not well served if it needs to handle the
> constraint of: "consider that at any moment this entire node starts
> throwing poison, the system needs to stay alive through that event".
>
fwiw N_MEMORY_PRIVATE would only guarantee the kernel won't dish the
memory out to any random user - it doesn't say anything about what
the enlightened services *do* with that page.
The driver is ultimately responsible for ensuring page usage is fit for
reliability situations like a link going down.
e.g. It's on the driver and services to use MCE-safe access patterns
if it intends to access that page's data from a kernel context.
An enlightened service might still call, for example:
/* migrate N_MEMORY_PRIVATE to N_MEMORY */
migrate_pages(memory_node, private_pages, ...)
/* migrate N_MEMORY to N_MEMORY_PRIVATE */
migrate_pages(private_node, memory_pages, ...)
With a couple of helpful callbacks. (migrate should be MCE-safe since
it can take requests from userland, for example).
I think if the type-2 pattern looks like the ZONE_DEVICE coherent
pattern, then if ZONE_DEVICE provides any amount of isolation /
reliability benefit we would expect the same of private nodes.
Except with the bonus that you don't need to write an allocator or
migration code of course :]
~Gregory
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-02-05 22:55 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-30 19:47 RFC: CXL Isolation Support Cheatham, Benjamin
2026-01-30 21:30 ` Gregory Price
2026-02-02 15:59 ` Jonathan Cameron
2026-02-02 16:50 ` Gregory Price
2026-02-02 17:31 ` Cheatham, Benjamin
2026-02-02 17:30 ` Cheatham, Benjamin
2026-02-02 19:52 ` Gregory Price
2026-02-02 15:52 ` Jonathan Cameron
2026-02-02 19:28 ` Vikram Sethi
2026-02-02 20:20 ` dan.j.williams
2026-02-05 20:49 ` Cheatham, Benjamin
2026-02-05 21:52 ` dan.j.williams
2026-02-05 22:54 ` Gregory Price
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox