RFC: Kernel CXL cache support (and IOMMU implications)

All of lore.kernel.org
 help / color / mirror / Atom feed

* RFC: Kernel CXL cache support (and IOMMU implications)
@ 2024-11-19 16:52 Alejandro Lucero Palau
  2024-11-20 22:33 ` Zhi Wang
  0 siblings, 1 reply; 4+ messages in thread
From: Alejandro Lucero Palau @ 2024-11-19 16:52 UTC (permalink / raw)
  To: linux-cxl@vger.kernel.org, iommu

November, 2024

Tittle: CXL Cache support by the kernel
Author: Alejandro Lucero (alucerop@amd.com)

Version 0.1

Introduction
========

After the LPC where I presented the current status of the Type2 CXL.mem 
support
patchset, and some ideas about supporting CXL.cache, it is time to dig 
deeper in
this second goal, and discussing the security/reliability aspect as well.

It is also important to try to describe how this is going to work and 
what the
kernel needs to know and enforce. Reading the CXL specs when having in 
mind some
specific use case can easily lead to assuming certains aspects with a 
different
perspective from other readers/use cases. To start with, it is necessary to
differentiate two "CXL cache" functionalities when a Type2 device is in 
place:

1) A Type2 device caching Host memory.

2) The Host caching HDM memory, that is the memory inside the Type2 CXL 
device.

The first option is also what a Type1 device can do, and the kernel support
needs to manage all those Type1/2 per CXL Root Complex knowing the resources
limitation, that is the snooping cache size.

A snoop cache allows the host to track which memory is being used/cached by
those devices, enforcing the cache coherency. The specs are not clear 
about some
important aspects regarding how the host can enforce the proper use of 
this by
devices or even if the snoop cache needs to do so. At pages 786 and 787 
of CXL
specs 3.1, how the system software should deal with CXL cache devices is 
given,
but this is inside a Hot-plug section. I think we can assume the Host
firmware/BIOS will follow same approach for enabling CXL cache, and the 
kernel
needs to look at those devices with CXL cache enabled by the BIOS for 
properly
handling the available space in the snoop cache.

It is also worth to mention the CXL.cache protocol can be used in the two
"CXL cache" functionalities listed above. However, the last CXL spec implies
CXL.cache only used for the first case. Some comments about what the 
specs say

regarding number of devices with a cache for host memory:

         - up to 16 Type1 and/or Type2 devices allowed per VH.

can be easily confused with the limitations of just one CXL Type2 device 
using
CXL.cache for enforcing coherency of its HDM. This limitation is 
overcome with
forcing Type2 device using HDM-DB, which relies on CXL.mem instead of 
CXL.cache
for HDM cache coherency.

While the Host is assumed to be able to access HDM in a Type2 device, and
keeping data in the host cpu caches, it is the Type2 device 
responsibility to
properly manage cache coherency of its HDM. There is nothing the kernel can
control here.

Therefore the interesting part and what this documents tries to cover is the
Host memory being cached by Type2 or Type1 devices. While the main goal is
discussing how the kernel needs to handle this, and to describe how it 
should
work when CXL devices are used by the system/Host, some comments are made to
cover the virtualization case where those CXL devices can potenetially 
be used
(device passthrough) by guests/VMs. I try to expose the current security
problems where IOMMU is used for restricting what a guest controlled 
CXL.cache
device can read/write in Host memory what I think needs to be clarified by
hardware vendors.

Understanding the memory accesses from CXL devices
==================================

For the sake of presenting the case about kernel CXL.cache support, I'll 
try to
explain how it works (I should say "how I think it works") and the main 
points
to discuss regarding how to implement this support. So, do not take the next
explanation as the definitive answer or guide, and if you think there 
are errors
or maybe too much generalization at some points, please help fixing or 
adding
further details. Also, consider some parts as just me thinking out loud, 
what
maybe help other people (or confuse them!).

The CXL.cache protocol allows devices to be part of the coherency ring 
of the
system.

Let's start with a Type2 device reading from a specific host memory 
address. The
final situation is 64bytes (cache line) from host memory copied to the 
device
cache, supposedly for being used by the device/accelerator. If the data 
changes,
because some host cpu modifies it, the device will be signalled by the 
coherency
ring, so the device will know. The important point here is the device can be
told because the Host knows the device has a copy or the only copy of that
data/memory. And that is thanks to the snoop cache implemented by the 
CXL Root
Complex.

A device caching host memory can be used as well for writes to host memory
through the cache coherency ring. A device can not just read host memory and
keep it, but it can modified it. The implications of writes versus reads 
are not
important for the goal of this document. It requires the device to 
support more
protocol exchange cases, but regarding the snoop cache, it is irrelevant.

There arise obvious questions about how this snoop cache is going to work.

First, with the simple case of just one device caching Host memory. From the
specs, the device CXL.cache should not be enabled by the Host if the device
cache is bigger than the snoop cache. However, what does preclude a 
device to do
more memory accesses than what the snoop cache can cover? This can be partly
explained with some allocation control for CXL.cache what is discussed 
in the
next section. But a "rogue" device could try things like this, what for 
the case
of a single device using the snoop cache and without any other concern about
security, is probably fine:

         - With a Type2, the snoop cache will tell the device to release 
another
           line, meaning any modified line to be sent back to the Host.
         - Any performance problem will only have an impact on the 
device itself.

Then the case of multiple CXL devices caching Host memory in the same 
CXL Root
Complex and therefore same CXL Snoop Cache:

* How can the snoop cache track reads from different devices without one 
device
   monopolizing the full space?

         - enforcing snoop cache slices by software?
         - allowing specific/limited host ranges by the kernel?

AFAIK, there is not any kind of hardware control for avoiding this 
contention.
Note that with the proper checking by the BIOS and by the kernel (for 
hotplug or
those not enabled devices yet during boot time), the size of total 
device caches
allowed per CXL Root Complex should not be bigger than the snoop cache 
size, and
therefore theoretically no contention at all ... if the devices do the right
thing. From software the only thing we can do is to ensure the CXL.cache
accesses from a device are within a range with same size than the enabled
CXL.cache.

Therefore, some memory allocation API is required for dealing with the 
amount of
memory the snoop cache can track, and the host memory a device can 
access to.
The device needs the physical address to work with, and it is in this 
required
translation from virtual to physical addresses where we can enforce the
restriction. Of course, such an API does already exist, although not 
with the
checking we need: the kernel DMA API.

(Secure) memory allocation  and CXL.cache
===========================

DMAs allow devices to perform read/write operations to system memory 
without any
cpu intervention after the (meta)data about how to perform the DMA is 
given to
the device. CXL.cache is more than DMA because the system memory caches are
implicitly involved but for the sake of handling this by the operating 
system,
not too much different. The important point here is there is no restriction
about the DMAble memory to be used by a device, but due to the snoop cache
limitations, this needs to change for CXL: code aware of the snoop cache 
state
and what a device requires needs to be consulted for properly handling the
available space.

Should we use the kernel DMA API for CXL.cache allocations? This API 
deals with
memory coherency what is not needed for the CXL.cache case. However, it is
connected with the IOMMU functionality what is required for CXL.cache if 
it is
enabled.

I think the solution should be to implement a CXL.cache allocation API 
inside
the CXL core dealing with the snoop cache available space, and to 
connect with
IOMMU kernel code when it is enabled.

A security aspect behind DMAs is a device has (usually) no restrictions for
memory access. This is true in a system with no IOMMU hardware, and 
CXL.cache
is not different in this case. With IOMMU is a different game though.

First of all, IOMMU will be in place for CXL.io, what implies legacy TLP 
PCIe
packets. A CXL.cache operation can not be handled by the IOMMU hardware 
and the
spec states ATS to be used beforehand, that is, the CXL device asking 
the IOMMU
hardware about the physical address to work with, and keeping that 
translation
internally. The CXL spec specifies ATS service extensions for CXL, and 
some ATS
requests can tell the device some addresses only to be used through 
CXL.io. This
implies some sort of knowledge about CXL is required by the IOMMU/ATS 
hardware
which depends on how the per device tables are programmed by the Host. 
However,
AFAIK, this is not supported yet by any Linux kernel IOMMU vendor 
support. Note
the usual IOMMU device/domain tables will/can be used for normal DMA 
transfers,
so IOMMU configuration, both in the Host and by the HW, needs to know 
which parts
of the domain are for DMAs and which are for CXL.cache.

Assuming this support will be implemented at some point in the future, the
questions are, when?, and, how safe is it?

Can a device issue CXL.cache operations using arbitrary physical 
addresses? It
seems there are some cases where the hardware can take control of PCIe TLP
packets with the ATS bit on. For example, if there is a PCIe bridge in 
the path,
and with that bridge using a specific redirection table based on 
configured ATS
per device ranges, any TLP with the ATS bit on will be redirected based 
on such
a table, and implying no redirection if no table entry. However, that 
does not
seem to be in place for PCIe Root Complex implementations. For example, AMD
IOMMU documentation states ATS TLP packets are not handled at all, implying
trusting the device, and if more security is required, the IOMMU 
hardware can
check those TLP ATS packets as well, spoiling the ATS advantage. Note 
this is
PCIe, so CXL.io will likely keep the functionality, but CXL.cache operations
follow another path with apparently no further control to enforce the right
addresses within the allowed memory ranges per device are used.

Because this apparently lack of security for IOMMU and CXL.cache, this 
implies a
CXL device should not be used by VMs or any other user space controlled 
driver
with CXL.cache being enabled. This seems a really serious limitation, so 
maybe
I'm missing something here.

Regarding virtualization, assuming the security problems do not exist or 
will be
solved, while CXL.mem can be supported with an ahead mapping by the 
Host, with
CXL.cache this needs to be handled when the related driver asks for specific
memory to access, and then to configure the IOMMU/ATS tables by the 
Host. This
implies the emulation needs a backend, what an ahead mapping, as currently
proposed for CXL.mem can avoid.

Finally, if my concerns about the security of CXL.cache with IOMMU are
unfounded, at least this document should describe how is this solved and the
security enforced by the hardware, and if the kernel requires to handle it
specifically (what I really think is the case, at least with IOMMU changes
managed by the CXL core).

Summary
======

Next the proposed tasks to perform for supporting CXL.cache:

         - CXL core handling per device CXL.cache enabling based on CXL Root
           Complex snoop cache state.

         - CXL core implementing a CXL.cache host memory allocation 
restricting
           the physical memory a a device can access to through CXL.cache.

         - IOMMU being CXL aware and dealing with CXL.cache vs CXL.io 
requests.

         - Clarify CXL.cache and security with IOMMU.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RFC: Kernel CXL cache support (and IOMMU implications)
  2024-11-19 16:52 RFC: Kernel CXL cache support (and IOMMU implications) Alejandro Lucero Palau
@ 2024-11-20 22:33 ` Zhi Wang
  2024-12-13 14:15   ` Alejandro Lucero Palau
  0 siblings, 1 reply; 4+ messages in thread
From: Zhi Wang @ 2024-11-20 22:33 UTC (permalink / raw)
  To: Alejandro Lucero Palau; +Cc: linux-cxl@vger.kernel.org, iommu

On Tue, 19 Nov 2024 16:52:15 +0000
Alejandro Lucero Palau <alucerop@amd.com> wrote:

Thanks so much for the doc. I just quickly went through the doc and here
are my comments.

> November, 2024
> 
> Tittle: CXL Cache support by the kernel
> Author: Alejandro Lucero (alucerop@amd.com)
> 
> Version 0.1
> 
> Introduction
> ========
> 
> After the LPC where I presented the current status of the Type2 CXL.mem 
> support
> patchset, and some ideas about supporting CXL.cache, it is time to dig 
> deeper in
> this second goal, and discussing the security/reliability aspect as well.
> 
> It is also important to try to describe how this is going to work and 
> what the
> kernel needs to know and enforce. Reading the CXL specs when having in 
> mind some
> specific use case can easily lead to assuming certains aspects with a 
> different
> perspective from other readers/use cases. To start with, it is necessary to
> differentiate two "CXL cache" functionalities when a Type2 device is in 
> place:
> 
> 1) A Type2 device caching Host memory.
> 
> 2) The Host caching HDM memory, that is the memory inside the Type2 CXL 
> device.
> 
> The first option is also what a Type1 device can do, and the kernel support
> needs to manage all those Type1/2 per CXL Root Complex knowing the resources
> limitation, that is the snooping cache size.
> 
> A snoop cache allows the host to track which memory is being used/cached by
> those devices, enforcing the cache coherency. The specs are not clear 
> about some
> important aspects regarding how the host can enforce the proper use of 
> this by
> devices or even if the snoop cache needs to do so. At pages 786 and 787 
> of CXL
> specs 3.1, how the system software should deal with CXL cache devices is 
> given,
> but this is inside a Hot-plug section. I think we can assume the Host
> firmware/BIOS will follow same approach for enabling CXL cache, and the 
> kernel
> needs to look at those devices with CXL cache enabled by the BIOS for 
> properly
> handling the available space in the snoop cache.
> 
> It is also worth to mention the CXL.cache protocol can be used in the two
> "CXL cache" functionalities listed above. However, the last CXL spec implies
> CXL.cache only used for the first case. Some comments about what the 
> specs say
> 
> regarding number of devices with a cache for host memory:
> 
>          - up to 16 Type1 and/or Type2 devices allowed per VH.
> 
> can be easily confused with the limitations of just one CXL Type2 device 
> using
> CXL.cache for enforcing coherency of its HDM. This limitation is 
> overcome with
> forcing Type2 device using HDM-DB, which relies on CXL.mem instead of 
> CXL.cache
> for HDM cache coherency.
> 
> While the Host is assumed to be able to access HDM in a Type2 device, and
> keeping data in the host cpu caches, it is the Type2 device 
> responsibility to
> properly manage cache coherency of its HDM. There is nothing the kernel can
> control here.
> 
> Therefore the interesting part and what this documents tries to cover is the
> Host memory being cached by Type2 or Type1 devices. While the main goal is
> discussing how the kernel needs to handle this, and to describe how it 
> should
> work when CXL devices are used by the system/Host, some comments are made to
> cover the virtualization case where those CXL devices can potenetially 
> be used
> (device passthrough) by guests/VMs. I try to expose the current security
> problems where IOMMU is used for restricting what a guest controlled 
> CXL.cache
> device can read/write in Host memory what I think needs to be clarified by
> hardware vendors.
> 
> 
> Understanding the memory accesses from CXL devices
> ==================================
> 
> For the sake of presenting the case about kernel CXL.cache support, I'll 
> try to
> explain how it works (I should say "how I think it works") and the main 
> points
> to discuss regarding how to implement this support. So, do not take the next
> explanation as the definitive answer or guide, and if you think there 
> are errors
> or maybe too much generalization at some points, please help fixing or 
> adding
> further details. Also, consider some parts as just me thinking out loud, 
> what
> maybe help other people (or confuse them!).
> 
> The CXL.cache protocol allows devices to be part of the coherency ring 
> of the
> system.
> 
> Let's start with a Type2 device reading from a specific host memory 
> address. The
> final situation is 64bytes (cache line) from host memory copied to the 
> device
> cache, supposedly for being used by the device/accelerator. If the data 
> changes,
> because some host cpu modifies it, the device will be signalled by the 
> coherency
> ring, so the device will know. The important point here is the device can be
> told because the Host knows the device has a copy or the only copy of that
> data/memory. And that is thanks to the snoop cache implemented by the 
> CXL Root
> Complex.
> 
> A device caching host memory can be used as well for writes to host memory
> through the cache coherency ring. A device can not just read host memory and
> keep it, but it can modified it. The implications of writes versus reads 
> are not
> important for the goal of this document. It requires the device to 
> support more
> protocol exchange cases, but regarding the snoop cache, it is irrelevant.
> 
> There arise obvious questions about how this snoop cache is going to work.
> 
> First, with the simple case of just one device caching Host memory. From the
> specs, the device CXL.cache should not be enabled by the Host if the device
> cache is bigger than the snoop cache. However, what does preclude a 
> device to do
> more memory accesses than what the snoop cache can cover? This can be partly
> explained with some allocation control for CXL.cache what is discussed 
> in the
> next section. But a "rogue" device could try things like this, what for 
> the case
> of a single device using the snoop cache and without any other concern about
> security, is probably fine:
> 
>          - With a Type2, the snoop cache will tell the device to release 
> another
>            line, meaning any modified line to be sent back to the Host.
>          - Any performance problem will only have an impact on the 
> device itself.
> 
> Then the case of multiple CXL devices caching Host memory in the same 
> CXL Root
> Complex and therefore same CXL Snoop Cache:
> 
> * How can the snoop cache track reads from different devices without one 
> device
>    monopolizing the full space?
> 
>          - enforcing snoop cache slices by software?
>          - allowing specific/limited host ranges by the kernel?
> 

I would like to compare it with the approaches that solves the similar
problem of the CPU cache since they might have similar essence. 

CPU cache suffered from the similar problems that noisy and
restless neighborhood keep poking the cache that might cause performance
drop. Nowadays, it is solved by the HW mechanism, cache allocation. For
Intel, it is called cache allocation technology(CAT) which is a subset of
Resource Director Technology(RDT). They can be also used in the
virtualization world.

Before SW gets the support from the HW, many research papers were talking
about solving it via page color. E.g. allocate the VM memory with page
color awareness for different VMs. But I don't think those ideas eventually
land in the mainline.

Back to this prob, I think probably SW is going to rely on a HW mechanism
to solve this problem nicely and decently, the same as CPU side. 

> AFAIK, there is not any kind of hardware control for avoiding this 
> contention.
> Note that with the proper checking by the BIOS and by the kernel (for 
> hotplug or
> those not enabled devices yet during boot time), the size of total 
> device caches
> allowed per CXL Root Complex should not be bigger than the snoop cache 
> size, and
> therefore theoretically no contention at all ... if the devices do the right
> thing. From software the only thing we can do is to ensure the CXL.cache
> accesses from a device are within a range with same size than the enabled
> CXL.cache.
> 

What would be the consequence if we violate this rule?

> Therefore, some memory allocation API is required for dealing with the 
> amount of
> memory the snoop cache can track, and the host memory a device can 
> access to.
> The device needs the physical address to work with, and it is in this 
> required
> translation from virtual to physical addresses where we can enforce the
> restriction. Of course, such an API does already exist, although not 
> with the
> checking we need: the kernel DMA API.
> 
> 
> (Secure) memory allocation  and CXL.cache
> ===========================
> 
> DMAs allow devices to perform read/write operations to system memory 
> without any
> cpu intervention after the (meta)data about how to perform the DMA is 
> given to
> the device. CXL.cache is more than DMA because the system memory caches are
> implicitly involved but for the sake of handling this by the operating 
> system,
> not too much different. The important point here is there is no restriction
> about the DMAble memory to be used by a device, but due to the snoop cache
> limitations, this needs to change for CXL: code aware of the snoop cache 
> state
> and what a device requires needs to be consulted for properly handling the
> available space.
> 

As what I replied above, I think we probably need a HW mechanism to solve
this problem nicely and decently. (Thinking sharing cache is
also a pre-condition of side-channel attack, even here is a snoop state
cahce.) With the HW mechanism, allocating the space of snoop state
cache might imply a glue layer of snoop cache management for different
CXL HB vendors to plug into the CXL core.

So when the CXL driver is initialized, the space of the snoop state cache
is allocated. With that is solved, for restricting the device to access the
memory (creating/mapping an IOVA for the DMA memory), SW can still leverage
the current Linux IOMMU/DMA APIs.

> Should we use the kernel DMA API for CXL.cache allocations? This API 
> deals with
> memory coherency what is not needed for the CXL.cache case. However, it is
> connected with the IOMMU functionality what is required for CXL.cache if 
> it is
> enabled.
> 
> I think the solution should be to implement a CXL.cache allocation API 
> inside
> the CXL core dealing with the snoop cache available space, and to 
> connect with
> IOMMU kernel code when it is enabled.
> 
> A security aspect behind DMAs is a device has (usually) no restrictions for
> memory access. This is true in a system with no IOMMU hardware, and 
> CXL.cache
> is not different in this case. With IOMMU is a different game though.
> 
> First of all, IOMMU will be in place for CXL.io, what implies legacy TLP 
> PCIe
> packets. A CXL.cache operation can not be handled by the IOMMU hardware 
> and the
> spec states ATS to be used beforehand, that is, the CXL device asking 
> the IOMMU
> hardware about the physical address to work with, and keeping that 
> translation
> internally. The CXL spec specifies ATS service extensions for CXL, and 
> some ATS
> requests can tell the device some addresses only to be used through 
> CXL.io. This
> implies some sort of knowledge about CXL is required by the IOMMU/ATS 
> hardware
> which depends on how the per device tables are programmed by the Host. 
> However,
> AFAIK, this is not supported yet by any Linux kernel IOMMU vendor 
> support. Note
> the usual IOMMU device/domain tables will/can be used for normal DMA 
> transfers,
> so IOMMU configuration, both in the Host and by the HW, needs to know 
> which parts
> of the domain are for DMAs and which are for CXL.cache.
> 
> Assuming this support will be implemented at some point in the future, the
> questions are, when?, and, how safe is it?
> 
> Can a device issue CXL.cache operations using arbitrary physical 
> addresses? It
> seems there are some cases where the hardware can take control of PCIe TLP
> packets with the ATS bit on. For example, if there is a PCIe bridge in 
> the path,
> and with that bridge using a specific redirection table based on 
> configured ATS
> per device ranges, any TLP with the ATS bit on will be redirected based 
> on such
> a table, and implying no redirection if no table entry. However, that 
> does not
> seem to be in place for PCIe Root Complex implementations. For example, AMD
> IOMMU documentation states ATS TLP packets are not handled at all, implying
> trusting the device, and if more security is required, the IOMMU 
> hardware can

Are you referring to the ATS translated request here? I think ATS itself
doesn't consider the security in its mind. 

> check those TLP ATS packets as well, spoiling the ATS advantage. Note 

Yes, AMD IOMMU has the secure ATS support, but as you said, it is pretty
straight-forward, basically just check every translated request when
enabled.

> this is
> PCIe, so CXL.io will likely keep the functionality, but CXL.cache operations
> follow another path with apparently no further control to enforce the right
> addresses within the allowed memory ranges per device are used.
> 
> Because this apparently lack of security for IOMMU and CXL.cache, this 
> implies a
> CXL device should not be used by VMs or any other user space controlled 
> driver
> with CXL.cache being enabled. This seems a really serious limitation, so 
> maybe
> I'm missing something here.
> 

I think at least for CXL path, IOMMU should have the similar mechanism like
secure ATS, and let the user to choose if they want it to be enabled or
not.

In reality, many CSP design the HW by themselves and trust their HW won't
do messy things, they may want to enable it only on the 3rd party HW.

For confidential computing world, secure ATS is mandatory, and performance
drop is the price of security.

> Regarding virtualization, assuming the security problems do not exist or 
> will be
> solved, while CXL.mem can be supported with an ahead mapping by the 
> Host, with
> CXL.cache this needs to be handled when the related driver asks for specific
> memory to access, and then to configure the IOMMU/ATS tables by the 
> Host. This
> implies the emulation needs a backend, what an ahead mapping, as currently
> proposed for CXL.mem can avoid.
> 
> Finally, if my concerns about the security of CXL.cache with IOMMU are
> unfounded, at least this document should describe how is this solved and the
> security enforced by the hardware, and if the kernel requires to handle it
> specifically (what I really think is the case, at least with IOMMU changes
> managed by the CXL core).
> 
> 
> Summary
> ======
> 
> 
> Next the proposed tasks to perform for supporting CXL.cache:
> 
>          - CXL core handling per device CXL.cache enabling based on CXL Root
>            Complex snoop cache state.
> 
>          - CXL core implementing a CXL.cache host memory allocation 
> restricting
>            the physical memory a a device can access to through CXL.cache.
> 
>          - IOMMU being CXL aware and dealing with CXL.cache vs CXL.io 
> requests.
> 
>          - Clarify CXL.cache and security with IOMMU.
> 
> 
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RFC: Kernel CXL cache support (and IOMMU implications)
  2024-11-20 22:33 ` Zhi Wang
@ 2024-12-13 14:15   ` Alejandro Lucero Palau
  2024-12-24 15:05     ` Jonathan Cameron
  0 siblings, 1 reply; 4+ messages in thread
From: Alejandro Lucero Palau @ 2024-12-13 14:15 UTC (permalink / raw)
  To: Zhi Wang; +Cc: linux-cxl@vger.kernel.org, iommu


On 11/20/24 22:33, Zhi Wang wrote:
> On Tue, 19 Nov 2024 16:52:15 +0000
> Alejandro Lucero Palau <alucerop@amd.com> wrote:
>
> Thanks so much for the doc. I just quickly went through the doc and here
> are my comments.


Hi Zhi,


Thanks for your comments. I did not reply earlier waiting for more 
feedback from, mainly, the IOMMU kernel guys. Maybe CXL support is 
something most of them neither have contemplated nor aware of (maybe) 
requiring special handling. I really think IOMMU/DMA API will need some 
change, but this document is for discussing it and maybe proving me wrong.


Let's hope replying to your comments keep things moving somehow ...


>> November, 2024
>>
>> Tittle: CXL Cache support by the kernel
>> Author: Alejandro Lucero (alucerop@amd.com)
>>
>> Version 0.1
>>
>> Introduction
>> ========
>>
>> After the LPC where I presented the current status of the Type2 CXL.mem
>> support
>> patchset, and some ideas about supporting CXL.cache, it is time to dig
>> deeper in
>> this second goal, and discussing the security/reliability aspect as well.
>>
>> It is also important to try to describe how this is going to work and
>> what the
>> kernel needs to know and enforce. Reading the CXL specs when having in
>> mind some
>> specific use case can easily lead to assuming certains aspects with a
>> different
>> perspective from other readers/use cases. To start with, it is necessary to
>> differentiate two "CXL cache" functionalities when a Type2 device is in
>> place:
>>
>> 1) A Type2 device caching Host memory.
>>
>> 2) The Host caching HDM memory, that is the memory inside the Type2 CXL
>> device.
>>
>> The first option is also what a Type1 device can do, and the kernel support
>> needs to manage all those Type1/2 per CXL Root Complex knowing the resources
>> limitation, that is the snooping cache size.
>>
>> A snoop cache allows the host to track which memory is being used/cached by
>> those devices, enforcing the cache coherency. The specs are not clear
>> about some
>> important aspects regarding how the host can enforce the proper use of
>> this by
>> devices or even if the snoop cache needs to do so. At pages 786 and 787
>> of CXL
>> specs 3.1, how the system software should deal with CXL cache devices is
>> given,
>> but this is inside a Hot-plug section. I think we can assume the Host
>> firmware/BIOS will follow same approach for enabling CXL cache, and the
>> kernel
>> needs to look at those devices with CXL cache enabled by the BIOS for
>> properly
>> handling the available space in the snoop cache.
>>
>> It is also worth to mention the CXL.cache protocol can be used in the two
>> "CXL cache" functionalities listed above. However, the last CXL spec implies
>> CXL.cache only used for the first case. Some comments about what the
>> specs say
>>
>> regarding number of devices with a cache for host memory:
>>
>>           - up to 16 Type1 and/or Type2 devices allowed per VH.
>>
>> can be easily confused with the limitations of just one CXL Type2 device
>> using
>> CXL.cache for enforcing coherency of its HDM. This limitation is
>> overcome with
>> forcing Type2 device using HDM-DB, which relies on CXL.mem instead of
>> CXL.cache
>> for HDM cache coherency.
>>
>> While the Host is assumed to be able to access HDM in a Type2 device, and
>> keeping data in the host cpu caches, it is the Type2 device
>> responsibility to
>> properly manage cache coherency of its HDM. There is nothing the kernel can
>> control here.
>>
>> Therefore the interesting part and what this documents tries to cover is the
>> Host memory being cached by Type2 or Type1 devices. While the main goal is
>> discussing how the kernel needs to handle this, and to describe how it
>> should
>> work when CXL devices are used by the system/Host, some comments are made to
>> cover the virtualization case where those CXL devices can potenetially
>> be used
>> (device passthrough) by guests/VMs. I try to expose the current security
>> problems where IOMMU is used for restricting what a guest controlled
>> CXL.cache
>> device can read/write in Host memory what I think needs to be clarified by
>> hardware vendors.
>>
>>
>> Understanding the memory accesses from CXL devices
>> ==================================
>>
>> For the sake of presenting the case about kernel CXL.cache support, I'll
>> try to
>> explain how it works (I should say "how I think it works") and the main
>> points
>> to discuss regarding how to implement this support. So, do not take the next
>> explanation as the definitive answer or guide, and if you think there
>> are errors
>> or maybe too much generalization at some points, please help fixing or
>> adding
>> further details. Also, consider some parts as just me thinking out loud,
>> what
>> maybe help other people (or confuse them!).
>>
>> The CXL.cache protocol allows devices to be part of the coherency ring
>> of the
>> system.
>>
>> Let's start with a Type2 device reading from a specific host memory
>> address. The
>> final situation is 64bytes (cache line) from host memory copied to the
>> device
>> cache, supposedly for being used by the device/accelerator. If the data
>> changes,
>> because some host cpu modifies it, the device will be signalled by the
>> coherency
>> ring, so the device will know. The important point here is the device can be
>> told because the Host knows the device has a copy or the only copy of that
>> data/memory. And that is thanks to the snoop cache implemented by the
>> CXL Root
>> Complex.
>>
>> A device caching host memory can be used as well for writes to host memory
>> through the cache coherency ring. A device can not just read host memory and
>> keep it, but it can modified it. The implications of writes versus reads
>> are not
>> important for the goal of this document. It requires the device to
>> support more
>> protocol exchange cases, but regarding the snoop cache, it is irrelevant.
>>
>> There arise obvious questions about how this snoop cache is going to work.
>>
>> First, with the simple case of just one device caching Host memory. From the
>> specs, the device CXL.cache should not be enabled by the Host if the device
>> cache is bigger than the snoop cache. However, what does preclude a
>> device to do
>> more memory accesses than what the snoop cache can cover? This can be partly
>> explained with some allocation control for CXL.cache what is discussed
>> in the
>> next section. But a "rogue" device could try things like this, what for
>> the case
>> of a single device using the snoop cache and without any other concern about
>> security, is probably fine:
>>
>>           - With a Type2, the snoop cache will tell the device to release
>> another
>>             line, meaning any modified line to be sent back to the Host.
>>           - Any performance problem will only have an impact on the
>> device itself.
>>
>> Then the case of multiple CXL devices caching Host memory in the same
>> CXL Root
>> Complex and therefore same CXL Snoop Cache:
>>
>> * How can the snoop cache track reads from different devices without one
>> device
>>     monopolizing the full space?
>>
>>           - enforcing snoop cache slices by software?
>>           - allowing specific/limited host ranges by the kernel?
>>
> I would like to compare it with the approaches that solves the similar
> problem of the CPU cache since they might have similar essence.
>
> CPU cache suffered from the similar problems that noisy and
> restless neighborhood keep poking the cache that might cause performance
> drop. Nowadays, it is solved by the HW mechanism, cache allocation. For
> Intel, it is called cache allocation technology(CAT) which is a subset of
> Resource Director Technology(RDT). They can be also used in the
> virtualization world.
>
> Before SW gets the support from the HW, many research papers were talking
> about solving it via page color. E.g. allocate the VM memory with page
> color awareness for different VMs. But I don't think those ideas eventually
> land in the mainline.
>
> Back to this prob, I think probably SW is going to rely on a HW mechanism
> to solve this problem nicely and decently, the same as CPU side.


I agree with the need of relying on HW, what the following sentence (in 
the original doc) tried to summarize.

But we need how this is done by HW for avoiding some undesirable 
situations if we just blindly configure CXL.cache to those devices 
advertising it and apparently without no problems regarding the snoop 
cache size.


>> AFAIK, there is not any kind of hardware control for avoiding this
>> contention.
>> Note that with the proper checking by the BIOS and by the kernel (for
>> hotplug or
>> those not enabled devices yet during boot time), the size of total
>> device caches
>> allowed per CXL Root Complex should not be bigger than the snoop cache
>> size, and
>> therefore theoretically no contention at all ... if the devices do the right
>> thing. From software the only thing we can do is to ensure the CXL.cache
>> accesses from a device are within a range with same size than the enabled
>> CXL.cache.
>>
> What would be the consequence if we violate this rule?


Contention or just one device getting less snoop cache coverage implying 
requests from the snoop cache for flushing cached data before trying to 
access more data in the host.

With a full snoop cache, a new access to uncached addresses will trigger 
some action by the snoop cache. I'm assuming it will be the device where 
that new access comes from the one receiving orders for first flushing 
cached data for making space, but I may be wrong.

But a rogue device, and the first using CXL.cache, could get most of the 
snoop cache if no other control existing. Has the snoop cache have this 
control per device about amount of tracked cachelines allowed? If so, 
who is configuring it properly? I would expect the kernel CXL core being 
the one after doing other checks for validation. If the CXL specs do not 
specify how, we can expect different implementations, and the kernel 
will need to implement a generic frontend layer with per vendor backends.


>> Therefore, some memory allocation API is required for dealing with the
>> amount of
>> memory the snoop cache can track, and the host memory a device can
>> access to.
>> The device needs the physical address to work with, and it is in this
>> required
>> translation from virtual to physical addresses where we can enforce the
>> restriction. Of course, such an API does already exist, although not
>> with the
>> checking we need: the kernel DMA API.
>>
>>
>> (Secure) memory allocation  and CXL.cache
>> ===========================
>>
>> DMAs allow devices to perform read/write operations to system memory
>> without any
>> cpu intervention after the (meta)data about how to perform the DMA is
>> given to
>> the device. CXL.cache is more than DMA because the system memory caches are
>> implicitly involved but for the sake of handling this by the operating
>> system,
>> not too much different. The important point here is there is no restriction
>> about the DMAble memory to be used by a device, but due to the snoop cache
>> limitations, this needs to change for CXL: code aware of the snoop cache
>> state
>> and what a device requires needs to be consulted for properly handling the
>> available space.
>>
> As what I replied above, I think we probably need a HW mechanism to solve
> this problem nicely and decently. (Thinking sharing cache is
> also a pre-condition of side-channel attack, even here is a snoop state
> cahce.) With the HW mechanism, allocating the space of snoop state
> cache might imply a glue layer of snoop cache management for different
> CXL HB vendors to plug into the CXL core.


Just what I did mention above :-)


Glad to have someone else seeing the problem.


>
> So when the CXL driver is initialized, the space of the snoop state cache
> is allocated. With that is solved, for restricting the device to access the
> memory (creating/mapping an IOVA for the DMA memory), SW can still leverage
> the current Linux IOMMU/DMA APIs.
>

This is my main concern. Note the DMA/IOMMU is likely needed for normal 
device operations, and that will be through CXL.io. Same mapping should 
then not be shared for CXL.cache, or it can, but with additional per 
mapping flags and obviously API changes. The implications here are 
obviously more important if IOMMU is enabled, at least if we take what 
the specs  say about some ATS/IOMMU mapping only to be allowed by 
CXL.io. Without IOMMU, it turns into the problem of rogue devices 
monopolizing the snoop cache.


>> Should we use the kernel DMA API for CXL.cache allocations? This API
>> deals with
>> memory coherency what is not needed for the CXL.cache case. However, it is
>> connected with the IOMMU functionality what is required for CXL.cache if
>> it is
>> enabled.
>>
>> I think the solution should be to implement a CXL.cache allocation API
>> inside
>> the CXL core dealing with the snoop cache available space, and to
>> connect with
>> IOMMU kernel code when it is enabled.
>>
>> A security aspect behind DMAs is a device has (usually) no restrictions for
>> memory access. This is true in a system with no IOMMU hardware, and
>> CXL.cache
>> is not different in this case. With IOMMU is a different game though.
>>
>> First of all, IOMMU will be in place for CXL.io, what implies legacy TLP
>> PCIe
>> packets. A CXL.cache operation can not be handled by the IOMMU hardware
>> and the
>> spec states ATS to be used beforehand, that is, the CXL device asking
>> the IOMMU
>> hardware about the physical address to work with, and keeping that
>> translation
>> internally. The CXL spec specifies ATS service extensions for CXL, and
>> some ATS
>> requests can tell the device some addresses only to be used through
>> CXL.io. This
>> implies some sort of knowledge about CXL is required by the IOMMU/ATS
>> hardware
>> which depends on how the per device tables are programmed by the Host.
>> However,
>> AFAIK, this is not supported yet by any Linux kernel IOMMU vendor
>> support. Note
>> the usual IOMMU device/domain tables will/can be used for normal DMA
>> transfers,
>> so IOMMU configuration, both in the Host and by the HW, needs to know
>> which parts
>> of the domain are for DMAs and which are for CXL.cache.
>>
>> Assuming this support will be implemented at some point in the future, the
>> questions are, when?, and, how safe is it?
>>
>> Can a device issue CXL.cache operations using arbitrary physical
>> addresses? It
>> seems there are some cases where the hardware can take control of PCIe TLP
>> packets with the ATS bit on. For example, if there is a PCIe bridge in
>> the path,
>> and with that bridge using a specific redirection table based on
>> configured ATS
>> per device ranges, any TLP with the ATS bit on will be redirected based
>> on such
>> a table, and implying no redirection if no table entry. However, that
>> does not
>> seem to be in place for PCIe Root Complex implementations. For example, AMD
>> IOMMU documentation states ATS TLP packets are not handled at all, implying
>> trusting the device, and if more security is required, the IOMMU
>> hardware can
> Are you referring to the ATS translated request here? I think ATS itself
> doesn't consider the security in its mind.


It seems so, but I guess we agree that is not an option for VMs ... 
Without IOMMU you can not have DMAs from passthrough devices, and if 
CXL.cache dodges the IOMMU checks (and none other security mechanism in 
place), CXL.cache should not be allowed in virtualization.


>> check those TLP ATS packets as well, spoiling the ATS advantage. Note
> Yes, AMD IOMMU has the secure ATS support, but as you said, it is pretty
> straight-forward, basically just check every translated request when
> enabled.
>
>> this is
>> PCIe, so CXL.io will likely keep the functionality, but CXL.cache operations
>> follow another path with apparently no further control to enforce the right
>> addresses within the allowed memory ranges per device are used.
>>
>> Because this apparently lack of security for IOMMU and CXL.cache, this
>> implies a
>> CXL device should not be used by VMs or any other user space controlled
>> driver
>> with CXL.cache being enabled. This seems a really serious limitation, so
>> maybe
>> I'm missing something here.
>>
> I think at least for CXL path, IOMMU should have the similar mechanism like
> secure ATS, and let the user to choose if they want it to be enabled or
> not.
>
> In reality, many CSP design the HW by themselves and trust their HW won't
> do messy things, they may want to enable it only on the 3rd party HW.
>
> For confidential computing world, secure ATS is mandatory, and performance
> drop is the price of security.


We can let the user to choose ... but in the virtualization world the 
provider does not want the user to choose ... and if CXL.cache is like 
DMAs without IOMMU, I would say this is a really good reason.


>> Regarding virtualization, assuming the security problems do not exist or
>> will be
>> solved, while CXL.mem can be supported with an ahead mapping by the
>> Host, with
>> CXL.cache this needs to be handled when the related driver asks for specific
>> memory to access, and then to configure the IOMMU/ATS tables by the
>> Host. This
>> implies the emulation needs a backend, what an ahead mapping, as currently
>> proposed for CXL.mem can avoid.
>>
>> Finally, if my concerns about the security of CXL.cache with IOMMU are
>> unfounded, at least this document should describe how is this solved and the
>> security enforced by the hardware, and if the kernel requires to handle it
>> specifically (what I really think is the case, at least with IOMMU changes
>> managed by the CXL core).
>>
>>
>> Summary
>> ======
>>
>>
>> Next the proposed tasks to perform for supporting CXL.cache:
>>
>>           - CXL core handling per device CXL.cache enabling based on CXL Root
>>             Complex snoop cache state.
>>
>>           - CXL core implementing a CXL.cache host memory allocation
>> restricting
>>             the physical memory a a device can access to through CXL.cache.
>>
>>           - IOMMU being CXL aware and dealing with CXL.cache vs CXL.io
>> requests.
>>
>>           - Clarify CXL.cache and security with IOMMU.
>>
>>
>>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RFC: Kernel CXL cache support (and IOMMU implications)
  2024-12-13 14:15   ` Alejandro Lucero Palau
@ 2024-12-24 15:05     ` Jonathan Cameron
  0 siblings, 0 replies; 4+ messages in thread
From: Jonathan Cameron @ 2024-12-24 15:05 UTC (permalink / raw)
  To: Alejandro Lucero Palau
  Cc: Zhi Wang, linux-cxl@vger.kernel.org, iommu, Lukas Wunner

On Fri, 13 Dec 2024 14:15:18 +0000
Alejandro Lucero Palau <alucerop@amd.com> wrote:

> On 11/20/24 22:33, Zhi Wang wrote:
> > On Tue, 19 Nov 2024 16:52:15 +0000
> > Alejandro Lucero Palau <alucerop@amd.com> wrote:
> >
> > Thanks so much for the doc. I just quickly went through the doc and here
> > are my comments.  
> 
> 
> Hi Zhi,
> 
> 
> Thanks for your comments. I did not reply earlier waiting for more 
> feedback from, mainly, the IOMMU kernel guys. Maybe CXL support is 
> something most of them neither have contemplated nor aware of (maybe) 
> requiring special handling. I really think IOMMU/DMA API will need some 
> change, but this document is for discussing it and maybe proving me wrong.
> 
> 
> Let's hope replying to your comments keep things moving somehow ...

Sorry it took me so long to get to this! 

I replied as I read through it, so thoughts may not be totally coherent.
Key points:

1. If you are doing CXL.cache on a device then either your host should be
   doing checking that the device is not accessing something it shouldn't or
   you should have done the work to ensure it is part of your trusted
   compute base.
2. Using restrictions on memory in the IOMMU page tables to avoid
   thrashing of coherency tracking resources / cache in the host is a non
   starter.  That puts a bound on the number used, but at the cost of
   breaking many use cases.  Page table coverage != cachelines in device.
   a) Endpoint should be part of TCB, trusted not to use more than it is
      told.
   b) Host should not do rubbish QoS so the burden should be mainly on the
      badly behaving device 
3. I'm not sure why (for what is discussed here) there is any problem with
   VM usecases. The same model used for ATS etc for PCIe VFs should apply
   just fine here.

Anyhow, that's enough muddying the waters for today.

Jonathan


> 
> >> November, 2024
> >>
> >> Tittle: CXL Cache support by the kernel
> >> Author: Alejandro Lucero (alucerop@amd.com)
> >>
> >> Version 0.1
> >>
> >> Introduction
> >> ========
> >>
> >> After the LPC where I presented the current status of the Type2 CXL.mem
> >> support
> >> patchset, and some ideas about supporting CXL.cache, it is time to dig
> >> deeper in
> >> this second goal, and discussing the security/reliability aspect as well.
> >>
> >> It is also important to try to describe how this is going to work and
> >> what the
> >> kernel needs to know and enforce. Reading the CXL specs when having in
> >> mind some
> >> specific use case can easily lead to assuming certains aspects with a
> >> different
> >> perspective from other readers/use cases. To start with, it is necessary to
> >> differentiate two "CXL cache" functionalities when a Type2 device is in
> >> place:
> >>
> >> 1) A Type2 device caching Host memory.
> >>
> >> 2) The Host caching HDM memory, that is the memory inside the Type2 CXL
> >> device.
> >>
> >> The first option is also what a Type1 device can do, and the kernel support
> >> needs to manage all those Type1/2 per CXL Root Complex knowing the resources
> >> limitation, that is the snooping cache size.
> >>
> >> A snoop cache allows the host to track which memory is being used/cached by
> >> those devices, enforcing the cache coherency. The specs are not clear
> >> about some
> >> important aspects regarding how the host can enforce the proper use of
> >> this by
> >> devices or even if the snoop cache needs to do so. At pages 786 and 787
> >> of CXL
> >> specs 3.1, how the system software should deal with CXL cache devices is
> >> given,
> >> but this is inside a Hot-plug section. I think we can assume the Host
> >> firmware/BIOS will follow same approach for enabling CXL cache, and the
> >> kernel
> >> needs to look at those devices with CXL cache enabled by the BIOS for
> >> properly
> >> handling the available space in the snoop cache.
> >>
> >> It is also worth to mention the CXL.cache protocol can be used in the two
> >> "CXL cache" functionalities listed above. However, the last CXL spec implies
> >> CXL.cache only used for the first case. Some comments about what the
> >> specs say
> >>
> >> regarding number of devices with a cache for host memory:
> >>
> >>           - up to 16 Type1 and/or Type2 devices allowed per VH.
> >>
> >> can be easily confused with the limitations of just one CXL Type2 device
> >> using
> >> CXL.cache for enforcing coherency of its HDM. 

It has been a while since I read the relevant sections, but CXL 3.0 introduced
a cache ID that I thought was precisely to allow for multiple CXL.Cache agents
per VCS (not the HDM-DB stuff that for this purpose is replacing bias
based coherency)  There may well not be any hosts that support that yet though
and it doesnt' really matter for rest of this discussion.

> >> This limitation is
> >> overcome with
> >> forcing Type2 device using HDM-DB, which relies on CXL.mem instead of
> >> CXL.cache
> >> for HDM cache coherency.
> >>
> >> While the Host is assumed to be able to access HDM in a Type2 device, and
> >> keeping data in the host cpu caches, it is the Type2 device
> >> responsibility to
> >> properly manage cache coherency of its HDM. There is nothing the kernel can
> >> control here.
> >>
> >> Therefore the interesting part and what this documents tries to cover is the
> >> Host memory being cached by Type2 or Type1 devices. While the main goal is
> >> discussing how the kernel needs to handle this, and to describe how it
> >> should
> >> work when CXL devices are used by the system/Host, some comments are made to
> >> cover the virtualization case where those CXL devices can potenetially
> >> be used
> >> (device passthrough) by guests/VMs. I try to expose the current security
> >> problems where IOMMU is used for restricting what a guest controlled
> >> CXL.cache
> >> device can read/write in Host memory what I think needs to be clarified by
> >> hardware vendors.
> >>
> >>
> >> Understanding the memory accesses from CXL devices
> >> ==================================
> >>
> >> For the sake of presenting the case about kernel CXL.cache support, I'll
> >> try to
> >> explain how it works (I should say "how I think it works") and the main
> >> points
> >> to discuss regarding how to implement this support. So, do not take the next
> >> explanation as the definitive answer or guide, and if you think there
> >> are errors
> >> or maybe too much generalization at some points, please help fixing or
> >> adding
> >> further details. Also, consider some parts as just me thinking out loud,
> >> what
> >> maybe help other people (or confuse them!).
> >>
> >> The CXL.cache protocol allows devices to be part of the coherency ring

Probably avoid 'ring' in terminology. Just "coherency of the system" is fine
I think.

> >> of the
> >> system.
> >>
> >> Let's start with a Type2 device reading from a specific host memory
> >> address. The
> >> final situation is 64bytes (cache line) from host memory copied to the
> >> device
> >> cache, supposedly for being used by the device/accelerator. If the data
> >> changes,
> >> because some host cpu modifies it, the device will be signalled by the
> >> coherency
> >> ring, so the device will know. The important point here is the device can be
> >> told because the Host knows the device has a copy or the only copy of that
> >> data/memory. And that is thanks to the snoop cache implemented by the

Probably refer to "coherency tracking" rather than say a snoop cache which is
just one way of doing it and don't specify where it is.  Could be in any number
of places depending on system design.

> >> CXL Root
> >> Complex.
> >>
> >> A device caching host memory can be used as well for writes to host memory
> >> through the cache coherency ring. A device can not just read host memory and
> >> keep it, but it can modified it. The implications of writes versus reads
> >> are not
> >> important for the goal of this document. It requires the device to
> >> support more
> >> protocol exchange cases, but regarding the snoop cache, it is irrelevant.
> >>
> >> There arise obvious questions about how this snoop cache is going to work.
> >>
> >> First, with the simple case of just one device caching Host memory. From the
> >> specs, the device CXL.cache should not be enabled by the Host if the device
> >> cache is bigger than the snoop cache. However, what does preclude a
> >> device to do
> >> more memory accesses than what the snoop cache can cover? This can be partly
> >> explained with some allocation control for CXL.cache what is discussed
> >> in the
> >> next section. But a "rogue" device could try things like this, what for
> >> the case
> >> of a single device using the snoop cache and without any other concern about
> >> security, is probably fine:

If you are letting a device into your host coherency and you haven't done
a bunch of stuff to ensure it is not rogue you are on your own.  That stuff is the
domain of technologies such as attestation and more basic stuff like supply chain
management.

Having said that the host can easily identify such a problem and refuse to do
anything that would cause it to loose track + issue appropriate RAS event
(maybe including isolating the device).

> >>
> >>           - With a Type2, the snoop cache will tell the device to release
> >> another
> >>             line, meaning any modified line to be sent back to the Host.
> >>           - Any performance problem will only have an impact on the
> >> device itself.

Agreed the whole sizing thing is a performance question not so much a
correctness - though  a device might not make forwards progress
if it can't get enough data to do what it wants to do.

> >>
> >> Then the case of multiple CXL devices caching Host memory in the same
> >> CXL Root
> >> Complex and therefore same CXL Snoop Cache:
> >>
> >> * How can the snoop cache track reads from different devices without one
> >> device
> >>     monopolizing the full space?
> >>
> >>           - enforcing snoop cache slices by software?
> >>           - allowing specific/limited host ranges by the kernel?

To me, that's a hardware problem.  Hardware that doesn't do the
QoS handling for this is broken. Sure we can quirk that if needed
but I'd do it in the first instance by declaring the hardware so
broken we only support one device doing CXL.cache for each set of
tracking resources.  Seems you say that later :)

> >>  
> > I would like to compare it with the approaches that solves the similar
> > problem of the CPU cache since they might have similar essence.
> >
> > CPU cache suffered from the similar problems that noisy and
> > restless neighborhood keep poking the cache that might cause performance
> > drop. Nowadays, it is solved by the HW mechanism, cache allocation. For
> > Intel, it is called cache allocation technology(CAT) which is a subset of
> > Resource Director Technology(RDT). They can be also used in the
> > virtualization world.

Fine in theory in practice not used all that widely, but agreed similar
solutions could be applied here. They are a pain to tune though so I'd
expect to see better non configurable solutions for QoS first and the
ability to tweak only in a few generations time (could be wrong though!)

> >
> > Before SW gets the support from the HW, many research papers were talking
> > about solving it via page color. E.g. allocate the VM memory with page
> > color awareness for different VMs. But I don't think those ideas eventually
> > land in the mainline.
> >
> > Back to this prob, I think probably SW is going to rely on a HW mechanism
> > to solve this problem nicely and decently, the same as CPU side.  
> 
> 
> I agree with the need of relying on HW, what the following sentence (in 
> the original doc) tried to summarize.
> 
> But we need how this is done by HW for avoiding some undesirable 
> situations if we just blindly configure CXL.cache to those devices 
> advertising it and apparently without no problems regarding the snoop 
> cache size.

I'm not convinced we need to do anything in software.  This is no worse than
head of line blocking on PCIe (bandwidth to host is often
less than that if all devices below some switches want to DMA to 
host memory at the same time).  In theory we can tweak demand by
messing around with device specific stuff, or tweaking link controls
but in practice does anyone do this in a general purpose system?
Don't think so. We rely on sane QoS handling via credit allocations
etc and the switch doing something sensible.

Sure, a particular implementation might not do this, but that to
me is a quirk that we need to handle on a case by case basis.

> 
> 
> >> AFAIK, there is not any kind of hardware control for avoiding this
> >> contention.
> >> Note that with the proper checking by the BIOS and by the kernel (for
> >> hotplug or
> >> those not enabled devices yet during boot time), the size of total
> >> device caches
> >> allowed per CXL Root Complex should not be bigger than the snoop cache
> >> size, and
> >> therefore theoretically no contention at all ... if the devices do the right
> >> thing. From software the only thing we can do is to ensure the CXL.cache
> >> accesses from a device are within a range with same size than the enabled
> >> CXL.cache.
> >>  
> > What would be the consequence if we violate this rule?  
> 
> 
> Contention or just one device getting less snoop cache coverage implying 
> requests from the snoop cache for flushing cached data before trying to 
> access more data in the host.
> 
> With a full snoop cache, a new access to uncached addresses will trigger 
> some action by the snoop cache. I'm assuming it will be the device where 
> that new access comes from the one receiving orders for first flushing 
> cached data for making space, but I may be wrong.

That might be a design choice, but I don't see that being true in general.
Imagine a very simple QoS policy that aims to balance devices use of tracking
entrees.  In that case if only one agent is sending requests for a bit it
will rightly get all the entrees.  New agent wakes up and sends requests, the
eviction will be an entry for the original agent.  Lots of other policies
will lead to that.

> 
> But a rogue device, and the first using CXL.cache, could get most of the 
> snoop cache if no other control existing. Has the snoop cache have this 
> control per device about amount of tracked cachelines allowed? If so, 
> who is configuring it properly? I would expect the kernel CXL core being 
> the one after doing other checks for validation. If the CXL specs do not 
> specify how, we can expect different implementations, and the kernel 
> will need to implement a generic frontend layer with per vendor backends.

It's a host implementation thing. Out of scope for CXL. I'd expect we might
see RDT / MPAM whatever your favourite arch uses for resource controls for
this but even those are often soft limits so if only one agent
doing anything it still gets the whole set of tracking entrees.

As before though, I'm unconcerned about rogue devices as if such a device
is plugged in it can do a lot more damage than this! (your later thing
on lying about translated addresses).

> 
> 
> >> Therefore, some memory allocation API is required for dealing with the
> >> amount of
> >> memory the snoop cache can track, and the host memory a device can
> >> access to.
> >> The device needs the physical address to work with, and it is in this
> >> required
> >> translation from virtual to physical addresses where we can enforce the
> >> restriction. Of course, such an API does already exist, although not
> >> with the
> >> checking we need: the kernel DMA API.
> >>
> >>
> >> (Secure) memory allocation  and CXL.cache
> >> ===========================
> >>
> >> DMAs allow devices to perform read/write operations to system memory
> >> without any
> >> cpu intervention after the (meta)data about how to perform the DMA is
> >> given to
> >> the device. CXL.cache is more than DMA because the system memory caches are
> >> implicitly involved but for the sake of handling this by the operating
> >> system,
> >> not too much different. The important point here is there is no restriction
> >> about the DMAble memory to be used by a device, but due to the snoop cache
> >> limitations, this needs to change for CXL: code aware of the snoop cache
> >> state
> >> and what a device requires needs to be consulted for properly handling the
> >> available space.

I'm not sure how this is connected to DMA mappings etc.

What we map in the page tables is an upper bound on what might be cached by
the device, but it is not the only thing applying that upper bound.
The device is making guarantees not to cache more than a certain amount of memory
at a time - likely a tiny subset of what is mapped. Imagine a DB accelerator.
Those will typically having DMA mappings for many TB of data, but the query engines
will only be using a tiny amount of it at any time. They will evict the cachelines
when they are done with reading particular data (usually as part of a pointer chase).
Hence they will use only a few cachelines, but the page tables and indeed
address translation cache will cover much more.


> >>  
> > As what I replied above, I think we probably need a HW mechanism to solve
> > this problem nicely and decently. (Thinking sharing cache is
> > also a pre-condition of side-channel attack, even here is a snoop state
> > cahce.) With the HW mechanism, allocating the space of snoop state
> > cache might imply a glue layer of snoop cache management for different
> > CXL HB vendors to plug into the CXL core.  
> 
> 
> Just what I did mention above :-)
> 
> 
> Glad to have someone else seeing the problem.

I have no problem with an RDT type scheme but that is about performance
not correctness and is a fine turning on top of telling the devices we
have restricted tracking capacity.

> 
> 
> >
> > So when the CXL driver is initialized, the space of the snoop state cache
> > is allocated. With that is solved, for restricting the device to access the
> > memory (creating/mapping an IOVA for the DMA memory), SW can still leverage
> > the current Linux IOMMU/DMA APIs.
> >  
> 
> This is my main concern. Note the DMA/IOMMU is likely needed for normal 
> device operations, and that will be through CXL.io. Same mapping should 
> then not be shared for CXL.cache, or it can, but with additional per 
> mapping flags and obviously API changes. The implications here are 
> obviously more important if IOMMU is enabled, at least if we take what 
> the specs  say about some ATS/IOMMU mapping only to be allowed by 
> CXL.io. Without IOMMU, it turns into the problem of rogue devices 
> monopolizing the snoop cache.

If the snoop cache implementation is that bad, go annoy the hardware
folk. + Rogue devices are not a thing in production systems, well they
are but constraining them is not a problem for this layer of the stack
(if FPGA or similar then checks for rogue accesses need to belong in the shell
around the bit you allow customers to program - if not cloud vendors
etc will not plug in your devices)


> 
> 
> >> Should we use the kernel DMA API for CXL.cache allocations? This API
> >> deals with
> >> memory coherency what is not needed for the CXL.cache case. However, it is
> >> connected with the IOMMU functionality what is required for CXL.cache if
> >> it is
> >> enabled.
> >>
> >> I think the solution should be to implement a CXL.cache allocation API
> >> inside
> >> the CXL core dealing with the snoop cache available space, and to
> >> connect with
> >> IOMMU kernel code when it is enabled.

Just to check, does anyone care about these devices with iommu disabled?
In my opinion that should just be blocked on day one.

> >>
> >> A security aspect behind DMAs is a device has (usually) no restrictions for
> >> memory access. This is true in a system with no IOMMU hardware, and
> >> CXL.cache
> >> is not different in this case. With IOMMU is a different game though.
> >>
> >> First of all, IOMMU will be in place for CXL.io, what implies legacy TLP
> >> PCIe
> >> packets. A CXL.cache operation can not be handled by the IOMMU hardware
> >> and the
> >> spec states ATS to be used beforehand, that is, the CXL device asking
> >> the IOMMU
> >> hardware about the physical address to work with, and keeping that
> >> translation
> >> internally. The CXL spec specifies ATS service extensions for CXL, and
> >> some ATS
> >> requests can tell the device some addresses only to be used through
> >> CXL.io. This
> >> implies some sort of knowledge about CXL is required by the IOMMU/ATS
> >> hardware
> >> which depends on how the per device tables are programmed by the Host.

I think that is not (only?) about security, it is about correctness as some hosts may
not support CXL.cache accesses to some regions of the host address map, e.g. peer
PCI device BARs for which we are doing UIO or similar.

> >> However,
> >> AFAIK, this is not supported yet by any Linux kernel IOMMU vendor
> >> support. Note
> >> the usual IOMMU device/domain tables will/can be used for normal DMA
> >> transfers,
> >> so IOMMU configuration, both in the Host and by the HW, needs to know
> >> which parts
> >> of the domain are for DMAs and which are for CXL.cache.

Potentially yes.  Though there are bunch of ways to do that which don't
necessarily expose them to the OS, they may be characterstics of the underlying
HPA memory map.

> >>
> >> Assuming this support will be implemented at some point in the future, the
> >> questions are, when?, and, how safe is it?
> >>
> >> Can a device issue CXL.cache operations using arbitrary physical
> >> addresses?

Of course, same as a PCIe device doing DMA with translated addresses.
You rely on attestation and in some cases sanity checks in the host
(they are still subject to the Confidential compute type checks on physical
 address space permissions for example).

This is what all the fun of device security is all about.  You must know
and trust devices before you let them do anything at all to host memory.

> >> It
> >> seems there are some cases where the hardware can take control of PCIe TLP
> >> packets with the ATS bit on. For example, if there is a PCIe bridge in
> >> the path,
> >> and with that bridge using a specific redirection table based on
> >> configured ATS
> >> per device ranges, any TLP with the ATS bit on will be redirected based
> >> on such
> >> a table, and implying no redirection if no table entry. However, that
> >> does not
> >> seem to be in place for PCIe Root Complex implementations. For example, AMD
> >> IOMMU documentation states ATS TLP packets are not handled at all, implying
> >> trusting the device, and if more security is required, the IOMMU
> >> hardware can  
> > Are you referring to the ATS translated request here? I think ATS itself
> > doesn't consider the security in its mind.  
> 
> 
> It seems so, but I guess we agree that is not an option for VMs ... 

Why not?  There is a still quite a bit of infrastructure needed for all the
iommufd work to land, but it's getting close.  That absolutely allows
for ATS - up to the host to check it trusts the device before assigning
any part of it to the VM.  On at least some architectures the translation
is the full 2 stage one to host physical address.

> Without IOMMU you can not have DMAs from passthrough devices, and if 
> CXL.cache dodges the IOMMU checks (and none other security mechanism in 
> place), CXL.cache should not be allowed in virtualization.

No difference to ATS for conventional PCIe. Sure you need to do your security
checks. That's standard stuff. See what Lukas has been working on for last few
years around CMA + the bits of relevant confidential compute that are actually
useful for non CoCo usecases.

> 
> 
> >> check those TLP ATS packets as well, spoiling the ATS advantage. Note  
> > Yes, AMD IOMMU has the secure ATS support, but as you said, it is pretty
> > straight-forward, basically just check every translated request when
> > enabled.

Interesting approach. I'd not noticed that before.

There are other solutions that keep enough tracking data in host to verify
translated requests are fine but if you've checked the device, none of that
should be needed.

> >  
> >> this is
> >> PCIe, so CXL.io will likely keep the functionality, but CXL.cache operations
> >> follow another path with apparently no further control to enforce the right
> >> addresses within the allowed memory ranges per device are used.
> >>
> >> Because this apparently lack of security for IOMMU and CXL.cache, this
> >> implies a
> >> CXL device should not be used by VMs or any other user space controlled
> >> driver
> >> with CXL.cache being enabled. This seems a really serious limitation, so
> >> maybe
> >> I'm missing something here.
> >>  
> > I think at least for CXL path, IOMMU should have the similar mechanism like
> > secure ATS, and let the user to choose if they want it to be enabled or
> > not.
> >
> > In reality, many CSP design the HW by themselves and trust their HW won't
> > do messy things, they may want to enable it only on the 3rd party HW.

Exactly.  They do a lot of paperwork and security audits.  These devices have
to be in the trusted computing base to operate.

> >
> > For confidential computing world, secure ATS is mandatory, and performance
> > drop is the price of security.  

That's a choice, but it's not the only one for confidential compute
and from my recollection of other CoCo solutions they don't all do
this.

> 
> 
> We can let the user to choose ... but in the virtualization world the 
> provider does not want the user to choose ... and if CXL.cache is like 
> DMAs without IOMMU, I would say this is a really good reason.

Just require appropriate attestation. That's a userspace policy (see
Lukas' series) + the one Dan has for confidential compute.

> 
> 
> >> Regarding virtualization, assuming the security problems do not exist or
> >> will be
> >> solved, while CXL.mem can be supported with an ahead mapping by the
> >> Host, with
> >> CXL.cache this needs to be handled when the related driver asks for specific
> >> memory to access, and then to configure the IOMMU/ATS tables by the
> >> Host. This
> >> implies the emulation needs a backend, what an ahead mapping, as currently
> >> proposed for CXL.mem can avoid.
> >>
> >> Finally, if my concerns about the security of CXL.cache with IOMMU are
> >> unfounded, at least this document should describe how is this solved and the
> >> security enforced by the hardware, and if the kernel requires to handle it
> >> specifically (what I really think is the case, at least with IOMMU changes
> >> managed by the CXL core).

It's an implementation specific question. If you are selling a device that
uses cxl.cache, expect to do a lot of paperwork and security audits + probably
reveal a lot of implementation details to you major customers.

We should expose policy control to host userspace and provide it all the relevant
information.  I'm not yet convinced this has anything to do with DMA mapping or
the IOMMU core code.  There might be some advantages in locking down accesses
via CXL.io vs CXL.cache for some IOMMU designs but my understanding is that is
not a universal thing at all.  If userspace says accept the device after doing
all the certs etc are checked then it is part of the trusted compute base and
there will not need to be anything in the IOMMU page tables that is unique
to this.

> >>
> >>
> >> Summary
> >> ======
> >>
> >>
> >> Next the proposed tasks to perform for supporting CXL.cache:
> >>
> >>           - CXL core handling per device CXL.cache enabling based on CXL Root
> >>             Complex snoop cache state.

Agreed.

> >>
> >>           - CXL core implementing a CXL.cache host memory allocation
> >> restricting
> >>             the physical memory a a device can access to through CXL.cache.

No on this one. It's a broken solution to a potential hardware solution.

> >>
> >>           - IOMMU being CXL aware and dealing with CXL.cache vs CXL.io
> >> requests.

Not required in general. May be required for some host IOMMU architectures.
So we may need some hooks.

> >>
> >>           - Clarify CXL.cache and security with IOMMU.


Standard device security flows.  Policy for what counts as 'secure' may
be tighter, but it's no different to flows for PCIe devices in general.
Bring them into your TCB.

Jonathan

> >>
> >>
> >>  
> 
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-12-24 15:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-19 16:52 RFC: Kernel CXL cache support (and IOMMU implications) Alejandro Lucero Palau
2024-11-20 22:33 ` Zhi Wang
2024-12-13 14:15   ` Alejandro Lucero Palau
2024-12-24 15:05     ` Jonathan Cameron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.