All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: Kernel CXL cache support (and IOMMU implications)
@ 2024-11-19 16:52 Alejandro Lucero Palau
  2024-11-20 22:33 ` Zhi Wang
  0 siblings, 1 reply; 4+ messages in thread
From: Alejandro Lucero Palau @ 2024-11-19 16:52 UTC (permalink / raw)
  To: linux-cxl@vger.kernel.org, iommu

November, 2024

Tittle: CXL Cache support by the kernel
Author: Alejandro Lucero (alucerop@amd.com)

Version 0.1

Introduction
========

After the LPC where I presented the current status of the Type2 CXL.mem 
support
patchset, and some ideas about supporting CXL.cache, it is time to dig 
deeper in
this second goal, and discussing the security/reliability aspect as well.

It is also important to try to describe how this is going to work and 
what the
kernel needs to know and enforce. Reading the CXL specs when having in 
mind some
specific use case can easily lead to assuming certains aspects with a 
different
perspective from other readers/use cases. To start with, it is necessary to
differentiate two "CXL cache" functionalities when a Type2 device is in 
place:

1) A Type2 device caching Host memory.

2) The Host caching HDM memory, that is the memory inside the Type2 CXL 
device.

The first option is also what a Type1 device can do, and the kernel support
needs to manage all those Type1/2 per CXL Root Complex knowing the resources
limitation, that is the snooping cache size.

A snoop cache allows the host to track which memory is being used/cached by
those devices, enforcing the cache coherency. The specs are not clear 
about some
important aspects regarding how the host can enforce the proper use of 
this by
devices or even if the snoop cache needs to do so. At pages 786 and 787 
of CXL
specs 3.1, how the system software should deal with CXL cache devices is 
given,
but this is inside a Hot-plug section. I think we can assume the Host
firmware/BIOS will follow same approach for enabling CXL cache, and the 
kernel
needs to look at those devices with CXL cache enabled by the BIOS for 
properly
handling the available space in the snoop cache.

It is also worth to mention the CXL.cache protocol can be used in the two
"CXL cache" functionalities listed above. However, the last CXL spec implies
CXL.cache only used for the first case. Some comments about what the 
specs say

regarding number of devices with a cache for host memory:

         - up to 16 Type1 and/or Type2 devices allowed per VH.

can be easily confused with the limitations of just one CXL Type2 device 
using
CXL.cache for enforcing coherency of its HDM. This limitation is 
overcome with
forcing Type2 device using HDM-DB, which relies on CXL.mem instead of 
CXL.cache
for HDM cache coherency.

While the Host is assumed to be able to access HDM in a Type2 device, and
keeping data in the host cpu caches, it is the Type2 device 
responsibility to
properly manage cache coherency of its HDM. There is nothing the kernel can
control here.

Therefore the interesting part and what this documents tries to cover is the
Host memory being cached by Type2 or Type1 devices. While the main goal is
discussing how the kernel needs to handle this, and to describe how it 
should
work when CXL devices are used by the system/Host, some comments are made to
cover the virtualization case where those CXL devices can potenetially 
be used
(device passthrough) by guests/VMs. I try to expose the current security
problems where IOMMU is used for restricting what a guest controlled 
CXL.cache
device can read/write in Host memory what I think needs to be clarified by
hardware vendors.


Understanding the memory accesses from CXL devices
==================================

For the sake of presenting the case about kernel CXL.cache support, I'll 
try to
explain how it works (I should say "how I think it works") and the main 
points
to discuss regarding how to implement this support. So, do not take the next
explanation as the definitive answer or guide, and if you think there 
are errors
or maybe too much generalization at some points, please help fixing or 
adding
further details. Also, consider some parts as just me thinking out loud, 
what
maybe help other people (or confuse them!).

The CXL.cache protocol allows devices to be part of the coherency ring 
of the
system.

Let's start with a Type2 device reading from a specific host memory 
address. The
final situation is 64bytes (cache line) from host memory copied to the 
device
cache, supposedly for being used by the device/accelerator. If the data 
changes,
because some host cpu modifies it, the device will be signalled by the 
coherency
ring, so the device will know. The important point here is the device can be
told because the Host knows the device has a copy or the only copy of that
data/memory. And that is thanks to the snoop cache implemented by the 
CXL Root
Complex.

A device caching host memory can be used as well for writes to host memory
through the cache coherency ring. A device can not just read host memory and
keep it, but it can modified it. The implications of writes versus reads 
are not
important for the goal of this document. It requires the device to 
support more
protocol exchange cases, but regarding the snoop cache, it is irrelevant.

There arise obvious questions about how this snoop cache is going to work.

First, with the simple case of just one device caching Host memory. From the
specs, the device CXL.cache should not be enabled by the Host if the device
cache is bigger than the snoop cache. However, what does preclude a 
device to do
more memory accesses than what the snoop cache can cover? This can be partly
explained with some allocation control for CXL.cache what is discussed 
in the
next section. But a "rogue" device could try things like this, what for 
the case
of a single device using the snoop cache and without any other concern about
security, is probably fine:

         - With a Type2, the snoop cache will tell the device to release 
another
           line, meaning any modified line to be sent back to the Host.
         - Any performance problem will only have an impact on the 
device itself.

Then the case of multiple CXL devices caching Host memory in the same 
CXL Root
Complex and therefore same CXL Snoop Cache:

* How can the snoop cache track reads from different devices without one 
device
   monopolizing the full space?

         - enforcing snoop cache slices by software?
         - allowing specific/limited host ranges by the kernel?

AFAIK, there is not any kind of hardware control for avoiding this 
contention.
Note that with the proper checking by the BIOS and by the kernel (for 
hotplug or
those not enabled devices yet during boot time), the size of total 
device caches
allowed per CXL Root Complex should not be bigger than the snoop cache 
size, and
therefore theoretically no contention at all ... if the devices do the right
thing. From software the only thing we can do is to ensure the CXL.cache
accesses from a device are within a range with same size than the enabled
CXL.cache.

Therefore, some memory allocation API is required for dealing with the 
amount of
memory the snoop cache can track, and the host memory a device can 
access to.
The device needs the physical address to work with, and it is in this 
required
translation from virtual to physical addresses where we can enforce the
restriction. Of course, such an API does already exist, although not 
with the
checking we need: the kernel DMA API.


(Secure) memory allocation  and CXL.cache
===========================

DMAs allow devices to perform read/write operations to system memory 
without any
cpu intervention after the (meta)data about how to perform the DMA is 
given to
the device. CXL.cache is more than DMA because the system memory caches are
implicitly involved but for the sake of handling this by the operating 
system,
not too much different. The important point here is there is no restriction
about the DMAble memory to be used by a device, but due to the snoop cache
limitations, this needs to change for CXL: code aware of the snoop cache 
state
and what a device requires needs to be consulted for properly handling the
available space.

Should we use the kernel DMA API for CXL.cache allocations? This API 
deals with
memory coherency what is not needed for the CXL.cache case. However, it is
connected with the IOMMU functionality what is required for CXL.cache if 
it is
enabled.

I think the solution should be to implement a CXL.cache allocation API 
inside
the CXL core dealing with the snoop cache available space, and to 
connect with
IOMMU kernel code when it is enabled.

A security aspect behind DMAs is a device has (usually) no restrictions for
memory access. This is true in a system with no IOMMU hardware, and 
CXL.cache
is not different in this case. With IOMMU is a different game though.

First of all, IOMMU will be in place for CXL.io, what implies legacy TLP 
PCIe
packets. A CXL.cache operation can not be handled by the IOMMU hardware 
and the
spec states ATS to be used beforehand, that is, the CXL device asking 
the IOMMU
hardware about the physical address to work with, and keeping that 
translation
internally. The CXL spec specifies ATS service extensions for CXL, and 
some ATS
requests can tell the device some addresses only to be used through 
CXL.io. This
implies some sort of knowledge about CXL is required by the IOMMU/ATS 
hardware
which depends on how the per device tables are programmed by the Host. 
However,
AFAIK, this is not supported yet by any Linux kernel IOMMU vendor 
support. Note
the usual IOMMU device/domain tables will/can be used for normal DMA 
transfers,
so IOMMU configuration, both in the Host and by the HW, needs to know 
which parts
of the domain are for DMAs and which are for CXL.cache.

Assuming this support will be implemented at some point in the future, the
questions are, when?, and, how safe is it?

Can a device issue CXL.cache operations using arbitrary physical 
addresses? It
seems there are some cases where the hardware can take control of PCIe TLP
packets with the ATS bit on. For example, if there is a PCIe bridge in 
the path,
and with that bridge using a specific redirection table based on 
configured ATS
per device ranges, any TLP with the ATS bit on will be redirected based 
on such
a table, and implying no redirection if no table entry. However, that 
does not
seem to be in place for PCIe Root Complex implementations. For example, AMD
IOMMU documentation states ATS TLP packets are not handled at all, implying
trusting the device, and if more security is required, the IOMMU 
hardware can
check those TLP ATS packets as well, spoiling the ATS advantage. Note 
this is
PCIe, so CXL.io will likely keep the functionality, but CXL.cache operations
follow another path with apparently no further control to enforce the right
addresses within the allowed memory ranges per device are used.

Because this apparently lack of security for IOMMU and CXL.cache, this 
implies a
CXL device should not be used by VMs or any other user space controlled 
driver
with CXL.cache being enabled. This seems a really serious limitation, so 
maybe
I'm missing something here.

Regarding virtualization, assuming the security problems do not exist or 
will be
solved, while CXL.mem can be supported with an ahead mapping by the 
Host, with
CXL.cache this needs to be handled when the related driver asks for specific
memory to access, and then to configure the IOMMU/ATS tables by the 
Host. This
implies the emulation needs a backend, what an ahead mapping, as currently
proposed for CXL.mem can avoid.

Finally, if my concerns about the security of CXL.cache with IOMMU are
unfounded, at least this document should describe how is this solved and the
security enforced by the hardware, and if the kernel requires to handle it
specifically (what I really think is the case, at least with IOMMU changes
managed by the CXL core).


Summary
======


Next the proposed tasks to perform for supporting CXL.cache:

         - CXL core handling per device CXL.cache enabling based on CXL Root
           Complex snoop cache state.

         - CXL core implementing a CXL.cache host memory allocation 
restricting
           the physical memory a a device can access to through CXL.cache.

         - IOMMU being CXL aware and dealing with CXL.cache vs CXL.io 
requests.

         - Clarify CXL.cache and security with IOMMU.



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-12-24 15:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-19 16:52 RFC: Kernel CXL cache support (and IOMMU implications) Alejandro Lucero Palau
2024-11-20 22:33 ` Zhi Wang
2024-12-13 14:15   ` Alejandro Lucero Palau
2024-12-24 15:05     ` Jonathan Cameron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.