Re: RFC: Kernel CXL cache support (and IOMMU implications)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zhi Wang <zhiw@nvidia.com>
To: Alejandro Lucero Palau <alucerop@amd.com>
Cc: "linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	<iommu@lists.linux.dev>
Subject: Re: RFC: Kernel CXL cache support (and IOMMU implications)
Date: Thu, 21 Nov 2024 00:33:16 +0200	[thread overview]
Message-ID: <20241121003316.00001cd3@nvidia.com> (raw)
In-Reply-To: <cc2525a6-0f6a-c1c8-83e1-6396661efc8a@amd.com>

On Tue, 19 Nov 2024 16:52:15 +0000
Alejandro Lucero Palau <alucerop@amd.com> wrote:

Thanks so much for the doc. I just quickly went through the doc and here
are my comments.

> November, 2024
> 
> Tittle: CXL Cache support by the kernel
> Author: Alejandro Lucero (alucerop@amd.com)
> 
> Version 0.1
> 
> Introduction
> ========
> 
> After the LPC where I presented the current status of the Type2 CXL.mem 
> support
> patchset, and some ideas about supporting CXL.cache, it is time to dig 
> deeper in
> this second goal, and discussing the security/reliability aspect as well.
> 
> It is also important to try to describe how this is going to work and 
> what the
> kernel needs to know and enforce. Reading the CXL specs when having in 
> mind some
> specific use case can easily lead to assuming certains aspects with a 
> different
> perspective from other readers/use cases. To start with, it is necessary to
> differentiate two "CXL cache" functionalities when a Type2 device is in 
> place:
> 
> 1) A Type2 device caching Host memory.
> 
> 2) The Host caching HDM memory, that is the memory inside the Type2 CXL 
> device.
> 
> The first option is also what a Type1 device can do, and the kernel support
> needs to manage all those Type1/2 per CXL Root Complex knowing the resources
> limitation, that is the snooping cache size.
> 
> A snoop cache allows the host to track which memory is being used/cached by
> those devices, enforcing the cache coherency. The specs are not clear 
> about some
> important aspects regarding how the host can enforce the proper use of 
> this by
> devices or even if the snoop cache needs to do so. At pages 786 and 787 
> of CXL
> specs 3.1, how the system software should deal with CXL cache devices is 
> given,
> but this is inside a Hot-plug section. I think we can assume the Host
> firmware/BIOS will follow same approach for enabling CXL cache, and the 
> kernel
> needs to look at those devices with CXL cache enabled by the BIOS for 
> properly
> handling the available space in the snoop cache.
> 
> It is also worth to mention the CXL.cache protocol can be used in the two
> "CXL cache" functionalities listed above. However, the last CXL spec implies
> CXL.cache only used for the first case. Some comments about what the 
> specs say
> 
> regarding number of devices with a cache for host memory:
> 
>          - up to 16 Type1 and/or Type2 devices allowed per VH.
> 
> can be easily confused with the limitations of just one CXL Type2 device 
> using
> CXL.cache for enforcing coherency of its HDM. This limitation is 
> overcome with
> forcing Type2 device using HDM-DB, which relies on CXL.mem instead of 
> CXL.cache
> for HDM cache coherency.
> 
> While the Host is assumed to be able to access HDM in a Type2 device, and
> keeping data in the host cpu caches, it is the Type2 device 
> responsibility to
> properly manage cache coherency of its HDM. There is nothing the kernel can
> control here.
> 
> Therefore the interesting part and what this documents tries to cover is the
> Host memory being cached by Type2 or Type1 devices. While the main goal is
> discussing how the kernel needs to handle this, and to describe how it 
> should
> work when CXL devices are used by the system/Host, some comments are made to
> cover the virtualization case where those CXL devices can potenetially 
> be used
> (device passthrough) by guests/VMs. I try to expose the current security
> problems where IOMMU is used for restricting what a guest controlled 
> CXL.cache
> device can read/write in Host memory what I think needs to be clarified by
> hardware vendors.
> 
> 
> Understanding the memory accesses from CXL devices
> ==================================
> 
> For the sake of presenting the case about kernel CXL.cache support, I'll 
> try to
> explain how it works (I should say "how I think it works") and the main 
> points
> to discuss regarding how to implement this support. So, do not take the next
> explanation as the definitive answer or guide, and if you think there 
> are errors
> or maybe too much generalization at some points, please help fixing or 
> adding
> further details. Also, consider some parts as just me thinking out loud, 
> what
> maybe help other people (or confuse them!).
> 
> The CXL.cache protocol allows devices to be part of the coherency ring 
> of the
> system.
> 
> Let's start with a Type2 device reading from a specific host memory 
> address. The
> final situation is 64bytes (cache line) from host memory copied to the 
> device
> cache, supposedly for being used by the device/accelerator. If the data 
> changes,
> because some host cpu modifies it, the device will be signalled by the 
> coherency
> ring, so the device will know. The important point here is the device can be
> told because the Host knows the device has a copy or the only copy of that
> data/memory. And that is thanks to the snoop cache implemented by the 
> CXL Root
> Complex.
> 
> A device caching host memory can be used as well for writes to host memory
> through the cache coherency ring. A device can not just read host memory and
> keep it, but it can modified it. The implications of writes versus reads 
> are not
> important for the goal of this document. It requires the device to 
> support more
> protocol exchange cases, but regarding the snoop cache, it is irrelevant.
> 
> There arise obvious questions about how this snoop cache is going to work.
> 
> First, with the simple case of just one device caching Host memory. From the
> specs, the device CXL.cache should not be enabled by the Host if the device
> cache is bigger than the snoop cache. However, what does preclude a 
> device to do
> more memory accesses than what the snoop cache can cover? This can be partly
> explained with some allocation control for CXL.cache what is discussed 
> in the
> next section. But a "rogue" device could try things like this, what for 
> the case
> of a single device using the snoop cache and without any other concern about
> security, is probably fine:
> 
>          - With a Type2, the snoop cache will tell the device to release 
> another
>            line, meaning any modified line to be sent back to the Host.
>          - Any performance problem will only have an impact on the 
> device itself.
> 
> Then the case of multiple CXL devices caching Host memory in the same 
> CXL Root
> Complex and therefore same CXL Snoop Cache:
> 
> * How can the snoop cache track reads from different devices without one 
> device
>    monopolizing the full space?
> 
>          - enforcing snoop cache slices by software?
>          - allowing specific/limited host ranges by the kernel?
> 

I would like to compare it with the approaches that solves the similar
problem of the CPU cache since they might have similar essence. 

CPU cache suffered from the similar problems that noisy and
restless neighborhood keep poking the cache that might cause performance
drop. Nowadays, it is solved by the HW mechanism, cache allocation. For
Intel, it is called cache allocation technology(CAT) which is a subset of
Resource Director Technology(RDT). They can be also used in the
virtualization world.

Before SW gets the support from the HW, many research papers were talking
about solving it via page color. E.g. allocate the VM memory with page
color awareness for different VMs. But I don't think those ideas eventually
land in the mainline.

Back to this prob, I think probably SW is going to rely on a HW mechanism
to solve this problem nicely and decently, the same as CPU side. 

> AFAIK, there is not any kind of hardware control for avoiding this 
> contention.
> Note that with the proper checking by the BIOS and by the kernel (for 
> hotplug or
> those not enabled devices yet during boot time), the size of total 
> device caches
> allowed per CXL Root Complex should not be bigger than the snoop cache 
> size, and
> therefore theoretically no contention at all ... if the devices do the right
> thing. From software the only thing we can do is to ensure the CXL.cache
> accesses from a device are within a range with same size than the enabled
> CXL.cache.
> 

What would be the consequence if we violate this rule?

> Therefore, some memory allocation API is required for dealing with the 
> amount of
> memory the snoop cache can track, and the host memory a device can 
> access to.
> The device needs the physical address to work with, and it is in this 
> required
> translation from virtual to physical addresses where we can enforce the
> restriction. Of course, such an API does already exist, although not 
> with the
> checking we need: the kernel DMA API.
> 
> 
> (Secure) memory allocation  and CXL.cache
> ===========================
> 
> DMAs allow devices to perform read/write operations to system memory 
> without any
> cpu intervention after the (meta)data about how to perform the DMA is 
> given to
> the device. CXL.cache is more than DMA because the system memory caches are
> implicitly involved but for the sake of handling this by the operating 
> system,
> not too much different. The important point here is there is no restriction
> about the DMAble memory to be used by a device, but due to the snoop cache
> limitations, this needs to change for CXL: code aware of the snoop cache 
> state
> and what a device requires needs to be consulted for properly handling the
> available space.
> 

As what I replied above, I think we probably need a HW mechanism to solve
this problem nicely and decently. (Thinking sharing cache is
also a pre-condition of side-channel attack, even here is a snoop state
cahce.) With the HW mechanism, allocating the space of snoop state
cache might imply a glue layer of snoop cache management for different
CXL HB vendors to plug into the CXL core.

So when the CXL driver is initialized, the space of the snoop state cache
is allocated. With that is solved, for restricting the device to access the
memory (creating/mapping an IOVA for the DMA memory), SW can still leverage
the current Linux IOMMU/DMA APIs.

> Should we use the kernel DMA API for CXL.cache allocations? This API 
> deals with
> memory coherency what is not needed for the CXL.cache case. However, it is
> connected with the IOMMU functionality what is required for CXL.cache if 
> it is
> enabled.
> 
> I think the solution should be to implement a CXL.cache allocation API 
> inside
> the CXL core dealing with the snoop cache available space, and to 
> connect with
> IOMMU kernel code when it is enabled.
> 
> A security aspect behind DMAs is a device has (usually) no restrictions for
> memory access. This is true in a system with no IOMMU hardware, and 
> CXL.cache
> is not different in this case. With IOMMU is a different game though.
> 
> First of all, IOMMU will be in place for CXL.io, what implies legacy TLP 
> PCIe
> packets. A CXL.cache operation can not be handled by the IOMMU hardware 
> and the
> spec states ATS to be used beforehand, that is, the CXL device asking 
> the IOMMU
> hardware about the physical address to work with, and keeping that 
> translation
> internally. The CXL spec specifies ATS service extensions for CXL, and 
> some ATS
> requests can tell the device some addresses only to be used through 
> CXL.io. This
> implies some sort of knowledge about CXL is required by the IOMMU/ATS 
> hardware
> which depends on how the per device tables are programmed by the Host. 
> However,
> AFAIK, this is not supported yet by any Linux kernel IOMMU vendor 
> support. Note
> the usual IOMMU device/domain tables will/can be used for normal DMA 
> transfers,
> so IOMMU configuration, both in the Host and by the HW, needs to know 
> which parts
> of the domain are for DMAs and which are for CXL.cache.
> 
> Assuming this support will be implemented at some point in the future, the
> questions are, when?, and, how safe is it?
> 
> Can a device issue CXL.cache operations using arbitrary physical 
> addresses? It
> seems there are some cases where the hardware can take control of PCIe TLP
> packets with the ATS bit on. For example, if there is a PCIe bridge in 
> the path,
> and with that bridge using a specific redirection table based on 
> configured ATS
> per device ranges, any TLP with the ATS bit on will be redirected based 
> on such
> a table, and implying no redirection if no table entry. However, that 
> does not
> seem to be in place for PCIe Root Complex implementations. For example, AMD
> IOMMU documentation states ATS TLP packets are not handled at all, implying
> trusting the device, and if more security is required, the IOMMU 
> hardware can

Are you referring to the ATS translated request here? I think ATS itself
doesn't consider the security in its mind. 

> check those TLP ATS packets as well, spoiling the ATS advantage. Note 

Yes, AMD IOMMU has the secure ATS support, but as you said, it is pretty
straight-forward, basically just check every translated request when
enabled.

> this is
> PCIe, so CXL.io will likely keep the functionality, but CXL.cache operations
> follow another path with apparently no further control to enforce the right
> addresses within the allowed memory ranges per device are used.
> 
> Because this apparently lack of security for IOMMU and CXL.cache, this 
> implies a
> CXL device should not be used by VMs or any other user space controlled 
> driver
> with CXL.cache being enabled. This seems a really serious limitation, so 
> maybe
> I'm missing something here.
> 

I think at least for CXL path, IOMMU should have the similar mechanism like
secure ATS, and let the user to choose if they want it to be enabled or
not.

In reality, many CSP design the HW by themselves and trust their HW won't
do messy things, they may want to enable it only on the 3rd party HW.

For confidential computing world, secure ATS is mandatory, and performance
drop is the price of security.

> Regarding virtualization, assuming the security problems do not exist or 
> will be
> solved, while CXL.mem can be supported with an ahead mapping by the 
> Host, with
> CXL.cache this needs to be handled when the related driver asks for specific
> memory to access, and then to configure the IOMMU/ATS tables by the 
> Host. This
> implies the emulation needs a backend, what an ahead mapping, as currently
> proposed for CXL.mem can avoid.
> 
> Finally, if my concerns about the security of CXL.cache with IOMMU are
> unfounded, at least this document should describe how is this solved and the
> security enforced by the hardware, and if the kernel requires to handle it
> specifically (what I really think is the case, at least with IOMMU changes
> managed by the CXL core).
> 
> 
> Summary
> ======
> 
> 
> Next the proposed tasks to perform for supporting CXL.cache:
> 
>          - CXL core handling per device CXL.cache enabling based on CXL Root
>            Complex snoop cache state.
> 
>          - CXL core implementing a CXL.cache host memory allocation 
> restricting
>            the physical memory a a device can access to through CXL.cache.
> 
>          - IOMMU being CXL aware and dealing with CXL.cache vs CXL.io 
> requests.
> 
>          - Clarify CXL.cache and security with IOMMU.
> 
> 
>

next prev parent reply	other threads:[~2024-11-20 22:33 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-19 16:52 RFC: Kernel CXL cache support (and IOMMU implications) Alejandro Lucero Palau
2024-11-20 22:33 ` Zhi Wang [this message]
2024-12-13 14:15   ` Alejandro Lucero Palau
2024-12-24 15:05     ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241121003316.00001cd3@nvidia.com \
    --to=zhiw@nvidia.com \
    --cc=alucerop@amd.com \
    --cc=iommu@lists.linux.dev \
    --cc=linux-cxl@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.