* RFC: Kernel CXL cache support (and IOMMU implications) @ 2024-11-19 16:52 Alejandro Lucero Palau 2024-11-20 22:33 ` Zhi Wang 0 siblings, 1 reply; 4+ messages in thread From: Alejandro Lucero Palau @ 2024-11-19 16:52 UTC (permalink / raw) To: linux-cxl@vger.kernel.org, iommu November, 2024 Tittle: CXL Cache support by the kernel Author: Alejandro Lucero (alucerop@amd.com) Version 0.1 Introduction ======== After the LPC where I presented the current status of the Type2 CXL.mem support patchset, and some ideas about supporting CXL.cache, it is time to dig deeper in this second goal, and discussing the security/reliability aspect as well. It is also important to try to describe how this is going to work and what the kernel needs to know and enforce. Reading the CXL specs when having in mind some specific use case can easily lead to assuming certains aspects with a different perspective from other readers/use cases. To start with, it is necessary to differentiate two "CXL cache" functionalities when a Type2 device is in place: 1) A Type2 device caching Host memory. 2) The Host caching HDM memory, that is the memory inside the Type2 CXL device. The first option is also what a Type1 device can do, and the kernel support needs to manage all those Type1/2 per CXL Root Complex knowing the resources limitation, that is the snooping cache size. A snoop cache allows the host to track which memory is being used/cached by those devices, enforcing the cache coherency. The specs are not clear about some important aspects regarding how the host can enforce the proper use of this by devices or even if the snoop cache needs to do so. At pages 786 and 787 of CXL specs 3.1, how the system software should deal with CXL cache devices is given, but this is inside a Hot-plug section. I think we can assume the Host firmware/BIOS will follow same approach for enabling CXL cache, and the kernel needs to look at those devices with CXL cache enabled by the BIOS for properly handling the available space in the snoop cache. It is also worth to mention the CXL.cache protocol can be used in the two "CXL cache" functionalities listed above. However, the last CXL spec implies CXL.cache only used for the first case. Some comments about what the specs say regarding number of devices with a cache for host memory: - up to 16 Type1 and/or Type2 devices allowed per VH. can be easily confused with the limitations of just one CXL Type2 device using CXL.cache for enforcing coherency of its HDM. This limitation is overcome with forcing Type2 device using HDM-DB, which relies on CXL.mem instead of CXL.cache for HDM cache coherency. While the Host is assumed to be able to access HDM in a Type2 device, and keeping data in the host cpu caches, it is the Type2 device responsibility to properly manage cache coherency of its HDM. There is nothing the kernel can control here. Therefore the interesting part and what this documents tries to cover is the Host memory being cached by Type2 or Type1 devices. While the main goal is discussing how the kernel needs to handle this, and to describe how it should work when CXL devices are used by the system/Host, some comments are made to cover the virtualization case where those CXL devices can potenetially be used (device passthrough) by guests/VMs. I try to expose the current security problems where IOMMU is used for restricting what a guest controlled CXL.cache device can read/write in Host memory what I think needs to be clarified by hardware vendors. Understanding the memory accesses from CXL devices ================================== For the sake of presenting the case about kernel CXL.cache support, I'll try to explain how it works (I should say "how I think it works") and the main points to discuss regarding how to implement this support. So, do not take the next explanation as the definitive answer or guide, and if you think there are errors or maybe too much generalization at some points, please help fixing or adding further details. Also, consider some parts as just me thinking out loud, what maybe help other people (or confuse them!). The CXL.cache protocol allows devices to be part of the coherency ring of the system. Let's start with a Type2 device reading from a specific host memory address. The final situation is 64bytes (cache line) from host memory copied to the device cache, supposedly for being used by the device/accelerator. If the data changes, because some host cpu modifies it, the device will be signalled by the coherency ring, so the device will know. The important point here is the device can be told because the Host knows the device has a copy or the only copy of that data/memory. And that is thanks to the snoop cache implemented by the CXL Root Complex. A device caching host memory can be used as well for writes to host memory through the cache coherency ring. A device can not just read host memory and keep it, but it can modified it. The implications of writes versus reads are not important for the goal of this document. It requires the device to support more protocol exchange cases, but regarding the snoop cache, it is irrelevant. There arise obvious questions about how this snoop cache is going to work. First, with the simple case of just one device caching Host memory. From the specs, the device CXL.cache should not be enabled by the Host if the device cache is bigger than the snoop cache. However, what does preclude a device to do more memory accesses than what the snoop cache can cover? This can be partly explained with some allocation control for CXL.cache what is discussed in the next section. But a "rogue" device could try things like this, what for the case of a single device using the snoop cache and without any other concern about security, is probably fine: - With a Type2, the snoop cache will tell the device to release another line, meaning any modified line to be sent back to the Host. - Any performance problem will only have an impact on the device itself. Then the case of multiple CXL devices caching Host memory in the same CXL Root Complex and therefore same CXL Snoop Cache: * How can the snoop cache track reads from different devices without one device monopolizing the full space? - enforcing snoop cache slices by software? - allowing specific/limited host ranges by the kernel? AFAIK, there is not any kind of hardware control for avoiding this contention. Note that with the proper checking by the BIOS and by the kernel (for hotplug or those not enabled devices yet during boot time), the size of total device caches allowed per CXL Root Complex should not be bigger than the snoop cache size, and therefore theoretically no contention at all ... if the devices do the right thing. From software the only thing we can do is to ensure the CXL.cache accesses from a device are within a range with same size than the enabled CXL.cache. Therefore, some memory allocation API is required for dealing with the amount of memory the snoop cache can track, and the host memory a device can access to. The device needs the physical address to work with, and it is in this required translation from virtual to physical addresses where we can enforce the restriction. Of course, such an API does already exist, although not with the checking we need: the kernel DMA API. (Secure) memory allocation and CXL.cache =========================== DMAs allow devices to perform read/write operations to system memory without any cpu intervention after the (meta)data about how to perform the DMA is given to the device. CXL.cache is more than DMA because the system memory caches are implicitly involved but for the sake of handling this by the operating system, not too much different. The important point here is there is no restriction about the DMAble memory to be used by a device, but due to the snoop cache limitations, this needs to change for CXL: code aware of the snoop cache state and what a device requires needs to be consulted for properly handling the available space. Should we use the kernel DMA API for CXL.cache allocations? This API deals with memory coherency what is not needed for the CXL.cache case. However, it is connected with the IOMMU functionality what is required for CXL.cache if it is enabled. I think the solution should be to implement a CXL.cache allocation API inside the CXL core dealing with the snoop cache available space, and to connect with IOMMU kernel code when it is enabled. A security aspect behind DMAs is a device has (usually) no restrictions for memory access. This is true in a system with no IOMMU hardware, and CXL.cache is not different in this case. With IOMMU is a different game though. First of all, IOMMU will be in place for CXL.io, what implies legacy TLP PCIe packets. A CXL.cache operation can not be handled by the IOMMU hardware and the spec states ATS to be used beforehand, that is, the CXL device asking the IOMMU hardware about the physical address to work with, and keeping that translation internally. The CXL spec specifies ATS service extensions for CXL, and some ATS requests can tell the device some addresses only to be used through CXL.io. This implies some sort of knowledge about CXL is required by the IOMMU/ATS hardware which depends on how the per device tables are programmed by the Host. However, AFAIK, this is not supported yet by any Linux kernel IOMMU vendor support. Note the usual IOMMU device/domain tables will/can be used for normal DMA transfers, so IOMMU configuration, both in the Host and by the HW, needs to know which parts of the domain are for DMAs and which are for CXL.cache. Assuming this support will be implemented at some point in the future, the questions are, when?, and, how safe is it? Can a device issue CXL.cache operations using arbitrary physical addresses? It seems there are some cases where the hardware can take control of PCIe TLP packets with the ATS bit on. For example, if there is a PCIe bridge in the path, and with that bridge using a specific redirection table based on configured ATS per device ranges, any TLP with the ATS bit on will be redirected based on such a table, and implying no redirection if no table entry. However, that does not seem to be in place for PCIe Root Complex implementations. For example, AMD IOMMU documentation states ATS TLP packets are not handled at all, implying trusting the device, and if more security is required, the IOMMU hardware can check those TLP ATS packets as well, spoiling the ATS advantage. Note this is PCIe, so CXL.io will likely keep the functionality, but CXL.cache operations follow another path with apparently no further control to enforce the right addresses within the allowed memory ranges per device are used. Because this apparently lack of security for IOMMU and CXL.cache, this implies a CXL device should not be used by VMs or any other user space controlled driver with CXL.cache being enabled. This seems a really serious limitation, so maybe I'm missing something here. Regarding virtualization, assuming the security problems do not exist or will be solved, while CXL.mem can be supported with an ahead mapping by the Host, with CXL.cache this needs to be handled when the related driver asks for specific memory to access, and then to configure the IOMMU/ATS tables by the Host. This implies the emulation needs a backend, what an ahead mapping, as currently proposed for CXL.mem can avoid. Finally, if my concerns about the security of CXL.cache with IOMMU are unfounded, at least this document should describe how is this solved and the security enforced by the hardware, and if the kernel requires to handle it specifically (what I really think is the case, at least with IOMMU changes managed by the CXL core). Summary ====== Next the proposed tasks to perform for supporting CXL.cache: - CXL core handling per device CXL.cache enabling based on CXL Root Complex snoop cache state. - CXL core implementing a CXL.cache host memory allocation restricting the physical memory a a device can access to through CXL.cache. - IOMMU being CXL aware and dealing with CXL.cache vs CXL.io requests. - Clarify CXL.cache and security with IOMMU. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: RFC: Kernel CXL cache support (and IOMMU implications) 2024-11-19 16:52 RFC: Kernel CXL cache support (and IOMMU implications) Alejandro Lucero Palau @ 2024-11-20 22:33 ` Zhi Wang 2024-12-13 14:15 ` Alejandro Lucero Palau 0 siblings, 1 reply; 4+ messages in thread From: Zhi Wang @ 2024-11-20 22:33 UTC (permalink / raw) To: Alejandro Lucero Palau; +Cc: linux-cxl@vger.kernel.org, iommu On Tue, 19 Nov 2024 16:52:15 +0000 Alejandro Lucero Palau <alucerop@amd.com> wrote: Thanks so much for the doc. I just quickly went through the doc and here are my comments. > November, 2024 > > Tittle: CXL Cache support by the kernel > Author: Alejandro Lucero (alucerop@amd.com) > > Version 0.1 > > Introduction > ======== > > After the LPC where I presented the current status of the Type2 CXL.mem > support > patchset, and some ideas about supporting CXL.cache, it is time to dig > deeper in > this second goal, and discussing the security/reliability aspect as well. > > It is also important to try to describe how this is going to work and > what the > kernel needs to know and enforce. Reading the CXL specs when having in > mind some > specific use case can easily lead to assuming certains aspects with a > different > perspective from other readers/use cases. To start with, it is necessary to > differentiate two "CXL cache" functionalities when a Type2 device is in > place: > > 1) A Type2 device caching Host memory. > > 2) The Host caching HDM memory, that is the memory inside the Type2 CXL > device. > > The first option is also what a Type1 device can do, and the kernel support > needs to manage all those Type1/2 per CXL Root Complex knowing the resources > limitation, that is the snooping cache size. > > A snoop cache allows the host to track which memory is being used/cached by > those devices, enforcing the cache coherency. The specs are not clear > about some > important aspects regarding how the host can enforce the proper use of > this by > devices or even if the snoop cache needs to do so. At pages 786 and 787 > of CXL > specs 3.1, how the system software should deal with CXL cache devices is > given, > but this is inside a Hot-plug section. I think we can assume the Host > firmware/BIOS will follow same approach for enabling CXL cache, and the > kernel > needs to look at those devices with CXL cache enabled by the BIOS for > properly > handling the available space in the snoop cache. > > It is also worth to mention the CXL.cache protocol can be used in the two > "CXL cache" functionalities listed above. However, the last CXL spec implies > CXL.cache only used for the first case. Some comments about what the > specs say > > regarding number of devices with a cache for host memory: > > - up to 16 Type1 and/or Type2 devices allowed per VH. > > can be easily confused with the limitations of just one CXL Type2 device > using > CXL.cache for enforcing coherency of its HDM. This limitation is > overcome with > forcing Type2 device using HDM-DB, which relies on CXL.mem instead of > CXL.cache > for HDM cache coherency. > > While the Host is assumed to be able to access HDM in a Type2 device, and > keeping data in the host cpu caches, it is the Type2 device > responsibility to > properly manage cache coherency of its HDM. There is nothing the kernel can > control here. > > Therefore the interesting part and what this documents tries to cover is the > Host memory being cached by Type2 or Type1 devices. While the main goal is > discussing how the kernel needs to handle this, and to describe how it > should > work when CXL devices are used by the system/Host, some comments are made to > cover the virtualization case where those CXL devices can potenetially > be used > (device passthrough) by guests/VMs. I try to expose the current security > problems where IOMMU is used for restricting what a guest controlled > CXL.cache > device can read/write in Host memory what I think needs to be clarified by > hardware vendors. > > > Understanding the memory accesses from CXL devices > ================================== > > For the sake of presenting the case about kernel CXL.cache support, I'll > try to > explain how it works (I should say "how I think it works") and the main > points > to discuss regarding how to implement this support. So, do not take the next > explanation as the definitive answer or guide, and if you think there > are errors > or maybe too much generalization at some points, please help fixing or > adding > further details. Also, consider some parts as just me thinking out loud, > what > maybe help other people (or confuse them!). > > The CXL.cache protocol allows devices to be part of the coherency ring > of the > system. > > Let's start with a Type2 device reading from a specific host memory > address. The > final situation is 64bytes (cache line) from host memory copied to the > device > cache, supposedly for being used by the device/accelerator. If the data > changes, > because some host cpu modifies it, the device will be signalled by the > coherency > ring, so the device will know. The important point here is the device can be > told because the Host knows the device has a copy or the only copy of that > data/memory. And that is thanks to the snoop cache implemented by the > CXL Root > Complex. > > A device caching host memory can be used as well for writes to host memory > through the cache coherency ring. A device can not just read host memory and > keep it, but it can modified it. The implications of writes versus reads > are not > important for the goal of this document. It requires the device to > support more > protocol exchange cases, but regarding the snoop cache, it is irrelevant. > > There arise obvious questions about how this snoop cache is going to work. > > First, with the simple case of just one device caching Host memory. From the > specs, the device CXL.cache should not be enabled by the Host if the device > cache is bigger than the snoop cache. However, what does preclude a > device to do > more memory accesses than what the snoop cache can cover? This can be partly > explained with some allocation control for CXL.cache what is discussed > in the > next section. But a "rogue" device could try things like this, what for > the case > of a single device using the snoop cache and without any other concern about > security, is probably fine: > > - With a Type2, the snoop cache will tell the device to release > another > line, meaning any modified line to be sent back to the Host. > - Any performance problem will only have an impact on the > device itself. > > Then the case of multiple CXL devices caching Host memory in the same > CXL Root > Complex and therefore same CXL Snoop Cache: > > * How can the snoop cache track reads from different devices without one > device > monopolizing the full space? > > - enforcing snoop cache slices by software? > - allowing specific/limited host ranges by the kernel? > I would like to compare it with the approaches that solves the similar problem of the CPU cache since they might have similar essence. CPU cache suffered from the similar problems that noisy and restless neighborhood keep poking the cache that might cause performance drop. Nowadays, it is solved by the HW mechanism, cache allocation. For Intel, it is called cache allocation technology(CAT) which is a subset of Resource Director Technology(RDT). They can be also used in the virtualization world. Before SW gets the support from the HW, many research papers were talking about solving it via page color. E.g. allocate the VM memory with page color awareness for different VMs. But I don't think those ideas eventually land in the mainline. Back to this prob, I think probably SW is going to rely on a HW mechanism to solve this problem nicely and decently, the same as CPU side. > AFAIK, there is not any kind of hardware control for avoiding this > contention. > Note that with the proper checking by the BIOS and by the kernel (for > hotplug or > those not enabled devices yet during boot time), the size of total > device caches > allowed per CXL Root Complex should not be bigger than the snoop cache > size, and > therefore theoretically no contention at all ... if the devices do the right > thing. From software the only thing we can do is to ensure the CXL.cache > accesses from a device are within a range with same size than the enabled > CXL.cache. > What would be the consequence if we violate this rule? > Therefore, some memory allocation API is required for dealing with the > amount of > memory the snoop cache can track, and the host memory a device can > access to. > The device needs the physical address to work with, and it is in this > required > translation from virtual to physical addresses where we can enforce the > restriction. Of course, such an API does already exist, although not > with the > checking we need: the kernel DMA API. > > > (Secure) memory allocation and CXL.cache > =========================== > > DMAs allow devices to perform read/write operations to system memory > without any > cpu intervention after the (meta)data about how to perform the DMA is > given to > the device. CXL.cache is more than DMA because the system memory caches are > implicitly involved but for the sake of handling this by the operating > system, > not too much different. The important point here is there is no restriction > about the DMAble memory to be used by a device, but due to the snoop cache > limitations, this needs to change for CXL: code aware of the snoop cache > state > and what a device requires needs to be consulted for properly handling the > available space. > As what I replied above, I think we probably need a HW mechanism to solve this problem nicely and decently. (Thinking sharing cache is also a pre-condition of side-channel attack, even here is a snoop state cahce.) With the HW mechanism, allocating the space of snoop state cache might imply a glue layer of snoop cache management for different CXL HB vendors to plug into the CXL core. So when the CXL driver is initialized, the space of the snoop state cache is allocated. With that is solved, for restricting the device to access the memory (creating/mapping an IOVA for the DMA memory), SW can still leverage the current Linux IOMMU/DMA APIs. > Should we use the kernel DMA API for CXL.cache allocations? This API > deals with > memory coherency what is not needed for the CXL.cache case. However, it is > connected with the IOMMU functionality what is required for CXL.cache if > it is > enabled. > > I think the solution should be to implement a CXL.cache allocation API > inside > the CXL core dealing with the snoop cache available space, and to > connect with > IOMMU kernel code when it is enabled. > > A security aspect behind DMAs is a device has (usually) no restrictions for > memory access. This is true in a system with no IOMMU hardware, and > CXL.cache > is not different in this case. With IOMMU is a different game though. > > First of all, IOMMU will be in place for CXL.io, what implies legacy TLP > PCIe > packets. A CXL.cache operation can not be handled by the IOMMU hardware > and the > spec states ATS to be used beforehand, that is, the CXL device asking > the IOMMU > hardware about the physical address to work with, and keeping that > translation > internally. The CXL spec specifies ATS service extensions for CXL, and > some ATS > requests can tell the device some addresses only to be used through > CXL.io. This > implies some sort of knowledge about CXL is required by the IOMMU/ATS > hardware > which depends on how the per device tables are programmed by the Host. > However, > AFAIK, this is not supported yet by any Linux kernel IOMMU vendor > support. Note > the usual IOMMU device/domain tables will/can be used for normal DMA > transfers, > so IOMMU configuration, both in the Host and by the HW, needs to know > which parts > of the domain are for DMAs and which are for CXL.cache. > > Assuming this support will be implemented at some point in the future, the > questions are, when?, and, how safe is it? > > Can a device issue CXL.cache operations using arbitrary physical > addresses? It > seems there are some cases where the hardware can take control of PCIe TLP > packets with the ATS bit on. For example, if there is a PCIe bridge in > the path, > and with that bridge using a specific redirection table based on > configured ATS > per device ranges, any TLP with the ATS bit on will be redirected based > on such > a table, and implying no redirection if no table entry. However, that > does not > seem to be in place for PCIe Root Complex implementations. For example, AMD > IOMMU documentation states ATS TLP packets are not handled at all, implying > trusting the device, and if more security is required, the IOMMU > hardware can Are you referring to the ATS translated request here? I think ATS itself doesn't consider the security in its mind. > check those TLP ATS packets as well, spoiling the ATS advantage. Note Yes, AMD IOMMU has the secure ATS support, but as you said, it is pretty straight-forward, basically just check every translated request when enabled. > this is > PCIe, so CXL.io will likely keep the functionality, but CXL.cache operations > follow another path with apparently no further control to enforce the right > addresses within the allowed memory ranges per device are used. > > Because this apparently lack of security for IOMMU and CXL.cache, this > implies a > CXL device should not be used by VMs or any other user space controlled > driver > with CXL.cache being enabled. This seems a really serious limitation, so > maybe > I'm missing something here. > I think at least for CXL path, IOMMU should have the similar mechanism like secure ATS, and let the user to choose if they want it to be enabled or not. In reality, many CSP design the HW by themselves and trust their HW won't do messy things, they may want to enable it only on the 3rd party HW. For confidential computing world, secure ATS is mandatory, and performance drop is the price of security. > Regarding virtualization, assuming the security problems do not exist or > will be > solved, while CXL.mem can be supported with an ahead mapping by the > Host, with > CXL.cache this needs to be handled when the related driver asks for specific > memory to access, and then to configure the IOMMU/ATS tables by the > Host. This > implies the emulation needs a backend, what an ahead mapping, as currently > proposed for CXL.mem can avoid. > > Finally, if my concerns about the security of CXL.cache with IOMMU are > unfounded, at least this document should describe how is this solved and the > security enforced by the hardware, and if the kernel requires to handle it > specifically (what I really think is the case, at least with IOMMU changes > managed by the CXL core). > > > Summary > ====== > > > Next the proposed tasks to perform for supporting CXL.cache: > > - CXL core handling per device CXL.cache enabling based on CXL Root > Complex snoop cache state. > > - CXL core implementing a CXL.cache host memory allocation > restricting > the physical memory a a device can access to through CXL.cache. > > - IOMMU being CXL aware and dealing with CXL.cache vs CXL.io > requests. > > - Clarify CXL.cache and security with IOMMU. > > > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: RFC: Kernel CXL cache support (and IOMMU implications) 2024-11-20 22:33 ` Zhi Wang @ 2024-12-13 14:15 ` Alejandro Lucero Palau 2024-12-24 15:05 ` Jonathan Cameron 0 siblings, 1 reply; 4+ messages in thread From: Alejandro Lucero Palau @ 2024-12-13 14:15 UTC (permalink / raw) To: Zhi Wang; +Cc: linux-cxl@vger.kernel.org, iommu On 11/20/24 22:33, Zhi Wang wrote: > On Tue, 19 Nov 2024 16:52:15 +0000 > Alejandro Lucero Palau <alucerop@amd.com> wrote: > > Thanks so much for the doc. I just quickly went through the doc and here > are my comments. Hi Zhi, Thanks for your comments. I did not reply earlier waiting for more feedback from, mainly, the IOMMU kernel guys. Maybe CXL support is something most of them neither have contemplated nor aware of (maybe) requiring special handling. I really think IOMMU/DMA API will need some change, but this document is for discussing it and maybe proving me wrong. Let's hope replying to your comments keep things moving somehow ... >> November, 2024 >> >> Tittle: CXL Cache support by the kernel >> Author: Alejandro Lucero (alucerop@amd.com) >> >> Version 0.1 >> >> Introduction >> ======== >> >> After the LPC where I presented the current status of the Type2 CXL.mem >> support >> patchset, and some ideas about supporting CXL.cache, it is time to dig >> deeper in >> this second goal, and discussing the security/reliability aspect as well. >> >> It is also important to try to describe how this is going to work and >> what the >> kernel needs to know and enforce. Reading the CXL specs when having in >> mind some >> specific use case can easily lead to assuming certains aspects with a >> different >> perspective from other readers/use cases. To start with, it is necessary to >> differentiate two "CXL cache" functionalities when a Type2 device is in >> place: >> >> 1) A Type2 device caching Host memory. >> >> 2) The Host caching HDM memory, that is the memory inside the Type2 CXL >> device. >> >> The first option is also what a Type1 device can do, and the kernel support >> needs to manage all those Type1/2 per CXL Root Complex knowing the resources >> limitation, that is the snooping cache size. >> >> A snoop cache allows the host to track which memory is being used/cached by >> those devices, enforcing the cache coherency. The specs are not clear >> about some >> important aspects regarding how the host can enforce the proper use of >> this by >> devices or even if the snoop cache needs to do so. At pages 786 and 787 >> of CXL >> specs 3.1, how the system software should deal with CXL cache devices is >> given, >> but this is inside a Hot-plug section. I think we can assume the Host >> firmware/BIOS will follow same approach for enabling CXL cache, and the >> kernel >> needs to look at those devices with CXL cache enabled by the BIOS for >> properly >> handling the available space in the snoop cache. >> >> It is also worth to mention the CXL.cache protocol can be used in the two >> "CXL cache" functionalities listed above. However, the last CXL spec implies >> CXL.cache only used for the first case. Some comments about what the >> specs say >> >> regarding number of devices with a cache for host memory: >> >> - up to 16 Type1 and/or Type2 devices allowed per VH. >> >> can be easily confused with the limitations of just one CXL Type2 device >> using >> CXL.cache for enforcing coherency of its HDM. This limitation is >> overcome with >> forcing Type2 device using HDM-DB, which relies on CXL.mem instead of >> CXL.cache >> for HDM cache coherency. >> >> While the Host is assumed to be able to access HDM in a Type2 device, and >> keeping data in the host cpu caches, it is the Type2 device >> responsibility to >> properly manage cache coherency of its HDM. There is nothing the kernel can >> control here. >> >> Therefore the interesting part and what this documents tries to cover is the >> Host memory being cached by Type2 or Type1 devices. While the main goal is >> discussing how the kernel needs to handle this, and to describe how it >> should >> work when CXL devices are used by the system/Host, some comments are made to >> cover the virtualization case where those CXL devices can potenetially >> be used >> (device passthrough) by guests/VMs. I try to expose the current security >> problems where IOMMU is used for restricting what a guest controlled >> CXL.cache >> device can read/write in Host memory what I think needs to be clarified by >> hardware vendors. >> >> >> Understanding the memory accesses from CXL devices >> ================================== >> >> For the sake of presenting the case about kernel CXL.cache support, I'll >> try to >> explain how it works (I should say "how I think it works") and the main >> points >> to discuss regarding how to implement this support. So, do not take the next >> explanation as the definitive answer or guide, and if you think there >> are errors >> or maybe too much generalization at some points, please help fixing or >> adding >> further details. Also, consider some parts as just me thinking out loud, >> what >> maybe help other people (or confuse them!). >> >> The CXL.cache protocol allows devices to be part of the coherency ring >> of the >> system. >> >> Let's start with a Type2 device reading from a specific host memory >> address. The >> final situation is 64bytes (cache line) from host memory copied to the >> device >> cache, supposedly for being used by the device/accelerator. If the data >> changes, >> because some host cpu modifies it, the device will be signalled by the >> coherency >> ring, so the device will know. The important point here is the device can be >> told because the Host knows the device has a copy or the only copy of that >> data/memory. And that is thanks to the snoop cache implemented by the >> CXL Root >> Complex. >> >> A device caching host memory can be used as well for writes to host memory >> through the cache coherency ring. A device can not just read host memory and >> keep it, but it can modified it. The implications of writes versus reads >> are not >> important for the goal of this document. It requires the device to >> support more >> protocol exchange cases, but regarding the snoop cache, it is irrelevant. >> >> There arise obvious questions about how this snoop cache is going to work. >> >> First, with the simple case of just one device caching Host memory. From the >> specs, the device CXL.cache should not be enabled by the Host if the device >> cache is bigger than the snoop cache. However, what does preclude a >> device to do >> more memory accesses than what the snoop cache can cover? This can be partly >> explained with some allocation control for CXL.cache what is discussed >> in the >> next section. But a "rogue" device could try things like this, what for >> the case >> of a single device using the snoop cache and without any other concern about >> security, is probably fine: >> >> - With a Type2, the snoop cache will tell the device to release >> another >> line, meaning any modified line to be sent back to the Host. >> - Any performance problem will only have an impact on the >> device itself. >> >> Then the case of multiple CXL devices caching Host memory in the same >> CXL Root >> Complex and therefore same CXL Snoop Cache: >> >> * How can the snoop cache track reads from different devices without one >> device >> monopolizing the full space? >> >> - enforcing snoop cache slices by software? >> - allowing specific/limited host ranges by the kernel? >> > I would like to compare it with the approaches that solves the similar > problem of the CPU cache since they might have similar essence. > > CPU cache suffered from the similar problems that noisy and > restless neighborhood keep poking the cache that might cause performance > drop. Nowadays, it is solved by the HW mechanism, cache allocation. For > Intel, it is called cache allocation technology(CAT) which is a subset of > Resource Director Technology(RDT). They can be also used in the > virtualization world. > > Before SW gets the support from the HW, many research papers were talking > about solving it via page color. E.g. allocate the VM memory with page > color awareness for different VMs. But I don't think those ideas eventually > land in the mainline. > > Back to this prob, I think probably SW is going to rely on a HW mechanism > to solve this problem nicely and decently, the same as CPU side. I agree with the need of relying on HW, what the following sentence (in the original doc) tried to summarize. But we need how this is done by HW for avoiding some undesirable situations if we just blindly configure CXL.cache to those devices advertising it and apparently without no problems regarding the snoop cache size. >> AFAIK, there is not any kind of hardware control for avoiding this >> contention. >> Note that with the proper checking by the BIOS and by the kernel (for >> hotplug or >> those not enabled devices yet during boot time), the size of total >> device caches >> allowed per CXL Root Complex should not be bigger than the snoop cache >> size, and >> therefore theoretically no contention at all ... if the devices do the right >> thing. From software the only thing we can do is to ensure the CXL.cache >> accesses from a device are within a range with same size than the enabled >> CXL.cache. >> > What would be the consequence if we violate this rule? Contention or just one device getting less snoop cache coverage implying requests from the snoop cache for flushing cached data before trying to access more data in the host. With a full snoop cache, a new access to uncached addresses will trigger some action by the snoop cache. I'm assuming it will be the device where that new access comes from the one receiving orders for first flushing cached data for making space, but I may be wrong. But a rogue device, and the first using CXL.cache, could get most of the snoop cache if no other control existing. Has the snoop cache have this control per device about amount of tracked cachelines allowed? If so, who is configuring it properly? I would expect the kernel CXL core being the one after doing other checks for validation. If the CXL specs do not specify how, we can expect different implementations, and the kernel will need to implement a generic frontend layer with per vendor backends. >> Therefore, some memory allocation API is required for dealing with the >> amount of >> memory the snoop cache can track, and the host memory a device can >> access to. >> The device needs the physical address to work with, and it is in this >> required >> translation from virtual to physical addresses where we can enforce the >> restriction. Of course, such an API does already exist, although not >> with the >> checking we need: the kernel DMA API. >> >> >> (Secure) memory allocation and CXL.cache >> =========================== >> >> DMAs allow devices to perform read/write operations to system memory >> without any >> cpu intervention after the (meta)data about how to perform the DMA is >> given to >> the device. CXL.cache is more than DMA because the system memory caches are >> implicitly involved but for the sake of handling this by the operating >> system, >> not too much different. The important point here is there is no restriction >> about the DMAble memory to be used by a device, but due to the snoop cache >> limitations, this needs to change for CXL: code aware of the snoop cache >> state >> and what a device requires needs to be consulted for properly handling the >> available space. >> > As what I replied above, I think we probably need a HW mechanism to solve > this problem nicely and decently. (Thinking sharing cache is > also a pre-condition of side-channel attack, even here is a snoop state > cahce.) With the HW mechanism, allocating the space of snoop state > cache might imply a glue layer of snoop cache management for different > CXL HB vendors to plug into the CXL core. Just what I did mention above :-) Glad to have someone else seeing the problem. > > So when the CXL driver is initialized, the space of the snoop state cache > is allocated. With that is solved, for restricting the device to access the > memory (creating/mapping an IOVA for the DMA memory), SW can still leverage > the current Linux IOMMU/DMA APIs. > This is my main concern. Note the DMA/IOMMU is likely needed for normal device operations, and that will be through CXL.io. Same mapping should then not be shared for CXL.cache, or it can, but with additional per mapping flags and obviously API changes. The implications here are obviously more important if IOMMU is enabled, at least if we take what the specs say about some ATS/IOMMU mapping only to be allowed by CXL.io. Without IOMMU, it turns into the problem of rogue devices monopolizing the snoop cache. >> Should we use the kernel DMA API for CXL.cache allocations? This API >> deals with >> memory coherency what is not needed for the CXL.cache case. However, it is >> connected with the IOMMU functionality what is required for CXL.cache if >> it is >> enabled. >> >> I think the solution should be to implement a CXL.cache allocation API >> inside >> the CXL core dealing with the snoop cache available space, and to >> connect with >> IOMMU kernel code when it is enabled. >> >> A security aspect behind DMAs is a device has (usually) no restrictions for >> memory access. This is true in a system with no IOMMU hardware, and >> CXL.cache >> is not different in this case. With IOMMU is a different game though. >> >> First of all, IOMMU will be in place for CXL.io, what implies legacy TLP >> PCIe >> packets. A CXL.cache operation can not be handled by the IOMMU hardware >> and the >> spec states ATS to be used beforehand, that is, the CXL device asking >> the IOMMU >> hardware about the physical address to work with, and keeping that >> translation >> internally. The CXL spec specifies ATS service extensions for CXL, and >> some ATS >> requests can tell the device some addresses only to be used through >> CXL.io. This >> implies some sort of knowledge about CXL is required by the IOMMU/ATS >> hardware >> which depends on how the per device tables are programmed by the Host. >> However, >> AFAIK, this is not supported yet by any Linux kernel IOMMU vendor >> support. Note >> the usual IOMMU device/domain tables will/can be used for normal DMA >> transfers, >> so IOMMU configuration, both in the Host and by the HW, needs to know >> which parts >> of the domain are for DMAs and which are for CXL.cache. >> >> Assuming this support will be implemented at some point in the future, the >> questions are, when?, and, how safe is it? >> >> Can a device issue CXL.cache operations using arbitrary physical >> addresses? It >> seems there are some cases where the hardware can take control of PCIe TLP >> packets with the ATS bit on. For example, if there is a PCIe bridge in >> the path, >> and with that bridge using a specific redirection table based on >> configured ATS >> per device ranges, any TLP with the ATS bit on will be redirected based >> on such >> a table, and implying no redirection if no table entry. However, that >> does not >> seem to be in place for PCIe Root Complex implementations. For example, AMD >> IOMMU documentation states ATS TLP packets are not handled at all, implying >> trusting the device, and if more security is required, the IOMMU >> hardware can > Are you referring to the ATS translated request here? I think ATS itself > doesn't consider the security in its mind. It seems so, but I guess we agree that is not an option for VMs ... Without IOMMU you can not have DMAs from passthrough devices, and if CXL.cache dodges the IOMMU checks (and none other security mechanism in place), CXL.cache should not be allowed in virtualization. >> check those TLP ATS packets as well, spoiling the ATS advantage. Note > Yes, AMD IOMMU has the secure ATS support, but as you said, it is pretty > straight-forward, basically just check every translated request when > enabled. > >> this is >> PCIe, so CXL.io will likely keep the functionality, but CXL.cache operations >> follow another path with apparently no further control to enforce the right >> addresses within the allowed memory ranges per device are used. >> >> Because this apparently lack of security for IOMMU and CXL.cache, this >> implies a >> CXL device should not be used by VMs or any other user space controlled >> driver >> with CXL.cache being enabled. This seems a really serious limitation, so >> maybe >> I'm missing something here. >> > I think at least for CXL path, IOMMU should have the similar mechanism like > secure ATS, and let the user to choose if they want it to be enabled or > not. > > In reality, many CSP design the HW by themselves and trust their HW won't > do messy things, they may want to enable it only on the 3rd party HW. > > For confidential computing world, secure ATS is mandatory, and performance > drop is the price of security. We can let the user to choose ... but in the virtualization world the provider does not want the user to choose ... and if CXL.cache is like DMAs without IOMMU, I would say this is a really good reason. >> Regarding virtualization, assuming the security problems do not exist or >> will be >> solved, while CXL.mem can be supported with an ahead mapping by the >> Host, with >> CXL.cache this needs to be handled when the related driver asks for specific >> memory to access, and then to configure the IOMMU/ATS tables by the >> Host. This >> implies the emulation needs a backend, what an ahead mapping, as currently >> proposed for CXL.mem can avoid. >> >> Finally, if my concerns about the security of CXL.cache with IOMMU are >> unfounded, at least this document should describe how is this solved and the >> security enforced by the hardware, and if the kernel requires to handle it >> specifically (what I really think is the case, at least with IOMMU changes >> managed by the CXL core). >> >> >> Summary >> ====== >> >> >> Next the proposed tasks to perform for supporting CXL.cache: >> >> - CXL core handling per device CXL.cache enabling based on CXL Root >> Complex snoop cache state. >> >> - CXL core implementing a CXL.cache host memory allocation >> restricting >> the physical memory a a device can access to through CXL.cache. >> >> - IOMMU being CXL aware and dealing with CXL.cache vs CXL.io >> requests. >> >> - Clarify CXL.cache and security with IOMMU. >> >> >> ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: RFC: Kernel CXL cache support (and IOMMU implications) 2024-12-13 14:15 ` Alejandro Lucero Palau @ 2024-12-24 15:05 ` Jonathan Cameron 0 siblings, 0 replies; 4+ messages in thread From: Jonathan Cameron @ 2024-12-24 15:05 UTC (permalink / raw) To: Alejandro Lucero Palau Cc: Zhi Wang, linux-cxl@vger.kernel.org, iommu, Lukas Wunner On Fri, 13 Dec 2024 14:15:18 +0000 Alejandro Lucero Palau <alucerop@amd.com> wrote: > On 11/20/24 22:33, Zhi Wang wrote: > > On Tue, 19 Nov 2024 16:52:15 +0000 > > Alejandro Lucero Palau <alucerop@amd.com> wrote: > > > > Thanks so much for the doc. I just quickly went through the doc and here > > are my comments. > > > Hi Zhi, > > > Thanks for your comments. I did not reply earlier waiting for more > feedback from, mainly, the IOMMU kernel guys. Maybe CXL support is > something most of them neither have contemplated nor aware of (maybe) > requiring special handling. I really think IOMMU/DMA API will need some > change, but this document is for discussing it and maybe proving me wrong. > > > Let's hope replying to your comments keep things moving somehow ... Sorry it took me so long to get to this! I replied as I read through it, so thoughts may not be totally coherent. Key points: 1. If you are doing CXL.cache on a device then either your host should be doing checking that the device is not accessing something it shouldn't or you should have done the work to ensure it is part of your trusted compute base. 2. Using restrictions on memory in the IOMMU page tables to avoid thrashing of coherency tracking resources / cache in the host is a non starter. That puts a bound on the number used, but at the cost of breaking many use cases. Page table coverage != cachelines in device. a) Endpoint should be part of TCB, trusted not to use more than it is told. b) Host should not do rubbish QoS so the burden should be mainly on the badly behaving device 3. I'm not sure why (for what is discussed here) there is any problem with VM usecases. The same model used for ATS etc for PCIe VFs should apply just fine here. Anyhow, that's enough muddying the waters for today. Jonathan > > >> November, 2024 > >> > >> Tittle: CXL Cache support by the kernel > >> Author: Alejandro Lucero (alucerop@amd.com) > >> > >> Version 0.1 > >> > >> Introduction > >> ======== > >> > >> After the LPC where I presented the current status of the Type2 CXL.mem > >> support > >> patchset, and some ideas about supporting CXL.cache, it is time to dig > >> deeper in > >> this second goal, and discussing the security/reliability aspect as well. > >> > >> It is also important to try to describe how this is going to work and > >> what the > >> kernel needs to know and enforce. Reading the CXL specs when having in > >> mind some > >> specific use case can easily lead to assuming certains aspects with a > >> different > >> perspective from other readers/use cases. To start with, it is necessary to > >> differentiate two "CXL cache" functionalities when a Type2 device is in > >> place: > >> > >> 1) A Type2 device caching Host memory. > >> > >> 2) The Host caching HDM memory, that is the memory inside the Type2 CXL > >> device. > >> > >> The first option is also what a Type1 device can do, and the kernel support > >> needs to manage all those Type1/2 per CXL Root Complex knowing the resources > >> limitation, that is the snooping cache size. > >> > >> A snoop cache allows the host to track which memory is being used/cached by > >> those devices, enforcing the cache coherency. The specs are not clear > >> about some > >> important aspects regarding how the host can enforce the proper use of > >> this by > >> devices or even if the snoop cache needs to do so. At pages 786 and 787 > >> of CXL > >> specs 3.1, how the system software should deal with CXL cache devices is > >> given, > >> but this is inside a Hot-plug section. I think we can assume the Host > >> firmware/BIOS will follow same approach for enabling CXL cache, and the > >> kernel > >> needs to look at those devices with CXL cache enabled by the BIOS for > >> properly > >> handling the available space in the snoop cache. > >> > >> It is also worth to mention the CXL.cache protocol can be used in the two > >> "CXL cache" functionalities listed above. However, the last CXL spec implies > >> CXL.cache only used for the first case. Some comments about what the > >> specs say > >> > >> regarding number of devices with a cache for host memory: > >> > >> - up to 16 Type1 and/or Type2 devices allowed per VH. > >> > >> can be easily confused with the limitations of just one CXL Type2 device > >> using > >> CXL.cache for enforcing coherency of its HDM. It has been a while since I read the relevant sections, but CXL 3.0 introduced a cache ID that I thought was precisely to allow for multiple CXL.Cache agents per VCS (not the HDM-DB stuff that for this purpose is replacing bias based coherency) There may well not be any hosts that support that yet though and it doesnt' really matter for rest of this discussion. > >> This limitation is > >> overcome with > >> forcing Type2 device using HDM-DB, which relies on CXL.mem instead of > >> CXL.cache > >> for HDM cache coherency. > >> > >> While the Host is assumed to be able to access HDM in a Type2 device, and > >> keeping data in the host cpu caches, it is the Type2 device > >> responsibility to > >> properly manage cache coherency of its HDM. There is nothing the kernel can > >> control here. > >> > >> Therefore the interesting part and what this documents tries to cover is the > >> Host memory being cached by Type2 or Type1 devices. While the main goal is > >> discussing how the kernel needs to handle this, and to describe how it > >> should > >> work when CXL devices are used by the system/Host, some comments are made to > >> cover the virtualization case where those CXL devices can potenetially > >> be used > >> (device passthrough) by guests/VMs. I try to expose the current security > >> problems where IOMMU is used for restricting what a guest controlled > >> CXL.cache > >> device can read/write in Host memory what I think needs to be clarified by > >> hardware vendors. > >> > >> > >> Understanding the memory accesses from CXL devices > >> ================================== > >> > >> For the sake of presenting the case about kernel CXL.cache support, I'll > >> try to > >> explain how it works (I should say "how I think it works") and the main > >> points > >> to discuss regarding how to implement this support. So, do not take the next > >> explanation as the definitive answer or guide, and if you think there > >> are errors > >> or maybe too much generalization at some points, please help fixing or > >> adding > >> further details. Also, consider some parts as just me thinking out loud, > >> what > >> maybe help other people (or confuse them!). > >> > >> The CXL.cache protocol allows devices to be part of the coherency ring Probably avoid 'ring' in terminology. Just "coherency of the system" is fine I think. > >> of the > >> system. > >> > >> Let's start with a Type2 device reading from a specific host memory > >> address. The > >> final situation is 64bytes (cache line) from host memory copied to the > >> device > >> cache, supposedly for being used by the device/accelerator. If the data > >> changes, > >> because some host cpu modifies it, the device will be signalled by the > >> coherency > >> ring, so the device will know. The important point here is the device can be > >> told because the Host knows the device has a copy or the only copy of that > >> data/memory. And that is thanks to the snoop cache implemented by the Probably refer to "coherency tracking" rather than say a snoop cache which is just one way of doing it and don't specify where it is. Could be in any number of places depending on system design. > >> CXL Root > >> Complex. > >> > >> A device caching host memory can be used as well for writes to host memory > >> through the cache coherency ring. A device can not just read host memory and > >> keep it, but it can modified it. The implications of writes versus reads > >> are not > >> important for the goal of this document. It requires the device to > >> support more > >> protocol exchange cases, but regarding the snoop cache, it is irrelevant. > >> > >> There arise obvious questions about how this snoop cache is going to work. > >> > >> First, with the simple case of just one device caching Host memory. From the > >> specs, the device CXL.cache should not be enabled by the Host if the device > >> cache is bigger than the snoop cache. However, what does preclude a > >> device to do > >> more memory accesses than what the snoop cache can cover? This can be partly > >> explained with some allocation control for CXL.cache what is discussed > >> in the > >> next section. But a "rogue" device could try things like this, what for > >> the case > >> of a single device using the snoop cache and without any other concern about > >> security, is probably fine: If you are letting a device into your host coherency and you haven't done a bunch of stuff to ensure it is not rogue you are on your own. That stuff is the domain of technologies such as attestation and more basic stuff like supply chain management. Having said that the host can easily identify such a problem and refuse to do anything that would cause it to loose track + issue appropriate RAS event (maybe including isolating the device). > >> > >> - With a Type2, the snoop cache will tell the device to release > >> another > >> line, meaning any modified line to be sent back to the Host. > >> - Any performance problem will only have an impact on the > >> device itself. Agreed the whole sizing thing is a performance question not so much a correctness - though a device might not make forwards progress if it can't get enough data to do what it wants to do. > >> > >> Then the case of multiple CXL devices caching Host memory in the same > >> CXL Root > >> Complex and therefore same CXL Snoop Cache: > >> > >> * How can the snoop cache track reads from different devices without one > >> device > >> monopolizing the full space? > >> > >> - enforcing snoop cache slices by software? > >> - allowing specific/limited host ranges by the kernel? To me, that's a hardware problem. Hardware that doesn't do the QoS handling for this is broken. Sure we can quirk that if needed but I'd do it in the first instance by declaring the hardware so broken we only support one device doing CXL.cache for each set of tracking resources. Seems you say that later :) > >> > > I would like to compare it with the approaches that solves the similar > > problem of the CPU cache since they might have similar essence. > > > > CPU cache suffered from the similar problems that noisy and > > restless neighborhood keep poking the cache that might cause performance > > drop. Nowadays, it is solved by the HW mechanism, cache allocation. For > > Intel, it is called cache allocation technology(CAT) which is a subset of > > Resource Director Technology(RDT). They can be also used in the > > virtualization world. Fine in theory in practice not used all that widely, but agreed similar solutions could be applied here. They are a pain to tune though so I'd expect to see better non configurable solutions for QoS first and the ability to tweak only in a few generations time (could be wrong though!) > > > > Before SW gets the support from the HW, many research papers were talking > > about solving it via page color. E.g. allocate the VM memory with page > > color awareness for different VMs. But I don't think those ideas eventually > > land in the mainline. > > > > Back to this prob, I think probably SW is going to rely on a HW mechanism > > to solve this problem nicely and decently, the same as CPU side. > > > I agree with the need of relying on HW, what the following sentence (in > the original doc) tried to summarize. > > But we need how this is done by HW for avoiding some undesirable > situations if we just blindly configure CXL.cache to those devices > advertising it and apparently without no problems regarding the snoop > cache size. I'm not convinced we need to do anything in software. This is no worse than head of line blocking on PCIe (bandwidth to host is often less than that if all devices below some switches want to DMA to host memory at the same time). In theory we can tweak demand by messing around with device specific stuff, or tweaking link controls but in practice does anyone do this in a general purpose system? Don't think so. We rely on sane QoS handling via credit allocations etc and the switch doing something sensible. Sure, a particular implementation might not do this, but that to me is a quirk that we need to handle on a case by case basis. > > > >> AFAIK, there is not any kind of hardware control for avoiding this > >> contention. > >> Note that with the proper checking by the BIOS and by the kernel (for > >> hotplug or > >> those not enabled devices yet during boot time), the size of total > >> device caches > >> allowed per CXL Root Complex should not be bigger than the snoop cache > >> size, and > >> therefore theoretically no contention at all ... if the devices do the right > >> thing. From software the only thing we can do is to ensure the CXL.cache > >> accesses from a device are within a range with same size than the enabled > >> CXL.cache. > >> > > What would be the consequence if we violate this rule? > > > Contention or just one device getting less snoop cache coverage implying > requests from the snoop cache for flushing cached data before trying to > access more data in the host. > > With a full snoop cache, a new access to uncached addresses will trigger > some action by the snoop cache. I'm assuming it will be the device where > that new access comes from the one receiving orders for first flushing > cached data for making space, but I may be wrong. That might be a design choice, but I don't see that being true in general. Imagine a very simple QoS policy that aims to balance devices use of tracking entrees. In that case if only one agent is sending requests for a bit it will rightly get all the entrees. New agent wakes up and sends requests, the eviction will be an entry for the original agent. Lots of other policies will lead to that. > > But a rogue device, and the first using CXL.cache, could get most of the > snoop cache if no other control existing. Has the snoop cache have this > control per device about amount of tracked cachelines allowed? If so, > who is configuring it properly? I would expect the kernel CXL core being > the one after doing other checks for validation. If the CXL specs do not > specify how, we can expect different implementations, and the kernel > will need to implement a generic frontend layer with per vendor backends. It's a host implementation thing. Out of scope for CXL. I'd expect we might see RDT / MPAM whatever your favourite arch uses for resource controls for this but even those are often soft limits so if only one agent doing anything it still gets the whole set of tracking entrees. As before though, I'm unconcerned about rogue devices as if such a device is plugged in it can do a lot more damage than this! (your later thing on lying about translated addresses). > > > >> Therefore, some memory allocation API is required for dealing with the > >> amount of > >> memory the snoop cache can track, and the host memory a device can > >> access to. > >> The device needs the physical address to work with, and it is in this > >> required > >> translation from virtual to physical addresses where we can enforce the > >> restriction. Of course, such an API does already exist, although not > >> with the > >> checking we need: the kernel DMA API. > >> > >> > >> (Secure) memory allocation and CXL.cache > >> =========================== > >> > >> DMAs allow devices to perform read/write operations to system memory > >> without any > >> cpu intervention after the (meta)data about how to perform the DMA is > >> given to > >> the device. CXL.cache is more than DMA because the system memory caches are > >> implicitly involved but for the sake of handling this by the operating > >> system, > >> not too much different. The important point here is there is no restriction > >> about the DMAble memory to be used by a device, but due to the snoop cache > >> limitations, this needs to change for CXL: code aware of the snoop cache > >> state > >> and what a device requires needs to be consulted for properly handling the > >> available space. I'm not sure how this is connected to DMA mappings etc. What we map in the page tables is an upper bound on what might be cached by the device, but it is not the only thing applying that upper bound. The device is making guarantees not to cache more than a certain amount of memory at a time - likely a tiny subset of what is mapped. Imagine a DB accelerator. Those will typically having DMA mappings for many TB of data, but the query engines will only be using a tiny amount of it at any time. They will evict the cachelines when they are done with reading particular data (usually as part of a pointer chase). Hence they will use only a few cachelines, but the page tables and indeed address translation cache will cover much more. > >> > > As what I replied above, I think we probably need a HW mechanism to solve > > this problem nicely and decently. (Thinking sharing cache is > > also a pre-condition of side-channel attack, even here is a snoop state > > cahce.) With the HW mechanism, allocating the space of snoop state > > cache might imply a glue layer of snoop cache management for different > > CXL HB vendors to plug into the CXL core. > > > Just what I did mention above :-) > > > Glad to have someone else seeing the problem. I have no problem with an RDT type scheme but that is about performance not correctness and is a fine turning on top of telling the devices we have restricted tracking capacity. > > > > > > So when the CXL driver is initialized, the space of the snoop state cache > > is allocated. With that is solved, for restricting the device to access the > > memory (creating/mapping an IOVA for the DMA memory), SW can still leverage > > the current Linux IOMMU/DMA APIs. > > > > This is my main concern. Note the DMA/IOMMU is likely needed for normal > device operations, and that will be through CXL.io. Same mapping should > then not be shared for CXL.cache, or it can, but with additional per > mapping flags and obviously API changes. The implications here are > obviously more important if IOMMU is enabled, at least if we take what > the specs say about some ATS/IOMMU mapping only to be allowed by > CXL.io. Without IOMMU, it turns into the problem of rogue devices > monopolizing the snoop cache. If the snoop cache implementation is that bad, go annoy the hardware folk. + Rogue devices are not a thing in production systems, well they are but constraining them is not a problem for this layer of the stack (if FPGA or similar then checks for rogue accesses need to belong in the shell around the bit you allow customers to program - if not cloud vendors etc will not plug in your devices) > > > >> Should we use the kernel DMA API for CXL.cache allocations? This API > >> deals with > >> memory coherency what is not needed for the CXL.cache case. However, it is > >> connected with the IOMMU functionality what is required for CXL.cache if > >> it is > >> enabled. > >> > >> I think the solution should be to implement a CXL.cache allocation API > >> inside > >> the CXL core dealing with the snoop cache available space, and to > >> connect with > >> IOMMU kernel code when it is enabled. Just to check, does anyone care about these devices with iommu disabled? In my opinion that should just be blocked on day one. > >> > >> A security aspect behind DMAs is a device has (usually) no restrictions for > >> memory access. This is true in a system with no IOMMU hardware, and > >> CXL.cache > >> is not different in this case. With IOMMU is a different game though. > >> > >> First of all, IOMMU will be in place for CXL.io, what implies legacy TLP > >> PCIe > >> packets. A CXL.cache operation can not be handled by the IOMMU hardware > >> and the > >> spec states ATS to be used beforehand, that is, the CXL device asking > >> the IOMMU > >> hardware about the physical address to work with, and keeping that > >> translation > >> internally. The CXL spec specifies ATS service extensions for CXL, and > >> some ATS > >> requests can tell the device some addresses only to be used through > >> CXL.io. This > >> implies some sort of knowledge about CXL is required by the IOMMU/ATS > >> hardware > >> which depends on how the per device tables are programmed by the Host. I think that is not (only?) about security, it is about correctness as some hosts may not support CXL.cache accesses to some regions of the host address map, e.g. peer PCI device BARs for which we are doing UIO or similar. > >> However, > >> AFAIK, this is not supported yet by any Linux kernel IOMMU vendor > >> support. Note > >> the usual IOMMU device/domain tables will/can be used for normal DMA > >> transfers, > >> so IOMMU configuration, both in the Host and by the HW, needs to know > >> which parts > >> of the domain are for DMAs and which are for CXL.cache. Potentially yes. Though there are bunch of ways to do that which don't necessarily expose them to the OS, they may be characterstics of the underlying HPA memory map. > >> > >> Assuming this support will be implemented at some point in the future, the > >> questions are, when?, and, how safe is it? > >> > >> Can a device issue CXL.cache operations using arbitrary physical > >> addresses? Of course, same as a PCIe device doing DMA with translated addresses. You rely on attestation and in some cases sanity checks in the host (they are still subject to the Confidential compute type checks on physical address space permissions for example). This is what all the fun of device security is all about. You must know and trust devices before you let them do anything at all to host memory. > >> It > >> seems there are some cases where the hardware can take control of PCIe TLP > >> packets with the ATS bit on. For example, if there is a PCIe bridge in > >> the path, > >> and with that bridge using a specific redirection table based on > >> configured ATS > >> per device ranges, any TLP with the ATS bit on will be redirected based > >> on such > >> a table, and implying no redirection if no table entry. However, that > >> does not > >> seem to be in place for PCIe Root Complex implementations. For example, AMD > >> IOMMU documentation states ATS TLP packets are not handled at all, implying > >> trusting the device, and if more security is required, the IOMMU > >> hardware can > > Are you referring to the ATS translated request here? I think ATS itself > > doesn't consider the security in its mind. > > > It seems so, but I guess we agree that is not an option for VMs ... Why not? There is a still quite a bit of infrastructure needed for all the iommufd work to land, but it's getting close. That absolutely allows for ATS - up to the host to check it trusts the device before assigning any part of it to the VM. On at least some architectures the translation is the full 2 stage one to host physical address. > Without IOMMU you can not have DMAs from passthrough devices, and if > CXL.cache dodges the IOMMU checks (and none other security mechanism in > place), CXL.cache should not be allowed in virtualization. No difference to ATS for conventional PCIe. Sure you need to do your security checks. That's standard stuff. See what Lukas has been working on for last few years around CMA + the bits of relevant confidential compute that are actually useful for non CoCo usecases. > > > >> check those TLP ATS packets as well, spoiling the ATS advantage. Note > > Yes, AMD IOMMU has the secure ATS support, but as you said, it is pretty > > straight-forward, basically just check every translated request when > > enabled. Interesting approach. I'd not noticed that before. There are other solutions that keep enough tracking data in host to verify translated requests are fine but if you've checked the device, none of that should be needed. > > > >> this is > >> PCIe, so CXL.io will likely keep the functionality, but CXL.cache operations > >> follow another path with apparently no further control to enforce the right > >> addresses within the allowed memory ranges per device are used. > >> > >> Because this apparently lack of security for IOMMU and CXL.cache, this > >> implies a > >> CXL device should not be used by VMs or any other user space controlled > >> driver > >> with CXL.cache being enabled. This seems a really serious limitation, so > >> maybe > >> I'm missing something here. > >> > > I think at least for CXL path, IOMMU should have the similar mechanism like > > secure ATS, and let the user to choose if they want it to be enabled or > > not. > > > > In reality, many CSP design the HW by themselves and trust their HW won't > > do messy things, they may want to enable it only on the 3rd party HW. Exactly. They do a lot of paperwork and security audits. These devices have to be in the trusted computing base to operate. > > > > For confidential computing world, secure ATS is mandatory, and performance > > drop is the price of security. That's a choice, but it's not the only one for confidential compute and from my recollection of other CoCo solutions they don't all do this. > > > We can let the user to choose ... but in the virtualization world the > provider does not want the user to choose ... and if CXL.cache is like > DMAs without IOMMU, I would say this is a really good reason. Just require appropriate attestation. That's a userspace policy (see Lukas' series) + the one Dan has for confidential compute. > > > >> Regarding virtualization, assuming the security problems do not exist or > >> will be > >> solved, while CXL.mem can be supported with an ahead mapping by the > >> Host, with > >> CXL.cache this needs to be handled when the related driver asks for specific > >> memory to access, and then to configure the IOMMU/ATS tables by the > >> Host. This > >> implies the emulation needs a backend, what an ahead mapping, as currently > >> proposed for CXL.mem can avoid. > >> > >> Finally, if my concerns about the security of CXL.cache with IOMMU are > >> unfounded, at least this document should describe how is this solved and the > >> security enforced by the hardware, and if the kernel requires to handle it > >> specifically (what I really think is the case, at least with IOMMU changes > >> managed by the CXL core). It's an implementation specific question. If you are selling a device that uses cxl.cache, expect to do a lot of paperwork and security audits + probably reveal a lot of implementation details to you major customers. We should expose policy control to host userspace and provide it all the relevant information. I'm not yet convinced this has anything to do with DMA mapping or the IOMMU core code. There might be some advantages in locking down accesses via CXL.io vs CXL.cache for some IOMMU designs but my understanding is that is not a universal thing at all. If userspace says accept the device after doing all the certs etc are checked then it is part of the trusted compute base and there will not need to be anything in the IOMMU page tables that is unique to this. > >> > >> > >> Summary > >> ====== > >> > >> > >> Next the proposed tasks to perform for supporting CXL.cache: > >> > >> - CXL core handling per device CXL.cache enabling based on CXL Root > >> Complex snoop cache state. Agreed. > >> > >> - CXL core implementing a CXL.cache host memory allocation > >> restricting > >> the physical memory a a device can access to through CXL.cache. No on this one. It's a broken solution to a potential hardware solution. > >> > >> - IOMMU being CXL aware and dealing with CXL.cache vs CXL.io > >> requests. Not required in general. May be required for some host IOMMU architectures. So we may need some hooks. > >> > >> - Clarify CXL.cache and security with IOMMU. Standard device security flows. Policy for what counts as 'secure' may be tighter, but it's no different to flows for PCIe devices in general. Bring them into your TCB. Jonathan > >> > >> > >> > > ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-12-24 15:05 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-19 16:52 RFC: Kernel CXL cache support (and IOMMU implications) Alejandro Lucero Palau 2024-11-20 22:33 ` Zhi Wang 2024-12-13 14:15 ` Alejandro Lucero Palau 2024-12-24 15:05 ` Jonathan Cameron
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.