* [PATCH v3 0/1] KVM: arm64: Map GPU device memory as cacheable @ 2025-03-10 10:30 ankita 2025-03-10 10:30 ` [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita 0 siblings, 1 reply; 61+ messages in thread From: ankita @ 2025-03-10 10:30 UTC (permalink / raw) To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam, alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu, ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc, kvmarm, linux-kernel, linux-arm-kernel From: Ankit Agrawal <ankita@nvidia.com> Grace based platforms such as Grace Hopper/Blackwell Superchips have CPU accessible cache coherent GPU memory. The GPU device memory is essentially a DDR memory and retains properties such as cacheability, unaligned accesses, atomics and handling of executable faults. This requires the device memory to be mapped as NORMAL in stage-2. Today KVM forces the memory to either NORMAL or DEVICE_nGnRE depending on whethere the memory region is added to the kernel. The KVM code is thus restrictive and prevents device memory that is not added to the kernel to be marked as cacheable. The patch aims to solve this. A cachebility check is made if the VM_PFNMAP is set in VMA flags by consulting the VMA pgprot value. If the pgprot mapping type is MT_NORMAL, it is considered safe to be mapped cacheable as the KVM S2 will have the same Normal memory type as the VMA has in the S1 and KVM has no additional responsibility for safety. Note when FWB is not enabled, the kernel expects to trivially do cache management by flushing the memory by linearly converting a kvm_pte to phys_addr to a KVA. The cache management thus relies on memory being mapped. Since the GPU device memory is not kernel mapped, exit when the FWB is not supported. The changes are heavily influenced by the discussions between Catalin Marinas, David Hildenbrand and Jason Gunthorpe [1] on v2. Many thanks for their valuable suggestions. Applied over next-20250305 and tested on the Grace Hopper and Grace Blackwell platforms by booting up VM, loading NVIDIA module [2] and running nvidia-smi in the VM. To run CUDA workloads, there is a dependency on the IOMMUFD and the Nested Page Table patches being worked on separately by Nicolin Chen. (nicolinc@nvidia.com). NVIDIA has provided git repositories which includes all the requisite kernel [3] and Qemu [4] patches in case one wants to try. v2 -> v3 1. Restricted the new changes to check for cacheability to VM_PFNMAP based on David Hildenbrand's (david@redhat.com) suggestion. 2. Removed the MTE checks based on Jason Gunthorpe's (jgg@nvidia.com) observation that it already done earlier in kvm_arch_prepare_memory_region. 3. Dropped the pfn_valid() checks based on suggestions by Catalin Marinas (catalin.marinas@arm.com). 4. Removed the code for exec fault handling as it is not needed anymore. v1 -> v2 1. Removed kvm_is_device_pfn() as a determiner for device type memory determination. Instead using pfn_valid() 2. Added handling for MTE. 3. Minor cleanup. Link: https://lore.kernel.org/all/20241118131958.4609-1-ankita@nvidia.com/ [1] Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2] Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [3] Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [4] Ankit Agrawal (1): KVM: arm64: Allow cacheable stage 2 mapping using VMA flags arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++ arch/arm64/kvm/hyp/pgtable.c | 2 +- arch/arm64/kvm/mmu.c | 43 +++++++++++++++++++++++++++- 3 files changed, 51 insertions(+), 2 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-10 10:30 [PATCH v3 0/1] KVM: arm64: Map GPU device memory as cacheable ankita @ 2025-03-10 10:30 ` ankita 2025-03-10 11:54 ` Marc Zyngier 0 siblings, 1 reply; 61+ messages in thread From: ankita @ 2025-03-10 10:30 UTC (permalink / raw) To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam, alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu, ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc, kvmarm, linux-kernel, linux-arm-kernel From: Ankit Agrawal <ankita@nvidia.com> Today KVM forces the memory to either NORMAL or DEVICE_nGnRE based on pfn_is_map_memory (which tracks whether the device memory is added to the kernel) and ignores the per-VMA flags that indicates the memory attributes. The KVM code is thus restrictive and allows only for the memory that is added to the kernel to be marked as cacheable. The device memory such as on the Grace Hopper/Blackwell systems is interchangeable with DDR memory and retains properties such as cacheability, unaligned accesses, atomics and handling of executable faults. This requires the device memory to be mapped as NORMAL in stage-2. Given that the GPU device memory is not added to the kernel (but is rather VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory to be mapped Normal cacheable. The patch aims to solve this use case. A cachebility check is made if the VM_PFNMAP is set in VMA flags by consulting the VMA pgprot value. If the pgprot mapping type is MT_NORMAL, it is safe to be mapped cacheable as the KVM S2 will have the same Normal memory type as the VMA has in the S1 and KVM has no additional responsibility for safety. Checking pgprot as NORMAL is thus a KVM sanity check. Introduce a new variable cacheable_devmem to indicate a safely cacheable mapping. Do not set the device variable when cacheable_devmem is true. This essentially have the effect of setting stage-2 mapping as NORMAL through kvm_pgtable_stage2_map. Add check for COW VM_PFNMAP and refuse such mapping. No additional checks for MTE are needed as kvm_arch_prepare_memory_region() already tests it at an early stage during memslot creation. There would not even be a fault if the memslot is not created. Note when FWB is not enabled, the kernel expects to trivially do cache management by flushing the memory by linearly converting a kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The cache management thus relies on memory being mapped. So do not allow !FWB. Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> --- arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++ arch/arm64/kvm/hyp/pgtable.c | 2 +- arch/arm64/kvm/mmu.c | 43 +++++++++++++++++++++++++++- 3 files changed, 51 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index 6b9d274052c7..f21e2fae2bfe 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -507,6 +507,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size); */ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift); +/** + * stage2_has_fwb() - Determine whether FWB is supported + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*() + * + * Return: True if FWB is supported. + */ +bool stage2_has_fwb(struct kvm_pgtable *pgt); + /** * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD * @vtcr: Content of the VTCR register. diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index df5cc74a7dd0..ee6b98fefd61 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -637,7 +637,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift) return vtcr; } -static bool stage2_has_fwb(struct kvm_pgtable *pgt) +bool stage2_has_fwb(struct kvm_pgtable *pgt) { if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB)) return false; diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 1f55b0c7b11d..4b3a03c9d700 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1454,6 +1454,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma) return vma->vm_flags & VM_MTE_ALLOWED; } +/* + * Determine the memory region cacheability from VMA's pgprot. This + * is used to set the stage 2 PTEs. + */ +static unsigned long mapping_type(pgprot_t page_prot) +{ + return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot)); +} + static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, unsigned long hva, @@ -1463,6 +1472,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, bool write_fault, writable, force_pte = false; bool exec_fault, mte_allowed; bool device = false, vfio_allow_any_uc = false; + bool cacheable_devmem = false; unsigned long mmu_seq; phys_addr_t ipa = fault_ipa; struct kvm *kvm = vcpu->kvm; @@ -1600,6 +1610,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; + if (vma->vm_flags & VM_PFNMAP) { + /* Reject COW VM_PFNMAP */ + if (is_cow_mapping(vma->vm_flags)) + return -EINVAL; + /* + * If the VM_PFNMAP is set in VMA flags, do a KVM sanity + * check to see if pgprot mapping type is MT_NORMAL and a + * safely cacheable device memory. + */ + if (mapping_type(vma->vm_page_prot) == MT_NORMAL) + cacheable_devmem = true; + } + /* Don't use the VMA after the unlock -- it may have vanished */ vma = NULL; @@ -1633,8 +1656,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * * In both cases, we don't let transparent_hugepage_adjust() * change things at the last minute. + * + * Do not set device as the device memory is cacheable. Note + * that such mapping is safe as the KVM S2 will have the same + * Normal memory type as the VMA has in the S1. */ - device = true; + if (!cacheable_devmem) + device = true; } else if (logging_active && !write_fault) { /* * Only actually map the page as writable if this was a write @@ -1716,6 +1744,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, prot |= KVM_PGTABLE_PROT_X; } + /* + * When FWB is unsupported KVM needs to do cache flushes + * (via dcache_clean_inval_poc()) of the underlying memory. This is + * only possible if the memory is already mapped into the kernel map. + * + * Outright reject as the cacheable device memory is not present in + * the kernel map and not suitable for cache management. + */ + if (cacheable_devmem && !stage2_has_fwb(pgt)) { + ret = -EINVAL; + goto out_unlock; + } + /* * Under the premise of getting a FSC_PERM fault, we just need to relax * permissions only if vma_pagesize equals fault_granule. Otherwise, -- 2.34.1 ^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-10 10:30 ` [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita @ 2025-03-10 11:54 ` Marc Zyngier 2025-03-11 3:42 ` Ankit Agrawal 2025-03-18 12:57 ` Jason Gunthorpe 0 siblings, 2 replies; 61+ messages in thread From: Marc Zyngier @ 2025-03-10 11:54 UTC (permalink / raw) To: ankita Cc: jgg, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam, alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu, ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc, kvmarm, linux-kernel, linux-arm-kernel On Mon, 10 Mar 2025 10:30:08 +0000, <ankita@nvidia.com> wrote: > > From: Ankit Agrawal <ankita@nvidia.com> > > Today KVM forces the memory to either NORMAL or DEVICE_nGnRE > based on pfn_is_map_memory (which tracks whether the device memory > is added to the kernel) and ignores the per-VMA flags that indicates the > memory attributes. The KVM code is thus restrictive and allows only for > the memory that is added to the kernel to be marked as cacheable. > > The device memory such as on the Grace Hopper/Blackwell systems > is interchangeable with DDR memory and retains properties such as > cacheability, unaligned accesses, atomics and handling of executable > faults. This requires the device memory to be mapped as NORMAL in > stage-2. > > Given that the GPU device memory is not added to the kernel (but is rather > VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets > VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory > to be mapped Normal cacheable. The patch aims to solve this use case. > > A cachebility check is made if the VM_PFNMAP is set in VMA flags by > consulting the VMA pgprot value. If the pgprot mapping type is MT_NORMAL, > it is safe to be mapped cacheable as the KVM S2 will have the same > Normal memory type as the VMA has in the S1 and KVM has no additional > responsibility for safety. Checking pgprot as NORMAL is thus a KVM > sanity check. > > Introduce a new variable cacheable_devmem to indicate a safely > cacheable mapping. Do not set the device variable when cacheable_devmem > is true. This essentially have the effect of setting stage-2 mapping > as NORMAL through kvm_pgtable_stage2_map. > > Add check for COW VM_PFNMAP and refuse such mapping. > > No additional checks for MTE are needed as kvm_arch_prepare_memory_region() > already tests it at an early stage during memslot creation. There would > not even be a fault if the memslot is not created. > > Note when FWB is not enabled, the kernel expects to trivially do > cache management by flushing the memory by linearly converting a > kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The > cache management thus relies on memory being mapped. So do not allow > !FWB. > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > Suggested-by: Catalin Marinas <catalin.marinas@arm.com> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Suggested-by: David Hildenbrand <david@redhat.com> > --- > arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++ > arch/arm64/kvm/hyp/pgtable.c | 2 +- > arch/arm64/kvm/mmu.c | 43 +++++++++++++++++++++++++++- > 3 files changed, 51 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h > index 6b9d274052c7..f21e2fae2bfe 100644 > --- a/arch/arm64/include/asm/kvm_pgtable.h > +++ b/arch/arm64/include/asm/kvm_pgtable.h > @@ -507,6 +507,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size); > */ > u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift); > > +/** > + * stage2_has_fwb() - Determine whether FWB is supported > + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*() > + * > + * Return: True if FWB is supported. > + */ > +bool stage2_has_fwb(struct kvm_pgtable *pgt); > + > /** > * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD > * @vtcr: Content of the VTCR register. > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index df5cc74a7dd0..ee6b98fefd61 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -637,7 +637,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift) > return vtcr; > } > > -static bool stage2_has_fwb(struct kvm_pgtable *pgt) > +bool stage2_has_fwb(struct kvm_pgtable *pgt) > { > if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB)) > return false; > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 1f55b0c7b11d..4b3a03c9d700 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1454,6 +1454,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma) > return vma->vm_flags & VM_MTE_ALLOWED; > } > > +/* > + * Determine the memory region cacheability from VMA's pgprot. This > + * is used to set the stage 2 PTEs. > + */ > +static unsigned long mapping_type(pgprot_t page_prot) > +{ > + return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot)); > +} > + > static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > struct kvm_s2_trans *nested, > struct kvm_memory_slot *memslot, unsigned long hva, > @@ -1463,6 +1472,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > bool write_fault, writable, force_pte = false; > bool exec_fault, mte_allowed; > bool device = false, vfio_allow_any_uc = false; > + bool cacheable_devmem = false; > unsigned long mmu_seq; > phys_addr_t ipa = fault_ipa; > struct kvm *kvm = vcpu->kvm; > @@ -1600,6 +1610,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; > > + if (vma->vm_flags & VM_PFNMAP) { > + /* Reject COW VM_PFNMAP */ > + if (is_cow_mapping(vma->vm_flags)) > + return -EINVAL; > + /* > + * If the VM_PFNMAP is set in VMA flags, do a KVM sanity > + * check to see if pgprot mapping type is MT_NORMAL and a > + * safely cacheable device memory. > + */ > + if (mapping_type(vma->vm_page_prot) == MT_NORMAL) > + cacheable_devmem = true; > + } > + > /* Don't use the VMA after the unlock -- it may have vanished */ > vma = NULL; > > @@ -1633,8 +1656,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > * > * In both cases, we don't let transparent_hugepage_adjust() > * change things at the last minute. > + * > + * Do not set device as the device memory is cacheable. Note > + * that such mapping is safe as the KVM S2 will have the same > + * Normal memory type as the VMA has in the S1. > */ > - device = true; > + if (!cacheable_devmem) > + device = true; > } else if (logging_active && !write_fault) { > /* > * Only actually map the page as writable if this was a write > @@ -1716,6 +1744,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > prot |= KVM_PGTABLE_PROT_X; > } > > + /* > + * When FWB is unsupported KVM needs to do cache flushes > + * (via dcache_clean_inval_poc()) of the underlying memory. This is > + * only possible if the memory is already mapped into the kernel map. > + * > + * Outright reject as the cacheable device memory is not present in > + * the kernel map and not suitable for cache management. > + */ > + if (cacheable_devmem && !stage2_has_fwb(pgt)) { > + ret = -EINVAL; > + goto out_unlock; > + } > + These new error reasons should at least be complemented by an equivalent check at the point where the memslot is registered. It maybe OK to blindly return an error at fault time (because userspace has messed with the mapping behind our back), but there should at least be something telling a well behaved userspace that there is a bunch of combination we're unwilling to support. Which brings me to the next point: FWB is not discoverable from userspace. How do you expect a VMM to know what it can or cannot do? M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-10 11:54 ` Marc Zyngier @ 2025-03-11 3:42 ` Ankit Agrawal 2025-03-11 11:18 ` Marc Zyngier 2025-03-18 12:57 ` Jason Gunthorpe 1 sibling, 1 reply; 61+ messages in thread From: Ankit Agrawal @ 2025-03-11 3:42 UTC (permalink / raw) To: Marc Zyngier Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org >> + /* >> + * When FWB is unsupported KVM needs to do cache flushes >> + * (via dcache_clean_inval_poc()) of the underlying memory. This is >> + * only possible if the memory is already mapped into the kernel map. >> + * >> + * Outright reject as the cacheable device memory is not present in >> + * the kernel map and not suitable for cache management. >> + */ >> + if (cacheable_devmem && !stage2_has_fwb(pgt)) { >> + ret = -EINVAL; >> + goto out_unlock; >> + } >> + > > These new error reasons should at least be complemented by an > equivalent check at the point where the memslot is registered. It Understood. I can add such check in kvm_arch_prepare_memory_region(). > maybe OK to blindly return an error at fault time (because userspace > has messed with the mapping behind our back), but there should at > least be something telling a well behaved userspace that there is a > bunch of combination we're unwilling to support. How about WARN_ON() or BUG() for the faulty situation? > Which brings me to the next point: FWB is not discoverable from > userspace. How do you expect a VMM to know what it can or cannot do? Good point. I am not sure if it can. I suppose you are concerned about error during fault handling when !FWB without VMM having any clear indications of the cause? Perhaps we can gracefully fall back to the default device mapping in such case? But that would cause VM to crash as soon as it makes some access violating DEVICE_nGnRE. > M. > > -- > Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-11 3:42 ` Ankit Agrawal @ 2025-03-11 11:18 ` Marc Zyngier 2025-03-11 12:07 ` Ankit Agrawal 0 siblings, 1 reply; 61+ messages in thread From: Marc Zyngier @ 2025-03-11 11:18 UTC (permalink / raw) To: Ankit Agrawal Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, 11 Mar 2025 03:42:23 +0000, Ankit Agrawal <ankita@nvidia.com> wrote: > > >> + /* > >> + * When FWB is unsupported KVM needs to do cache flushes > >> + * (via dcache_clean_inval_poc()) of the underlying memory. This is > >> + * only possible if the memory is already mapped into the kernel map. > >> + * > >> + * Outright reject as the cacheable device memory is not present in > >> + * the kernel map and not suitable for cache management. > >> + */ > >> + if (cacheable_devmem && !stage2_has_fwb(pgt)) { > >> + ret = -EINVAL; > >> + goto out_unlock; > >> + } > >> + > > > > These new error reasons should at least be complemented by an > > equivalent check at the point where the memslot is registered. It > > Understood. I can add such check in kvm_arch_prepare_memory_region(). > > > > maybe OK to blindly return an error at fault time (because userspace > > has messed with the mapping behind our back), but there should at > > least be something telling a well behaved userspace that there is a > > bunch of combination we're unwilling to support. > > How about WARN_ON() or BUG() for the faulty situation? Absolutely not. Do you really want any user to randomly crash the kernel because they flip a mapping, which they can do anytime they want? The way KVM works is that we return to userspace for the VMM to fix things. Either by emulating something we can't do in the kernel, or by fixing things so that the kernel can replay the fault and sort it out. Either way, this requires some form of fault syndrome so that usespace has a chance of understanding WTF is going on. > > Which brings me to the next point: FWB is not discoverable from > > userspace. How do you expect a VMM to know what it can or cannot do? > > Good point. I am not sure if it can. I suppose you are concerned about error > during fault handling when !FWB without VMM having any clear indications > of the cause? No, I'm concerned that a well established API (populating a memslot) works in some case and doesn't work in another without a clear indication of *why* we have this behaviour. To me, this indicates that userspace needs to buy in this new behaviour, and that behaviour needs to be advertised by a capability, which is in turn conditional on FWB. > Perhaps we can gracefully fall back to the default device mapping > in such case? But that would cause VM to crash as soon as it makes some > access violating DEVICE_nGnRE. Which would now be a regression... My take is that this cacheable PNFMAP contraption must only be exposed to a guest if FWB is available. We can't prevent someone to do an mmap() behind our back, but we can at least: - tell userspace whether this is supported - only handle the fault if userspace has bought in this mode - report the fault to userspace for it to fix things otherwise M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-11 11:18 ` Marc Zyngier @ 2025-03-11 12:07 ` Ankit Agrawal 2025-03-12 8:21 ` Marc Zyngier 0 siblings, 1 reply; 61+ messages in thread From: Ankit Agrawal @ 2025-03-11 12:07 UTC (permalink / raw) To: Marc Zyngier Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Thanks Marc for the feedback. > No, I'm concerned that a well established API (populating a memslot) > works in some case and doesn't work in another without a clear > indication of *why* we have this behaviour. > > To me, this indicates that userspace needs to buy in this new > behaviour, and that behaviour needs to be advertised by a capability, > which is in turn conditional on FWB. Yes, that makes sense. >>> Perhaps we can gracefully fall back to the default device mapping >>> in such case? But that would cause VM to crash as soon as it makes some >>> access violating DEVICE_nGnRE. > > Which would now be a regression... > > My take is that this cacheable PNFMAP contraption must only be exposed > to a guest if FWB is available. We can't prevent someone to do an > mmap() behind our back, but we can at least: > > - tell userspace whether this is supported For my education, what is an accepted way to communicate this? Please let me know if there are any relevant examples that you may be aware of. I suppose just checking for FWB (for PFNMAP) and returning some sort of an error on userspace mmap will not be enough of a hint here? > - only handle the fault if userspace has bought in this mode > - report the fault to userspace for it to fix things otherwise > > M. > > -- > Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-11 12:07 ` Ankit Agrawal @ 2025-03-12 8:21 ` Marc Zyngier 2025-03-17 5:55 ` Ankit Agrawal 0 siblings, 1 reply; 61+ messages in thread From: Marc Zyngier @ 2025-03-12 8:21 UTC (permalink / raw) To: Ankit Agrawal Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, 11 Mar 2025 12:07:20 +0000, Ankit Agrawal <ankita@nvidia.com> wrote: > > Thanks Marc for the feedback. > > > No, I'm concerned that a well established API (populating a memslot) > > works in some case and doesn't work in another without a clear > > indication of *why* we have this behaviour. > > > > To me, this indicates that userspace needs to buy in this new > > behaviour, and that behaviour needs to be advertised by a capability, > > which is in turn conditional on FWB. > > Yes, that makes sense. > > >>> Perhaps we can gracefully fall back to the default device mapping > >>> in such case? But that would cause VM to crash as soon as it makes some > >>> access violating DEVICE_nGnRE. > > > > Which would now be a regression... > > > > My take is that this cacheable PNFMAP contraption must only be exposed > > to a guest if FWB is available. We can't prevent someone to do an > > mmap() behind our back, but we can at least: > > > > - tell userspace whether this is supported > > For my education, what is an accepted way to communicate this? Please let > me know if there are any relevant examples that you may be aware of. A KVM capability is what is usually needed. > > I suppose just checking for FWB (for PFNMAP) and returning some sort of > an error on userspace mmap will not be enough of a hint here? I don't think checking for FWB at mmap() time is correct. mmap() shouldn't care about FWB at all, because stage-2 is irrelevant to mmap(). You also want to be able t perform the same mmap() inside an EL1 guest, which by definition cannot consider FWB. This must be checked at the point of memslot creation, and return an error at that point. Memslots are all about stage-2, so it makes sense to check it there. M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-12 8:21 ` Marc Zyngier @ 2025-03-17 5:55 ` Ankit Agrawal 2025-03-17 9:27 ` Marc Zyngier 0 siblings, 1 reply; 61+ messages in thread From: Ankit Agrawal @ 2025-03-17 5:55 UTC (permalink / raw) To: Marc Zyngier Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org >> For my education, what is an accepted way to communicate this? Please let >> me know if there are any relevant examples that you may be aware of. > > A KVM capability is what is usually needed. I see. If IIUC, this would involve a corresponding Qemu (usermode) change to fetch the new KVM cap. Then it could fail in case the FWB is not supported with some additional conditions (so that the currently supported configs with !FWB won't break on usermode). The proposed code change is to map in S2 as NORMAL when vma flags has VM_PFNMAP. However, Qemu cannot know that driver is mapping with PFNMAP or not. So how may Qemu decide whether it is okay to fail for !FWB or not? > This must be checked at the point of memslot creation, and return an > error at that point. Memslots are all about stage-2, so it makes sense > to check it there. Ack, will add the check. > > M. > > -- > Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-17 5:55 ` Ankit Agrawal @ 2025-03-17 9:27 ` Marc Zyngier 2025-03-17 19:54 ` Catalin Marinas 0 siblings, 1 reply; 61+ messages in thread From: Marc Zyngier @ 2025-03-17 9:27 UTC (permalink / raw) To: Ankit Agrawal Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Mon, 17 Mar 2025 05:55:55 +0000, Ankit Agrawal <ankita@nvidia.com> wrote: > > >> For my education, what is an accepted way to communicate this? Please let > >> me know if there are any relevant examples that you may be aware of. > > > > A KVM capability is what is usually needed. > > I see. If IIUC, this would involve a corresponding Qemu (usermode) change > to fetch the new KVM cap. Then it could fail in case the FWB is not > supported with some additional conditions (so that the currently supported > configs with !FWB won't break on usermode). > > The proposed code change is to map in S2 as NORMAL when vma flags > has VM_PFNMAP. However, Qemu cannot know that driver is mapping > with PFNMAP or not. So how may Qemu decide whether it is okay to > fail for !FWB or not? This is not about FWB as far as userspace is concerned. This is about PFNMAP as non-device memory. If the host doesn't have FWB, then the "PFNMAP as non-device memory" capability doesn't exist, and userspace fails early. Userspace must also have some knowledge of what device it obtains the mapping from, and whether that device requires some extra host capability to be assigned to the guest. You can then check whether the VMA associated with the memslot is PFNMAP or not, if the memslot has been enabled for PFNMAP mappings (either globally or on a per-memslot basis, I don't really care). M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-17 9:27 ` Marc Zyngier @ 2025-03-17 19:54 ` Catalin Marinas 2025-03-18 9:39 ` Marc Zyngier 0 siblings, 1 reply; 61+ messages in thread From: Catalin Marinas @ 2025-03-17 19:54 UTC (permalink / raw) To: Marc Zyngier Cc: Ankit Agrawal, Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Mon, Mar 17, 2025 at 09:27:52AM +0000, Marc Zyngier wrote: > On Mon, 17 Mar 2025 05:55:55 +0000, > Ankit Agrawal <ankita@nvidia.com> wrote: > > > > >> For my education, what is an accepted way to communicate this? Please let > > >> me know if there are any relevant examples that you may be aware of. > > > > > > A KVM capability is what is usually needed. > > > > I see. If IIUC, this would involve a corresponding Qemu (usermode) change > > to fetch the new KVM cap. Then it could fail in case the FWB is not > > supported with some additional conditions (so that the currently supported > > configs with !FWB won't break on usermode). > > > > The proposed code change is to map in S2 as NORMAL when vma flags > > has VM_PFNMAP. However, Qemu cannot know that driver is mapping > > with PFNMAP or not. So how may Qemu decide whether it is okay to > > fail for !FWB or not? > > This is not about FWB as far as userspace is concerned. This is about > PFNMAP as non-device memory. If the host doesn't have FWB, then the > "PFNMAP as non-device memory" capability doesn't exist, and userspace > fails early. > > Userspace must also have some knowledge of what device it obtains the > mapping from, and whether that device requires some extra host > capability to be assigned to the guest. > > You can then check whether the VMA associated with the memslot is > PFNMAP or not, if the memslot has been enabled for PFNMAP mappings > (either globally or on a per-memslot basis, I don't really care). Trying to page this back in, I think there are three stages: 1. A KVM cap that the VMM can use to check for non-device PFNMAP (or rather cacheable PFNMAP since we already support Normal NC). 2. Memslot registration - we need a way for the VMM to require such cacheable PFNMAP and for KVM to check. Current patch relies on (a) the stage 1 vma attributes which I'm not a fan of. An alternative I suggested was (b) a VM_FORCE_CACHEABLE vma flag, on the assumption that the vfio driver knows if it supports cacheable (it's a bit of a stretch trying to make this generic). Yet another option is (c) a KVM_MEM_CACHEABLE flag that the VMM passes at memslot registration. 3. user_mem_abort() - follows the above logic (whatever we decide), maybe with some extra check and WARN in case we got the logic wrong. The problems in (2) are that we need to know that the device supports cacheable mappings and we don't introduce additional issues or end up with FWB on a PFNMAP that does not support cacheable. Without any vma flag like the current VM_ALLOW_ANY_UNCACHED, the next best thing is relying on the stage 1 attributes. But we don't know them at the memslot registration, only later in step (3) after a GUP on the VMM address space. So in (2), when !FWB, we only want to reject VM_PFNMAP slots if we know they are going to be mapped as cacheable. So we need this information somehow, either from the vma->vm_flags or slot->flags. -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-17 19:54 ` Catalin Marinas @ 2025-03-18 9:39 ` Marc Zyngier 2025-03-18 12:55 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Marc Zyngier @ 2025-03-18 9:39 UTC (permalink / raw) To: Catalin Marinas Cc: Ankit Agrawal, Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Mon, 17 Mar 2025 19:54:25 +0000, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Mon, Mar 17, 2025 at 09:27:52AM +0000, Marc Zyngier wrote: > > On Mon, 17 Mar 2025 05:55:55 +0000, > > Ankit Agrawal <ankita@nvidia.com> wrote: > > > > > > >> For my education, what is an accepted way to communicate this? Please let > > > >> me know if there are any relevant examples that you may be aware of. > > > > > > > > A KVM capability is what is usually needed. > > > > > > I see. If IIUC, this would involve a corresponding Qemu (usermode) change > > > to fetch the new KVM cap. Then it could fail in case the FWB is not > > > supported with some additional conditions (so that the currently supported > > > configs with !FWB won't break on usermode). > > > > > > The proposed code change is to map in S2 as NORMAL when vma flags > > > has VM_PFNMAP. However, Qemu cannot know that driver is mapping > > > with PFNMAP or not. So how may Qemu decide whether it is okay to > > > fail for !FWB or not? > > > > This is not about FWB as far as userspace is concerned. This is about > > PFNMAP as non-device memory. If the host doesn't have FWB, then the > > "PFNMAP as non-device memory" capability doesn't exist, and userspace > > fails early. > > > > Userspace must also have some knowledge of what device it obtains the > > mapping from, and whether that device requires some extra host > > capability to be assigned to the guest. > > > > You can then check whether the VMA associated with the memslot is > > PFNMAP or not, if the memslot has been enabled for PFNMAP mappings > > (either globally or on a per-memslot basis, I don't really care). > > Trying to page this back in, I think there are three stages: > > 1. A KVM cap that the VMM can use to check for non-device PFNMAP (or > rather cacheable PFNMAP since we already support Normal NC). > > 2. Memslot registration - we need a way for the VMM to require such > cacheable PFNMAP and for KVM to check. Current patch relies on (a) > the stage 1 vma attributes which I'm not a fan of. An alternative I > suggested was (b) a VM_FORCE_CACHEABLE vma flag, on the assumption > that the vfio driver knows if it supports cacheable (it's a bit of a > stretch trying to make this generic). Yet another option is (c) a > KVM_MEM_CACHEABLE flag that the VMM passes at memslot registration. > > 3. user_mem_abort() - follows the above logic (whatever we decide), > maybe with some extra check and WARN in case we got the logic wrong. > > The problems in (2) are that we need to know that the device supports > cacheable mappings and we don't introduce additional issues or end up > with FWB on a PFNMAP that does not support cacheable. Without any vma > flag like the current VM_ALLOW_ANY_UNCACHED, the next best thing is > relying on the stage 1 attributes. But we don't know them at the memslot > registration, only later in step (3) after a GUP on the VMM address > space. > > So in (2), when !FWB, we only want to reject VM_PFNMAP slots if we know > they are going to be mapped as cacheable. So we need this information > somehow, either from the vma->vm_flags or slot->flags. Yup, that's mostly how I think of it. Obtaining a mapping from the xPU driver must result in VM_PFNMAP being set in the VMA. I don't think that's particularly controversial. The memslot must also be created with a new flag ((2c) in the taxonomy above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This flag is only allowed if (1) is valid. This results in the following behaviours: - If the VMM creates the memslot with the cacheable attribute without (1) being advertised, we fail. - If the VMM creates the memslot without the cacheable attribute, we map as NC, as it is today. What this doesn't do is *automatically* decide for the VMM what attributes to use. The VMM must know what it is doing, and only provide the memslot flag when appropriate. Doing otherwise may eat your data and/or take the machine down (cacheable mapping on a device can be great fun). If you want to address this, then "someone" needs to pass some additional VMA flag that KVM can check. Of course, all of this only caters for well behaved userspace, and we need to gracefully handle (3) when the VMM sneaks in a new VMA that has conflicting attributes. For that, we need a reasonable fault reporting interface that allows userspace to correctly handle it. I don't think this is unique to this case, but also covers things like MTE and other funky stuff that relies on the backing memory having some particular "attributes". An alternative could be to require the VMA to be sealed, which would prevent any overlapping mapping. But I only have looked at that for 2 minutes... Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-18 9:39 ` Marc Zyngier @ 2025-03-18 12:55 ` Jason Gunthorpe 2025-03-18 19:27 ` Catalin Marinas 2025-03-18 19:30 ` Oliver Upton 0 siblings, 2 replies; 61+ messages in thread From: Jason Gunthorpe @ 2025-03-18 12:55 UTC (permalink / raw) To: Marc Zyngier Cc: Catalin Marinas, Ankit Agrawal, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote: > The memslot must also be created with a new flag ((2c) in the taxonomy > above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This > flag is only allowed if (1) is valid. > > This results in the following behaviours: > > - If the VMM creates the memslot with the cacheable attribute without > (1) being advertised, we fail. > > - If the VMM creates the memslot without the cacheable attribute, we > map as NC, as it is today. Is that OK though? Now we have the MM page tables mapping this memory as cachable but KVM and the guest is accessing it as non-cached. I thought ARM tried hard to avoid creating such mismatches? This is why the pgprot flags were used to drive this, not an opt-in flag. To prevent userspace from forcing a mismatch. > What this doesn't do is *automatically* decide for the VMM what > attributes to use. The VMM must know what it is doing, and only > provide the memslot flag when appropriate. Doing otherwise may eat > your data and/or take the machine down (cacheable mapping on a device > can be great fun). Again, this is why we followed the VMA flags. The thing creating the VMA already made this safety determination when it set pgprot cachable. We should not allow KVM to randomly make any PGPROT cachable! > If you want to address this, then "someone" needs > to pass some additional VMA flag that KVM can check. pgprot does this already, we don't need a new flag to determine if the VMA is cachably mapped. Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-18 12:55 ` Jason Gunthorpe @ 2025-03-18 19:27 ` Catalin Marinas 2025-03-18 19:35 ` David Hildenbrand 2025-03-18 23:17 ` Jason Gunthorpe 2025-03-18 19:30 ` Oliver Upton 1 sibling, 2 replies; 61+ messages in thread From: Catalin Marinas @ 2025-03-18 19:27 UTC (permalink / raw) To: Jason Gunthorpe Cc: Marc Zyngier, Ankit Agrawal, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote: > On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote: > > The memslot must also be created with a new flag ((2c) in the taxonomy > > above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This > > flag is only allowed if (1) is valid. > > > > This results in the following behaviours: > > > > - If the VMM creates the memslot with the cacheable attribute without > > (1) being advertised, we fail. > > > > - If the VMM creates the memslot without the cacheable attribute, we > > map as NC, as it is today. > > Is that OK though? > > Now we have the MM page tables mapping this memory as cachable but KVM > and the guest is accessing it as non-cached. I don't think we should allow this. > I thought ARM tried hard to avoid creating such mismatches? This is > why the pgprot flags were used to drive this, not an opt-in flag. To > prevent userspace from forcing a mismatch. We have the vma->vm_page_prot when the memslot is added, so we could use this instead of additional KVM flags. If it's Normal Cacheable and the platform does not support FWB, reject it. If the prot bits say cacheable, it means that the driver was ok with such mapping. Some extra checks for !MTE or MTE_PERM. As additional safety, we could check this again in user_mem_abort() in case the driver played with the vm_page_prot field in the meantime (e.g. in the .fault() callback). I'm not particularly keen on using the vm_page_prot but we probably need to do this anyway to avoid aliases as we can't fully trust the VMM. The alternative is a VM_* flag that says "cacheable everywhere" and we avoid the low-level attributes checking. > > What this doesn't do is *automatically* decide for the VMM what > > attributes to use. The VMM must know what it is doing, and only > > provide the memslot flag when appropriate. Doing otherwise may eat > > your data and/or take the machine down (cacheable mapping on a device > > can be great fun). > > Again, this is why we followed the VMA flags. The thing creating the > VMA already made this safety determination when it set pgprot > cachable. We should not allow KVM to randomly make any PGPROT > cachable! Can this be moved to kvm_arch_prepare_memory_region() and maybe an additional check in user_mem_abort()? Thinking some more about a KVM capability that the VMM can check, I'm not sure what it can do with this. The VMM simply maps something from a device and cannot probe the cacheability - that's a property of the device that's not usually exposed to user by the driver. The VMM just passes this vma to KVM. As with the Normal NC, we tried to avoid building device knowledge into the VMM (and ended up with VM_ALLOW_ANY_UNCACHED since the VFIO driver did not allow such user mapping and probably wasn't entirely safe either). I assume with the cacheable pfn mapping, the whole range covered by the vma is entirely safe to be mapped as such in user space. -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-18 19:27 ` Catalin Marinas @ 2025-03-18 19:35 ` David Hildenbrand 2025-03-18 19:40 ` Oliver Upton 2025-03-18 23:17 ` Jason Gunthorpe 1 sibling, 1 reply; 61+ messages in thread From: David Hildenbrand @ 2025-03-18 19:35 UTC (permalink / raw) To: Catalin Marinas, Jason Gunthorpe Cc: Marc Zyngier, Ankit Agrawal, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On 18.03.25 20:27, Catalin Marinas wrote: > On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote: >> On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote: >>> The memslot must also be created with a new flag ((2c) in the taxonomy >>> above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This >>> flag is only allowed if (1) is valid. >>> >>> This results in the following behaviours: >>> >>> - If the VMM creates the memslot with the cacheable attribute without >>> (1) being advertised, we fail. >>> >>> - If the VMM creates the memslot without the cacheable attribute, we >>> map as NC, as it is today. >> >> Is that OK though? >> >> Now we have the MM page tables mapping this memory as cachable but KVM >> and the guest is accessing it as non-cached. > > I don't think we should allow this. > >> I thought ARM tried hard to avoid creating such mismatches? This is >> why the pgprot flags were used to drive this, not an opt-in flag. To >> prevent userspace from forcing a mismatch. > > We have the vma->vm_page_prot when the memslot is added, so we could use > this instead of additional KVM flags. I thought we try to avoid messing with the VMA when adding memslots; because KVM_CAP_SYNC_MMU allows user space for changing the VMAs afterwards without changing the memslot? include/uapi/linux/kvm.h:#define KVM_CAP_SYNC_MMU 16 /* Changes to host mmap are reflected in guest */ -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-18 19:35 ` David Hildenbrand @ 2025-03-18 19:40 ` Oliver Upton 2025-03-20 3:30 ` bibo mao 0 siblings, 1 reply; 61+ messages in thread From: Oliver Upton @ 2025-03-18 19:40 UTC (permalink / raw) To: David Hildenbrand Cc: Catalin Marinas, Jason Gunthorpe, Marc Zyngier, Ankit Agrawal, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Mar 18, 2025 at 08:35:38PM +0100, David Hildenbrand wrote: > On 18.03.25 20:27, Catalin Marinas wrote: > > On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote: > > > On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote: > > > > The memslot must also be created with a new flag ((2c) in the taxonomy > > > > above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This > > > > flag is only allowed if (1) is valid. > > > > > > > > This results in the following behaviours: > > > > > > > > - If the VMM creates the memslot with the cacheable attribute without > > > > (1) being advertised, we fail. > > > > > > > > - If the VMM creates the memslot without the cacheable attribute, we > > > > map as NC, as it is today. > > > > > > Is that OK though? > > > > > > Now we have the MM page tables mapping this memory as cachable but KVM > > > and the guest is accessing it as non-cached. > > > > I don't think we should allow this. > > > > > I thought ARM tried hard to avoid creating such mismatches? This is > > > why the pgprot flags were used to drive this, not an opt-in flag. To > > > prevent userspace from forcing a mismatch. > > > > We have the vma->vm_page_prot when the memslot is added, so we could use > > this instead of additional KVM flags. > > I thought we try to avoid messing with the VMA when adding memslots; because > KVM_CAP_SYNC_MMU allows user space for changing the VMAs afterwards without > changing the memslot? Any checks on the VMA at memslot creation is done out of courtesy to userspace so it 'fails fast'. We repeat checks on the VMA at the time of fault to handle userspace twiddling VMAs behind our back. VM_MTE_ALLOWED is an example of this. Thanks, Oliver ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-18 19:40 ` Oliver Upton @ 2025-03-20 3:30 ` bibo mao 2025-03-20 7:24 ` bibo mao 0 siblings, 1 reply; 61+ messages in thread From: bibo mao @ 2025-03-20 3:30 UTC (permalink / raw) To: Oliver Upton, David Hildenbrand Cc: Catalin Marinas, Jason Gunthorpe, Marc Zyngier, Ankit Agrawal, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On 2025/3/19 上午3:40, Oliver Upton wrote: > On Tue, Mar 18, 2025 at 08:35:38PM +0100, David Hildenbrand wrote: >> On 18.03.25 20:27, Catalin Marinas wrote: >>> On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote: >>>> On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote: >>>>> The memslot must also be created with a new flag ((2c) in the taxonomy >>>>> above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This >>>>> flag is only allowed if (1) is valid. >>>>> >>>>> This results in the following behaviours: >>>>> >>>>> - If the VMM creates the memslot with the cacheable attribute without >>>>> (1) being advertised, we fail. >>>>> >>>>> - If the VMM creates the memslot without the cacheable attribute, we >>>>> map as NC, as it is today. >>>> >>>> Is that OK though? >>>> >>>> Now we have the MM page tables mapping this memory as cachable but KVM >>>> and the guest is accessing it as non-cached. >>> >>> I don't think we should allow this. >>> >>>> I thought ARM tried hard to avoid creating such mismatches? This is >>>> why the pgprot flags were used to drive this, not an opt-in flag. To >>>> prevent userspace from forcing a mismatch. >>> >>> We have the vma->vm_page_prot when the memslot is added, so we could use >>> this instead of additional KVM flags. >> >> I thought we try to avoid messing with the VMA when adding memslots; because >> KVM_CAP_SYNC_MMU allows user space for changing the VMAs afterwards without >> changing the memslot? > > Any checks on the VMA at memslot creation is done out of courtesy to > userspace so it 'fails fast'. We repeat checks on the VMA at the time of > fault to handle userspace twiddling VMAs behind our back. yes, I think it is better to add cachable attribute in memslot, it can be checked on the VMA at memslot creation. Also cache attribute can be abstracted with cachable/uc/wc type rather than detailed arch specified. Regards Bibo Mao > > VM_MTE_ALLOWED is an example of this. > > Thanks, > Oliver > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-20 3:30 ` bibo mao @ 2025-03-20 7:24 ` bibo mao 0 siblings, 0 replies; 61+ messages in thread From: bibo mao @ 2025-03-20 7:24 UTC (permalink / raw) To: Oliver Upton, David Hildenbrand Cc: Catalin Marinas, Jason Gunthorpe, Marc Zyngier, Ankit Agrawal, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On 2025/3/20 上午11:30, bibo mao wrote: > > > On 2025/3/19 上午3:40, Oliver Upton wrote: >> On Tue, Mar 18, 2025 at 08:35:38PM +0100, David Hildenbrand wrote: >>> On 18.03.25 20:27, Catalin Marinas wrote: >>>> On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote: >>>>> On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote: >>>>>> The memslot must also be created with a new flag ((2c) in the >>>>>> taxonomy >>>>>> above) that carries the "Please map VM_PFNMAP VMAs as cacheable". >>>>>> This >>>>>> flag is only allowed if (1) is valid. >>>>>> >>>>>> This results in the following behaviours: >>>>>> >>>>>> - If the VMM creates the memslot with the cacheable attribute without >>>>>> (1) being advertised, we fail. >>>>>> >>>>>> - If the VMM creates the memslot without the cacheable attribute, we >>>>>> map as NC, as it is today. >>>>> >>>>> Is that OK though? >>>>> >>>>> Now we have the MM page tables mapping this memory as cachable but KVM >>>>> and the guest is accessing it as non-cached. >>>> >>>> I don't think we should allow this. >>>> >>>>> I thought ARM tried hard to avoid creating such mismatches? This is >>>>> why the pgprot flags were used to drive this, not an opt-in flag. To >>>>> prevent userspace from forcing a mismatch. >>>> >>>> We have the vma->vm_page_prot when the memslot is added, so we could >>>> use >>>> this instead of additional KVM flags. >>> >>> I thought we try to avoid messing with the VMA when adding memslots; >>> because >>> KVM_CAP_SYNC_MMU allows user space for changing the VMAs afterwards >>> without >>> changing the memslot? >> >> Any checks on the VMA at memslot creation is done out of courtesy to >> userspace so it 'fails fast'. We repeat checks on the VMA at the time of >> fault to handle userspace twiddling VMAs behind our back. > yes, I think it is better to add cachable attribute in memslot, it can > be checked on the VMA at memslot creation. Also cache attribute can be > abstracted with cachable/uc/wc type rather than detailed arch specified. Sorry, I do not state this clearly. My meaning is tor add cachable attribute in memslot. It is acquired from prot of VMA at memslot creation, and checked at S2 page fault fastpath. So it is unnecessary to find vma at S2 page fault fastpath, only memslot is enough. And cache attribute write-combined should be added also since some GPU memory can be mapped with WC attribute in vfio-gpu driver. Regards Bibo Mao > > Regards > Bibo Mao >> >> VM_MTE_ALLOWED is an example of this. >> >> Thanks, >> Oliver >> > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-18 19:27 ` Catalin Marinas 2025-03-18 19:35 ` David Hildenbrand @ 2025-03-18 23:17 ` Jason Gunthorpe 2025-03-19 18:03 ` Catalin Marinas 1 sibling, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-03-18 23:17 UTC (permalink / raw) To: Catalin Marinas Cc: Marc Zyngier, Ankit Agrawal, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Mar 18, 2025 at 07:27:27PM +0000, Catalin Marinas wrote: > Thinking some more about a KVM capability that the VMM can check, I'm > not sure what it can do with this. The VMM simply maps something from a > device and cannot probe the cacheability KVM is mirroring the MM's PTEs to the S2's PTEs. You can think about this differently - KVM currently has a little bug where the MM's PTE's can say cachable but KVM will mirror it to a S2 PTE that is forced non-cachable. KVM will not do any cache flushing to make this safe. Fundamentally that discrepancy is what is being addressed here. Cachable PTEs in the MM should be mirrored to cachable PTEs in the S2. That the issue only arises with non-struct page memory is just part of the triggering condition.. > I assume with the cacheable pfn mapping, the whole range covered by the > vma is entirely safe to be mapped as such in user space. Yes Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-18 23:17 ` Jason Gunthorpe @ 2025-03-19 18:03 ` Catalin Marinas 0 siblings, 0 replies; 61+ messages in thread From: Catalin Marinas @ 2025-03-19 18:03 UTC (permalink / raw) To: Jason Gunthorpe Cc: Marc Zyngier, Ankit Agrawal, oliver.upton@linux.dev, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Mar 18, 2025 at 08:17:36PM -0300, Jason Gunthorpe wrote: > On Tue, Mar 18, 2025 at 07:27:27PM +0000, Catalin Marinas wrote: > > Thinking some more about a KVM capability that the VMM can check, I'm > > not sure what it can do with this. The VMM simply maps something from a > > device and cannot probe the cacheability > > KVM is mirroring the MM's PTEs to the S2's PTEs. > > You can think about this differently - KVM currently has a little bug > where the MM's PTE's can say cachable but KVM will mirror it to a S2 > PTE that is forced non-cachable. KVM will not do any cache flushing to > make this safe. > > Fundamentally that discrepancy is what is being addressed > here. Cachable PTEs in the MM should be mirrored to cachable PTEs in > the S2. If we treat this as a bug (and I agree it's a potential problem if we assign devices to the guest that have cacheable VMM mappings), the first step would be to reject such memory slots, report some error to the VMM. I don't think it's worth attempting to do any cache maintenance to sync the aliases. Subsequently, this can be allowed with on FWB-capable hardware. Whether we want to expose such capability to the VMM, I guess it's harmless but I doubt the VMM would do anything with it since it doesn't know the device mapping attributes anyway. -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-18 12:55 ` Jason Gunthorpe 2025-03-18 19:27 ` Catalin Marinas @ 2025-03-18 19:30 ` Oliver Upton 2025-03-18 23:09 ` Jason Gunthorpe 1 sibling, 1 reply; 61+ messages in thread From: Oliver Upton @ 2025-03-18 19:30 UTC (permalink / raw) To: Jason Gunthorpe Cc: Marc Zyngier, Catalin Marinas, Ankit Agrawal, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote: > On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote: > > > The memslot must also be created with a new flag ((2c) in the taxonomy > > above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This > > flag is only allowed if (1) is valid. > > > > This results in the following behaviours: > > > > - If the VMM creates the memslot with the cacheable attribute without > > (1) being advertised, we fail. > > > > - If the VMM creates the memslot without the cacheable attribute, we > > map as NC, as it is today. > > Is that OK though? > > Now we have the MM page tables mapping this memory as cachable but KVM > and the guest is accessing it as non-cached. > > I thought ARM tried hard to avoid creating such mismatches? This is > why the pgprot flags were used to drive this, not an opt-in flag. To > prevent userspace from forcing a mismatch. It's far more problematic the other way around, e.g. the host knows that something needs a Device-* attribute and the VM has done something cacheable. The endpoint for that PA could, for example, fall over when lines pulled in by the guest are written back, which of course can't always be traced back to the offending VM. OTOH, if the host knows that a PA is cacheable and the guest does something non-cacheable, you 'just' have to deal with the usual mismatched attributes problem as laid out in the ARM ARM. > > What this doesn't do is *automatically* decide for the VMM what > > attributes to use. The VMM must know what it is doing, and only > > provide the memslot flag when appropriate. Doing otherwise may eat > > your data and/or take the machine down (cacheable mapping on a device > > can be great fun). > > Again, this is why we followed the VMA flags. The thing creating the > VMA already made this safety determination when it set pgprot > cachable. We should not allow KVM to randomly make any PGPROT > cachable! That doesn't seem to be the suggestion. Userspace should be stating intentions on the memslot with the sort of mapping that it wants to create, and a memslot flag to say "I allow cacheable mappings" seems to fit the bill. Then we have: - Memslot creation fails for any PFNMAP slot with the flag set && !FEAT_FWB - Stage-2 faults fail (exit to userspace) if the above conditions are not met - Stage-2 faults serviced w/ a cacheable mapping if the precondition is satisfied and pgprot on the VMA is cacheable - Stage-2 faults serviced w/ a non-cacheable mapping if flag is not set Seems workable + would prevent KVM from being excessively permissive? Thanks, Oliver ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-18 19:30 ` Oliver Upton @ 2025-03-18 23:09 ` Jason Gunthorpe 2025-03-19 7:01 ` Oliver Upton 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-03-18 23:09 UTC (permalink / raw) To: Oliver Upton Cc: Marc Zyngier, Catalin Marinas, Ankit Agrawal, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Mar 18, 2025 at 12:30:43PM -0700, Oliver Upton wrote: > On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote: > > On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote: > > > > > The memslot must also be created with a new flag ((2c) in the taxonomy > > > above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This > > > flag is only allowed if (1) is valid. > > > > > > This results in the following behaviours: > > > > > > - If the VMM creates the memslot with the cacheable attribute without > > > (1) being advertised, we fail. > > > > > > - If the VMM creates the memslot without the cacheable attribute, we > > > map as NC, as it is today. > > > > Is that OK though? > > > > Now we have the MM page tables mapping this memory as cachable but KVM > > and the guest is accessing it as non-cached. > > > > I thought ARM tried hard to avoid creating such mismatches? This is > > why the pgprot flags were used to drive this, not an opt-in flag. To > > prevent userspace from forcing a mismatch. > > It's far more problematic the other way around, e.g. the host knows that > something needs a Device-* attribute and the VM has done something > cacheable. The endpoint for that PA could, for example, fall over when > lines pulled in by the guest are written back, which of course can't > always be traced back to the offending VM. > > OTOH, if the host knows that a PA is cacheable and the guest does > something non-cacheable, you 'just' have to deal with the usual > mismatched attributes problem as laid out in the ARM ARM. I think the issue is that KVM doesn't do that usual stuff (ie cache flushing) for memory that doesn't have a struct page backing. So nothing in the hypervisor does any cache flushing and I belive you end up with a situation where the VMM could have zero'd this cachable memory using cachable stores to sanitize it across VMs and then KVM can put that memory into the VM as uncached and the VM could then access stale non-zeroed data from a prior VM. Yes? This is a security problem. As I understand things KVM must either do the cache flushing, or must not allow mismatched attributes, as a matter of security. This is why FWB comes into the picture because KVM cannot do the cache flushing of PFNMAP VMAs. So we force the MM and KVM S2 to be cachable and use S2 FWB to prevent the guest from ever making it uncachable. Thus the cache flushes are not required and everything is security safe. So I think the logic we want here in the fault handler is to: Get the mm's PTE If it is cachable: Check if it has a struct page: Yes - KVM flushes it and can use a non-FWB path No - KVM either fails to install it, or installs it using FWB to force cachability. KVM never allows degrading cachable to non-cachable when it can't do flushing. Not cachable: Install it with Normal-NC as was previously discussed and merged > Userspace should be stating intentions on the memslot with the sort of > mapping that it wants to create, and a memslot flag to say "I allow > cacheable mappings" seems to fit the bill. I'm not sure about this, I don't see that the userspace has any choice. As above, KVM has to follow whatever is in the PTEs, the userspace can't ask for something different here. At best you could make non-struct page cachable memory always fail unless the flag is given - but why? It seems sufficient for fast fail to check if the VMA has PFNMAP and pgprot cachable then !FEAT_FWB fails the memslot. There is no real recovery from this, the VMM is doing something that cannot be supported. > - Stage-2 faults serviced w/ a non-cacheable mapping if flag is not > set As above, I think this creates a bug :\ Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-18 23:09 ` Jason Gunthorpe @ 2025-03-19 7:01 ` Oliver Upton 2025-03-19 17:04 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Oliver Upton @ 2025-03-19 7:01 UTC (permalink / raw) To: Jason Gunthorpe Cc: Marc Zyngier, Catalin Marinas, Ankit Agrawal, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Mar 18, 2025 at 08:09:09PM -0300, Jason Gunthorpe wrote: > > It's far more problematic the other way around, e.g. the host knows that > > something needs a Device-* attribute and the VM has done something > > cacheable. The endpoint for that PA could, for example, fall over when > > lines pulled in by the guest are written back, which of course can't > > always be traced back to the offending VM. > > > > OTOH, if the host knows that a PA is cacheable and the guest does > > something non-cacheable, you 'just' have to deal with the usual > > mismatched attributes problem as laid out in the ARM ARM. > > I think the issue is that KVM doesn't do that usual stuff (ie cache > flushing) for memory that doesn't have a struct page backing. Indeed, I clearly paged that out. What I said is how we arrived at the Device-* v. Normal-NC distinction. > So nothing in the hypervisor does any cache flushing and I belive you > end up with a situation where the VMM could have zero'd this cachable > memory using cachable stores to sanitize it across VMs and then KVM > can put that memory into the VM as uncached and the VM could then > access stale non-zeroed data from a prior VM. Yes? This is a security > problem. Pedantic, but KVM only cares about cache maintenance in response to the primary MM, not the VMM. After a stage-2 mapping has been established userspace cannot expect KVM to do cache maintenance on its behalf. You have a very good point that KVM is broken for cacheable PFNMAP'd crap since we demote to something non-cacheable, and maybe that deserves fixing first. Hopefully nobody notices that we've taken away the toys... > So I think the logic we want here in the fault handler is to: > Get the mm's PTE > If it is cachable: > Check if it has a struct page: > Yes - KVM flushes it and can use a non-FWB path > No - KVM either fails to install it, or installs it using FWB > to force cachability. KVM never allows degrading cachable > to non-cachable when it can't do flushing. > Not cachable: > Install it with Normal-NC as was previously discussed and merged We still need to test the VMA flag here to select Normal-NC v. Device. > > Userspace should be stating intentions on the memslot with the sort of > > mapping that it wants to create, and a memslot flag to say "I allow > > cacheable mappings" seems to fit the bill. > > I'm not sure about this, I don't see that the userspace has any > choice. As above, KVM has to follow whatever is in the PTEs, the > userspace can't ask for something different here. At best you could > make non-struct page cachable memory always fail unless the flag is > given - but why? > > It seems sufficient for fast fail to check if the VMA has PFNMAP and > pgprot cachable then !FEAT_FWB fails the memslot. There is no real > recovery from this, the VMM is doing something that cannot be > supported. I'm less worried about recovery and more focused on userspace being able to understand what happened. Otherwise we may get folks complaining about the ioctl failing "randomly" on certain machines. But we might need to just expose the FWB-ness of the MMU to userspace since it can already encounter mismatched attributes when poking struct page-backed guest memory. Thanks, Oliver ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-19 7:01 ` Oliver Upton @ 2025-03-19 17:04 ` Jason Gunthorpe 2025-03-19 18:11 ` Catalin Marinas 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-03-19 17:04 UTC (permalink / raw) To: Oliver Upton Cc: Marc Zyngier, Catalin Marinas, Ankit Agrawal, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > You have a very good point that KVM is broken for cacheable PFNMAP'd > crap since we demote to something non-cacheable, and maybe that > deserves fixing first. Hopefully nobody notices that we've taken away > the toys... Fixing it is either faulting all access attempts or mapping it cachable to the S2 (as this series is trying to do).. > We still need to test the VMA flag here to select Normal-NC v. Device. Yes > I'm less worried about recovery and more focused on userspace being > able to understand what happened. Otherwise we may get folks complaining > about the ioctl failing "randomly" on certain machines. Even today something like qemu doesn't know that VFIO has created a cachable mmap, so it would need a bunch more work to feed everything through as proposed here. IMHO the semantic should be simple enough, when the memslot is created determine if KVM can support it at all. non-struct page cacheable memory requires FWB to support it today. Maybe tomorrow KVM will learn how to cache flush non-struct page memory and then everything can support it. It just seems wrong to be making uAPI based around limitations of the kernel implementation.. Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-19 17:04 ` Jason Gunthorpe @ 2025-03-19 18:11 ` Catalin Marinas 2025-03-19 19:22 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Catalin Marinas @ 2025-03-19 18:11 UTC (permalink / raw) To: Jason Gunthorpe Cc: Oliver Upton, Marc Zyngier, Ankit Agrawal, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote: > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > > You have a very good point that KVM is broken for cacheable PFNMAP'd > > crap since we demote to something non-cacheable, and maybe that > > deserves fixing first. Hopefully nobody notices that we've taken away > > the toys... > > Fixing it is either faulting all access attempts or mapping it > cachable to the S2 (as this series is trying to do).. As I replied earlier, it might be worth doing both - fault on !FWB hardware (or rather reject the memslot creation), cacheable S2 otherwise. > > I'm less worried about recovery and more focused on userspace being > > able to understand what happened. Otherwise we may get folks complaining > > about the ioctl failing "randomly" on certain machines. > > Even today something like qemu doesn't know that VFIO has created a > cachable mmap, so it would need a bunch more work to feed everything > through as proposed here. I came to the same conclusion, qemu wouldn't be able to pass any KVM_MEMORY_* flag to request cacheable since it doesn't know what attributes it got from VFIO. You need a side-channel communication between the host VFIO driver and KVM. With Ankit's proposal, that's the Stage 1 MMU attributes for the qemu mapping. -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-19 18:11 ` Catalin Marinas @ 2025-03-19 19:22 ` Jason Gunthorpe 2025-03-19 21:48 ` Catalin Marinas 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-03-19 19:22 UTC (permalink / raw) To: Catalin Marinas Cc: Oliver Upton, Marc Zyngier, Ankit Agrawal, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote: > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote: > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > > > You have a very good point that KVM is broken for cacheable PFNMAP'd > > > crap since we demote to something non-cacheable, and maybe that > > > deserves fixing first. Hopefully nobody notices that we've taken away > > > the toys... > > > > Fixing it is either faulting all access attempts or mapping it > > cachable to the S2 (as this series is trying to do).. > > As I replied earlier, it might be worth doing both - fault on !FWB > hardware (or rather reject the memslot creation), cacheable S2 > otherwise. I have no objection, Ankit are you able to make a failure patch? Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-19 19:22 ` Jason Gunthorpe @ 2025-03-19 21:48 ` Catalin Marinas 2025-03-26 8:31 ` Ankit Agrawal 0 siblings, 1 reply; 61+ messages in thread From: Catalin Marinas @ 2025-03-19 21:48 UTC (permalink / raw) To: Jason Gunthorpe Cc: Oliver Upton, Marc Zyngier, Ankit Agrawal, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote: > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote: > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote: > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd > > > > crap since we demote to something non-cacheable, and maybe that > > > > deserves fixing first. Hopefully nobody notices that we've taken away > > > > the toys... > > > > > > Fixing it is either faulting all access attempts or mapping it > > > cachable to the S2 (as this series is trying to do).. > > > > As I replied earlier, it might be worth doing both - fault on !FWB > > hardware (or rather reject the memslot creation), cacheable S2 > > otherwise. > > I have no objection, Ankit are you able to make a failure patch? I'd wait until the KVM maintainers have their say. -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-19 21:48 ` Catalin Marinas @ 2025-03-26 8:31 ` Ankit Agrawal 2025-03-26 14:53 ` Sean Christopherson 0 siblings, 1 reply; 61+ messages in thread From: Ankit Agrawal @ 2025-03-26 8:31 UTC (permalink / raw) To: Catalin Marinas, Jason Gunthorpe Cc: Oliver Upton, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote: > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote: > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote: > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd > > > > > crap since we demote to something non-cacheable, and maybe that > > > > > deserves fixing first. Hopefully nobody notices that we've taken away > > > > > the toys... > > > > > > > > Fixing it is either faulting all access attempts or mapping it > > > > cachable to the S2 (as this series is trying to do).. > > > > > > As I replied earlier, it might be worth doing both - fault on !FWB > > > hardware (or rather reject the memslot creation), cacheable S2 > > > otherwise. > > > > I have no objection, Ankit are you able to make a failure patch? > > I'd wait until the KVM maintainers have their say. > Maz, Oliver any thoughts on this? Can we conclude to create this failure patch in memslot creation? > -- > Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-26 8:31 ` Ankit Agrawal @ 2025-03-26 14:53 ` Sean Christopherson 2025-03-26 15:42 ` Marc Zyngier 0 siblings, 1 reply; 61+ messages in thread From: Sean Christopherson @ 2025-03-26 14:53 UTC (permalink / raw) To: Ankit Agrawal Cc: Catalin Marinas, Jason Gunthorpe, Oliver Upton, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Mar 26, 2025, Ankit Agrawal wrote: > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote: > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote: > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote: > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd > > > > > > crap since we demote to something non-cacheable, and maybe that > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away > > > > > > the toys... > > > > > > > > > > Fixing it is either faulting all access attempts or mapping it > > > > > cachable to the S2 (as this series is trying to do).. > > > > > > > > As I replied earlier, it might be worth doing both - fault on !FWB > > > > hardware (or rather reject the memslot creation), cacheable S2 > > > > otherwise. > > > > > > I have no objection, Ankit are you able to make a failure patch? > > > > I'd wait until the KVM maintainers have their say. > > > > Maz, Oliver any thoughts on this? Can we conclude to create this failure > patch in memslot creation? That's not sufficient. As pointed out multiple times in this thread, any checks done at memslot creation are best effort "courtesies" provided to userspace to avoid terminating running VMs when the memory is faulted in. I.e. checking at memslot creation is optional, checking at fault-in/mapping is not. With that in place, I don't see any need for a memslot flag. IIUC, without FWB, cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB, KVM can simply honor the cacheability based on the VMA. Neither of those requires a memslot flag. A KVM capability to enumerate FWB support would be nice though, e.g. so userspace can assert and bail early without ever hitting an ioctl error. If we want to support existing setups that happen to work by dumb luck or careful configuration, then that should probably be an admin decision to support the "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-26 14:53 ` Sean Christopherson @ 2025-03-26 15:42 ` Marc Zyngier 2025-03-26 16:10 ` Sean Christopherson 0 siblings, 1 reply; 61+ messages in thread From: Marc Zyngier @ 2025-03-26 15:42 UTC (permalink / raw) To: Sean Christopherson Cc: Ankit Agrawal, Catalin Marinas, Jason Gunthorpe, Oliver Upton, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, 26 Mar 2025 14:53:34 +0000, Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Mar 26, 2025, Ankit Agrawal wrote: > > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote: > > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote: > > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote: > > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd > > > > > > > crap since we demote to something non-cacheable, and maybe that > > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away > > > > > > > the toys... > > > > > > > > > > > > Fixing it is either faulting all access attempts or mapping it > > > > > > cachable to the S2 (as this series is trying to do).. > > > > > > > > > > As I replied earlier, it might be worth doing both - fault on !FWB > > > > > hardware (or rather reject the memslot creation), cacheable S2 > > > > > otherwise. > > > > > > > > I have no objection, Ankit are you able to make a failure patch? > > > > > > I'd wait until the KVM maintainers have their say. > > > > > > > Maz, Oliver any thoughts on this? Can we conclude to create this failure > > patch in memslot creation? > > That's not sufficient. As pointed out multiple times in this thread, any checks > done at memslot creation are best effort "courtesies" provided to userspace to > avoid terminating running VMs when the memory is faulted in. > > I.e. checking at memslot creation is optional, checking at fault-in/mapping is > not. > > With that in place, I don't see any need for a memslot flag. IIUC, without FWB, > cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB, > KVM can simply honor the cacheability based on the VMA. Neither of those requires Remind me how this work with stuff such as guestmemfd, which, by definition, doesn't have a userspace mapping? > a memslot flag. A KVM capability to enumerate FWB support would be nice though, > e.g. so userspace can assert and bail early without ever hitting an > ioctl error. It's not "nice". It's mandatory. And FWB is definitely *not* something we want to expose as such. > > If we want to support existing setups that happen to work by dumb luck or careful > configuration, then that should probably be an admin decision to support the > "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag. No. That's not how we handle an ABI issue. VM migration, with and without FWB, can happen in both direction, and must have clear semantics. So NAK to a kernel parameter. If I have a VM with a device mapped as *device* on FWB host, I must be able to migrate it to non-FWB host, and back. A device mapped as *cacheable* can only be migrated between FWB-capable hosts. Importantly, it is *userspace* that is in charge of deciding how the device is mapped at S2. And the memslot flag is the correct abstraction for that. M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-26 15:42 ` Marc Zyngier @ 2025-03-26 16:10 ` Sean Christopherson 2025-03-26 18:02 ` Marc Zyngier 0 siblings, 1 reply; 61+ messages in thread From: Sean Christopherson @ 2025-03-26 16:10 UTC (permalink / raw) To: Marc Zyngier Cc: Ankit Agrawal, Catalin Marinas, Jason Gunthorpe, Oliver Upton, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Mar 26, 2025, Marc Zyngier wrote: > On Wed, 26 Mar 2025 14:53:34 +0000, > Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, Mar 26, 2025, Ankit Agrawal wrote: > > > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote: > > > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote: > > > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote: > > > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > > > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd > > > > > > > > crap since we demote to something non-cacheable, and maybe that > > > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away > > > > > > > > the toys... > > > > > > > > > > > > > > Fixing it is either faulting all access attempts or mapping it > > > > > > > cachable to the S2 (as this series is trying to do).. > > > > > > > > > > > > As I replied earlier, it might be worth doing both - fault on !FWB > > > > > > hardware (or rather reject the memslot creation), cacheable S2 > > > > > > otherwise. > > > > > > > > > > I have no objection, Ankit are you able to make a failure patch? > > > > > > > > I'd wait until the KVM maintainers have their say. > > > > > > > > > > Maz, Oliver any thoughts on this? Can we conclude to create this failure > > > patch in memslot creation? > > > > That's not sufficient. As pointed out multiple times in this thread, any checks > > done at memslot creation are best effort "courtesies" provided to userspace to > > avoid terminating running VMs when the memory is faulted in. > > > > I.e. checking at memslot creation is optional, checking at fault-in/mapping is > > not. > > > > With that in place, I don't see any need for a memslot flag. IIUC, without FWB, > > cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB, > > KVM can simply honor the cacheability based on the VMA. Neither of those requires > > Remind me how this work with stuff such as guestmemfd, which, by > definition, doesn't have a userspace mapping? Definitely not through a memslot flag. The cacheability would be a property of the guest_memfd inode, similar to how it's a property of the underlying device in this case. I don't entirely see what guest_memfd has to do with this. One of the big advantages of guest_memfd is that KVM has complete control over the lifecycle of the memory. IIUC, the issue with !FWB hosts is that KVM can't guarantee there are valid host mappings when memory is unmapped from the guest, and so can't do the necessary maintenance. I agree with Jason's earlier statement that that's a solvable kernel flaw. For guest_memfd, KVM already does maintenance operations when memory is reclaimed, for both SNP and TDX. I don't think ARM's cacheability stuff would require any new functionality in guest_memfd. > > a memslot flag. A KVM capability to enumerate FWB support would be nice though, > > e.g. so userspace can assert and bail early without ever hitting an > > ioctl error. > > It's not "nice". It's mandatory. And FWB is definitely *not* something > we want to expose as such. I agree a capability is mandatory if we're adding a memslot flag, but I don't think it's mandatory if this is all handled through kernel plumbing. > > If we want to support existing setups that happen to work by dumb luck or careful > > configuration, then that should probably be an admin decision to support the > > "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag. > > No. That's not how we handle an ABI issue. VM migration, with and > without FWB, can happen in both direction, and must have clear > semantics. So NAK to a kernel parameter. > > If I have a VM with a device mapped as *device* on FWB host, I must be > able to migrate it to non-FWB host, and back. A device mapped as > *cacheable* can only be migrated between FWB-capable hosts. But I thought the whole problem is that mapping this fancy memory as device is unsafe on non-FWB hosts? If it's safe, then why does KVM needs to reject anything in the first place? > Importantly, it is *userspace* that is in charge of deciding how the > device is mapped at S2. And the memslot flag is the correct > abstraction for that. I strongly disagree. Whatever owns the underlying physical memory is in charge, not userspace. For memory that's backed by a VMA, userspace can influence the behavior through mmap(), mprotect(), etc., but ultimately KVM needs to pull state from mm/, via the VMA. Or in the guest_memfd case, from guest_memfd. I have no objection to adding KVM uAPI to let userspace add _restrictions_, e.g. to disallow mapping memory as writable even if the VMA is writable. But IMO, adding a memslot flag to control cacheability isn't purely substractive. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-26 16:10 ` Sean Christopherson @ 2025-03-26 18:02 ` Marc Zyngier 2025-03-26 18:24 ` Sean Christopherson 0 siblings, 1 reply; 61+ messages in thread From: Marc Zyngier @ 2025-03-26 18:02 UTC (permalink / raw) To: Sean Christopherson Cc: Ankit Agrawal, Catalin Marinas, Jason Gunthorpe, Oliver Upton, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, 26 Mar 2025 16:10:45 +0000, Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Mar 26, 2025, Marc Zyngier wrote: > > On Wed, 26 Mar 2025 14:53:34 +0000, > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Wed, Mar 26, 2025, Ankit Agrawal wrote: > > > > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote: > > > > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote: > > > > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote: > > > > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > > > > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd > > > > > > > > > crap since we demote to something non-cacheable, and maybe that > > > > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away > > > > > > > > > the toys... > > > > > > > > > > > > > > > > Fixing it is either faulting all access attempts or mapping it > > > > > > > > cachable to the S2 (as this series is trying to do).. > > > > > > > > > > > > > > As I replied earlier, it might be worth doing both - fault on !FWB > > > > > > > hardware (or rather reject the memslot creation), cacheable S2 > > > > > > > otherwise. > > > > > > > > > > > > I have no objection, Ankit are you able to make a failure patch? > > > > > > > > > > I'd wait until the KVM maintainers have their say. > > > > > > > > > > > > > Maz, Oliver any thoughts on this? Can we conclude to create this failure > > > > patch in memslot creation? > > > > > > That's not sufficient. As pointed out multiple times in this thread, any checks > > > done at memslot creation are best effort "courtesies" provided to userspace to > > > avoid terminating running VMs when the memory is faulted in. > > > > > > I.e. checking at memslot creation is optional, checking at fault-in/mapping is > > > not. > > > > > > With that in place, I don't see any need for a memslot flag. IIUC, without FWB, > > > cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB, > > > KVM can simply honor the cacheability based on the VMA. Neither of those requires > > > > Remind me how this work with stuff such as guestmemfd, which, by > > definition, doesn't have a userspace mapping? > > Definitely not through a memslot flag. The cacheability would be a property of > the guest_memfd inode, similar to how it's a property of the underlying device > in this case. It's *not* a property of the device. It's a property of the mapping. > I don't entirely see what guest_memfd has to do with this. You were the one mentioning sampling the cacheability via the VMA. As far as I understand guestmemfd, there is no VMA to speak of. > One of the big > advantages of guest_memfd is that KVM has complete control over the lifecycle of > the memory. IIUC, the issue with !FWB hosts is that KVM can't guarantee there > are valid host mappings when memory is unmapped from the guest, and so can't do > the necessary maintenance. I agree with Jason's earlier statement that that's a > solvable kernel flaw. > > For guest_memfd, KVM already does maintenance operations when memory is reclaimed, > for both SNP and TDX. I don't think ARM's cacheability stuff would require any > new functionality in guest_memfd. I don't know how you reconcile the lack of host mapping and cache maintenance. The latter cannot take place without the former. > > > > a memslot flag. A KVM capability to enumerate FWB support would be nice though, > > > e.g. so userspace can assert and bail early without ever hitting an > > > ioctl error. > > > > It's not "nice". It's mandatory. And FWB is definitely *not* something > > we want to expose as such. > > I agree a capability is mandatory if we're adding a memslot flag, but I don't > think it's mandatory if this is all handled through kernel plumbing. It is mandatory, full stop. Otherwise, userspace is able to migrate a VM from an FWB host to a non-FWB one, start the VM, blow up on the first page fault. That's not an acceptable outcome. > > > > If we want to support existing setups that happen to work by dumb luck or careful > > > configuration, then that should probably be an admin decision to support the > > > "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag. > > > > No. That's not how we handle an ABI issue. VM migration, with and > > without FWB, can happen in both direction, and must have clear > > semantics. So NAK to a kernel parameter. > > > > If I have a VM with a device mapped as *device* on FWB host, I must be > > able to migrate it to non-FWB host, and back. A device mapped as > > *cacheable* can only be migrated between FWB-capable hosts. > > But I thought the whole problem is that mapping this fancy memory as device is > unsafe on non-FWB hosts? If it's safe, then why does KVM needs to reject anything > in the first place? I don't know where you got that idea. This is all about what memory type is exposed to a guest: - with FWB, no need for CMOs, so cacheable memory is allowed if the device supports it (i.e. it actually exposes memory), and device otherwise. - without FWB, CMOs are required, and we don't have a host mapping for these pages. As a fallback, the mapping is device only, as this doesn't require any CMO by definition. There is no notion of "safety" here. > > Importantly, it is *userspace* that is in charge of deciding how the > > device is mapped at S2. And the memslot flag is the correct > > abstraction for that. > > I strongly disagree. Whatever owns the underlying physical memory is in charge, > not userspace. For memory that's backed by a VMA, userspace can influence the > behavior through mmap(), mprotect(), etc., but ultimately KVM needs to pull state > from mm/, via the VMA. Or in the guest_memfd case, from guest_memfd. I don't buy that. Userspace needs to know the semantics of the memory it gives to the guest. Or at least discover that the same device plugged into to different hosts will have different behaviours. Just letting things rip is not an acceptable outcome. > I have no objection to adding KVM uAPI to let userspace add _restrictions_, e.g. > to disallow mapping memory as writable even if the VMA is writable. But IMO, > adding a memslot flag to control cacheability isn't purely substractive. I don't see how that solves the problem at hand: given the presence or absence of FWB, allow userspace to discover as early as possible what behaviour a piece of memory provided by a device will have, and control it to handle migration. M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-26 18:02 ` Marc Zyngier @ 2025-03-26 18:24 ` Sean Christopherson 2025-03-26 18:51 ` Oliver Upton 2025-03-31 14:56 ` Jason Gunthorpe 0 siblings, 2 replies; 61+ messages in thread From: Sean Christopherson @ 2025-03-26 18:24 UTC (permalink / raw) To: Marc Zyngier Cc: Ankit Agrawal, Catalin Marinas, Jason Gunthorpe, Oliver Upton, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Mar 26, 2025, Marc Zyngier wrote: > On Wed, 26 Mar 2025 16:10:45 +0000, > Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, Mar 26, 2025, Marc Zyngier wrote: > > > On Wed, 26 Mar 2025 14:53:34 +0000, > > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > On Wed, Mar 26, 2025, Ankit Agrawal wrote: > > > > > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote: > > > > > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote: > > > > > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote: > > > > > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote: > > > > > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd > > > > > > > > > > crap since we demote to something non-cacheable, and maybe that > > > > > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away > > > > > > > > > > the toys... > > > > > > > > > > > > > > > > > > Fixing it is either faulting all access attempts or mapping it > > > > > > > > > cachable to the S2 (as this series is trying to do).. > > > > > > > > > > > > > > > > As I replied earlier, it might be worth doing both - fault on !FWB > > > > > > > > hardware (or rather reject the memslot creation), cacheable S2 > > > > > > > > otherwise. > > > > > > > > > > > > > > I have no objection, Ankit are you able to make a failure patch? > > > > > > > > > > > > I'd wait until the KVM maintainers have their say. > > > > > > > > > > > > > > > > Maz, Oliver any thoughts on this? Can we conclude to create this failure > > > > > patch in memslot creation? > > > > > > > > That's not sufficient. As pointed out multiple times in this thread, any checks > > > > done at memslot creation are best effort "courtesies" provided to userspace to > > > > avoid terminating running VMs when the memory is faulted in. > > > > > > > > I.e. checking at memslot creation is optional, checking at fault-in/mapping is > > > > not. > > > > > > > > With that in place, I don't see any need for a memslot flag. IIUC, without FWB, > > > > cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB, > > > > KVM can simply honor the cacheability based on the VMA. Neither of those requires > > > > > > Remind me how this work with stuff such as guestmemfd, which, by > > > definition, doesn't have a userspace mapping? > > > > Definitely not through a memslot flag. The cacheability would be a property of > > the guest_memfd inode, similar to how it's a property of the underlying device > > in this case. > > It's *not* a property of the device. It's a property of the mapping. Sorry, bad phrasing. I was trying to say that the entity that controls the cacheability is ultimately whatever kernel subsystem/driver controls the mapping. > > I don't entirely see what guest_memfd has to do with this. > > You were the one mentioning sampling the cacheability via the VMA. As > far as I understand guestmemfd, there is no VMA to speak of. > > > One of the big > > advantages of guest_memfd is that KVM has complete control over the lifecycle of > > the memory. IIUC, the issue with !FWB hosts is that KVM can't guarantee there > > are valid host mappings when memory is unmapped from the guest, and so can't do > > the necessary maintenance. I agree with Jason's earlier statement that that's a > > solvable kernel flaw. > > > > For guest_memfd, KVM already does maintenance operations when memory is reclaimed, > > for both SNP and TDX. I don't think ARM's cacheability stuff would require any > > new functionality in guest_memfd. > > I don't know how you reconcile the lack of host mapping and cache > maintenance. The latter cannot take place without the former. I assume cache maintenance only requires _a_ mapping to the physical memory. With guest_memfd, KVM has the pfn (which happens to always be struct page memory today), and so can establish a VA=>PA mapping as needed. > > > > a memslot flag. A KVM capability to enumerate FWB support would be nice though, > > > > e.g. so userspace can assert and bail early without ever hitting an > > > > ioctl error. > > > > > > It's not "nice". It's mandatory. And FWB is definitely *not* something > > > we want to expose as such. > > > > I agree a capability is mandatory if we're adding a memslot flag, but I don't > > think it's mandatory if this is all handled through kernel plumbing. > > It is mandatory, full stop. Otherwise, userspace is able to migrate a > VM from an FWB host to a non-FWB one, start the VM, blow up on the > first page fault. That's not an acceptable outcome. > > > > > > > If we want to support existing setups that happen to work by dumb luck or careful > > > > configuration, then that should probably be an admin decision to support the > > > > "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag. > > > > > > No. That's not how we handle an ABI issue. VM migration, with and > > > without FWB, can happen in both direction, and must have clear > > > semantics. So NAK to a kernel parameter. > > > > > > If I have a VM with a device mapped as *device* on FWB host, I must be > > > able to migrate it to non-FWB host, and back. A device mapped as > > > *cacheable* can only be migrated between FWB-capable hosts. > > > > But I thought the whole problem is that mapping this fancy memory as device is > > unsafe on non-FWB hosts? If it's safe, then why does KVM needs to reject anything > > in the first place? > > I don't know where you got that idea. This is all about what memory > type is exposed to a guest: > > - with FWB, no need for CMOs, so cacheable memory is allowed if the > device supports it (i.e. it actually exposes memory), and device > otherwise. > > - without FWB, CMOs are required, and we don't have a host mapping for > these pages. As a fallback, the mapping is device only, as this > doesn't require any CMO by definition. > > There is no notion of "safety" here. Ah, the safety I'm talking about is the CMO requirement. IIUC, not doing CMOs if the memory is cacheable could result in data corruption, i.e. would be a safety issue for the host. But I missed that you were proposing that the !FWB behavior would be to force device mappings. > > > Importantly, it is *userspace* that is in charge of deciding how the > > > device is mapped at S2. And the memslot flag is the correct > > > abstraction for that. > > > > I strongly disagree. Whatever owns the underlying physical memory is in charge, > > not userspace. For memory that's backed by a VMA, userspace can influence the > > behavior through mmap(), mprotect(), etc., but ultimately KVM needs to pull state > > from mm/, via the VMA. Or in the guest_memfd case, from guest_memfd. > > I don't buy that. Userspace needs to know the semantics of the memory > it gives to the guest. Or at least discover that the same device > plugged into to different hosts will have different behaviours. Just > letting things rip is not an acceptable outcome. Agreed, but that doesn't require a memslot flag. A capability to enumerate that KVM can do cacheable mappings for PFNMAP memory would suffice. And if we want to have KVM reject memslots that are cachaeable in the VMA, but would get device in stage-2, then we can provide that functionality through the capability, i.e. let userspace decide if it wants "fallback to device" vs. "error on creation" on a per-VM basis. What I object to is adding a memslot flag. > > I have no objection to adding KVM uAPI to let userspace add _restrictions_, e.g. > > to disallow mapping memory as writable even if the VMA is writable. But IMO, > > adding a memslot flag to control cacheability isn't purely substractive. > > I don't see how that solves the problem at hand: given the presence or > absence of FWB, allow userspace to discover as early as possible what > behaviour a piece of memory provided by a device will have, and > control it to handle migration. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-26 18:24 ` Sean Christopherson @ 2025-03-26 18:51 ` Oliver Upton 2025-03-31 14:44 ` Jason Gunthorpe 2025-03-31 14:56 ` Jason Gunthorpe 1 sibling, 1 reply; 61+ messages in thread From: Oliver Upton @ 2025-03-26 18:51 UTC (permalink / raw) To: Sean Christopherson Cc: Marc Zyngier, Ankit Agrawal, Catalin Marinas, Jason Gunthorpe, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Mar 26, 2025 at 11:24:32AM -0700, Sean Christopherson wrote: > > > But I thought the whole problem is that mapping this fancy memory as device is > > > unsafe on non-FWB hosts? If it's safe, then why does KVM needs to reject anything > > > in the first place? > > > > I don't know where you got that idea. This is all about what memory > > type is exposed to a guest: > > > > - with FWB, no need for CMOs, so cacheable memory is allowed if the > > device supports it (i.e. it actually exposes memory), and device > > otherwise. > > > > - without FWB, CMOs are required, and we don't have a host mapping for > > these pages. As a fallback, the mapping is device only, as this > > doesn't require any CMO by definition. > > > > There is no notion of "safety" here. > > Ah, the safety I'm talking about is the CMO requirement. IIUC, not doing CMOs > if the memory is cacheable could result in data corruption, i.e. would be a safety > issue for the host. But I missed that you were proposing that the !FWB behavior > would be to force device mappings. To Jason's earlier point, you wind up with a security issue the other way around. Supposing the host is using a cacheable mapping to, say, zero the $THING at the other end of the mapping. Without a way to CMO the $THING we cannot make the zeroing visible to a guest with a stage-2 Device-* mapping. Marc, I understand that your proposed fallback is aligned to what we do today, but I'm actually unconvinced that it provides any reliable/correct behavior. We should then wind up with stage-2 memory attribute rules like so: 1) If struct page memory, use a cacheable mapping. CMO for non-FWB. 2) If cacheable PFNMAP: a) With FWB, use a cacheable mapping b) Without FWB, fail. 3) If VM_ALLOW_ANY_UNCACHED, use Normal Non-Cacheable mapping 4) Otherwise, Device-nGnRE I understand 2b breaks ABI, but the 'typical' VFIO usages fall into (3) and (4). > > > > Importantly, it is *userspace* that is in charge of deciding how the > > > > device is mapped at S2. And the memslot flag is the correct > > > > abstraction for that. > > > > > > I strongly disagree. Whatever owns the underlying physical memory is in charge, > > > not userspace. For memory that's backed by a VMA, userspace can influence the > > > behavior through mmap(), mprotect(), etc., but ultimately KVM needs to pull state > > > from mm/, via the VMA. Or in the guest_memfd case, from guest_memfd. > > > > I don't buy that. Userspace needs to know the semantics of the memory > > it gives to the guest. Or at least discover that the same device > > plugged into to different hosts will have different behaviours. Just > > letting things rip is not an acceptable outcome. > > Agreed, but that doesn't require a memslot flag. A capability to enumerate that > KVM can do cacheable mappings for PFNMAP memory would suffice. And if we want to > have KVM reject memslots that are cachaeable in the VMA, but would get device in > stage-2, then we can provide that functionality through the capability, i.e. let > userspace decide if it wants "fallback to device" vs. "error on creation" on a > per-VM basis. > > What I object to is adding a memslot flag. A capability that says "I can force cacheable things to be cacheable" is useful beyond even PFNMAP stuff. A pedantic but correct live migration / snapshotting implementation on non-FWB would need to do CMOs in case the VM used a non-WB mapping for memory. Thanks, Oliver ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-26 18:51 ` Oliver Upton @ 2025-03-31 14:44 ` Jason Gunthorpe 0 siblings, 0 replies; 61+ messages in thread From: Jason Gunthorpe @ 2025-03-31 14:44 UTC (permalink / raw) To: Oliver Upton Cc: Sean Christopherson, Marc Zyngier, Ankit Agrawal, Catalin Marinas, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Mar 26, 2025 at 11:51:57AM -0700, Oliver Upton wrote: > > 1) If struct page memory, use a cacheable mapping. CMO for non-FWB. > > 2) If cacheable PFNMAP: > a) With FWB, use a cacheable mapping > b) Without FWB, fail. > > 3) If VM_ALLOW_ANY_UNCACHED, use Normal Non-Cacheable mapping > > 4) Otherwise, Device-nGnRE > > I understand 2b breaks ABI, but the 'typical' VFIO usages fall into (3) > and (4). +1 (and +1 to Sean's remark about strictly tracking the VMA as well) IMHO not doing 2b is a "security" bug today. Catalin suggested we fix it as a first step to get agreement on this assement and fix. Once fixed there is no way for KVM to create a S2 with cachable semantics different from the VMA, or to have missing CMOs. That simplifies the discussion of adding 2a to strictly track cachable. I also don't see a need for a flag here. > A pedantic but correct live migration / snapshotting implementation > on non-FWB would need to do CMOs in case the VM used a non-WB > mapping for memory. From a live migration perspecive, we have already built in a lot of things on the VFIO side where a live migration can be attempted and then fail because of late-detected HW incompatibilities. We've sort of punted this for now to the orchestrator and operator. There is an expectation the environment will somehow ensure that live migration machine pools are sufficiently uniform when using VFIO. It is much more restricted than normal no-VFIO VM live migration. Given you can't have VM live migration of MMIO backed Cachable memory without a VFIO live migration driver, I wouldn't worry too much about these fine details from the KVM side. If /proc/cpuinfo shows the FWB that would be approximately similar discoverability to all the other limitations. Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-26 18:24 ` Sean Christopherson 2025-03-26 18:51 ` Oliver Upton @ 2025-03-31 14:56 ` Jason Gunthorpe 2025-04-07 15:20 ` Sean Christopherson 1 sibling, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-03-31 14:56 UTC (permalink / raw) To: Sean Christopherson Cc: Marc Zyngier, Ankit Agrawal, Catalin Marinas, Oliver Upton, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Mar 26, 2025 at 11:24:32AM -0700, Sean Christopherson wrote: > > I don't know how you reconcile the lack of host mapping and cache > > maintenance. The latter cannot take place without the former. > > I assume cache maintenance only requires _a_ mapping to the physical memory. > With guest_memfd, KVM has the pfn (which happens to always be struct page memory > today), and so can establish a VA=>PA mapping as needed. This is why we are forcing FWB in this work, because we don't have a VA mapping and KVM doesn't have the code to create one on demand. IMHO, I strongly suspect that all CCA capable ARM systems will have FWB, so it may make the most sense to continue that direction when using guest_memfd. Though I don't know what that would mean for pkvm.. > > > I agree a capability is mandatory if we're adding a memslot flag, but I don't > > > think it's mandatory if this is all handled through kernel plumbing. > > > > It is mandatory, full stop. Otherwise, userspace is able to migrate a > > VM from an FWB host to a non-FWB one, start the VM, blow up on the > > first page fault. That's not an acceptable outcome. It is fine if you add a protective check during memslot creation. If qemu asks for a memslot for a cachable VFIO VMA without FWB support then fail it immediately, and that will safely abort the migration. Do not delay until page fault time, though still check again at page fault time. This is good enough for VFIO device live migration as "try and fail" broadly matches the other HW compatability checks VFIO device live migration does today. > Ah, the safety I'm talking about is the CMO requirement. IIUC, not doing CMOs > if the memory is cacheable could result in data corruption, i.e. would be a safety > issue for the host. But I missed that you were proposing that the !FWB behavior > would be to force device mappings. It only forces device mappings on the VM side, so you still have the safety issue on the host side where a the VM will see the physical contents of the memory but the hypervisor doesn't flush or otherwise to sychronize. Ie data could leak across VM A to B because the cache was not flushed and A's data is still in physical. > Agreed, but that doesn't require a memslot flag. A capability to enumerate that > KVM can do cacheable mappings for PFNMAP memory would suffice. And if we want to > have KVM reject memslots that are cachaeable in the VMA, but would get device in > stage-2, then we can provide that functionality through the capability, i.e. let > userspace decide if it wants "fallback to device" vs. "error on creation" on a > per-VM basis. I think we must block "fallback to device" as a security fix. So you only get error on creation an option.. If that means some scenarios start to fail, and we consider that as important to fix, the fix is to have the S2 be cachable and add the missing CMOs. Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-31 14:56 ` Jason Gunthorpe @ 2025-04-07 15:20 ` Sean Christopherson 2025-04-07 16:15 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Sean Christopherson @ 2025-04-07 15:20 UTC (permalink / raw) To: Jason Gunthorpe Cc: Marc Zyngier, Ankit Agrawal, Catalin Marinas, Oliver Upton, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Mon, Mar 31, 2025, Jason Gunthorpe wrote: > On Wed, Mar 26, 2025 at 11:24:32AM -0700, Sean Christopherson wrote: > > > I don't know how you reconcile the lack of host mapping and cache > > > maintenance. The latter cannot take place without the former. > > > > I assume cache maintenance only requires _a_ mapping to the physical memory. > > With guest_memfd, KVM has the pfn (which happens to always be struct page memory > > today), and so can establish a VA=>PA mapping as needed. > > This is why we are forcing FWB in this work, because we don't have a > VA mapping and KVM doesn't have the code to create one on demand. I don't follow. As it exists today, guest_memfd doesn't touch the direct map, i.e. there's already a kernel mapping, KVM doesn't need to create one. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-07 15:20 ` Sean Christopherson @ 2025-04-07 16:15 ` Jason Gunthorpe 2025-04-07 16:43 ` Sean Christopherson 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-04-07 16:15 UTC (permalink / raw) To: Sean Christopherson Cc: Marc Zyngier, Ankit Agrawal, Catalin Marinas, Oliver Upton, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Mon, Apr 07, 2025 at 08:20:09AM -0700, Sean Christopherson wrote: > On Mon, Mar 31, 2025, Jason Gunthorpe wrote: > > On Wed, Mar 26, 2025 at 11:24:32AM -0700, Sean Christopherson wrote: > > > > I don't know how you reconcile the lack of host mapping and cache > > > > maintenance. The latter cannot take place without the former. > > > > > > I assume cache maintenance only requires _a_ mapping to the physical memory. > > > With guest_memfd, KVM has the pfn (which happens to always be struct page memory > > > today), and so can establish a VA=>PA mapping as needed. > > > > This is why we are forcing FWB in this work, because we don't have a > > VA mapping and KVM doesn't have the code to create one on demand. > > I don't follow. As it exists today, guest_memfd doesn't touch the direct map, > i.e. there's already a kernel mapping, KVM doesn't need to create one. This is not about guest_memfd.. When ARM KVM copies a PTE from a VMA's PTE into the S2 it may find that the physical address does not have a struct page, in which case it assumes the worst that it is not in the kmap and cannot be cache flushed. Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-07 16:15 ` Jason Gunthorpe @ 2025-04-07 16:43 ` Sean Christopherson 2025-04-16 8:51 ` Ankit Agrawal 0 siblings, 1 reply; 61+ messages in thread From: Sean Christopherson @ 2025-04-07 16:43 UTC (permalink / raw) To: Jason Gunthorpe Cc: Marc Zyngier, Ankit Agrawal, Catalin Marinas, Oliver Upton, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Mon, Apr 07, 2025, Jason Gunthorpe wrote: > On Mon, Apr 07, 2025 at 08:20:09AM -0700, Sean Christopherson wrote: > > On Mon, Mar 31, 2025, Jason Gunthorpe wrote: > > > On Wed, Mar 26, 2025 at 11:24:32AM -0700, Sean Christopherson wrote: > > > > > I don't know how you reconcile the lack of host mapping and cache > > > > > maintenance. The latter cannot take place without the former. > > > > > > > > I assume cache maintenance only requires _a_ mapping to the physical memory. > > > > With guest_memfd, KVM has the pfn (which happens to always be struct page memory > > > > today), and so can establish a VA=>PA mapping as needed. > > > > > > This is why we are forcing FWB in this work, because we don't have a > > > VA mapping and KVM doesn't have the code to create one on demand. > > > > I don't follow. As it exists today, guest_memfd doesn't touch the direct map, > > i.e. there's already a kernel mapping, KVM doesn't need to create one. > > This is not about guest_memfd.. Heh, my part of the thread was. I was responding to Marc's comment: : Remind me how this work with stuff such as guestmemfd, which, by : definition, doesn't have a userspace mapping? I'm pretty sure we're on the same page. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-07 16:43 ` Sean Christopherson @ 2025-04-16 8:51 ` Ankit Agrawal 2025-04-21 16:03 ` Ankit Agrawal 2025-04-22 7:49 ` Oliver Upton 0 siblings, 2 replies; 61+ messages in thread From: Ankit Agrawal @ 2025-04-16 8:51 UTC (permalink / raw) To: Sean Christopherson, Jason Gunthorpe Cc: Marc Zyngier, Catalin Marinas, Oliver Upton, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Hi, summarizing the discussion so far and outlining the next steps. The key points are as follows: 1. KVM cap to expose whether the kernel supports mapping cacheable PFNMAP: If the host doesn't have FWB, then the capability doesn't exist. Jason, Oliver, Caitlin and Sean points that this may not be required as userspace do not have much choice anyways. KVM has to follow the PTEs and userspace cannot ask for something different. However, Marc points that enumerating FWB support would allow userspace to discover the support and prevent live-migration across FWB and non-FWB hosts. Jason suggested that this may still be fine as we have already built in VFIO side protection where a live migration can be attempted and then fail because of late-detected HW incompatibilities. 2. New memslot flag that VMM passes at memslot registration: Discussion point that this is not necessary and KVM should just follow the VMA pgprot. 3. Fallback path handling for PFNMAP when the FWB is not set: Discussion points that there shouldn't be any fallback path and the memslot should just fail. i.e. KVM should not allow degrading cachable to non-cachable when it can't do flushing. This is to prevent the potential security issue pointed by Jason (S1 cacheable, S2 noncacheable). So AIU, the next step is to send out the updated series with the following patches: 1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region) and during fault handling (user_mem_abort()). 2. Enable support for cacheable PFN maps if S2FWB is enabled by following the vma pgprot (this patch). 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false for !FWB), pending maintainers' feedback on the necessity of this capability. Please let me know if there are any inaccuracies. Thanks Ankit Agrawal ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-16 8:51 ` Ankit Agrawal @ 2025-04-21 16:03 ` Ankit Agrawal 2025-04-22 7:49 ` Oliver Upton 1 sibling, 0 replies; 61+ messages in thread From: Ankit Agrawal @ 2025-04-21 16:03 UTC (permalink / raw) To: Sean Christopherson, Jason Gunthorpe Cc: Marc Zyngier, Catalin Marinas, Oliver Upton, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org > Hi, summarizing the discussion so far and outlining the next steps. The key points > are as follows: > > 1. KVM cap to expose whether the kernel supports mapping cacheable PFNMAP: > If the host doesn't have FWB, then the capability doesn't exist. Jason, Oliver, Caitlin > and Sean points that this may not be required as userspace do not have > much choice anyways. KVM has to follow the PTEs and userspace cannot ask > for something different. However, Marc points that enumerating FWB support > would allow userspace to discover the support and prevent live-migration > across FWB and non-FWB hosts. Jason suggested that this may still be fine as > we have already built in VFIO side protection where a live migration can be > attempted and then fail because of late-detected HW incompatibilities. > > 2. New memslot flag that VMM passes at memslot registration: > Discussion point that this is not necessary and KVM should just follow the > VMA pgprot. > > 3. Fallback path handling for PFNMAP when the FWB is not set: > Discussion points that there shouldn't be any fallback path and the memslot > should just fail. i.e. KVM should not allow degrading cachable to non-cachable > when it can't do flushing. This is to prevent the potential security issue > pointed by Jason (S1 cacheable, S2 noncacheable). > > So AIU, the next step is to send out the updated series with the following patches: > 1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region) > and during fault handling (user_mem_abort()). > > 2. Enable support for cacheable PFN maps if S2FWB is enabled by following > the vma pgprot (this patch). > > 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false > for !FWB), pending maintainers' feedback on the necessity of this capability. Hi, just a humble reminder to take a look at the summary and next steps. Marc, will you be able to confirm if you still think we should have the "cacheable PFNMAP" KVM cap? If yes, I'll send out the series inclusive of that. > Please let me know if there are any inaccuracies. > > Thanks > Ankit Agrawal ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-16 8:51 ` Ankit Agrawal 2025-04-21 16:03 ` Ankit Agrawal @ 2025-04-22 7:49 ` Oliver Upton 2025-04-22 13:54 ` Jason Gunthorpe 2025-04-22 14:53 ` Sean Christopherson 1 sibling, 2 replies; 61+ messages in thread From: Oliver Upton @ 2025-04-22 7:49 UTC (permalink / raw) To: Ankit Agrawal Cc: Sean Christopherson, Jason Gunthorpe, Marc Zyngier, Catalin Marinas, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Hi, On Wed, Apr 16, 2025 at 08:51:05AM +0000, Ankit Agrawal wrote: > Hi, summarizing the discussion so far and outlining the next steps. The key points > are as follows: > 1. KVM cap to expose whether the kernel supports mapping cacheable PFNMAP: > If the host doesn't have FWB, then the capability doesn't exist. Jason, Oliver, Caitlin > and Sean points that this may not be required as userspace do not have > much choice anyways. KVM has to follow the PTEs and userspace cannot ask > for something different. However, Marc points that enumerating FWB support > would allow userspace to discover the support and prevent live-migration > across FWB and non-FWB hosts. Jason suggested that this may still be fine as > we have already built in VFIO side protection where a live migration can be > attempted and then fail because of late-detected HW incompatibilities. > > 2. New memslot flag that VMM passes at memslot registration: > Discussion point that this is not necessary and KVM should just follow the > VMA pgprot. > > 3. Fallback path handling for PFNMAP when the FWB is not set: > Discussion points that there shouldn't be any fallback path and the memslot > should just fail. i.e. KVM should not allow degrading cachable to non-cachable > when it can't do flushing. This is to prevent the potential security issue > pointed by Jason (S1 cacheable, S2 noncacheable). > > > So AIU, the next step is to send out the updated series with the following patches: > 1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region) > and during fault handling (user_mem_abort()). Yes, we need to prevent the creation of stage-2 mappings to PFNMAP memory that uses cacheable attributes in the host stage-1. I believe we have alignment that this is a bugfix. > 2. Enable support for cacheable PFN maps if S2FWB is enabled by following > the vma pgprot (this patch). > > 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false > for !FWB), pending maintainers' feedback on the necessity of this capability. Regarding UAPI: I'm still convinced that we need the VMM to buy in to this behavior. And no, it doesn't matter if this is some VFIO-based mapping or kernel-managed memory. The reality is that userspace is an equal participant in remaining coherent with the guest. Whether or not FWB is employed for a particular region of IPA space is useful information for userspace deciding what it needs to do to access guest memory. Ignoring the Nvidia widget for a second, userspace also needs to know this for 'normal', kernel-managed memory so it understands what CMOs may be necessary when (for example) doing live migration of the VM. So this KVM CAP needs to be paired with a memslot flag. - The capability says KVM is able to enforce Write-Back at stage-2 - The memslot flag says userspace expects a particular GFN range to guarantee Write-Back semantics. This can be applied to 'normal', kernel-managed memory and PFNMAP thingies that have cacheable attributes at host stage-1. - Under no situation do we allow userspace to create non-cacheable mapping at stage-2 for something PFNMAP cacheable at stage-1. No matter what, my understanding is that we all agree the driver which provided the host stage-1 mapping is the authoritative source for memory attributes compatible with a given device. The accompanying UAPI is necessary for the VMM to understand how to handle arbitrary cacheable mappings provided to the VM. Thanks, Oliver ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-22 7:49 ` Oliver Upton @ 2025-04-22 13:54 ` Jason Gunthorpe 2025-04-22 16:50 ` Catalin Marinas 2025-04-22 14:53 ` Sean Christopherson 1 sibling, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-04-22 13:54 UTC (permalink / raw) To: Oliver Upton Cc: Ankit Agrawal, Sean Christopherson, Marc Zyngier, Catalin Marinas, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 22, 2025 at 12:49:28AM -0700, Oliver Upton wrote: > The reality is that userspace is an equal participant in remaining coherent with > the guest. Whether or not FWB is employed for a particular region of IPA > space is useful information for userspace deciding what it needs to do to access guest > memory. Ignoring the Nvidia widget for a second, userspace also needs to know this for > 'normal', kernel-managed memory so it understands what CMOs may be necessary when (for > example) doing live migration of the VM. Really? How does it work today then? Is this another existing problem? Userspace is doing CMOs during live migration that are not necessary? > So this KVM CAP needs to be paired with a memslot flag. > > - The capability says KVM is able to enforce Write-Back at stage-2 Sure > - The memslot flag says userspace expects a particular GFN range to guarantee > Write-Back semantics. This can be applied to 'normal', kernel-managed memory > and PFNMAP thingies that have cacheable attributes at host stage-1. Userspace doesn't actaully know if it has a cachable mapping from VFIO though :( I don't really see a point in this. If the KVM has the cap then userspace should assume the S2FWB behavior for all cachable memslots. What should happen if you have S2FWB but don't pass the flag? For normal kernel memory it should still use S2FWB. Thus for cachable PFNMAP it makes sense that it should also still use S2FWB without the flag? So, if you set the flag and don't have S2FWB it will fail the memslot, but then why not just rely on userspace to read the CAP and not create the memslot in the first place? If you don't set the flag then it should go ahead and use S2FWB anyhow and not fail anyhow.. It doesn't make alot of sense to me and brings more complexity to force userspace to discover the cachability of the VFIO side. > - Under no situation do we allow userspace to create non-cacheable mapping at > stage-2 for something PFNMAP cacheable at stage-1. Yes. memslot creation should fail, and page fault should fail. Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-22 13:54 ` Jason Gunthorpe @ 2025-04-22 16:50 ` Catalin Marinas 2025-04-22 17:03 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Catalin Marinas @ 2025-04-22 16:50 UTC (permalink / raw) To: Jason Gunthorpe Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 22, 2025 at 10:54:52AM -0300, Jason Gunthorpe wrote: > On Tue, Apr 22, 2025 at 12:49:28AM -0700, Oliver Upton wrote: > > - The memslot flag says userspace expects a particular GFN range to guarantee > > Write-Back semantics. This can be applied to 'normal', kernel-managed memory > > and PFNMAP thingies that have cacheable attributes at host stage-1. > > Userspace doesn't actaully know if it has a cachable mapping from VFIO > though :( Yes, that's why I couldn't figure out how useful a memory slot flag would be. A CAP is definitely useful and some way of preventing migrating between hosts with different capabilities for VM_PFNMAP is needed. However, I think we only want to prevent migration _if_ there's a cacheable VM_PFNMAP, otherwise KVM handles coherency via CMOs already. So, for the above, the VMM needs to know that it somehow got into such situation. If it knows the device (VFIO) capabilities and that the user mapping is Cacheable, coupled with the new KVM CAP, it can infer that Stage 2 will be S2FWB, no need for a memory slot flag. If it doesn't have such information, maybe a new memory slot flag can be used to probe what Stage 2 mapping is going to be: ask for KVM_MEM_PFNMAP_WB. If it fails, Stage 2 is Device/NC and can attempt again with the WB flag. It's a bit of a stretch for the KVM API but IIUC there's no option to query the properties of a memory slot. FWIW, I don't think we need a memory slot flag for host kernel managed memory, its S2FWB already. -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-22 16:50 ` Catalin Marinas @ 2025-04-22 17:03 ` Jason Gunthorpe 2025-04-22 21:28 ` Oliver Upton 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-04-22 17:03 UTC (permalink / raw) To: Catalin Marinas Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 22, 2025 at 05:50:32PM +0100, Catalin Marinas wrote: > So, for the above, the VMM needs to know that it somehow got into such > situation. If it knows the device (VFIO) capabilities and that the user > mapping is Cacheable, coupled with the new KVM CAP, it can infer that > Stage 2 will be S2FWB, no need for a memory slot flag. So long as the memslot creation fails for cachable PFNMAP without S2FWB the VMM is fine. qemu will begin its first steps to startup the migration destination and immediately fail. The migration will be aborted before it even gets started on the source side. As I said before, the present situation requires the site's orchestration to manage compatibility for live migration of VFIO devices. We only expect that the migration will abort early if the site has made a configuration error. > have such information, maybe a new memory slot flag can be used to probe > what Stage 2 mapping is going to be: ask for KVM_MEM_PFNMAP_WB. If it > fails, Stage 2 is Device/NC and can attempt again with the WB flag. > It's a bit of a stretch for the KVM API but IIUC there's no option to > query the properties of a memory slot. I don't know of any use case for something like this. If VFIO gives the VMM a cachable mapping there is no fallback to WB. The operator could use a different VFIO device, one that doesn't need cachable, but the VMM can't flip the VFIO device between modes on the fly. Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-22 17:03 ` Jason Gunthorpe @ 2025-04-22 21:28 ` Oliver Upton 2025-04-22 23:35 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Oliver Upton @ 2025-04-22 21:28 UTC (permalink / raw) To: Jason Gunthorpe Cc: Catalin Marinas, Ankit Agrawal, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 22, 2025 at 10:54:52AM -0300, Jason Gunthorpe wrote: > On Tue, Apr 22, 2025 at 12:49:28AM -0700, Oliver Upton wrote: > > The reality is that userspace is an equal participant in remaining coherent with > > the guest. Whether or not FWB is employed for a particular region of IPA > > space is useful information for userspace deciding what it needs to do to access guest > > memory. Ignoring the Nvidia widget for a second, userspace also needs to know this for > > 'normal', kernel-managed memory so it understands what CMOs may be necessary when (for > > example) doing live migration of the VM. > > Really? How does it work today then? Is this another existing problem? > Userspace is doing CMOs during live migration that are not necessary? Yes, this is a pre-existing problem. I'm not aware of a live migration implementation that handles !S2FWB correctly, and assumes all guest accesses are done through a cacheable alias. So, if a VMM wants to do migration of VMs on !S2FWB correctly, it'd probably want to know it can elide CMOs on something that actually bears the feature. > > - The memslot flag says userspace expects a particular GFN range to guarantee > > Write-Back semantics. This can be applied to 'normal', kernel-managed memory > > and PFNMAP thingies that have cacheable attributes at host stage-1. > > Userspace doesn't actaully know if it has a cachable mapping from VFIO > though :( That seems like a shortcoming on the VFIO side, not a KVM issue. What if userspace wants to do atomics on some VFIO mapping, doesn't it need to know that it has something with WB? > I don't really see a point in this. If the KVM has the cap then > userspace should assume the S2FWB behavior for all cachable memslots. Wait, so userspace simultaneously doesn't know the cacheability at host stage-1 but *does* for stage-2? This is why I contend that userspace needs a mechanism to discover the memory attributes on a given memslot. Without it there's no way of knowing what's a cacheable memslot. Along those lines, how is the VMM going to describe that cacheable PFNMAP region to the guest? > What should happen if you have S2FWB but don't pass the flag? For > normal kernel memory it should still use S2FWB. Thus for cachable > PFNMAP it makes sense that it should also still use S2FWB without the > flag? For kernel-managed memory, I agree. Accepting the flag for a memslot containing such memory would solely be for discoverability. OTOH, cacheable PFNMAP is a new feature and I see no issue compelling the use of a new bit with it. On Tue, Apr 22, 2025 at 02:03:24PM -0300, Jason Gunthorpe wrote: > On Tue, Apr 22, 2025 at 05:50:32PM +0100, Catalin Marinas wrote: > > > So, for the above, the VMM needs to know that it somehow got into such > > situation. If it knows the device (VFIO) capabilities and that the user > > mapping is Cacheable, coupled with the new KVM CAP, it can infer that > > Stage 2 will be S2FWB, no need for a memory slot flag. > > So long as the memslot creation fails for cachable PFNMAP without > S2FWB the VMM is fine. qemu will begin its first steps to startup the > migration destination and immediately fail. The migration will be > aborted before it even gets started on the source side. > > As I said before, the present situation requires the site's > orchestration to manage compatibility for live migration of VFIO > devices. We only expect that the migration will abort early if the > site has made a configuration error. > > > have such information, maybe a new memory slot flag can be used to probe > > what Stage 2 mapping is going to be: ask for KVM_MEM_PFNMAP_WB. If it > > fails, Stage 2 is Device/NC and can attempt again with the WB flag. > > It's a bit of a stretch for the KVM API but IIUC there's no option to > > query the properties of a memory slot. > > I don't know of any use case for something like this. If VFIO gives > the VMM a cachable mapping there is no fallback to WB. > > The operator could use a different VFIO device, one that doesn't need > cachable, but the VMM can't flip the VFIO device between modes on the > fly. I agree with you that in the context of a VFIO device userspace doesn't have any direct influence on the resulting memory attributes. The entire reason I'm dragging my feet about this is I'm concerned we've papered over the complexity of memory attributes (regardless of provenance) for way too long. KVM's done enough to make this dance 'work' in the context of kernel-managed memory, but adding more implicit KVM behavior for cacheable thingies makes the KVM UAPI even more unintelligible (as if it weren't already). So this flag isn't about giving userspace any degree of control over memory attributes. Just a way to know for things it _expects_ to be treated as cacheable can be guaranteed to use cacheable attributes in the VM. Thanks, Oliver ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-22 21:28 ` Oliver Upton @ 2025-04-22 23:35 ` Jason Gunthorpe 2025-04-23 10:45 ` Catalin Marinas 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-04-22 23:35 UTC (permalink / raw) To: Oliver Upton Cc: Catalin Marinas, Ankit Agrawal, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 22, 2025 at 02:28:18PM -0700, Oliver Upton wrote: > So, if a VMM wants to do migration of VMs on !S2FWB correctly, it'd > probably want to know it can elide CMOs on something that actually bears > the feature. OK > > > - The memslot flag says userspace expects a particular GFN range to guarantee > > > Write-Back semantics. This can be applied to 'normal', kernel-managed memory > > > and PFNMAP thingies that have cacheable attributes at host stage-1. > > > > Userspace doesn't actaully know if it has a cachable mapping from VFIO > > though :( > > That seems like a shortcoming on the VFIO side, not a KVM issue. What if > userspace wants to do atomics on some VFIO mapping, doesn't it need to > know that it has something with WB? VFIO is almost always un cachable. So far there is only one device merged, and CXL is coming, that would even support/require cachable. VFIO always had this sort of programming model where the userspace needs to know details about the device it is using so it hasn't really been an issue so far that the kernel doesn't tell the userspace what cachability it got.. We could add it but now we are adding new kernel code, qemu code and kvm code that is actually pretty pointless. > > I don't really see a point in this. If the KVM has the cap then > > userspace should assume the S2FWB behavior for all cachable memslots. > > Wait, so userspace simultaneously doesn't know the cacheability at host > stage-1 but *does* for stage-2? No, it doesn't know either. The point is the VMM doesn't care about any of this. It just wants to connect KVM to VFIO and have the kernel internally negotiate the details of how that works. There is zero value in the VMM being aware that KVM/VFIO is using cachable or non-cachable mappings because it will never touch this memory anyhow, and arguably it would be happier if it wasn't even in a VMA in the first place. > This is why I contend that userspace needs a mechanism to discover > the memory attributes on a given memslot. Without it there's no way > of knowing what's a cacheable memslot. If it cares about this then it should know by virtue of having put a cachable VMA into the memslot. > Along those lines, how is the VMM going to describe that cacheable > PFNMAP region to the guest? Heh. It creates a virtual PCI device in the guest and the space is mapped to a virtual BAR. When the guest driver binds to this device it will map the virtual BAR as cachable instead of as IO because it knows to do that based on the vPCI device ID. If something goes weird and the guest tries to use UC instead of cachable the S2FWB will block it and the guest will probably malfunction. CXL will probably have the VMM understand things a bit more and will generate the various ACPI tables CXL uses. > > What should happen if you have S2FWB but don't pass the flag? For > > normal kernel memory it should still use S2FWB. Thus for cachable > > PFNMAP it makes sense that it should also still use S2FWB without the > > flag? > > For kernel-managed memory, I agree. Accepting the flag for a memslot > containing such memory would solely be for discoverability. > > OTOH, cacheable PFNMAP is a new feature and I see no issue compelling > the use of a new bit with it. Feels weird to me. Now the VMM has to discover if the KVM supports the new flag and only use it on new KVM versions just to accomplish.. nothing? > The entire reason I'm dragging my feet about this is I'm concerned we've > papered over the complexity of memory attributes (regardless of > provenance) for way too long. KVM's done enough to make this dance 'work' > in the context of kernel-managed memory, but adding more implicit KVM > behavior for cacheable thingies makes the KVM UAPI even more > unintelligible (as if it weren't already). It is very complex.. I'm not sure adding a flag in this case is making it any simpler though. If you had a flag from day 0 that said 'this is a MMIO mapping do MMIO stuff' that would be alot clearer about how it should be used. But here we have a flag that doesn't seem well defined. When should the VMM set this flag? > So this flag isn't about giving userspace any degree of control over > memory attributes. Just a way to know for things it _expects_ to be > treated as cacheable can be guaranteed to use cacheable attributes in > the VM. Can you get some agreement with Sean? He seems very strongly opposed to this direction. Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-22 23:35 ` Jason Gunthorpe @ 2025-04-23 10:45 ` Catalin Marinas 2025-04-23 12:02 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Catalin Marinas @ 2025-04-23 10:45 UTC (permalink / raw) To: Jason Gunthorpe Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 22, 2025 at 08:35:56PM -0300, Jason Gunthorpe wrote: > On Tue, Apr 22, 2025 at 02:28:18PM -0700, Oliver Upton wrote: > > Wait, so userspace simultaneously doesn't know the cacheability at host > > stage-1 but *does* for stage-2? > > No, it doesn't know either. The point is the VMM doesn't care about > any of this. It just wants to connect KVM to VFIO and have the kernel > internally negotiate the details of how that works. > > There is zero value in the VMM being aware that KVM/VFIO is using > cachable or non-cachable mappings because it will never touch this > memory anyhow, and arguably it would be happier if it wasn't even in a > VMA in the first place. I think this was discussed at some point in the past - what about ignoring the VMA altogether and using an fd-based approach? Keep the current VFIO+vma ABI non-cacheable and introduce a new one for cacheable I/O mappings, e.g. KVM_MEM_IOFD. Is there a way for VFIO to communicate the attributes to KVM on the type of mapping in the absence of a VMA (e.g. via the inode)? If not: Stage 2 could default to non-cacheable and we can add a KVM_MEMORY_ATTRIBUTE_IO_WB as an option to ioctl(KVM_SET_MEMORY_ATTRIBUTES) (setup requiring two ioctls). The latter could fail if S2FWB is not available. But I agree, it does mean building knowledge of the device in the VMM (it might not be that bad). -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-23 10:45 ` Catalin Marinas @ 2025-04-23 12:02 ` Jason Gunthorpe 2025-04-23 12:26 ` Catalin Marinas 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-04-23 12:02 UTC (permalink / raw) To: Catalin Marinas Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Apr 23, 2025 at 11:45:04AM +0100, Catalin Marinas wrote: > On Tue, Apr 22, 2025 at 08:35:56PM -0300, Jason Gunthorpe wrote: > > On Tue, Apr 22, 2025 at 02:28:18PM -0700, Oliver Upton wrote: > > > Wait, so userspace simultaneously doesn't know the cacheability at host > > > stage-1 but *does* for stage-2? > > > > No, it doesn't know either. The point is the VMM doesn't care about > > any of this. It just wants to connect KVM to VFIO and have the kernel > > internally negotiate the details of how that works. > > > > There is zero value in the VMM being aware that KVM/VFIO is using > > cachable or non-cachable mappings because it will never touch this > > memory anyhow, and arguably it would be happier if it wasn't even in a > > VMA in the first place. > > I think this was discussed at some point in the past - what about > ignoring the VMA altogether and using an fd-based approach? Keep the > current VFIO+vma ABI non-cacheable and introduce a new one for cacheable > I/O mappings, e.g. KVM_MEM_IOFD. Is there a way for VFIO to communicate > the attributes to KVM on the type of mapping in the absence of a VMA > (e.g. via the inode)? If not: I hope to see that someday. The CC people are working in that direction, but we are still far from getting enough agreement from all the stakeholders on how that will work. > Stage 2 could default to non-cacheable and we can add a > KVM_MEMORY_ATTRIBUTE_IO_WB as an option to I don't think we should ever allow KVM to create a non-cachable mapping of an otherwise cachable object?? Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-23 12:02 ` Jason Gunthorpe @ 2025-04-23 12:26 ` Catalin Marinas 2025-04-23 13:03 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Catalin Marinas @ 2025-04-23 12:26 UTC (permalink / raw) To: Jason Gunthorpe Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Apr 23, 2025 at 09:02:43AM -0300, Jason Gunthorpe wrote: > On Wed, Apr 23, 2025 at 11:45:04AM +0100, Catalin Marinas wrote: > > On Tue, Apr 22, 2025 at 08:35:56PM -0300, Jason Gunthorpe wrote: > > > On Tue, Apr 22, 2025 at 02:28:18PM -0700, Oliver Upton wrote: > > > > Wait, so userspace simultaneously doesn't know the cacheability at host > > > > stage-1 but *does* for stage-2? > > > > > > No, it doesn't know either. The point is the VMM doesn't care about > > > any of this. It just wants to connect KVM to VFIO and have the kernel > > > internally negotiate the details of how that works. > > > > > > There is zero value in the VMM being aware that KVM/VFIO is using > > > cachable or non-cachable mappings because it will never touch this > > > memory anyhow, and arguably it would be happier if it wasn't even in a > > > VMA in the first place. > > > > I think this was discussed at some point in the past - what about > > ignoring the VMA altogether and using an fd-based approach? Keep the > > current VFIO+vma ABI non-cacheable and introduce a new one for cacheable > > I/O mappings, e.g. KVM_MEM_IOFD. Is there a way for VFIO to communicate > > the attributes to KVM on the type of mapping in the absence of a VMA > > (e.g. via the inode)? If not: > > I hope to see that someday. The CC people are working in that > direction, but we are still far from getting enough agreement from all > the stakeholders on how that will work. > > > Stage 2 could default to non-cacheable and we can add a > > KVM_MEMORY_ATTRIBUTE_IO_WB as an option to > > I don't think we should ever allow KVM to create a non-cachable > mapping of an otherwise cachable object?? Ah, good point. So we want a strict cacheable mapping even if there is no user/VMM mapping alias. Is it because KVM cannot do cache maintenance on the PFNMAP or because the device does not support other memory types? Both valid reasons though. In this case KVM still needs to know the properties of the device. Not sure how it could best do this without a vma. -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-23 12:26 ` Catalin Marinas @ 2025-04-23 13:03 ` Jason Gunthorpe 2025-04-29 10:47 ` Ankit Agrawal 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-04-23 13:03 UTC (permalink / raw) To: Catalin Marinas Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, Apr 23, 2025 at 01:26:51PM +0100, Catalin Marinas wrote: > Ah, good point. So we want a strict cacheable mapping even if there is > no user/VMM mapping alias. Is it because KVM cannot do cache maintenance > on the PFNMAP or because the device does not support other memory types? > Both valid reasons though. Lack of cache maintenance is the main problem for the vfio-grace device > In this case KVM still needs to know the properties of the device. Not > sure how it could best do this without a vma. Well, the idea I hope to succeed with would annotate this kind of information inside the page list that would be exchanged through the FD. There is an monstrous thread here about this topic: https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com/ I can't find it in the huge thread but I did explain some thoughts on how it could work Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-23 13:03 ` Jason Gunthorpe @ 2025-04-29 10:47 ` Ankit Agrawal 2025-04-29 13:27 ` Catalin Marinas 0 siblings, 1 reply; 61+ messages in thread From: Ankit Agrawal @ 2025-04-29 10:47 UTC (permalink / raw) To: Jason Gunthorpe, Catalin Marinas Cc: Oliver Upton, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org >> In this case KVM still needs to know the properties of the device. Not >> sure how it could best do this without a vma. > > Well, the idea I hope to succeed with would annotate this kind of > information inside the page list that would be exchanged through the > FD. > > There is an monstrous thread here about this topic: > > > > https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com/ > > I can't find it in the huge thread but I did explain some thoughts on > how it could work > > Jason Hi, Based on the recent discussions, I gather that having a KVM_CAP to expose the support for cacheable PFNMAP to the VMM would be useful from VM migration point of view. However it appears that the memslot flag isn't a must-have. The memslot flag cannot influence the KVM code anyways. For FWB, the PFNMAP would be cacheable and userspace should just assume S2FWB behavior; it would be a security bug otherwise as Jason pointed earlier (S1 cacheable, S2 noncacheable). For !FWB, a cacheable PFNMAP could not be allowed and VMM shouldn't attempt to create memslot at all by referring the cap. Also, can we take the fd based communication path between VFIO and KVM separately? I am planning to send out the series with the following implementation. Please let me know if there are any disagreements or concerns. 1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region) and during fault handling (user_mem_abort()). 2. Enable support for cacheable PFN maps if S2FWB is enabled by following the vma pgprot (this patch). 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false for !FWB). ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-29 10:47 ` Ankit Agrawal @ 2025-04-29 13:27 ` Catalin Marinas 2025-04-29 14:14 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Catalin Marinas @ 2025-04-29 13:27 UTC (permalink / raw) To: Ankit Agrawal Cc: Jason Gunthorpe, Oliver Upton, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 29, 2025 at 10:47:58AM +0000, Ankit Agrawal wrote: > >> In this case KVM still needs to know the properties of the device. Not > >> sure how it could best do this without a vma. > > > > Well, the idea I hope to succeed with would annotate this kind of > > information inside the page list that would be exchanged through the > > FD. > > > > There is an monstrous thread here about this topic: > > > > > > > > https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com/ > > > > I can't find it in the huge thread but I did explain some thoughts on > > how it could work > > > > Jason > > Hi, > > Based on the recent discussions, I gather that having a KVM_CAP > to expose the support for cacheable PFNMAP to the VMM would be > useful from VM migration point of view. > > However it appears that the memslot flag isn't a must-have. The memslot > flag cannot influence the KVM code anyways. For FWB, the PFNMAP would > be cacheable and userspace should just assume S2FWB behavior; it would > be a security bug otherwise as Jason pointed earlier (S1 cacheable, > S2 noncacheable). For !FWB, a cacheable PFNMAP could not be allowed > and VMM shouldn't attempt to create memslot at all by referring the cap. > > Also, can we take the fd based communication path between VFIO > and KVM separately? > > I am planning to send out the series with the following implementation. > Please let me know if there are any disagreements or concerns. > > 1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region) > and during fault handling (user_mem_abort()). I forgot the details here. I think it makes sense in general but as the first patch, we currently block cacheable PFNMAP anyway. Probably what you meant already but - this patch should block the PFNMAP slot if there's a cacheable S1 alias. > 2. Enable support for cacheable PFN maps if S2FWB is enabled by following > the vma pgprot (this patch). > 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false > for !FWB). I'll defer the memslot flag decision to the KVM maintainers. If we had one, it will enforce (2) or reject it as per (1) depending on the S1 attributes. Without the memslot flag, I assume at least the VMM will have to enable KVM_CAP_ARM_WB_PFNMAP (or whatever it will be called) to get the new behaviour. BTW, we should reject exec mappings as well (they probably fail for S1 VFIO since set_pte_at() will try to do cache maintenance). -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-29 13:27 ` Catalin Marinas @ 2025-04-29 14:14 ` Jason Gunthorpe 2025-04-29 16:03 ` Catalin Marinas 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-04-29 14:14 UTC (permalink / raw) To: Catalin Marinas Cc: Ankit Agrawal, Oliver Upton, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 29, 2025 at 02:27:02PM +0100, Catalin Marinas wrote: > BTW, we should reject exec mappings as well (they probably fail for S1 > VFIO since set_pte_at() will try to do cache maintenance). To be clear the S2 should leave the mapping as execute allowed though. Only the VM knows how it will use this memory and VM's do actually execute out of the cachable PFNMAP VMA today. The VM will set any execute deny/allow on its S1 table according to how it uses the memory. Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-29 14:14 ` Jason Gunthorpe @ 2025-04-29 16:03 ` Catalin Marinas 2025-04-29 16:44 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Catalin Marinas @ 2025-04-29 16:03 UTC (permalink / raw) To: Jason Gunthorpe Cc: Ankit Agrawal, Oliver Upton, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 29, 2025 at 11:14:37AM -0300, Jason Gunthorpe wrote: > On Tue, Apr 29, 2025 at 02:27:02PM +0100, Catalin Marinas wrote: > > BTW, we should reject exec mappings as well (they probably fail for S1 > > VFIO since set_pte_at() will try to do cache maintenance). > > To be clear the S2 should leave the mapping as execute allowed > though. Only the VM knows how it will use this memory and VM's do > actually execute out of the cachable PFNMAP VMA today. The VM will set > any execute deny/allow on its S1 table according to how it uses the > memory. If S2 is executable, wouldn't KVM try to invalidate the I-cache and it won't have an alias to do this? Unless it doesn't end up in stage2_map_walker_try_leaf() or the walk has been flagged as skipping the CMO. -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-29 16:03 ` Catalin Marinas @ 2025-04-29 16:44 ` Jason Gunthorpe 2025-04-29 18:09 ` Catalin Marinas 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-04-29 16:44 UTC (permalink / raw) To: Catalin Marinas Cc: Ankit Agrawal, Oliver Upton, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 29, 2025 at 05:03:18PM +0100, Catalin Marinas wrote: > On Tue, Apr 29, 2025 at 11:14:37AM -0300, Jason Gunthorpe wrote: > > On Tue, Apr 29, 2025 at 02:27:02PM +0100, Catalin Marinas wrote: > > > BTW, we should reject exec mappings as well (they probably fail for S1 > > > VFIO since set_pte_at() will try to do cache maintenance). > > > > To be clear the S2 should leave the mapping as execute allowed > > though. Only the VM knows how it will use this memory and VM's do > > actually execute out of the cachable PFNMAP VMA today. The VM will set > > any execute deny/allow on its S1 table according to how it uses the > > memory. > > If S2 is executable, wouldn't KVM try to invalidate the I-cache and it > won't have an alias to do this? Unless it doesn't end up in > stage2_map_walker_try_leaf() or the walk has been flagged as skipping > the CMO. Okay, that does seem to have been overlooked a bit. The answer I got back is: Cachable PFNMAP is also relying on ARM64_HAS_CACHE_DIC also, simlar to how S2FWB allows KVM to avoid flushing the D cache, that CPU cap allows KVM to avoid flushing the icache and turns icache_inval_pou() into a NOP. Ankit, you should add that check as well.. Thanks, Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-29 16:44 ` Jason Gunthorpe @ 2025-04-29 18:09 ` Catalin Marinas 2025-04-29 18:19 ` Jason Gunthorpe 0 siblings, 1 reply; 61+ messages in thread From: Catalin Marinas @ 2025-04-29 18:09 UTC (permalink / raw) To: Jason Gunthorpe Cc: Ankit Agrawal, Oliver Upton, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 29, 2025 at 01:44:30PM -0300, Jason Gunthorpe wrote: > On Tue, Apr 29, 2025 at 05:03:18PM +0100, Catalin Marinas wrote: > > On Tue, Apr 29, 2025 at 11:14:37AM -0300, Jason Gunthorpe wrote: > > > On Tue, Apr 29, 2025 at 02:27:02PM +0100, Catalin Marinas wrote: > > > > BTW, we should reject exec mappings as well (they probably fail for S1 > > > > VFIO since set_pte_at() will try to do cache maintenance). > > > > > > To be clear the S2 should leave the mapping as execute allowed > > > though. Only the VM knows how it will use this memory and VM's do > > > actually execute out of the cachable PFNMAP VMA today. The VM will set > > > any execute deny/allow on its S1 table according to how it uses the > > > memory. > > > > If S2 is executable, wouldn't KVM try to invalidate the I-cache and it > > won't have an alias to do this? Unless it doesn't end up in > > stage2_map_walker_try_leaf() or the walk has been flagged as skipping > > the CMO. > > Okay, that does seem to have been overlooked a bit. The answer I got > back is: > > Cachable PFNMAP is also relying on ARM64_HAS_CACHE_DIC also, simlar to > how S2FWB allows KVM to avoid flushing the D cache, that CPU cap > allows KVM to avoid flushing the icache and turns icache_inval_pou() > into a NOP. Another CAP for executable PFNMAP then? I feel like this is a different use-case (e.g. more like general purpose CXL attached memory) than the GPU one. Unless FWB implies CTR_EL0.DIC (AFAIK, it doesn't) we may be restricting some CPUs. -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-29 18:09 ` Catalin Marinas @ 2025-04-29 18:19 ` Jason Gunthorpe 2025-05-07 15:26 ` Ankit Agrawal 0 siblings, 1 reply; 61+ messages in thread From: Jason Gunthorpe @ 2025-04-29 18:19 UTC (permalink / raw) To: Catalin Marinas Cc: Ankit Agrawal, Oliver Upton, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 29, 2025 at 07:09:42PM +0100, Catalin Marinas wrote: > On Tue, Apr 29, 2025 at 01:44:30PM -0300, Jason Gunthorpe wrote: > > On Tue, Apr 29, 2025 at 05:03:18PM +0100, Catalin Marinas wrote: > > > On Tue, Apr 29, 2025 at 11:14:37AM -0300, Jason Gunthorpe wrote: > > > > On Tue, Apr 29, 2025 at 02:27:02PM +0100, Catalin Marinas wrote: > > > > > BTW, we should reject exec mappings as well (they probably fail for S1 > > > > > VFIO since set_pte_at() will try to do cache maintenance). > > > > > > > > To be clear the S2 should leave the mapping as execute allowed > > > > though. Only the VM knows how it will use this memory and VM's do > > > > actually execute out of the cachable PFNMAP VMA today. The VM will set > > > > any execute deny/allow on its S1 table according to how it uses the > > > > memory. > > > > > > If S2 is executable, wouldn't KVM try to invalidate the I-cache and it > > > won't have an alias to do this? Unless it doesn't end up in > > > stage2_map_walker_try_leaf() or the walk has been flagged as skipping > > > the CMO. > > > > Okay, that does seem to have been overlooked a bit. The answer I got > > back is: > > > > Cachable PFNMAP is also relying on ARM64_HAS_CACHE_DIC also, simlar to > > how S2FWB allows KVM to avoid flushing the D cache, that CPU cap > > allows KVM to avoid flushing the icache and turns icache_inval_pou() > > into a NOP. > > Another CAP for executable PFNMAP then? IDK, either that or a more general cap 'support PFNMAP VMAs'? > I feel like this is a different > use-case (e.g. more like general purpose CXL attached memory) than the > GPU one. The GPUs we have today pretty much pretend to be CXL attached memory so they can and do execute from it. > Unless FWB implies CTR_EL0.DIC (AFAIK, it doesn't) we may be > restricting some CPUs. Yes, it will further narrow the CPUs down. However, we just did this discussion for BBML2 + SMMUv3 SVA. I think the same argument holds. If someone is crazy enough to build a CPU with CXLish support and uses an old core without DIC, IDC and S2FWB then they are going to have a bunch of work to fix the SW to support it. Right now we know of no system that exists like this.. Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-29 18:19 ` Jason Gunthorpe @ 2025-05-07 15:26 ` Ankit Agrawal 2025-05-09 12:47 ` Catalin Marinas 0 siblings, 1 reply; 61+ messages in thread From: Ankit Agrawal @ 2025-05-07 15:26 UTC (permalink / raw) To: Jason Gunthorpe, Catalin Marinas Cc: Oliver Upton, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org >> Unless FWB implies CTR_EL0.DIC (AFAIK, it doesn't) we may be >> restricting some CPUs. > > Yes, it will further narrow the CPUs down. > > However, we just did this discussion for BBML2 + SMMUv3 SVA. I think > the same argument holds. If someone is crazy enough to build a CPU > with CXLish support and uses an old core without DIC, IDC and S2FWB > then they are going to have a bunch of work to fix the SW to support > it. Right now we know of no system that exists like this.. > > Jason Catalin, do you agree if I can go ahead and add the check for ARM64_HAS_CACHE_DIC? >> Another CAP for executable PFNMAP then? > > IDK, either that or a more general cap 'support PFNMAP VMAs'? I think it would be good to have the generic cap that is safe for executable as well. >> However it appears that the memslot flag isn't a must-have. The memslot >> flag cannot influence the KVM code anyways. For FWB, the PFNMAP would >> be cacheable and userspace should just assume S2FWB behavior; it would >> be a security bug otherwise as Jason pointed earlier (S1 cacheable, >> S2 noncacheable). For !FWB, a cacheable PFNMAP could not be allowed >> and VMM shouldn't attempt to create memslot at all by referring the cap. ... >> 2. Enable support for cacheable PFN maps if S2FWB is enabled by following >> the vma pgprot (this patch). >> 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false >> for !FWB). > > I'll defer the memslot flag decision to the KVM maintainers. If we had > one, it will enforce (2) or reject it as per (1) depending on the S1 > attributes. Marc, Oliver, let me know your thoughts if we should add memslot flag and I'll implement accordingly. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-05-07 15:26 ` Ankit Agrawal @ 2025-05-09 12:47 ` Catalin Marinas 0 siblings, 0 replies; 61+ messages in thread From: Catalin Marinas @ 2025-05-09 12:47 UTC (permalink / raw) To: Ankit Agrawal Cc: Jason Gunthorpe, Oliver Upton, Sean Christopherson, Marc Zyngier, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Wed, May 07, 2025 at 03:26:05PM +0000, Ankit Agrawal wrote: > >> Unless FWB implies CTR_EL0.DIC (AFAIK, it doesn't) we may be > >> restricting some CPUs. > > > > Yes, it will further narrow the CPUs down. > > > > However, we just did this discussion for BBML2 + SMMUv3 SVA. I think > > the same argument holds. If someone is crazy enough to build a CPU > > with CXLish support and uses an old core without DIC, IDC and S2FWB > > then they are going to have a bunch of work to fix the SW to support > > it. Right now we know of no system that exists like this.. > > > > Jason > > Catalin, do you agree if I can go ahead and add the check for > ARM64_HAS_CACHE_DIC? As long as we don't leave out some hardware that has FWB but not DIC, that's fine by me. -- Catalin ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-04-22 7:49 ` Oliver Upton 2025-04-22 13:54 ` Jason Gunthorpe @ 2025-04-22 14:53 ` Sean Christopherson 1 sibling, 0 replies; 61+ messages in thread From: Sean Christopherson @ 2025-04-22 14:53 UTC (permalink / raw) To: Oliver Upton Cc: Ankit Agrawal, Jason Gunthorpe, Marc Zyngier, Catalin Marinas, joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com, sebastianene@google.com, coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org On Tue, Apr 22, 2025, Oliver Upton wrote: > Hi, > > On Wed, Apr 16, 2025 at 08:51:05AM +0000, Ankit Agrawal wrote: > > Hi, summarizing the discussion so far and outlining the next steps. The key points > > are as follows: > > 1. KVM cap to expose whether the kernel supports mapping cacheable PFNMAP: > > If the host doesn't have FWB, then the capability doesn't exist. Jason, Oliver, Caitlin > > and Sean points that this may not be required as userspace do not have > > much choice anyways. KVM has to follow the PTEs and userspace cannot ask > > for something different. However, Marc points that enumerating FWB support > > would allow userspace to discover the support and prevent live-migration > > across FWB and non-FWB hosts. Jason suggested that this may still be fine as > > we have already built in VFIO side protection where a live migration can be > > attempted and then fail because of late-detected HW incompatibilities. > > > > 2. New memslot flag that VMM passes at memslot registration: > > Discussion point that this is not necessary and KVM should just follow the > > VMA pgprot. > > > > 3. Fallback path handling for PFNMAP when the FWB is not set: > > Discussion points that there shouldn't be any fallback path and the memslot > > should just fail. i.e. KVM should not allow degrading cachable to non-cachable > > when it can't do flushing. This is to prevent the potential security issue > > pointed by Jason (S1 cacheable, S2 noncacheable). > > > > > > So AIU, the next step is to send out the updated series with the following patches: > > 1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region) > > and during fault handling (user_mem_abort()). > > Yes, we need to prevent the creation of stage-2 mappings to PFNMAP memory > that uses cacheable attributes in the host stage-1. I believe we have alignment > that this is a bugfix. > > > 2. Enable support for cacheable PFN maps if S2FWB is enabled by following > > the vma pgprot (this patch). > > > > 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false > > for !FWB), pending maintainers' feedback on the necessity of this capability. > > Regarding UAPI: I'm still convinced that we need the VMM to buy in to this > behavior. And no, it doesn't matter if this is some VFIO-based mapping > or kernel-managed memory. > > The reality is that userspace is an equal participant in remaining coherent with > the guest. Whether or not FWB is employed for a particular region of IPA > space is useful information for userspace deciding what it needs to do to access guest > memory. Ignoring the Nvidia widget for a second, userspace also needs to know this for > 'normal', kernel-managed memory so it understands what CMOs may be necessary when (for > example) doing live migration of the VM. > > So this KVM CAP needs to be paired with a memslot flag. > > - The capability says KVM is able to enforce Write-Back at stage-2 > > - The memslot flag says userspace expects a particular GFN range to guarantee > Write-Back semantics. This can be applied to 'normal', kernel-managed memory > and PFNMAP thingies that have cacheable attributes at host stage-1. I am very strongly opposed to adding a memslot flag. IMO, it sets a terrible precedent, and I am struggling to understand why a per-VM CAP isn't sufficient protection for the VMM. > - Under no situation do we allow userspace to create non-cacheable mapping at > stage-2 for something PFNMAP cacheable at stage-1. > > No matter what, my understanding is that we all agree the driver which provided the > host stage-1 mapping is the authoritative source for memory attributes compatible > with a given device. The accompanying UAPI is necessary for the VMM to understand how > to handle arbitrary cacheable mappings provided to the VM. > > Thanks, > Oliver ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags 2025-03-10 11:54 ` Marc Zyngier 2025-03-11 3:42 ` Ankit Agrawal @ 2025-03-18 12:57 ` Jason Gunthorpe 1 sibling, 0 replies; 61+ messages in thread From: Jason Gunthorpe @ 2025-03-18 12:57 UTC (permalink / raw) To: Marc Zyngier Cc: ankita, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam, alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu, ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc, kvmarm, linux-kernel, linux-arm-kernel On Mon, Mar 10, 2025 at 11:54:52AM +0000, Marc Zyngier wrote: > Which brings me to the next point: FWB is not discoverable from > userspace. How do you expect a VMM to know what it can or cannot do? If you add a check at memslot creation time then simply fail the memslot. There is nothing the VMM can do except completely fail. There is no alternative path here where a cachable VMA can be safely made non-cachable for the guest? Jason ^ permalink raw reply [flat|nested] 61+ messages in thread
end of thread, other threads:[~2025-05-09 12:47 UTC | newest] Thread overview: 61+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-10 10:30 [PATCH v3 0/1] KVM: arm64: Map GPU device memory as cacheable ankita 2025-03-10 10:30 ` [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita 2025-03-10 11:54 ` Marc Zyngier 2025-03-11 3:42 ` Ankit Agrawal 2025-03-11 11:18 ` Marc Zyngier 2025-03-11 12:07 ` Ankit Agrawal 2025-03-12 8:21 ` Marc Zyngier 2025-03-17 5:55 ` Ankit Agrawal 2025-03-17 9:27 ` Marc Zyngier 2025-03-17 19:54 ` Catalin Marinas 2025-03-18 9:39 ` Marc Zyngier 2025-03-18 12:55 ` Jason Gunthorpe 2025-03-18 19:27 ` Catalin Marinas 2025-03-18 19:35 ` David Hildenbrand 2025-03-18 19:40 ` Oliver Upton 2025-03-20 3:30 ` bibo mao 2025-03-20 7:24 ` bibo mao 2025-03-18 23:17 ` Jason Gunthorpe 2025-03-19 18:03 ` Catalin Marinas 2025-03-18 19:30 ` Oliver Upton 2025-03-18 23:09 ` Jason Gunthorpe 2025-03-19 7:01 ` Oliver Upton 2025-03-19 17:04 ` Jason Gunthorpe 2025-03-19 18:11 ` Catalin Marinas 2025-03-19 19:22 ` Jason Gunthorpe 2025-03-19 21:48 ` Catalin Marinas 2025-03-26 8:31 ` Ankit Agrawal 2025-03-26 14:53 ` Sean Christopherson 2025-03-26 15:42 ` Marc Zyngier 2025-03-26 16:10 ` Sean Christopherson 2025-03-26 18:02 ` Marc Zyngier 2025-03-26 18:24 ` Sean Christopherson 2025-03-26 18:51 ` Oliver Upton 2025-03-31 14:44 ` Jason Gunthorpe 2025-03-31 14:56 ` Jason Gunthorpe 2025-04-07 15:20 ` Sean Christopherson 2025-04-07 16:15 ` Jason Gunthorpe 2025-04-07 16:43 ` Sean Christopherson 2025-04-16 8:51 ` Ankit Agrawal 2025-04-21 16:03 ` Ankit Agrawal 2025-04-22 7:49 ` Oliver Upton 2025-04-22 13:54 ` Jason Gunthorpe 2025-04-22 16:50 ` Catalin Marinas 2025-04-22 17:03 ` Jason Gunthorpe 2025-04-22 21:28 ` Oliver Upton 2025-04-22 23:35 ` Jason Gunthorpe 2025-04-23 10:45 ` Catalin Marinas 2025-04-23 12:02 ` Jason Gunthorpe 2025-04-23 12:26 ` Catalin Marinas 2025-04-23 13:03 ` Jason Gunthorpe 2025-04-29 10:47 ` Ankit Agrawal 2025-04-29 13:27 ` Catalin Marinas 2025-04-29 14:14 ` Jason Gunthorpe 2025-04-29 16:03 ` Catalin Marinas 2025-04-29 16:44 ` Jason Gunthorpe 2025-04-29 18:09 ` Catalin Marinas 2025-04-29 18:19 ` Jason Gunthorpe 2025-05-07 15:26 ` Ankit Agrawal 2025-05-09 12:47 ` Catalin Marinas 2025-04-22 14:53 ` Sean Christopherson 2025-03-18 12:57 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).