* [PATCH v3 0/1] KVM: arm64: Map GPU device memory as cacheable
@ 2025-03-10 10:30 ankita
2025-03-10 10:30 ` [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
0 siblings, 1 reply; 61+ messages in thread
From: ankita @ 2025-03-10 10:30 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
lpieralisi, david
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, zhiw, mochs, udhoke, dnigam, alex.williamson,
sebastianene, coltonlewis, kevin.tian, yi.l.liu, ardb, akpm,
gshan, linux-mm, ddutile, tabba, qperret, seanjc, kvmarm,
linux-kernel, linux-arm-kernel
From: Ankit Agrawal <ankita@nvidia.com>
Grace based platforms such as Grace Hopper/Blackwell Superchips have
CPU accessible cache coherent GPU memory. The GPU device memory is
essentially a DDR memory and retains properties such as cacheability,
unaligned accesses, atomics and handling of executable faults. This
requires the device memory to be mapped as NORMAL in stage-2.
Today KVM forces the memory to either NORMAL or DEVICE_nGnRE depending
on whethere the memory region is added to the kernel. The KVM code is
thus restrictive and prevents device memory that is not added to the
kernel to be marked as cacheable. The patch aims to solve this.
A cachebility check is made if the VM_PFNMAP is set in VMA flags by
consulting the VMA pgprot value. If the pgprot mapping type is MT_NORMAL,
it is considered safe to be mapped cacheable as the KVM S2 will have
the same Normal memory type as the VMA has in the S1 and KVM has no
additional responsibility for safety.
Note when FWB is not enabled, the kernel expects to trivially do
cache management by flushing the memory by linearly converting a
kvm_pte to phys_addr to a KVA. The cache management thus relies on
memory being mapped. Since the GPU device memory is not kernel
mapped, exit when the FWB is not supported.
The changes are heavily influenced by the discussions between
Catalin Marinas, David Hildenbrand and Jason Gunthorpe [1] on v2.
Many thanks for their valuable suggestions.
Applied over next-20250305 and tested on the Grace Hopper and
Grace Blackwell platforms by booting up VM, loading NVIDIA module [2]
and running nvidia-smi in the VM.
To run CUDA workloads, there is a dependency on the IOMMUFD and the
Nested Page Table patches being worked on separately by Nicolin Chen.
(nicolinc@nvidia.com). NVIDIA has provided git repositories which
includes all the requisite kernel [3] and Qemu [4] patches in case
one wants to try.
v2 -> v3
1. Restricted the new changes to check for cacheability to VM_PFNMAP
based on David Hildenbrand's (david@redhat.com) suggestion.
2. Removed the MTE checks based on Jason Gunthorpe's (jgg@nvidia.com)
observation that it already done earlier in
kvm_arch_prepare_memory_region.
3. Dropped the pfn_valid() checks based on suggestions by
Catalin Marinas (catalin.marinas@arm.com).
4. Removed the code for exec fault handling as it is not needed
anymore.
v1 -> v2
1. Removed kvm_is_device_pfn() as a determiner for device type memory
determination. Instead using pfn_valid()
2. Added handling for MTE.
3. Minor cleanup.
Link: https://lore.kernel.org/all/20241118131958.4609-1-ankita@nvidia.com/ [1]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2]
Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [3]
Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [4]
Ankit Agrawal (1):
KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++
arch/arm64/kvm/hyp/pgtable.c | 2 +-
arch/arm64/kvm/mmu.c | 43 +++++++++++++++++++++++++++-
3 files changed, 51 insertions(+), 2 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-10 10:30 [PATCH v3 0/1] KVM: arm64: Map GPU device memory as cacheable ankita
@ 2025-03-10 10:30 ` ankita
2025-03-10 11:54 ` Marc Zyngier
0 siblings, 1 reply; 61+ messages in thread
From: ankita @ 2025-03-10 10:30 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
lpieralisi, david
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, zhiw, mochs, udhoke, dnigam, alex.williamson,
sebastianene, coltonlewis, kevin.tian, yi.l.liu, ardb, akpm,
gshan, linux-mm, ddutile, tabba, qperret, seanjc, kvmarm,
linux-kernel, linux-arm-kernel
From: Ankit Agrawal <ankita@nvidia.com>
Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
based on pfn_is_map_memory (which tracks whether the device memory
is added to the kernel) and ignores the per-VMA flags that indicates the
memory attributes. The KVM code is thus restrictive and allows only for
the memory that is added to the kernel to be marked as cacheable.
The device memory such as on the Grace Hopper/Blackwell systems
is interchangeable with DDR memory and retains properties such as
cacheability, unaligned accesses, atomics and handling of executable
faults. This requires the device memory to be mapped as NORMAL in
stage-2.
Given that the GPU device memory is not added to the kernel (but is rather
VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
to be mapped Normal cacheable. The patch aims to solve this use case.
A cachebility check is made if the VM_PFNMAP is set in VMA flags by
consulting the VMA pgprot value. If the pgprot mapping type is MT_NORMAL,
it is safe to be mapped cacheable as the KVM S2 will have the same
Normal memory type as the VMA has in the S1 and KVM has no additional
responsibility for safety. Checking pgprot as NORMAL is thus a KVM
sanity check.
Introduce a new variable cacheable_devmem to indicate a safely
cacheable mapping. Do not set the device variable when cacheable_devmem
is true. This essentially have the effect of setting stage-2 mapping
as NORMAL through kvm_pgtable_stage2_map.
Add check for COW VM_PFNMAP and refuse such mapping.
No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
already tests it at an early stage during memslot creation. There would
not even be a fault if the memslot is not created.
Note when FWB is not enabled, the kernel expects to trivially do
cache management by flushing the memory by linearly converting a
kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
cache management thus relies on memory being mapped. So do not allow
!FWB.
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: David Hildenbrand <david@redhat.com>
---
arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++
arch/arm64/kvm/hyp/pgtable.c | 2 +-
arch/arm64/kvm/mmu.c | 43 +++++++++++++++++++++++++++-
3 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 6b9d274052c7..f21e2fae2bfe 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -507,6 +507,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
*/
u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
+/**
+ * stage2_has_fwb() - Determine whether FWB is supported
+ * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*()
+ *
+ * Return: True if FWB is supported.
+ */
+bool stage2_has_fwb(struct kvm_pgtable *pgt);
+
/**
* kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
* @vtcr: Content of the VTCR register.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index df5cc74a7dd0..ee6b98fefd61 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -637,7 +637,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
return vtcr;
}
-static bool stage2_has_fwb(struct kvm_pgtable *pgt)
+bool stage2_has_fwb(struct kvm_pgtable *pgt)
{
if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
return false;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1f55b0c7b11d..4b3a03c9d700 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1454,6 +1454,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
return vma->vm_flags & VM_MTE_ALLOWED;
}
+/*
+ * Determine the memory region cacheability from VMA's pgprot. This
+ * is used to set the stage 2 PTEs.
+ */
+static unsigned long mapping_type(pgprot_t page_prot)
+{
+ return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
+}
+
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct kvm_s2_trans *nested,
struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1463,6 +1472,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
bool write_fault, writable, force_pte = false;
bool exec_fault, mte_allowed;
bool device = false, vfio_allow_any_uc = false;
+ bool cacheable_devmem = false;
unsigned long mmu_seq;
phys_addr_t ipa = fault_ipa;
struct kvm *kvm = vcpu->kvm;
@@ -1600,6 +1610,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
+ if (vma->vm_flags & VM_PFNMAP) {
+ /* Reject COW VM_PFNMAP */
+ if (is_cow_mapping(vma->vm_flags))
+ return -EINVAL;
+ /*
+ * If the VM_PFNMAP is set in VMA flags, do a KVM sanity
+ * check to see if pgprot mapping type is MT_NORMAL and a
+ * safely cacheable device memory.
+ */
+ if (mapping_type(vma->vm_page_prot) == MT_NORMAL)
+ cacheable_devmem = true;
+ }
+
/* Don't use the VMA after the unlock -- it may have vanished */
vma = NULL;
@@ -1633,8 +1656,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
*
* In both cases, we don't let transparent_hugepage_adjust()
* change things at the last minute.
+ *
+ * Do not set device as the device memory is cacheable. Note
+ * that such mapping is safe as the KVM S2 will have the same
+ * Normal memory type as the VMA has in the S1.
*/
- device = true;
+ if (!cacheable_devmem)
+ device = true;
} else if (logging_active && !write_fault) {
/*
* Only actually map the page as writable if this was a write
@@ -1716,6 +1744,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
prot |= KVM_PGTABLE_PROT_X;
}
+ /*
+ * When FWB is unsupported KVM needs to do cache flushes
+ * (via dcache_clean_inval_poc()) of the underlying memory. This is
+ * only possible if the memory is already mapped into the kernel map.
+ *
+ * Outright reject as the cacheable device memory is not present in
+ * the kernel map and not suitable for cache management.
+ */
+ if (cacheable_devmem && !stage2_has_fwb(pgt)) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
/*
* Under the premise of getting a FSC_PERM fault, we just need to relax
* permissions only if vma_pagesize equals fault_granule. Otherwise,
--
2.34.1
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-10 10:30 ` [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
@ 2025-03-10 11:54 ` Marc Zyngier
2025-03-11 3:42 ` Ankit Agrawal
2025-03-18 12:57 ` Jason Gunthorpe
0 siblings, 2 replies; 61+ messages in thread
From: Marc Zyngier @ 2025-03-10 11:54 UTC (permalink / raw)
To: ankita
Cc: jgg, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david,
aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, zhiw, mochs, udhoke, dnigam, alex.williamson,
sebastianene, coltonlewis, kevin.tian, yi.l.liu, ardb, akpm,
gshan, linux-mm, ddutile, tabba, qperret, seanjc, kvmarm,
linux-kernel, linux-arm-kernel
On Mon, 10 Mar 2025 10:30:08 +0000,
<ankita@nvidia.com> wrote:
>
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
> based on pfn_is_map_memory (which tracks whether the device memory
> is added to the kernel) and ignores the per-VMA flags that indicates the
> memory attributes. The KVM code is thus restrictive and allows only for
> the memory that is added to the kernel to be marked as cacheable.
>
> The device memory such as on the Grace Hopper/Blackwell systems
> is interchangeable with DDR memory and retains properties such as
> cacheability, unaligned accesses, atomics and handling of executable
> faults. This requires the device memory to be mapped as NORMAL in
> stage-2.
>
> Given that the GPU device memory is not added to the kernel (but is rather
> VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
> VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
> to be mapped Normal cacheable. The patch aims to solve this use case.
>
> A cachebility check is made if the VM_PFNMAP is set in VMA flags by
> consulting the VMA pgprot value. If the pgprot mapping type is MT_NORMAL,
> it is safe to be mapped cacheable as the KVM S2 will have the same
> Normal memory type as the VMA has in the S1 and KVM has no additional
> responsibility for safety. Checking pgprot as NORMAL is thus a KVM
> sanity check.
>
> Introduce a new variable cacheable_devmem to indicate a safely
> cacheable mapping. Do not set the device variable when cacheable_devmem
> is true. This essentially have the effect of setting stage-2 mapping
> as NORMAL through kvm_pgtable_stage2_map.
>
> Add check for COW VM_PFNMAP and refuse such mapping.
>
> No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
> already tests it at an early stage during memslot creation. There would
> not even be a fault if the memslot is not created.
>
> Note when FWB is not enabled, the kernel expects to trivially do
> cache management by flushing the memory by linearly converting a
> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
> cache management thus relies on memory being mapped. So do not allow
> !FWB.
>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++
> arch/arm64/kvm/hyp/pgtable.c | 2 +-
> arch/arm64/kvm/mmu.c | 43 +++++++++++++++++++++++++++-
> 3 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 6b9d274052c7..f21e2fae2bfe 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -507,6 +507,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
> */
> u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
>
> +/**
> + * stage2_has_fwb() - Determine whether FWB is supported
> + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*()
> + *
> + * Return: True if FWB is supported.
> + */
> +bool stage2_has_fwb(struct kvm_pgtable *pgt);
> +
> /**
> * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
> * @vtcr: Content of the VTCR register.
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index df5cc74a7dd0..ee6b98fefd61 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -637,7 +637,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
> return vtcr;
> }
>
> -static bool stage2_has_fwb(struct kvm_pgtable *pgt)
> +bool stage2_has_fwb(struct kvm_pgtable *pgt)
> {
> if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> return false;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 1f55b0c7b11d..4b3a03c9d700 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1454,6 +1454,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
> return vma->vm_flags & VM_MTE_ALLOWED;
> }
>
> +/*
> + * Determine the memory region cacheability from VMA's pgprot. This
> + * is used to set the stage 2 PTEs.
> + */
> +static unsigned long mapping_type(pgprot_t page_prot)
> +{
> + return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
> +}
> +
> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> struct kvm_s2_trans *nested,
> struct kvm_memory_slot *memslot, unsigned long hva,
> @@ -1463,6 +1472,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> bool write_fault, writable, force_pte = false;
> bool exec_fault, mte_allowed;
> bool device = false, vfio_allow_any_uc = false;
> + bool cacheable_devmem = false;
> unsigned long mmu_seq;
> phys_addr_t ipa = fault_ipa;
> struct kvm *kvm = vcpu->kvm;
> @@ -1600,6 +1610,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>
> vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
>
> + if (vma->vm_flags & VM_PFNMAP) {
> + /* Reject COW VM_PFNMAP */
> + if (is_cow_mapping(vma->vm_flags))
> + return -EINVAL;
> + /*
> + * If the VM_PFNMAP is set in VMA flags, do a KVM sanity
> + * check to see if pgprot mapping type is MT_NORMAL and a
> + * safely cacheable device memory.
> + */
> + if (mapping_type(vma->vm_page_prot) == MT_NORMAL)
> + cacheable_devmem = true;
> + }
> +
> /* Don't use the VMA after the unlock -- it may have vanished */
> vma = NULL;
>
> @@ -1633,8 +1656,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> *
> * In both cases, we don't let transparent_hugepage_adjust()
> * change things at the last minute.
> + *
> + * Do not set device as the device memory is cacheable. Note
> + * that such mapping is safe as the KVM S2 will have the same
> + * Normal memory type as the VMA has in the S1.
> */
> - device = true;
> + if (!cacheable_devmem)
> + device = true;
> } else if (logging_active && !write_fault) {
> /*
> * Only actually map the page as writable if this was a write
> @@ -1716,6 +1744,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> prot |= KVM_PGTABLE_PROT_X;
> }
>
> + /*
> + * When FWB is unsupported KVM needs to do cache flushes
> + * (via dcache_clean_inval_poc()) of the underlying memory. This is
> + * only possible if the memory is already mapped into the kernel map.
> + *
> + * Outright reject as the cacheable device memory is not present in
> + * the kernel map and not suitable for cache management.
> + */
> + if (cacheable_devmem && !stage2_has_fwb(pgt)) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }
> +
These new error reasons should at least be complemented by an
equivalent check at the point where the memslot is registered. It
maybe OK to blindly return an error at fault time (because userspace
has messed with the mapping behind our back), but there should at
least be something telling a well behaved userspace that there is a
bunch of combination we're unwilling to support.
Which brings me to the next point: FWB is not discoverable from
userspace. How do you expect a VMM to know what it can or cannot do?
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-10 11:54 ` Marc Zyngier
@ 2025-03-11 3:42 ` Ankit Agrawal
2025-03-11 11:18 ` Marc Zyngier
2025-03-18 12:57 ` Jason Gunthorpe
1 sibling, 1 reply; 61+ messages in thread
From: Ankit Agrawal @ 2025-03-11 3:42 UTC (permalink / raw)
To: Marc Zyngier
Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com,
catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam,
Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
>> + /*
>> + * When FWB is unsupported KVM needs to do cache flushes
>> + * (via dcache_clean_inval_poc()) of the underlying memory. This is
>> + * only possible if the memory is already mapped into the kernel map.
>> + *
>> + * Outright reject as the cacheable device memory is not present in
>> + * the kernel map and not suitable for cache management.
>> + */
>> + if (cacheable_devmem && !stage2_has_fwb(pgt)) {
>> + ret = -EINVAL;
>> + goto out_unlock;
>> + }
>> +
>
> These new error reasons should at least be complemented by an
> equivalent check at the point where the memslot is registered. It
Understood. I can add such check in kvm_arch_prepare_memory_region().
> maybe OK to blindly return an error at fault time (because userspace
> has messed with the mapping behind our back), but there should at
> least be something telling a well behaved userspace that there is a
> bunch of combination we're unwilling to support.
How about WARN_ON() or BUG() for the faulty situation?
> Which brings me to the next point: FWB is not discoverable from
> userspace. How do you expect a VMM to know what it can or cannot do?
Good point. I am not sure if it can. I suppose you are concerned about error
during fault handling when !FWB without VMM having any clear indications
of the cause? Perhaps we can gracefully fall back to the default device mapping
in such case? But that would cause VM to crash as soon as it makes some
access violating DEVICE_nGnRE.
> M.
>
> --
> Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-11 3:42 ` Ankit Agrawal
@ 2025-03-11 11:18 ` Marc Zyngier
2025-03-11 12:07 ` Ankit Agrawal
0 siblings, 1 reply; 61+ messages in thread
From: Marc Zyngier @ 2025-03-11 11:18 UTC (permalink / raw)
To: Ankit Agrawal
Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com,
catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam,
Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, 11 Mar 2025 03:42:23 +0000,
Ankit Agrawal <ankita@nvidia.com> wrote:
>
> >> + /*
> >> + * When FWB is unsupported KVM needs to do cache flushes
> >> + * (via dcache_clean_inval_poc()) of the underlying memory. This is
> >> + * only possible if the memory is already mapped into the kernel map.
> >> + *
> >> + * Outright reject as the cacheable device memory is not present in
> >> + * the kernel map and not suitable for cache management.
> >> + */
> >> + if (cacheable_devmem && !stage2_has_fwb(pgt)) {
> >> + ret = -EINVAL;
> >> + goto out_unlock;
> >> + }
> >> +
> >
> > These new error reasons should at least be complemented by an
> > equivalent check at the point where the memslot is registered. It
>
> Understood. I can add such check in kvm_arch_prepare_memory_region().
>
>
> > maybe OK to blindly return an error at fault time (because userspace
> > has messed with the mapping behind our back), but there should at
> > least be something telling a well behaved userspace that there is a
> > bunch of combination we're unwilling to support.
>
> How about WARN_ON() or BUG() for the faulty situation?
Absolutely not. Do you really want any user to randomly crash the
kernel because they flip a mapping, which they can do anytime they
want?
The way KVM works is that we return to userspace for the VMM to fix
things. Either by emulating something we can't do in the kernel, or by
fixing things so that the kernel can replay the fault and sort it out.
Either way, this requires some form of fault syndrome so that usespace
has a chance of understanding WTF is going on.
> > Which brings me to the next point: FWB is not discoverable from
> > userspace. How do you expect a VMM to know what it can or cannot do?
>
> Good point. I am not sure if it can. I suppose you are concerned about error
> during fault handling when !FWB without VMM having any clear indications
> of the cause?
No, I'm concerned that a well established API (populating a memslot)
works in some case and doesn't work in another without a clear
indication of *why* we have this behaviour.
To me, this indicates that userspace needs to buy in this new
behaviour, and that behaviour needs to be advertised by a capability,
which is in turn conditional on FWB.
> Perhaps we can gracefully fall back to the default device mapping
> in such case? But that would cause VM to crash as soon as it makes some
> access violating DEVICE_nGnRE.
Which would now be a regression...
My take is that this cacheable PNFMAP contraption must only be exposed
to a guest if FWB is available. We can't prevent someone to do an
mmap() behind our back, but we can at least:
- tell userspace whether this is supported
- only handle the fault if userspace has bought in this mode
- report the fault to userspace for it to fix things otherwise
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-11 11:18 ` Marc Zyngier
@ 2025-03-11 12:07 ` Ankit Agrawal
2025-03-12 8:21 ` Marc Zyngier
0 siblings, 1 reply; 61+ messages in thread
From: Ankit Agrawal @ 2025-03-11 12:07 UTC (permalink / raw)
To: Marc Zyngier
Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com,
catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam,
Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
Thanks Marc for the feedback.
> No, I'm concerned that a well established API (populating a memslot)
> works in some case and doesn't work in another without a clear
> indication of *why* we have this behaviour.
>
> To me, this indicates that userspace needs to buy in this new
> behaviour, and that behaviour needs to be advertised by a capability,
> which is in turn conditional on FWB.
Yes, that makes sense.
>>> Perhaps we can gracefully fall back to the default device mapping
>>> in such case? But that would cause VM to crash as soon as it makes some
>>> access violating DEVICE_nGnRE.
>
> Which would now be a regression...
>
> My take is that this cacheable PNFMAP contraption must only be exposed
> to a guest if FWB is available. We can't prevent someone to do an
> mmap() behind our back, but we can at least:
>
> - tell userspace whether this is supported
For my education, what is an accepted way to communicate this? Please let
me know if there are any relevant examples that you may be aware of.
I suppose just checking for FWB (for PFNMAP) and returning some sort of
an error on userspace mmap will not be enough of a hint here?
> - only handle the fault if userspace has bought in this mode
> - report the fault to userspace for it to fix things otherwise
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-11 12:07 ` Ankit Agrawal
@ 2025-03-12 8:21 ` Marc Zyngier
2025-03-17 5:55 ` Ankit Agrawal
0 siblings, 1 reply; 61+ messages in thread
From: Marc Zyngier @ 2025-03-12 8:21 UTC (permalink / raw)
To: Ankit Agrawal
Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com,
catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam,
Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, 11 Mar 2025 12:07:20 +0000,
Ankit Agrawal <ankita@nvidia.com> wrote:
>
> Thanks Marc for the feedback.
>
> > No, I'm concerned that a well established API (populating a memslot)
> > works in some case and doesn't work in another without a clear
> > indication of *why* we have this behaviour.
> >
> > To me, this indicates that userspace needs to buy in this new
> > behaviour, and that behaviour needs to be advertised by a capability,
> > which is in turn conditional on FWB.
>
> Yes, that makes sense.
>
> >>> Perhaps we can gracefully fall back to the default device mapping
> >>> in such case? But that would cause VM to crash as soon as it makes some
> >>> access violating DEVICE_nGnRE.
> >
> > Which would now be a regression...
> >
> > My take is that this cacheable PNFMAP contraption must only be exposed
> > to a guest if FWB is available. We can't prevent someone to do an
> > mmap() behind our back, but we can at least:
> >
> > - tell userspace whether this is supported
>
> For my education, what is an accepted way to communicate this? Please let
> me know if there are any relevant examples that you may be aware of.
A KVM capability is what is usually needed.
>
> I suppose just checking for FWB (for PFNMAP) and returning some sort of
> an error on userspace mmap will not be enough of a hint here?
I don't think checking for FWB at mmap() time is correct. mmap()
shouldn't care about FWB at all, because stage-2 is irrelevant to
mmap(). You also want to be able t perform the same mmap() inside an
EL1 guest, which by definition cannot consider FWB.
This must be checked at the point of memslot creation, and return an
error at that point. Memslots are all about stage-2, so it makes sense
to check it there.
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-12 8:21 ` Marc Zyngier
@ 2025-03-17 5:55 ` Ankit Agrawal
2025-03-17 9:27 ` Marc Zyngier
0 siblings, 1 reply; 61+ messages in thread
From: Ankit Agrawal @ 2025-03-17 5:55 UTC (permalink / raw)
To: Marc Zyngier
Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com,
catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam,
Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
>> For my education, what is an accepted way to communicate this? Please let
>> me know if there are any relevant examples that you may be aware of.
>
> A KVM capability is what is usually needed.
I see. If IIUC, this would involve a corresponding Qemu (usermode) change
to fetch the new KVM cap. Then it could fail in case the FWB is not
supported with some additional conditions (so that the currently supported
configs with !FWB won't break on usermode).
The proposed code change is to map in S2 as NORMAL when vma flags
has VM_PFNMAP. However, Qemu cannot know that driver is mapping
with PFNMAP or not. So how may Qemu decide whether it is okay to
fail for !FWB or not?
> This must be checked at the point of memslot creation, and return an
> error at that point. Memslots are all about stage-2, so it makes sense
> to check it there.
Ack, will add the check.
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-17 5:55 ` Ankit Agrawal
@ 2025-03-17 9:27 ` Marc Zyngier
2025-03-17 19:54 ` Catalin Marinas
0 siblings, 1 reply; 61+ messages in thread
From: Marc Zyngier @ 2025-03-17 9:27 UTC (permalink / raw)
To: Ankit Agrawal
Cc: Jason Gunthorpe, oliver.upton@linux.dev, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com,
catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam,
Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Mon, 17 Mar 2025 05:55:55 +0000,
Ankit Agrawal <ankita@nvidia.com> wrote:
>
> >> For my education, what is an accepted way to communicate this? Please let
> >> me know if there are any relevant examples that you may be aware of.
> >
> > A KVM capability is what is usually needed.
>
> I see. If IIUC, this would involve a corresponding Qemu (usermode) change
> to fetch the new KVM cap. Then it could fail in case the FWB is not
> supported with some additional conditions (so that the currently supported
> configs with !FWB won't break on usermode).
>
> The proposed code change is to map in S2 as NORMAL when vma flags
> has VM_PFNMAP. However, Qemu cannot know that driver is mapping
> with PFNMAP or not. So how may Qemu decide whether it is okay to
> fail for !FWB or not?
This is not about FWB as far as userspace is concerned. This is about
PFNMAP as non-device memory. If the host doesn't have FWB, then the
"PFNMAP as non-device memory" capability doesn't exist, and userspace
fails early.
Userspace must also have some knowledge of what device it obtains the
mapping from, and whether that device requires some extra host
capability to be assigned to the guest.
You can then check whether the VMA associated with the memslot is
PFNMAP or not, if the memslot has been enabled for PFNMAP mappings
(either globally or on a per-memslot basis, I don't really care).
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-17 9:27 ` Marc Zyngier
@ 2025-03-17 19:54 ` Catalin Marinas
2025-03-18 9:39 ` Marc Zyngier
0 siblings, 1 reply; 61+ messages in thread
From: Catalin Marinas @ 2025-03-17 19:54 UTC (permalink / raw)
To: Marc Zyngier
Cc: Ankit Agrawal, Jason Gunthorpe, oliver.upton@linux.dev,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Mon, Mar 17, 2025 at 09:27:52AM +0000, Marc Zyngier wrote:
> On Mon, 17 Mar 2025 05:55:55 +0000,
> Ankit Agrawal <ankita@nvidia.com> wrote:
> >
> > >> For my education, what is an accepted way to communicate this? Please let
> > >> me know if there are any relevant examples that you may be aware of.
> > >
> > > A KVM capability is what is usually needed.
> >
> > I see. If IIUC, this would involve a corresponding Qemu (usermode) change
> > to fetch the new KVM cap. Then it could fail in case the FWB is not
> > supported with some additional conditions (so that the currently supported
> > configs with !FWB won't break on usermode).
> >
> > The proposed code change is to map in S2 as NORMAL when vma flags
> > has VM_PFNMAP. However, Qemu cannot know that driver is mapping
> > with PFNMAP or not. So how may Qemu decide whether it is okay to
> > fail for !FWB or not?
>
> This is not about FWB as far as userspace is concerned. This is about
> PFNMAP as non-device memory. If the host doesn't have FWB, then the
> "PFNMAP as non-device memory" capability doesn't exist, and userspace
> fails early.
>
> Userspace must also have some knowledge of what device it obtains the
> mapping from, and whether that device requires some extra host
> capability to be assigned to the guest.
>
> You can then check whether the VMA associated with the memslot is
> PFNMAP or not, if the memslot has been enabled for PFNMAP mappings
> (either globally or on a per-memslot basis, I don't really care).
Trying to page this back in, I think there are three stages:
1. A KVM cap that the VMM can use to check for non-device PFNMAP (or
rather cacheable PFNMAP since we already support Normal NC).
2. Memslot registration - we need a way for the VMM to require such
cacheable PFNMAP and for KVM to check. Current patch relies on (a)
the stage 1 vma attributes which I'm not a fan of. An alternative I
suggested was (b) a VM_FORCE_CACHEABLE vma flag, on the assumption
that the vfio driver knows if it supports cacheable (it's a bit of a
stretch trying to make this generic). Yet another option is (c) a
KVM_MEM_CACHEABLE flag that the VMM passes at memslot registration.
3. user_mem_abort() - follows the above logic (whatever we decide),
maybe with some extra check and WARN in case we got the logic wrong.
The problems in (2) are that we need to know that the device supports
cacheable mappings and we don't introduce additional issues or end up
with FWB on a PFNMAP that does not support cacheable. Without any vma
flag like the current VM_ALLOW_ANY_UNCACHED, the next best thing is
relying on the stage 1 attributes. But we don't know them at the memslot
registration, only later in step (3) after a GUP on the VMM address
space.
So in (2), when !FWB, we only want to reject VM_PFNMAP slots if we know
they are going to be mapped as cacheable. So we need this information
somehow, either from the vma->vm_flags or slot->flags.
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-17 19:54 ` Catalin Marinas
@ 2025-03-18 9:39 ` Marc Zyngier
2025-03-18 12:55 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Marc Zyngier @ 2025-03-18 9:39 UTC (permalink / raw)
To: Catalin Marinas
Cc: Ankit Agrawal, Jason Gunthorpe, oliver.upton@linux.dev,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Mon, 17 Mar 2025 19:54:25 +0000,
Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Mon, Mar 17, 2025 at 09:27:52AM +0000, Marc Zyngier wrote:
> > On Mon, 17 Mar 2025 05:55:55 +0000,
> > Ankit Agrawal <ankita@nvidia.com> wrote:
> > >
> > > >> For my education, what is an accepted way to communicate this? Please let
> > > >> me know if there are any relevant examples that you may be aware of.
> > > >
> > > > A KVM capability is what is usually needed.
> > >
> > > I see. If IIUC, this would involve a corresponding Qemu (usermode) change
> > > to fetch the new KVM cap. Then it could fail in case the FWB is not
> > > supported with some additional conditions (so that the currently supported
> > > configs with !FWB won't break on usermode).
> > >
> > > The proposed code change is to map in S2 as NORMAL when vma flags
> > > has VM_PFNMAP. However, Qemu cannot know that driver is mapping
> > > with PFNMAP or not. So how may Qemu decide whether it is okay to
> > > fail for !FWB or not?
> >
> > This is not about FWB as far as userspace is concerned. This is about
> > PFNMAP as non-device memory. If the host doesn't have FWB, then the
> > "PFNMAP as non-device memory" capability doesn't exist, and userspace
> > fails early.
> >
> > Userspace must also have some knowledge of what device it obtains the
> > mapping from, and whether that device requires some extra host
> > capability to be assigned to the guest.
> >
> > You can then check whether the VMA associated with the memslot is
> > PFNMAP or not, if the memslot has been enabled for PFNMAP mappings
> > (either globally or on a per-memslot basis, I don't really care).
>
> Trying to page this back in, I think there are three stages:
>
> 1. A KVM cap that the VMM can use to check for non-device PFNMAP (or
> rather cacheable PFNMAP since we already support Normal NC).
>
> 2. Memslot registration - we need a way for the VMM to require such
> cacheable PFNMAP and for KVM to check. Current patch relies on (a)
> the stage 1 vma attributes which I'm not a fan of. An alternative I
> suggested was (b) a VM_FORCE_CACHEABLE vma flag, on the assumption
> that the vfio driver knows if it supports cacheable (it's a bit of a
> stretch trying to make this generic). Yet another option is (c) a
> KVM_MEM_CACHEABLE flag that the VMM passes at memslot registration.
>
> 3. user_mem_abort() - follows the above logic (whatever we decide),
> maybe with some extra check and WARN in case we got the logic wrong.
>
> The problems in (2) are that we need to know that the device supports
> cacheable mappings and we don't introduce additional issues or end up
> with FWB on a PFNMAP that does not support cacheable. Without any vma
> flag like the current VM_ALLOW_ANY_UNCACHED, the next best thing is
> relying on the stage 1 attributes. But we don't know them at the memslot
> registration, only later in step (3) after a GUP on the VMM address
> space.
>
> So in (2), when !FWB, we only want to reject VM_PFNMAP slots if we know
> they are going to be mapped as cacheable. So we need this information
> somehow, either from the vma->vm_flags or slot->flags.
Yup, that's mostly how I think of it.
Obtaining a mapping from the xPU driver must result in VM_PFNMAP being
set in the VMA. I don't think that's particularly controversial.
The memslot must also be created with a new flag ((2c) in the taxonomy
above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This
flag is only allowed if (1) is valid.
This results in the following behaviours:
- If the VMM creates the memslot with the cacheable attribute without
(1) being advertised, we fail.
- If the VMM creates the memslot without the cacheable attribute, we
map as NC, as it is today.
What this doesn't do is *automatically* decide for the VMM what
attributes to use. The VMM must know what it is doing, and only
provide the memslot flag when appropriate. Doing otherwise may eat
your data and/or take the machine down (cacheable mapping on a device
can be great fun). If you want to address this, then "someone" needs
to pass some additional VMA flag that KVM can check.
Of course, all of this only caters for well behaved userspace, and we
need to gracefully handle (3) when the VMM sneaks in a new VMA that
has conflicting attributes.
For that, we need a reasonable fault reporting interface that allows
userspace to correctly handle it. I don't think this is unique to this
case, but also covers things like MTE and other funky stuff that
relies on the backing memory having some particular "attributes".
An alternative could be to require the VMA to be sealed, which would
prevent any overlapping mapping. But I only have looked at that for 2
minutes...
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-18 9:39 ` Marc Zyngier
@ 2025-03-18 12:55 ` Jason Gunthorpe
2025-03-18 19:27 ` Catalin Marinas
2025-03-18 19:30 ` Oliver Upton
0 siblings, 2 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2025-03-18 12:55 UTC (permalink / raw)
To: Marc Zyngier
Cc: Catalin Marinas, Ankit Agrawal, oliver.upton@linux.dev,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote:
> The memslot must also be created with a new flag ((2c) in the taxonomy
> above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This
> flag is only allowed if (1) is valid.
>
> This results in the following behaviours:
>
> - If the VMM creates the memslot with the cacheable attribute without
> (1) being advertised, we fail.
>
> - If the VMM creates the memslot without the cacheable attribute, we
> map as NC, as it is today.
Is that OK though?
Now we have the MM page tables mapping this memory as cachable but KVM
and the guest is accessing it as non-cached.
I thought ARM tried hard to avoid creating such mismatches? This is
why the pgprot flags were used to drive this, not an opt-in flag. To
prevent userspace from forcing a mismatch.
> What this doesn't do is *automatically* decide for the VMM what
> attributes to use. The VMM must know what it is doing, and only
> provide the memslot flag when appropriate. Doing otherwise may eat
> your data and/or take the machine down (cacheable mapping on a device
> can be great fun).
Again, this is why we followed the VMA flags. The thing creating the
VMA already made this safety determination when it set pgprot
cachable. We should not allow KVM to randomly make any PGPROT
cachable!
> If you want to address this, then "someone" needs
> to pass some additional VMA flag that KVM can check.
pgprot does this already, we don't need a new flag to determine if the
VMA is cachably mapped.
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-10 11:54 ` Marc Zyngier
2025-03-11 3:42 ` Ankit Agrawal
@ 2025-03-18 12:57 ` Jason Gunthorpe
1 sibling, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2025-03-18 12:57 UTC (permalink / raw)
To: Marc Zyngier
Cc: ankita, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david,
aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, zhiw, mochs, udhoke, dnigam, alex.williamson,
sebastianene, coltonlewis, kevin.tian, yi.l.liu, ardb, akpm,
gshan, linux-mm, ddutile, tabba, qperret, seanjc, kvmarm,
linux-kernel, linux-arm-kernel
On Mon, Mar 10, 2025 at 11:54:52AM +0000, Marc Zyngier wrote:
> Which brings me to the next point: FWB is not discoverable from
> userspace. How do you expect a VMM to know what it can or cannot do?
If you add a check at memslot creation time then simply fail the
memslot.
There is nothing the VMM can do except completely fail. There is no
alternative path here where a cachable VMA can be safely made
non-cachable for the guest?
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-18 12:55 ` Jason Gunthorpe
@ 2025-03-18 19:27 ` Catalin Marinas
2025-03-18 19:35 ` David Hildenbrand
2025-03-18 23:17 ` Jason Gunthorpe
2025-03-18 19:30 ` Oliver Upton
1 sibling, 2 replies; 61+ messages in thread
From: Catalin Marinas @ 2025-03-18 19:27 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Marc Zyngier, Ankit Agrawal, oliver.upton@linux.dev,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote:
> > The memslot must also be created with a new flag ((2c) in the taxonomy
> > above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This
> > flag is only allowed if (1) is valid.
> >
> > This results in the following behaviours:
> >
> > - If the VMM creates the memslot with the cacheable attribute without
> > (1) being advertised, we fail.
> >
> > - If the VMM creates the memslot without the cacheable attribute, we
> > map as NC, as it is today.
>
> Is that OK though?
>
> Now we have the MM page tables mapping this memory as cachable but KVM
> and the guest is accessing it as non-cached.
I don't think we should allow this.
> I thought ARM tried hard to avoid creating such mismatches? This is
> why the pgprot flags were used to drive this, not an opt-in flag. To
> prevent userspace from forcing a mismatch.
We have the vma->vm_page_prot when the memslot is added, so we could use
this instead of additional KVM flags. If it's Normal Cacheable and the
platform does not support FWB, reject it. If the prot bits say
cacheable, it means that the driver was ok with such mapping. Some extra
checks for !MTE or MTE_PERM.
As additional safety, we could check this again in user_mem_abort() in
case the driver played with the vm_page_prot field in the meantime (e.g.
in the .fault() callback).
I'm not particularly keen on using the vm_page_prot but we probably need
to do this anyway to avoid aliases as we can't fully trust the VMM. The
alternative is a VM_* flag that says "cacheable everywhere" and we avoid
the low-level attributes checking.
> > What this doesn't do is *automatically* decide for the VMM what
> > attributes to use. The VMM must know what it is doing, and only
> > provide the memslot flag when appropriate. Doing otherwise may eat
> > your data and/or take the machine down (cacheable mapping on a device
> > can be great fun).
>
> Again, this is why we followed the VMA flags. The thing creating the
> VMA already made this safety determination when it set pgprot
> cachable. We should not allow KVM to randomly make any PGPROT
> cachable!
Can this be moved to kvm_arch_prepare_memory_region() and maybe an
additional check in user_mem_abort()?
Thinking some more about a KVM capability that the VMM can check, I'm
not sure what it can do with this. The VMM simply maps something from a
device and cannot probe the cacheability - that's a property of the
device that's not usually exposed to user by the driver. The VMM just
passes this vma to KVM. As with the Normal NC, we tried to avoid
building device knowledge into the VMM (and ended up with
VM_ALLOW_ANY_UNCACHED since the VFIO driver did not allow such user
mapping and probably wasn't entirely safe either).
I assume with the cacheable pfn mapping, the whole range covered by the
vma is entirely safe to be mapped as such in user space.
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-18 12:55 ` Jason Gunthorpe
2025-03-18 19:27 ` Catalin Marinas
@ 2025-03-18 19:30 ` Oliver Upton
2025-03-18 23:09 ` Jason Gunthorpe
1 sibling, 1 reply; 61+ messages in thread
From: Oliver Upton @ 2025-03-18 19:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Marc Zyngier, Catalin Marinas, Ankit Agrawal, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org,
ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org,
david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote:
>
> > The memslot must also be created with a new flag ((2c) in the taxonomy
> > above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This
> > flag is only allowed if (1) is valid.
> >
> > This results in the following behaviours:
> >
> > - If the VMM creates the memslot with the cacheable attribute without
> > (1) being advertised, we fail.
> >
> > - If the VMM creates the memslot without the cacheable attribute, we
> > map as NC, as it is today.
>
> Is that OK though?
>
> Now we have the MM page tables mapping this memory as cachable but KVM
> and the guest is accessing it as non-cached.
>
> I thought ARM tried hard to avoid creating such mismatches? This is
> why the pgprot flags were used to drive this, not an opt-in flag. To
> prevent userspace from forcing a mismatch.
It's far more problematic the other way around, e.g. the host knows that
something needs a Device-* attribute and the VM has done something
cacheable. The endpoint for that PA could, for example, fall over when
lines pulled in by the guest are written back, which of course can't
always be traced back to the offending VM.
OTOH, if the host knows that a PA is cacheable and the guest does
something non-cacheable, you 'just' have to deal with the usual
mismatched attributes problem as laid out in the ARM ARM.
> > What this doesn't do is *automatically* decide for the VMM what
> > attributes to use. The VMM must know what it is doing, and only
> > provide the memslot flag when appropriate. Doing otherwise may eat
> > your data and/or take the machine down (cacheable mapping on a device
> > can be great fun).
>
> Again, this is why we followed the VMA flags. The thing creating the
> VMA already made this safety determination when it set pgprot
> cachable. We should not allow KVM to randomly make any PGPROT
> cachable!
That doesn't seem to be the suggestion.
Userspace should be stating intentions on the memslot with the sort of
mapping that it wants to create, and a memslot flag to say "I allow
cacheable mappings" seems to fit the bill.
Then we have:
- Memslot creation fails for any PFNMAP slot with the flag set &&
!FEAT_FWB
- Stage-2 faults fail (exit to userspace) if the above conditions are
not met
- Stage-2 faults serviced w/ a cacheable mapping if the precondition is
satisfied and pgprot on the VMA is cacheable
- Stage-2 faults serviced w/ a non-cacheable mapping if flag is not
set
Seems workable + would prevent KVM from being excessively permissive?
Thanks,
Oliver
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-18 19:27 ` Catalin Marinas
@ 2025-03-18 19:35 ` David Hildenbrand
2025-03-18 19:40 ` Oliver Upton
2025-03-18 23:17 ` Jason Gunthorpe
1 sibling, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-03-18 19:35 UTC (permalink / raw)
To: Catalin Marinas, Jason Gunthorpe
Cc: Marc Zyngier, Ankit Agrawal, oliver.upton@linux.dev,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On 18.03.25 20:27, Catalin Marinas wrote:
> On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote:
>> On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote:
>>> The memslot must also be created with a new flag ((2c) in the taxonomy
>>> above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This
>>> flag is only allowed if (1) is valid.
>>>
>>> This results in the following behaviours:
>>>
>>> - If the VMM creates the memslot with the cacheable attribute without
>>> (1) being advertised, we fail.
>>>
>>> - If the VMM creates the memslot without the cacheable attribute, we
>>> map as NC, as it is today.
>>
>> Is that OK though?
>>
>> Now we have the MM page tables mapping this memory as cachable but KVM
>> and the guest is accessing it as non-cached.
>
> I don't think we should allow this.
>
>> I thought ARM tried hard to avoid creating such mismatches? This is
>> why the pgprot flags were used to drive this, not an opt-in flag. To
>> prevent userspace from forcing a mismatch.
>
> We have the vma->vm_page_prot when the memslot is added, so we could use
> this instead of additional KVM flags.
I thought we try to avoid messing with the VMA when adding memslots;
because KVM_CAP_SYNC_MMU allows user space for changing the VMAs
afterwards without changing the memslot?
include/uapi/linux/kvm.h:#define KVM_CAP_SYNC_MMU 16 /* Changes to host
mmap are reflected in guest */
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-18 19:35 ` David Hildenbrand
@ 2025-03-18 19:40 ` Oliver Upton
2025-03-20 3:30 ` bibo mao
0 siblings, 1 reply; 61+ messages in thread
From: Oliver Upton @ 2025-03-18 19:40 UTC (permalink / raw)
To: David Hildenbrand
Cc: Catalin Marinas, Jason Gunthorpe, Marc Zyngier, Ankit Agrawal,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Mar 18, 2025 at 08:35:38PM +0100, David Hildenbrand wrote:
> On 18.03.25 20:27, Catalin Marinas wrote:
> > On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote:
> > > > The memslot must also be created with a new flag ((2c) in the taxonomy
> > > > above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This
> > > > flag is only allowed if (1) is valid.
> > > >
> > > > This results in the following behaviours:
> > > >
> > > > - If the VMM creates the memslot with the cacheable attribute without
> > > > (1) being advertised, we fail.
> > > >
> > > > - If the VMM creates the memslot without the cacheable attribute, we
> > > > map as NC, as it is today.
> > >
> > > Is that OK though?
> > >
> > > Now we have the MM page tables mapping this memory as cachable but KVM
> > > and the guest is accessing it as non-cached.
> >
> > I don't think we should allow this.
> >
> > > I thought ARM tried hard to avoid creating such mismatches? This is
> > > why the pgprot flags were used to drive this, not an opt-in flag. To
> > > prevent userspace from forcing a mismatch.
> >
> > We have the vma->vm_page_prot when the memslot is added, so we could use
> > this instead of additional KVM flags.
>
> I thought we try to avoid messing with the VMA when adding memslots; because
> KVM_CAP_SYNC_MMU allows user space for changing the VMAs afterwards without
> changing the memslot?
Any checks on the VMA at memslot creation is done out of courtesy to
userspace so it 'fails fast'. We repeat checks on the VMA at the time of
fault to handle userspace twiddling VMAs behind our back.
VM_MTE_ALLOWED is an example of this.
Thanks,
Oliver
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-18 19:30 ` Oliver Upton
@ 2025-03-18 23:09 ` Jason Gunthorpe
2025-03-19 7:01 ` Oliver Upton
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-03-18 23:09 UTC (permalink / raw)
To: Oliver Upton
Cc: Marc Zyngier, Catalin Marinas, Ankit Agrawal, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org,
ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org,
david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Mar 18, 2025 at 12:30:43PM -0700, Oliver Upton wrote:
> On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote:
> > On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote:
> >
> > > The memslot must also be created with a new flag ((2c) in the taxonomy
> > > above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This
> > > flag is only allowed if (1) is valid.
> > >
> > > This results in the following behaviours:
> > >
> > > - If the VMM creates the memslot with the cacheable attribute without
> > > (1) being advertised, we fail.
> > >
> > > - If the VMM creates the memslot without the cacheable attribute, we
> > > map as NC, as it is today.
> >
> > Is that OK though?
> >
> > Now we have the MM page tables mapping this memory as cachable but KVM
> > and the guest is accessing it as non-cached.
> >
> > I thought ARM tried hard to avoid creating such mismatches? This is
> > why the pgprot flags were used to drive this, not an opt-in flag. To
> > prevent userspace from forcing a mismatch.
>
> It's far more problematic the other way around, e.g. the host knows that
> something needs a Device-* attribute and the VM has done something
> cacheable. The endpoint for that PA could, for example, fall over when
> lines pulled in by the guest are written back, which of course can't
> always be traced back to the offending VM.
>
> OTOH, if the host knows that a PA is cacheable and the guest does
> something non-cacheable, you 'just' have to deal with the usual
> mismatched attributes problem as laid out in the ARM ARM.
I think the issue is that KVM doesn't do that usual stuff (ie cache
flushing) for memory that doesn't have a struct page backing.
So nothing in the hypervisor does any cache flushing and I belive you
end up with a situation where the VMM could have zero'd this cachable
memory using cachable stores to sanitize it across VMs and then KVM
can put that memory into the VM as uncached and the VM could then
access stale non-zeroed data from a prior VM. Yes? This is a security
problem.
As I understand things KVM must either do the cache flushing, or must
not allow mismatched attributes, as a matter of security.
This is why FWB comes into the picture because KVM cannot do the cache
flushing of PFNMAP VMAs. So we force the MM and KVM S2 to be cachable
and use S2 FWB to prevent the guest from ever making it
uncachable. Thus the cache flushes are not required and everything is
security safe.
So I think the logic we want here in the fault handler is to:
Get the mm's PTE
If it is cachable:
Check if it has a struct page:
Yes - KVM flushes it and can use a non-FWB path
No - KVM either fails to install it, or installs it using FWB
to force cachability. KVM never allows degrading cachable
to non-cachable when it can't do flushing.
Not cachable:
Install it with Normal-NC as was previously discussed and merged
> Userspace should be stating intentions on the memslot with the sort of
> mapping that it wants to create, and a memslot flag to say "I allow
> cacheable mappings" seems to fit the bill.
I'm not sure about this, I don't see that the userspace has any
choice. As above, KVM has to follow whatever is in the PTEs, the
userspace can't ask for something different here. At best you could
make non-struct page cachable memory always fail unless the flag is
given - but why?
It seems sufficient for fast fail to check if the VMA has PFNMAP and
pgprot cachable then !FEAT_FWB fails the memslot. There is no real
recovery from this, the VMM is doing something that cannot be
supported.
> - Stage-2 faults serviced w/ a non-cacheable mapping if flag is not
> set
As above, I think this creates a bug :\
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-18 19:27 ` Catalin Marinas
2025-03-18 19:35 ` David Hildenbrand
@ 2025-03-18 23:17 ` Jason Gunthorpe
2025-03-19 18:03 ` Catalin Marinas
1 sibling, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-03-18 23:17 UTC (permalink / raw)
To: Catalin Marinas
Cc: Marc Zyngier, Ankit Agrawal, oliver.upton@linux.dev,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Mar 18, 2025 at 07:27:27PM +0000, Catalin Marinas wrote:
> Thinking some more about a KVM capability that the VMM can check, I'm
> not sure what it can do with this. The VMM simply maps something from a
> device and cannot probe the cacheability
KVM is mirroring the MM's PTEs to the S2's PTEs.
You can think about this differently - KVM currently has a little bug
where the MM's PTE's can say cachable but KVM will mirror it to a S2
PTE that is forced non-cachable. KVM will not do any cache flushing to
make this safe.
Fundamentally that discrepancy is what is being addressed
here. Cachable PTEs in the MM should be mirrored to cachable PTEs in
the S2.
That the issue only arises with non-struct page memory is just part of
the triggering condition..
> I assume with the cacheable pfn mapping, the whole range covered by the
> vma is entirely safe to be mapped as such in user space.
Yes
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-18 23:09 ` Jason Gunthorpe
@ 2025-03-19 7:01 ` Oliver Upton
2025-03-19 17:04 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Oliver Upton @ 2025-03-19 7:01 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Marc Zyngier, Catalin Marinas, Ankit Agrawal, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org,
ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org,
david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Mar 18, 2025 at 08:09:09PM -0300, Jason Gunthorpe wrote:
> > It's far more problematic the other way around, e.g. the host knows that
> > something needs a Device-* attribute and the VM has done something
> > cacheable. The endpoint for that PA could, for example, fall over when
> > lines pulled in by the guest are written back, which of course can't
> > always be traced back to the offending VM.
> >
> > OTOH, if the host knows that a PA is cacheable and the guest does
> > something non-cacheable, you 'just' have to deal with the usual
> > mismatched attributes problem as laid out in the ARM ARM.
>
> I think the issue is that KVM doesn't do that usual stuff (ie cache
> flushing) for memory that doesn't have a struct page backing.
Indeed, I clearly paged that out. What I said is how we arrived at the
Device-* v. Normal-NC distinction.
> So nothing in the hypervisor does any cache flushing and I belive you
> end up with a situation where the VMM could have zero'd this cachable
> memory using cachable stores to sanitize it across VMs and then KVM
> can put that memory into the VM as uncached and the VM could then
> access stale non-zeroed data from a prior VM. Yes? This is a security
> problem.
Pedantic, but KVM only cares about cache maintenance in response to the
primary MM, not the VMM. After a stage-2 mapping has been established
userspace cannot expect KVM to do cache maintenance on its behalf.
You have a very good point that KVM is broken for cacheable PFNMAP'd
crap since we demote to something non-cacheable, and maybe that
deserves fixing first. Hopefully nobody notices that we've taken away
the toys...
> So I think the logic we want here in the fault handler is to:
> Get the mm's PTE
> If it is cachable:
> Check if it has a struct page:
> Yes - KVM flushes it and can use a non-FWB path
> No - KVM either fails to install it, or installs it using FWB
> to force cachability. KVM never allows degrading cachable
> to non-cachable when it can't do flushing.
> Not cachable:
> Install it with Normal-NC as was previously discussed and merged
We still need to test the VMA flag here to select Normal-NC v. Device.
> > Userspace should be stating intentions on the memslot with the sort of
> > mapping that it wants to create, and a memslot flag to say "I allow
> > cacheable mappings" seems to fit the bill.
>
> I'm not sure about this, I don't see that the userspace has any
> choice. As above, KVM has to follow whatever is in the PTEs, the
> userspace can't ask for something different here. At best you could
> make non-struct page cachable memory always fail unless the flag is
> given - but why?
>
> It seems sufficient for fast fail to check if the VMA has PFNMAP and
> pgprot cachable then !FEAT_FWB fails the memslot. There is no real
> recovery from this, the VMM is doing something that cannot be
> supported.
I'm less worried about recovery and more focused on userspace being
able to understand what happened. Otherwise we may get folks complaining
about the ioctl failing "randomly" on certain machines.
But we might need to just expose the FWB-ness of the MMU to userspace
since it can already encounter mismatched attributes when poking struct
page-backed guest memory.
Thanks,
Oliver
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-19 7:01 ` Oliver Upton
@ 2025-03-19 17:04 ` Jason Gunthorpe
2025-03-19 18:11 ` Catalin Marinas
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-03-19 17:04 UTC (permalink / raw)
To: Oliver Upton
Cc: Marc Zyngier, Catalin Marinas, Ankit Agrawal, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org,
ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org,
david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> You have a very good point that KVM is broken for cacheable PFNMAP'd
> crap since we demote to something non-cacheable, and maybe that
> deserves fixing first. Hopefully nobody notices that we've taken away
> the toys...
Fixing it is either faulting all access attempts or mapping it
cachable to the S2 (as this series is trying to do)..
> We still need to test the VMA flag here to select Normal-NC v. Device.
Yes
> I'm less worried about recovery and more focused on userspace being
> able to understand what happened. Otherwise we may get folks complaining
> about the ioctl failing "randomly" on certain machines.
Even today something like qemu doesn't know that VFIO has created a
cachable mmap, so it would need a bunch more work to feed everything
through as proposed here.
IMHO the semantic should be simple enough, when the memslot is created
determine if KVM can support it at all. non-struct page cacheable
memory requires FWB to support it today.
Maybe tomorrow KVM will learn how to cache flush non-struct page
memory and then everything can support it.
It just seems wrong to be making uAPI based around limitations of the
kernel implementation..
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-18 23:17 ` Jason Gunthorpe
@ 2025-03-19 18:03 ` Catalin Marinas
0 siblings, 0 replies; 61+ messages in thread
From: Catalin Marinas @ 2025-03-19 18:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Marc Zyngier, Ankit Agrawal, oliver.upton@linux.dev,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Mar 18, 2025 at 08:17:36PM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 18, 2025 at 07:27:27PM +0000, Catalin Marinas wrote:
> > Thinking some more about a KVM capability that the VMM can check, I'm
> > not sure what it can do with this. The VMM simply maps something from a
> > device and cannot probe the cacheability
>
> KVM is mirroring the MM's PTEs to the S2's PTEs.
>
> You can think about this differently - KVM currently has a little bug
> where the MM's PTE's can say cachable but KVM will mirror it to a S2
> PTE that is forced non-cachable. KVM will not do any cache flushing to
> make this safe.
>
> Fundamentally that discrepancy is what is being addressed
> here. Cachable PTEs in the MM should be mirrored to cachable PTEs in
> the S2.
If we treat this as a bug (and I agree it's a potential problem if we
assign devices to the guest that have cacheable VMM mappings), the first
step would be to reject such memory slots, report some error to the VMM.
I don't think it's worth attempting to do any cache maintenance to sync
the aliases.
Subsequently, this can be allowed with on FWB-capable hardware. Whether
we want to expose such capability to the VMM, I guess it's harmless but
I doubt the VMM would do anything with it since it doesn't know the
device mapping attributes anyway.
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-19 17:04 ` Jason Gunthorpe
@ 2025-03-19 18:11 ` Catalin Marinas
2025-03-19 19:22 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Catalin Marinas @ 2025-03-19 18:11 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Oliver Upton, Marc Zyngier, Ankit Agrawal, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org,
ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org,
david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> > You have a very good point that KVM is broken for cacheable PFNMAP'd
> > crap since we demote to something non-cacheable, and maybe that
> > deserves fixing first. Hopefully nobody notices that we've taken away
> > the toys...
>
> Fixing it is either faulting all access attempts or mapping it
> cachable to the S2 (as this series is trying to do)..
As I replied earlier, it might be worth doing both - fault on !FWB
hardware (or rather reject the memslot creation), cacheable S2
otherwise.
> > I'm less worried about recovery and more focused on userspace being
> > able to understand what happened. Otherwise we may get folks complaining
> > about the ioctl failing "randomly" on certain machines.
>
> Even today something like qemu doesn't know that VFIO has created a
> cachable mmap, so it would need a bunch more work to feed everything
> through as proposed here.
I came to the same conclusion, qemu wouldn't be able to pass any
KVM_MEMORY_* flag to request cacheable since it doesn't know what
attributes it got from VFIO. You need a side-channel communication
between the host VFIO driver and KVM. With Ankit's proposal, that's the
Stage 1 MMU attributes for the qemu mapping.
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-19 18:11 ` Catalin Marinas
@ 2025-03-19 19:22 ` Jason Gunthorpe
2025-03-19 21:48 ` Catalin Marinas
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-03-19 19:22 UTC (permalink / raw)
To: Catalin Marinas
Cc: Oliver Upton, Marc Zyngier, Ankit Agrawal, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org,
ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org,
david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote:
> On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote:
> > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> > > You have a very good point that KVM is broken for cacheable PFNMAP'd
> > > crap since we demote to something non-cacheable, and maybe that
> > > deserves fixing first. Hopefully nobody notices that we've taken away
> > > the toys...
> >
> > Fixing it is either faulting all access attempts or mapping it
> > cachable to the S2 (as this series is trying to do)..
>
> As I replied earlier, it might be worth doing both - fault on !FWB
> hardware (or rather reject the memslot creation), cacheable S2
> otherwise.
I have no objection, Ankit are you able to make a failure patch?
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-19 19:22 ` Jason Gunthorpe
@ 2025-03-19 21:48 ` Catalin Marinas
2025-03-26 8:31 ` Ankit Agrawal
0 siblings, 1 reply; 61+ messages in thread
From: Catalin Marinas @ 2025-03-19 21:48 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Oliver Upton, Marc Zyngier, Ankit Agrawal, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org,
ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org,
david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote:
> > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> > > > You have a very good point that KVM is broken for cacheable PFNMAP'd
> > > > crap since we demote to something non-cacheable, and maybe that
> > > > deserves fixing first. Hopefully nobody notices that we've taken away
> > > > the toys...
> > >
> > > Fixing it is either faulting all access attempts or mapping it
> > > cachable to the S2 (as this series is trying to do)..
> >
> > As I replied earlier, it might be worth doing both - fault on !FWB
> > hardware (or rather reject the memslot creation), cacheable S2
> > otherwise.
>
> I have no objection, Ankit are you able to make a failure patch?
I'd wait until the KVM maintainers have their say.
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-18 19:40 ` Oliver Upton
@ 2025-03-20 3:30 ` bibo mao
2025-03-20 7:24 ` bibo mao
0 siblings, 1 reply; 61+ messages in thread
From: bibo mao @ 2025-03-20 3:30 UTC (permalink / raw)
To: Oliver Upton, David Hildenbrand
Cc: Catalin Marinas, Jason Gunthorpe, Marc Zyngier, Ankit Agrawal,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On 2025/3/19 上午3:40, Oliver Upton wrote:
> On Tue, Mar 18, 2025 at 08:35:38PM +0100, David Hildenbrand wrote:
>> On 18.03.25 20:27, Catalin Marinas wrote:
>>> On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote:
>>>> On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote:
>>>>> The memslot must also be created with a new flag ((2c) in the taxonomy
>>>>> above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This
>>>>> flag is only allowed if (1) is valid.
>>>>>
>>>>> This results in the following behaviours:
>>>>>
>>>>> - If the VMM creates the memslot with the cacheable attribute without
>>>>> (1) being advertised, we fail.
>>>>>
>>>>> - If the VMM creates the memslot without the cacheable attribute, we
>>>>> map as NC, as it is today.
>>>>
>>>> Is that OK though?
>>>>
>>>> Now we have the MM page tables mapping this memory as cachable but KVM
>>>> and the guest is accessing it as non-cached.
>>>
>>> I don't think we should allow this.
>>>
>>>> I thought ARM tried hard to avoid creating such mismatches? This is
>>>> why the pgprot flags were used to drive this, not an opt-in flag. To
>>>> prevent userspace from forcing a mismatch.
>>>
>>> We have the vma->vm_page_prot when the memslot is added, so we could use
>>> this instead of additional KVM flags.
>>
>> I thought we try to avoid messing with the VMA when adding memslots; because
>> KVM_CAP_SYNC_MMU allows user space for changing the VMAs afterwards without
>> changing the memslot?
>
> Any checks on the VMA at memslot creation is done out of courtesy to
> userspace so it 'fails fast'. We repeat checks on the VMA at the time of
> fault to handle userspace twiddling VMAs behind our back.
yes, I think it is better to add cachable attribute in memslot, it can
be checked on the VMA at memslot creation. Also cache attribute can be
abstracted with cachable/uc/wc type rather than detailed arch specified.
Regards
Bibo Mao
>
> VM_MTE_ALLOWED is an example of this.
>
> Thanks,
> Oliver
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-20 3:30 ` bibo mao
@ 2025-03-20 7:24 ` bibo mao
0 siblings, 0 replies; 61+ messages in thread
From: bibo mao @ 2025-03-20 7:24 UTC (permalink / raw)
To: Oliver Upton, David Hildenbrand
Cc: Catalin Marinas, Jason Gunthorpe, Marc Zyngier, Ankit Agrawal,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On 2025/3/20 上午11:30, bibo mao wrote:
>
>
> On 2025/3/19 上午3:40, Oliver Upton wrote:
>> On Tue, Mar 18, 2025 at 08:35:38PM +0100, David Hildenbrand wrote:
>>> On 18.03.25 20:27, Catalin Marinas wrote:
>>>> On Tue, Mar 18, 2025 at 09:55:27AM -0300, Jason Gunthorpe wrote:
>>>>> On Tue, Mar 18, 2025 at 09:39:30AM +0000, Marc Zyngier wrote:
>>>>>> The memslot must also be created with a new flag ((2c) in the
>>>>>> taxonomy
>>>>>> above) that carries the "Please map VM_PFNMAP VMAs as cacheable".
>>>>>> This
>>>>>> flag is only allowed if (1) is valid.
>>>>>>
>>>>>> This results in the following behaviours:
>>>>>>
>>>>>> - If the VMM creates the memslot with the cacheable attribute without
>>>>>> (1) being advertised, we fail.
>>>>>>
>>>>>> - If the VMM creates the memslot without the cacheable attribute, we
>>>>>> map as NC, as it is today.
>>>>>
>>>>> Is that OK though?
>>>>>
>>>>> Now we have the MM page tables mapping this memory as cachable but KVM
>>>>> and the guest is accessing it as non-cached.
>>>>
>>>> I don't think we should allow this.
>>>>
>>>>> I thought ARM tried hard to avoid creating such mismatches? This is
>>>>> why the pgprot flags were used to drive this, not an opt-in flag. To
>>>>> prevent userspace from forcing a mismatch.
>>>>
>>>> We have the vma->vm_page_prot when the memslot is added, so we could
>>>> use
>>>> this instead of additional KVM flags.
>>>
>>> I thought we try to avoid messing with the VMA when adding memslots;
>>> because
>>> KVM_CAP_SYNC_MMU allows user space for changing the VMAs afterwards
>>> without
>>> changing the memslot?
>>
>> Any checks on the VMA at memslot creation is done out of courtesy to
>> userspace so it 'fails fast'. We repeat checks on the VMA at the time of
>> fault to handle userspace twiddling VMAs behind our back.
> yes, I think it is better to add cachable attribute in memslot, it can
> be checked on the VMA at memslot creation. Also cache attribute can be
> abstracted with cachable/uc/wc type rather than detailed arch specified.
Sorry, I do not state this clearly. My meaning is tor add cachable
attribute in memslot. It is acquired from prot of VMA at memslot
creation, and checked at S2 page fault fastpath.
So it is unnecessary to find vma at S2 page fault fastpath, only memslot
is enough. And cache attribute write-combined should be added also since
some GPU memory can be mapped with WC attribute in vfio-gpu driver.
Regards
Bibo Mao
>
> Regards
> Bibo Mao
>>
>> VM_MTE_ALLOWED is an example of this.
>>
>> Thanks,
>> Oliver
>>
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-19 21:48 ` Catalin Marinas
@ 2025-03-26 8:31 ` Ankit Agrawal
2025-03-26 14:53 ` Sean Christopherson
0 siblings, 1 reply; 61+ messages in thread
From: Ankit Agrawal @ 2025-03-26 8:31 UTC (permalink / raw)
To: Catalin Marinas, Jason Gunthorpe
Cc: Oliver Upton, Marc Zyngier, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org,
ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org,
david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
seanjc@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
> On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote:
> > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote:
> > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd
> > > > > crap since we demote to something non-cacheable, and maybe that
> > > > > deserves fixing first. Hopefully nobody notices that we've taken away
> > > > > the toys...
> > > >
> > > > Fixing it is either faulting all access attempts or mapping it
> > > > cachable to the S2 (as this series is trying to do)..
> > >
> > > As I replied earlier, it might be worth doing both - fault on !FWB
> > > hardware (or rather reject the memslot creation), cacheable S2
> > > otherwise.
> >
> > I have no objection, Ankit are you able to make a failure patch?
>
> I'd wait until the KVM maintainers have their say.
>
Maz, Oliver any thoughts on this? Can we conclude to create this failure
patch in memslot creation?
> --
> Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-26 8:31 ` Ankit Agrawal
@ 2025-03-26 14:53 ` Sean Christopherson
2025-03-26 15:42 ` Marc Zyngier
0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2025-03-26 14:53 UTC (permalink / raw)
To: Ankit Agrawal
Cc: Catalin Marinas, Jason Gunthorpe, Oliver Upton, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Mar 26, 2025, Ankit Agrawal wrote:
> > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote:
> > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd
> > > > > > crap since we demote to something non-cacheable, and maybe that
> > > > > > deserves fixing first. Hopefully nobody notices that we've taken away
> > > > > > the toys...
> > > > >
> > > > > Fixing it is either faulting all access attempts or mapping it
> > > > > cachable to the S2 (as this series is trying to do)..
> > > >
> > > > As I replied earlier, it might be worth doing both - fault on !FWB
> > > > hardware (or rather reject the memslot creation), cacheable S2
> > > > otherwise.
> > >
> > > I have no objection, Ankit are you able to make a failure patch?
> >
> > I'd wait until the KVM maintainers have their say.
> >
>
> Maz, Oliver any thoughts on this? Can we conclude to create this failure
> patch in memslot creation?
That's not sufficient. As pointed out multiple times in this thread, any checks
done at memslot creation are best effort "courtesies" provided to userspace to
avoid terminating running VMs when the memory is faulted in.
I.e. checking at memslot creation is optional, checking at fault-in/mapping is
not.
With that in place, I don't see any need for a memslot flag. IIUC, without FWB,
cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB,
KVM can simply honor the cacheability based on the VMA. Neither of those requires
a memslot flag. A KVM capability to enumerate FWB support would be nice though,
e.g. so userspace can assert and bail early without ever hitting an ioctl error.
If we want to support existing setups that happen to work by dumb luck or careful
configuration, then that should probably be an admin decision to support the
"unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-26 14:53 ` Sean Christopherson
@ 2025-03-26 15:42 ` Marc Zyngier
2025-03-26 16:10 ` Sean Christopherson
0 siblings, 1 reply; 61+ messages in thread
From: Marc Zyngier @ 2025-03-26 15:42 UTC (permalink / raw)
To: Sean Christopherson
Cc: Ankit Agrawal, Catalin Marinas, Jason Gunthorpe, Oliver Upton,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, 26 Mar 2025 14:53:34 +0000,
Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Mar 26, 2025, Ankit Agrawal wrote:
> > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote:
> > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote:
> > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd
> > > > > > > crap since we demote to something non-cacheable, and maybe that
> > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away
> > > > > > > the toys...
> > > > > >
> > > > > > Fixing it is either faulting all access attempts or mapping it
> > > > > > cachable to the S2 (as this series is trying to do)..
> > > > >
> > > > > As I replied earlier, it might be worth doing both - fault on !FWB
> > > > > hardware (or rather reject the memslot creation), cacheable S2
> > > > > otherwise.
> > > >
> > > > I have no objection, Ankit are you able to make a failure patch?
> > >
> > > I'd wait until the KVM maintainers have their say.
> > >
> >
> > Maz, Oliver any thoughts on this? Can we conclude to create this failure
> > patch in memslot creation?
>
> That's not sufficient. As pointed out multiple times in this thread, any checks
> done at memslot creation are best effort "courtesies" provided to userspace to
> avoid terminating running VMs when the memory is faulted in.
>
> I.e. checking at memslot creation is optional, checking at fault-in/mapping is
> not.
>
> With that in place, I don't see any need for a memslot flag. IIUC, without FWB,
> cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB,
> KVM can simply honor the cacheability based on the VMA. Neither of those requires
Remind me how this work with stuff such as guestmemfd, which, by
definition, doesn't have a userspace mapping?
> a memslot flag. A KVM capability to enumerate FWB support would be nice though,
> e.g. so userspace can assert and bail early without ever hitting an
> ioctl error.
It's not "nice". It's mandatory. And FWB is definitely *not* something
we want to expose as such.
>
> If we want to support existing setups that happen to work by dumb luck or careful
> configuration, then that should probably be an admin decision to support the
> "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag.
No. That's not how we handle an ABI issue. VM migration, with and
without FWB, can happen in both direction, and must have clear
semantics. So NAK to a kernel parameter.
If I have a VM with a device mapped as *device* on FWB host, I must be
able to migrate it to non-FWB host, and back. A device mapped as
*cacheable* can only be migrated between FWB-capable hosts.
Importantly, it is *userspace* that is in charge of deciding how the
device is mapped at S2. And the memslot flag is the correct
abstraction for that.
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-26 15:42 ` Marc Zyngier
@ 2025-03-26 16:10 ` Sean Christopherson
2025-03-26 18:02 ` Marc Zyngier
0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2025-03-26 16:10 UTC (permalink / raw)
To: Marc Zyngier
Cc: Ankit Agrawal, Catalin Marinas, Jason Gunthorpe, Oliver Upton,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Mar 26, 2025, Marc Zyngier wrote:
> On Wed, 26 Mar 2025 14:53:34 +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Mar 26, 2025, Ankit Agrawal wrote:
> > > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote:
> > > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote:
> > > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> > > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd
> > > > > > > > crap since we demote to something non-cacheable, and maybe that
> > > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away
> > > > > > > > the toys...
> > > > > > >
> > > > > > > Fixing it is either faulting all access attempts or mapping it
> > > > > > > cachable to the S2 (as this series is trying to do)..
> > > > > >
> > > > > > As I replied earlier, it might be worth doing both - fault on !FWB
> > > > > > hardware (or rather reject the memslot creation), cacheable S2
> > > > > > otherwise.
> > > > >
> > > > > I have no objection, Ankit are you able to make a failure patch?
> > > >
> > > > I'd wait until the KVM maintainers have their say.
> > > >
> > >
> > > Maz, Oliver any thoughts on this? Can we conclude to create this failure
> > > patch in memslot creation?
> >
> > That's not sufficient. As pointed out multiple times in this thread, any checks
> > done at memslot creation are best effort "courtesies" provided to userspace to
> > avoid terminating running VMs when the memory is faulted in.
> >
> > I.e. checking at memslot creation is optional, checking at fault-in/mapping is
> > not.
> >
> > With that in place, I don't see any need for a memslot flag. IIUC, without FWB,
> > cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB,
> > KVM can simply honor the cacheability based on the VMA. Neither of those requires
>
> Remind me how this work with stuff such as guestmemfd, which, by
> definition, doesn't have a userspace mapping?
Definitely not through a memslot flag. The cacheability would be a property of
the guest_memfd inode, similar to how it's a property of the underlying device
in this case.
I don't entirely see what guest_memfd has to do with this. One of the big
advantages of guest_memfd is that KVM has complete control over the lifecycle of
the memory. IIUC, the issue with !FWB hosts is that KVM can't guarantee there
are valid host mappings when memory is unmapped from the guest, and so can't do
the necessary maintenance. I agree with Jason's earlier statement that that's a
solvable kernel flaw.
For guest_memfd, KVM already does maintenance operations when memory is reclaimed,
for both SNP and TDX. I don't think ARM's cacheability stuff would require any
new functionality in guest_memfd.
> > a memslot flag. A KVM capability to enumerate FWB support would be nice though,
> > e.g. so userspace can assert and bail early without ever hitting an
> > ioctl error.
>
> It's not "nice". It's mandatory. And FWB is definitely *not* something
> we want to expose as such.
I agree a capability is mandatory if we're adding a memslot flag, but I don't
think it's mandatory if this is all handled through kernel plumbing.
> > If we want to support existing setups that happen to work by dumb luck or careful
> > configuration, then that should probably be an admin decision to support the
> > "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag.
>
> No. That's not how we handle an ABI issue. VM migration, with and
> without FWB, can happen in both direction, and must have clear
> semantics. So NAK to a kernel parameter.
>
> If I have a VM with a device mapped as *device* on FWB host, I must be
> able to migrate it to non-FWB host, and back. A device mapped as
> *cacheable* can only be migrated between FWB-capable hosts.
But I thought the whole problem is that mapping this fancy memory as device is
unsafe on non-FWB hosts? If it's safe, then why does KVM needs to reject anything
in the first place?
> Importantly, it is *userspace* that is in charge of deciding how the
> device is mapped at S2. And the memslot flag is the correct
> abstraction for that.
I strongly disagree. Whatever owns the underlying physical memory is in charge,
not userspace. For memory that's backed by a VMA, userspace can influence the
behavior through mmap(), mprotect(), etc., but ultimately KVM needs to pull state
from mm/, via the VMA. Or in the guest_memfd case, from guest_memfd.
I have no objection to adding KVM uAPI to let userspace add _restrictions_, e.g.
to disallow mapping memory as writable even if the VMA is writable. But IMO,
adding a memslot flag to control cacheability isn't purely substractive.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-26 16:10 ` Sean Christopherson
@ 2025-03-26 18:02 ` Marc Zyngier
2025-03-26 18:24 ` Sean Christopherson
0 siblings, 1 reply; 61+ messages in thread
From: Marc Zyngier @ 2025-03-26 18:02 UTC (permalink / raw)
To: Sean Christopherson
Cc: Ankit Agrawal, Catalin Marinas, Jason Gunthorpe, Oliver Upton,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, 26 Mar 2025 16:10:45 +0000,
Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Mar 26, 2025, Marc Zyngier wrote:
> > On Wed, 26 Mar 2025 14:53:34 +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Wed, Mar 26, 2025, Ankit Agrawal wrote:
> > > > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote:
> > > > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote:
> > > > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote:
> > > > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> > > > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd
> > > > > > > > > crap since we demote to something non-cacheable, and maybe that
> > > > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away
> > > > > > > > > the toys...
> > > > > > > >
> > > > > > > > Fixing it is either faulting all access attempts or mapping it
> > > > > > > > cachable to the S2 (as this series is trying to do)..
> > > > > > >
> > > > > > > As I replied earlier, it might be worth doing both - fault on !FWB
> > > > > > > hardware (or rather reject the memslot creation), cacheable S2
> > > > > > > otherwise.
> > > > > >
> > > > > > I have no objection, Ankit are you able to make a failure patch?
> > > > >
> > > > > I'd wait until the KVM maintainers have their say.
> > > > >
> > > >
> > > > Maz, Oliver any thoughts on this? Can we conclude to create this failure
> > > > patch in memslot creation?
> > >
> > > That's not sufficient. As pointed out multiple times in this thread, any checks
> > > done at memslot creation are best effort "courtesies" provided to userspace to
> > > avoid terminating running VMs when the memory is faulted in.
> > >
> > > I.e. checking at memslot creation is optional, checking at fault-in/mapping is
> > > not.
> > >
> > > With that in place, I don't see any need for a memslot flag. IIUC, without FWB,
> > > cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB,
> > > KVM can simply honor the cacheability based on the VMA. Neither of those requires
> >
> > Remind me how this work with stuff such as guestmemfd, which, by
> > definition, doesn't have a userspace mapping?
>
> Definitely not through a memslot flag. The cacheability would be a property of
> the guest_memfd inode, similar to how it's a property of the underlying device
> in this case.
It's *not* a property of the device. It's a property of the mapping.
> I don't entirely see what guest_memfd has to do with this.
You were the one mentioning sampling the cacheability via the VMA. As
far as I understand guestmemfd, there is no VMA to speak of.
> One of the big
> advantages of guest_memfd is that KVM has complete control over the lifecycle of
> the memory. IIUC, the issue with !FWB hosts is that KVM can't guarantee there
> are valid host mappings when memory is unmapped from the guest, and so can't do
> the necessary maintenance. I agree with Jason's earlier statement that that's a
> solvable kernel flaw.
>
> For guest_memfd, KVM already does maintenance operations when memory is reclaimed,
> for both SNP and TDX. I don't think ARM's cacheability stuff would require any
> new functionality in guest_memfd.
I don't know how you reconcile the lack of host mapping and cache
maintenance. The latter cannot take place without the former.
>
> > > a memslot flag. A KVM capability to enumerate FWB support would be nice though,
> > > e.g. so userspace can assert and bail early without ever hitting an
> > > ioctl error.
> >
> > It's not "nice". It's mandatory. And FWB is definitely *not* something
> > we want to expose as such.
>
> I agree a capability is mandatory if we're adding a memslot flag, but I don't
> think it's mandatory if this is all handled through kernel plumbing.
It is mandatory, full stop. Otherwise, userspace is able to migrate a
VM from an FWB host to a non-FWB one, start the VM, blow up on the
first page fault. That's not an acceptable outcome.
>
> > > If we want to support existing setups that happen to work by dumb luck or careful
> > > configuration, then that should probably be an admin decision to support the
> > > "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag.
> >
> > No. That's not how we handle an ABI issue. VM migration, with and
> > without FWB, can happen in both direction, and must have clear
> > semantics. So NAK to a kernel parameter.
> >
> > If I have a VM with a device mapped as *device* on FWB host, I must be
> > able to migrate it to non-FWB host, and back. A device mapped as
> > *cacheable* can only be migrated between FWB-capable hosts.
>
> But I thought the whole problem is that mapping this fancy memory as device is
> unsafe on non-FWB hosts? If it's safe, then why does KVM needs to reject anything
> in the first place?
I don't know where you got that idea. This is all about what memory
type is exposed to a guest:
- with FWB, no need for CMOs, so cacheable memory is allowed if the
device supports it (i.e. it actually exposes memory), and device
otherwise.
- without FWB, CMOs are required, and we don't have a host mapping for
these pages. As a fallback, the mapping is device only, as this
doesn't require any CMO by definition.
There is no notion of "safety" here.
> > Importantly, it is *userspace* that is in charge of deciding how the
> > device is mapped at S2. And the memslot flag is the correct
> > abstraction for that.
>
> I strongly disagree. Whatever owns the underlying physical memory is in charge,
> not userspace. For memory that's backed by a VMA, userspace can influence the
> behavior through mmap(), mprotect(), etc., but ultimately KVM needs to pull state
> from mm/, via the VMA. Or in the guest_memfd case, from guest_memfd.
I don't buy that. Userspace needs to know the semantics of the memory
it gives to the guest. Or at least discover that the same device
plugged into to different hosts will have different behaviours. Just
letting things rip is not an acceptable outcome.
> I have no objection to adding KVM uAPI to let userspace add _restrictions_, e.g.
> to disallow mapping memory as writable even if the VMA is writable. But IMO,
> adding a memslot flag to control cacheability isn't purely substractive.
I don't see how that solves the problem at hand: given the presence or
absence of FWB, allow userspace to discover as early as possible what
behaviour a piece of memory provided by a device will have, and
control it to handle migration.
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-26 18:02 ` Marc Zyngier
@ 2025-03-26 18:24 ` Sean Christopherson
2025-03-26 18:51 ` Oliver Upton
2025-03-31 14:56 ` Jason Gunthorpe
0 siblings, 2 replies; 61+ messages in thread
From: Sean Christopherson @ 2025-03-26 18:24 UTC (permalink / raw)
To: Marc Zyngier
Cc: Ankit Agrawal, Catalin Marinas, Jason Gunthorpe, Oliver Upton,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Mar 26, 2025, Marc Zyngier wrote:
> On Wed, 26 Mar 2025 16:10:45 +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Mar 26, 2025, Marc Zyngier wrote:
> > > On Wed, 26 Mar 2025 14:53:34 +0000,
> > > Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Wed, Mar 26, 2025, Ankit Agrawal wrote:
> > > > > > On Wed, Mar 19, 2025 at 04:22:46PM -0300, Jason Gunthorpe wrote:
> > > > > > > On Wed, Mar 19, 2025 at 06:11:02PM +0000, Catalin Marinas wrote:
> > > > > > > > On Wed, Mar 19, 2025 at 02:04:29PM -0300, Jason Gunthorpe wrote:
> > > > > > > > > On Wed, Mar 19, 2025 at 12:01:29AM -0700, Oliver Upton wrote:
> > > > > > > > > > You have a very good point that KVM is broken for cacheable PFNMAP'd
> > > > > > > > > > crap since we demote to something non-cacheable, and maybe that
> > > > > > > > > > deserves fixing first. Hopefully nobody notices that we've taken away
> > > > > > > > > > the toys...
> > > > > > > > >
> > > > > > > > > Fixing it is either faulting all access attempts or mapping it
> > > > > > > > > cachable to the S2 (as this series is trying to do)..
> > > > > > > >
> > > > > > > > As I replied earlier, it might be worth doing both - fault on !FWB
> > > > > > > > hardware (or rather reject the memslot creation), cacheable S2
> > > > > > > > otherwise.
> > > > > > >
> > > > > > > I have no objection, Ankit are you able to make a failure patch?
> > > > > >
> > > > > > I'd wait until the KVM maintainers have their say.
> > > > > >
> > > > >
> > > > > Maz, Oliver any thoughts on this? Can we conclude to create this failure
> > > > > patch in memslot creation?
> > > >
> > > > That's not sufficient. As pointed out multiple times in this thread, any checks
> > > > done at memslot creation are best effort "courtesies" provided to userspace to
> > > > avoid terminating running VMs when the memory is faulted in.
> > > >
> > > > I.e. checking at memslot creation is optional, checking at fault-in/mapping is
> > > > not.
> > > >
> > > > With that in place, I don't see any need for a memslot flag. IIUC, without FWB,
> > > > cacheable pfn-mapped memory is broken and needs to be disallowed. But with FWB,
> > > > KVM can simply honor the cacheability based on the VMA. Neither of those requires
> > >
> > > Remind me how this work with stuff such as guestmemfd, which, by
> > > definition, doesn't have a userspace mapping?
> >
> > Definitely not through a memslot flag. The cacheability would be a property of
> > the guest_memfd inode, similar to how it's a property of the underlying device
> > in this case.
>
> It's *not* a property of the device. It's a property of the mapping.
Sorry, bad phrasing. I was trying to say that the entity that controls the
cacheability is ultimately whatever kernel subsystem/driver controls the mapping.
> > I don't entirely see what guest_memfd has to do with this.
>
> You were the one mentioning sampling the cacheability via the VMA. As
> far as I understand guestmemfd, there is no VMA to speak of.
>
> > One of the big
> > advantages of guest_memfd is that KVM has complete control over the lifecycle of
> > the memory. IIUC, the issue with !FWB hosts is that KVM can't guarantee there
> > are valid host mappings when memory is unmapped from the guest, and so can't do
> > the necessary maintenance. I agree with Jason's earlier statement that that's a
> > solvable kernel flaw.
> >
> > For guest_memfd, KVM already does maintenance operations when memory is reclaimed,
> > for both SNP and TDX. I don't think ARM's cacheability stuff would require any
> > new functionality in guest_memfd.
>
> I don't know how you reconcile the lack of host mapping and cache
> maintenance. The latter cannot take place without the former.
I assume cache maintenance only requires _a_ mapping to the physical memory.
With guest_memfd, KVM has the pfn (which happens to always be struct page memory
today), and so can establish a VA=>PA mapping as needed.
> > > > a memslot flag. A KVM capability to enumerate FWB support would be nice though,
> > > > e.g. so userspace can assert and bail early without ever hitting an
> > > > ioctl error.
> > >
> > > It's not "nice". It's mandatory. And FWB is definitely *not* something
> > > we want to expose as such.
> >
> > I agree a capability is mandatory if we're adding a memslot flag, but I don't
> > think it's mandatory if this is all handled through kernel plumbing.
>
> It is mandatory, full stop. Otherwise, userspace is able to migrate a
> VM from an FWB host to a non-FWB one, start the VM, blow up on the
> first page fault. That's not an acceptable outcome.
>
> >
> > > > If we want to support existing setups that happen to work by dumb luck or careful
> > > > configuration, then that should probably be an admin decision to support the
> > > > "unsafe" behavior, i.e. an off-by-default KVM module param, not a memslot flag.
> > >
> > > No. That's not how we handle an ABI issue. VM migration, with and
> > > without FWB, can happen in both direction, and must have clear
> > > semantics. So NAK to a kernel parameter.
> > >
> > > If I have a VM with a device mapped as *device* on FWB host, I must be
> > > able to migrate it to non-FWB host, and back. A device mapped as
> > > *cacheable* can only be migrated between FWB-capable hosts.
> >
> > But I thought the whole problem is that mapping this fancy memory as device is
> > unsafe on non-FWB hosts? If it's safe, then why does KVM needs to reject anything
> > in the first place?
>
> I don't know where you got that idea. This is all about what memory
> type is exposed to a guest:
>
> - with FWB, no need for CMOs, so cacheable memory is allowed if the
> device supports it (i.e. it actually exposes memory), and device
> otherwise.
>
> - without FWB, CMOs are required, and we don't have a host mapping for
> these pages. As a fallback, the mapping is device only, as this
> doesn't require any CMO by definition.
>
> There is no notion of "safety" here.
Ah, the safety I'm talking about is the CMO requirement. IIUC, not doing CMOs
if the memory is cacheable could result in data corruption, i.e. would be a safety
issue for the host. But I missed that you were proposing that the !FWB behavior
would be to force device mappings.
> > > Importantly, it is *userspace* that is in charge of deciding how the
> > > device is mapped at S2. And the memslot flag is the correct
> > > abstraction for that.
> >
> > I strongly disagree. Whatever owns the underlying physical memory is in charge,
> > not userspace. For memory that's backed by a VMA, userspace can influence the
> > behavior through mmap(), mprotect(), etc., but ultimately KVM needs to pull state
> > from mm/, via the VMA. Or in the guest_memfd case, from guest_memfd.
>
> I don't buy that. Userspace needs to know the semantics of the memory
> it gives to the guest. Or at least discover that the same device
> plugged into to different hosts will have different behaviours. Just
> letting things rip is not an acceptable outcome.
Agreed, but that doesn't require a memslot flag. A capability to enumerate that
KVM can do cacheable mappings for PFNMAP memory would suffice. And if we want to
have KVM reject memslots that are cachaeable in the VMA, but would get device in
stage-2, then we can provide that functionality through the capability, i.e. let
userspace decide if it wants "fallback to device" vs. "error on creation" on a
per-VM basis.
What I object to is adding a memslot flag.
> > I have no objection to adding KVM uAPI to let userspace add _restrictions_, e.g.
> > to disallow mapping memory as writable even if the VMA is writable. But IMO,
> > adding a memslot flag to control cacheability isn't purely substractive.
>
> I don't see how that solves the problem at hand: given the presence or
> absence of FWB, allow userspace to discover as early as possible what
> behaviour a piece of memory provided by a device will have, and
> control it to handle migration.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-26 18:24 ` Sean Christopherson
@ 2025-03-26 18:51 ` Oliver Upton
2025-03-31 14:44 ` Jason Gunthorpe
2025-03-31 14:56 ` Jason Gunthorpe
1 sibling, 1 reply; 61+ messages in thread
From: Oliver Upton @ 2025-03-26 18:51 UTC (permalink / raw)
To: Sean Christopherson
Cc: Marc Zyngier, Ankit Agrawal, Catalin Marinas, Jason Gunthorpe,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Mar 26, 2025 at 11:24:32AM -0700, Sean Christopherson wrote:
> > > But I thought the whole problem is that mapping this fancy memory as device is
> > > unsafe on non-FWB hosts? If it's safe, then why does KVM needs to reject anything
> > > in the first place?
> >
> > I don't know where you got that idea. This is all about what memory
> > type is exposed to a guest:
> >
> > - with FWB, no need for CMOs, so cacheable memory is allowed if the
> > device supports it (i.e. it actually exposes memory), and device
> > otherwise.
> >
> > - without FWB, CMOs are required, and we don't have a host mapping for
> > these pages. As a fallback, the mapping is device only, as this
> > doesn't require any CMO by definition.
> >
> > There is no notion of "safety" here.
>
> Ah, the safety I'm talking about is the CMO requirement. IIUC, not doing CMOs
> if the memory is cacheable could result in data corruption, i.e. would be a safety
> issue for the host. But I missed that you were proposing that the !FWB behavior
> would be to force device mappings.
To Jason's earlier point, you wind up with a security issue the other
way around.
Supposing the host is using a cacheable mapping to, say, zero the $THING
at the other end of the mapping. Without a way to CMO the $THING we
cannot make the zeroing visible to a guest with a stage-2 Device-* mapping.
Marc, I understand that your proposed fallback is aligned to what we
do today, but I'm actually unconvinced that it provides any reliable/correct
behavior. We should then wind up with stage-2 memory attribute rules
like so:
1) If struct page memory, use a cacheable mapping. CMO for non-FWB.
2) If cacheable PFNMAP:
a) With FWB, use a cacheable mapping
b) Without FWB, fail.
3) If VM_ALLOW_ANY_UNCACHED, use Normal Non-Cacheable mapping
4) Otherwise, Device-nGnRE
I understand 2b breaks ABI, but the 'typical' VFIO usages fall into (3)
and (4).
> > > > Importantly, it is *userspace* that is in charge of deciding how the
> > > > device is mapped at S2. And the memslot flag is the correct
> > > > abstraction for that.
> > >
> > > I strongly disagree. Whatever owns the underlying physical memory is in charge,
> > > not userspace. For memory that's backed by a VMA, userspace can influence the
> > > behavior through mmap(), mprotect(), etc., but ultimately KVM needs to pull state
> > > from mm/, via the VMA. Or in the guest_memfd case, from guest_memfd.
> >
> > I don't buy that. Userspace needs to know the semantics of the memory
> > it gives to the guest. Or at least discover that the same device
> > plugged into to different hosts will have different behaviours. Just
> > letting things rip is not an acceptable outcome.
>
> Agreed, but that doesn't require a memslot flag. A capability to enumerate that
> KVM can do cacheable mappings for PFNMAP memory would suffice. And if we want to
> have KVM reject memslots that are cachaeable in the VMA, but would get device in
> stage-2, then we can provide that functionality through the capability, i.e. let
> userspace decide if it wants "fallback to device" vs. "error on creation" on a
> per-VM basis.
>
> What I object to is adding a memslot flag.
A capability that says "I can force cacheable things to be cacheable" is
useful beyond even PFNMAP stuff. A pedantic but correct live migration /
snapshotting implementation on non-FWB would need to do CMOs in case the
VM used a non-WB mapping for memory.
Thanks,
Oliver
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-26 18:51 ` Oliver Upton
@ 2025-03-31 14:44 ` Jason Gunthorpe
0 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2025-03-31 14:44 UTC (permalink / raw)
To: Oliver Upton
Cc: Sean Christopherson, Marc Zyngier, Ankit Agrawal, Catalin Marinas,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Mar 26, 2025 at 11:51:57AM -0700, Oliver Upton wrote:
>
> 1) If struct page memory, use a cacheable mapping. CMO for non-FWB.
>
> 2) If cacheable PFNMAP:
> a) With FWB, use a cacheable mapping
> b) Without FWB, fail.
>
> 3) If VM_ALLOW_ANY_UNCACHED, use Normal Non-Cacheable mapping
>
> 4) Otherwise, Device-nGnRE
>
> I understand 2b breaks ABI, but the 'typical' VFIO usages fall into (3)
> and (4).
+1 (and +1 to Sean's remark about strictly tracking the VMA as well)
IMHO not doing 2b is a "security" bug today. Catalin suggested we fix
it as a first step to get agreement on this assement and fix. Once
fixed there is no way for KVM to create a S2 with cachable semantics
different from the VMA, or to have missing CMOs.
That simplifies the discussion of adding 2a to strictly track
cachable.
I also don't see a need for a flag here.
> A pedantic but correct live migration / snapshotting implementation
> on non-FWB would need to do CMOs in case the VM used a non-WB
> mapping for memory.
From a live migration perspecive, we have already built in a lot of
things on the VFIO side where a live migration can be attempted and
then fail because of late-detected HW incompatibilities. We've sort of
punted this for now to the orchestrator and operator. There is an
expectation the environment will somehow ensure that live migration
machine pools are sufficiently uniform when using VFIO. It is much
more restricted than normal no-VFIO VM live migration.
Given you can't have VM live migration of MMIO backed Cachable memory
without a VFIO live migration driver, I wouldn't worry too much about
these fine details from the KVM side. If /proc/cpuinfo shows the FWB
that would be approximately similar discoverability to all the other
limitations.
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-26 18:24 ` Sean Christopherson
2025-03-26 18:51 ` Oliver Upton
@ 2025-03-31 14:56 ` Jason Gunthorpe
2025-04-07 15:20 ` Sean Christopherson
1 sibling, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-03-31 14:56 UTC (permalink / raw)
To: Sean Christopherson
Cc: Marc Zyngier, Ankit Agrawal, Catalin Marinas, Oliver Upton,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Mar 26, 2025 at 11:24:32AM -0700, Sean Christopherson wrote:
> > I don't know how you reconcile the lack of host mapping and cache
> > maintenance. The latter cannot take place without the former.
>
> I assume cache maintenance only requires _a_ mapping to the physical memory.
> With guest_memfd, KVM has the pfn (which happens to always be struct page memory
> today), and so can establish a VA=>PA mapping as needed.
This is why we are forcing FWB in this work, because we don't have a
VA mapping and KVM doesn't have the code to create one on demand.
IMHO, I strongly suspect that all CCA capable ARM systems will have
FWB, so it may make the most sense to continue that direction when
using guest_memfd. Though I don't know what that would mean for pkvm..
> > > I agree a capability is mandatory if we're adding a memslot flag, but I don't
> > > think it's mandatory if this is all handled through kernel plumbing.
> >
> > It is mandatory, full stop. Otherwise, userspace is able to migrate a
> > VM from an FWB host to a non-FWB one, start the VM, blow up on the
> > first page fault. That's not an acceptable outcome.
It is fine if you add a protective check during memslot creation. If
qemu asks for a memslot for a cachable VFIO VMA without FWB support
then fail it immediately, and that will safely abort the migration. Do
not delay until page fault time, though still check again at page
fault time.
This is good enough for VFIO device live migration as "try and fail"
broadly matches the other HW compatability checks VFIO device live
migration does today.
> Ah, the safety I'm talking about is the CMO requirement. IIUC, not doing CMOs
> if the memory is cacheable could result in data corruption, i.e. would be a safety
> issue for the host. But I missed that you were proposing that the !FWB behavior
> would be to force device mappings.
It only forces device mappings on the VM side, so you still have the
safety issue on the host side where a the VM will see the physical
contents of the memory but the hypervisor doesn't flush or otherwise
to sychronize. Ie data could leak across VM A to B because the cache
was not flushed and A's data is still in physical.
> Agreed, but that doesn't require a memslot flag. A capability to enumerate that
> KVM can do cacheable mappings for PFNMAP memory would suffice. And if we want to
> have KVM reject memslots that are cachaeable in the VMA, but would get device in
> stage-2, then we can provide that functionality through the capability, i.e. let
> userspace decide if it wants "fallback to device" vs. "error on creation" on a
> per-VM basis.
I think we must block "fallback to device" as a security fix. So you
only get error on creation an option..
If that means some scenarios start to fail, and we consider that as
important to fix, the fix is to have the S2 be cachable and add the
missing CMOs.
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-03-31 14:56 ` Jason Gunthorpe
@ 2025-04-07 15:20 ` Sean Christopherson
2025-04-07 16:15 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2025-04-07 15:20 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Marc Zyngier, Ankit Agrawal, Catalin Marinas, Oliver Upton,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Mon, Mar 31, 2025, Jason Gunthorpe wrote:
> On Wed, Mar 26, 2025 at 11:24:32AM -0700, Sean Christopherson wrote:
> > > I don't know how you reconcile the lack of host mapping and cache
> > > maintenance. The latter cannot take place without the former.
> >
> > I assume cache maintenance only requires _a_ mapping to the physical memory.
> > With guest_memfd, KVM has the pfn (which happens to always be struct page memory
> > today), and so can establish a VA=>PA mapping as needed.
>
> This is why we are forcing FWB in this work, because we don't have a
> VA mapping and KVM doesn't have the code to create one on demand.
I don't follow. As it exists today, guest_memfd doesn't touch the direct map,
i.e. there's already a kernel mapping, KVM doesn't need to create one.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-07 15:20 ` Sean Christopherson
@ 2025-04-07 16:15 ` Jason Gunthorpe
2025-04-07 16:43 ` Sean Christopherson
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-04-07 16:15 UTC (permalink / raw)
To: Sean Christopherson
Cc: Marc Zyngier, Ankit Agrawal, Catalin Marinas, Oliver Upton,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Mon, Apr 07, 2025 at 08:20:09AM -0700, Sean Christopherson wrote:
> On Mon, Mar 31, 2025, Jason Gunthorpe wrote:
> > On Wed, Mar 26, 2025 at 11:24:32AM -0700, Sean Christopherson wrote:
> > > > I don't know how you reconcile the lack of host mapping and cache
> > > > maintenance. The latter cannot take place without the former.
> > >
> > > I assume cache maintenance only requires _a_ mapping to the physical memory.
> > > With guest_memfd, KVM has the pfn (which happens to always be struct page memory
> > > today), and so can establish a VA=>PA mapping as needed.
> >
> > This is why we are forcing FWB in this work, because we don't have a
> > VA mapping and KVM doesn't have the code to create one on demand.
>
> I don't follow. As it exists today, guest_memfd doesn't touch the direct map,
> i.e. there's already a kernel mapping, KVM doesn't need to create one.
This is not about guest_memfd..
When ARM KVM copies a PTE from a VMA's PTE into the S2 it may find
that the physical address does not have a struct page, in which case
it assumes the worst that it is not in the kmap and cannot be cache
flushed.
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-07 16:15 ` Jason Gunthorpe
@ 2025-04-07 16:43 ` Sean Christopherson
2025-04-16 8:51 ` Ankit Agrawal
0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2025-04-07 16:43 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Marc Zyngier, Ankit Agrawal, Catalin Marinas, Oliver Upton,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Mon, Apr 07, 2025, Jason Gunthorpe wrote:
> On Mon, Apr 07, 2025 at 08:20:09AM -0700, Sean Christopherson wrote:
> > On Mon, Mar 31, 2025, Jason Gunthorpe wrote:
> > > On Wed, Mar 26, 2025 at 11:24:32AM -0700, Sean Christopherson wrote:
> > > > > I don't know how you reconcile the lack of host mapping and cache
> > > > > maintenance. The latter cannot take place without the former.
> > > >
> > > > I assume cache maintenance only requires _a_ mapping to the physical memory.
> > > > With guest_memfd, KVM has the pfn (which happens to always be struct page memory
> > > > today), and so can establish a VA=>PA mapping as needed.
> > >
> > > This is why we are forcing FWB in this work, because we don't have a
> > > VA mapping and KVM doesn't have the code to create one on demand.
> >
> > I don't follow. As it exists today, guest_memfd doesn't touch the direct map,
> > i.e. there's already a kernel mapping, KVM doesn't need to create one.
>
> This is not about guest_memfd..
Heh, my part of the thread was. I was responding to Marc's comment:
: Remind me how this work with stuff such as guestmemfd, which, by
: definition, doesn't have a userspace mapping?
I'm pretty sure we're on the same page.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-07 16:43 ` Sean Christopherson
@ 2025-04-16 8:51 ` Ankit Agrawal
2025-04-21 16:03 ` Ankit Agrawal
2025-04-22 7:49 ` Oliver Upton
0 siblings, 2 replies; 61+ messages in thread
From: Ankit Agrawal @ 2025-04-16 8:51 UTC (permalink / raw)
To: Sean Christopherson, Jason Gunthorpe
Cc: Marc Zyngier, Catalin Marinas, Oliver Upton, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org,
ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org,
david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
Hi, summarizing the discussion so far and outlining the next steps. The key points
are as follows:
1. KVM cap to expose whether the kernel supports mapping cacheable PFNMAP:
If the host doesn't have FWB, then the capability doesn't exist. Jason, Oliver, Caitlin
and Sean points that this may not be required as userspace do not have
much choice anyways. KVM has to follow the PTEs and userspace cannot ask
for something different. However, Marc points that enumerating FWB support
would allow userspace to discover the support and prevent live-migration
across FWB and non-FWB hosts. Jason suggested that this may still be fine as
we have already built in VFIO side protection where a live migration can be
attempted and then fail because of late-detected HW incompatibilities.
2. New memslot flag that VMM passes at memslot registration:
Discussion point that this is not necessary and KVM should just follow the
VMA pgprot.
3. Fallback path handling for PFNMAP when the FWB is not set:
Discussion points that there shouldn't be any fallback path and the memslot
should just fail. i.e. KVM should not allow degrading cachable to non-cachable
when it can't do flushing. This is to prevent the potential security issue
pointed by Jason (S1 cacheable, S2 noncacheable).
So AIU, the next step is to send out the updated series with the following patches:
1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region)
and during fault handling (user_mem_abort()).
2. Enable support for cacheable PFN maps if S2FWB is enabled by following
the vma pgprot (this patch).
3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false
for !FWB), pending maintainers' feedback on the necessity of this capability.
Please let me know if there are any inaccuracies.
Thanks
Ankit Agrawal
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-16 8:51 ` Ankit Agrawal
@ 2025-04-21 16:03 ` Ankit Agrawal
2025-04-22 7:49 ` Oliver Upton
1 sibling, 0 replies; 61+ messages in thread
From: Ankit Agrawal @ 2025-04-21 16:03 UTC (permalink / raw)
To: Sean Christopherson, Jason Gunthorpe
Cc: Marc Zyngier, Catalin Marinas, Oliver Upton, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com, will@kernel.org,
ryan.roberts@arm.com, shahuang@redhat.com, lpieralisi@kernel.org,
david@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
> Hi, summarizing the discussion so far and outlining the next steps. The key points
> are as follows:
>
> 1. KVM cap to expose whether the kernel supports mapping cacheable PFNMAP:
> If the host doesn't have FWB, then the capability doesn't exist. Jason, Oliver, Caitlin
> and Sean points that this may not be required as userspace do not have
> much choice anyways. KVM has to follow the PTEs and userspace cannot ask
> for something different. However, Marc points that enumerating FWB support
> would allow userspace to discover the support and prevent live-migration
> across FWB and non-FWB hosts. Jason suggested that this may still be fine as
> we have already built in VFIO side protection where a live migration can be
> attempted and then fail because of late-detected HW incompatibilities.
>
> 2. New memslot flag that VMM passes at memslot registration:
> Discussion point that this is not necessary and KVM should just follow the
> VMA pgprot.
>
> 3. Fallback path handling for PFNMAP when the FWB is not set:
> Discussion points that there shouldn't be any fallback path and the memslot
> should just fail. i.e. KVM should not allow degrading cachable to non-cachable
> when it can't do flushing. This is to prevent the potential security issue
> pointed by Jason (S1 cacheable, S2 noncacheable).
>
> So AIU, the next step is to send out the updated series with the following patches:
> 1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region)
> and during fault handling (user_mem_abort()).
>
> 2. Enable support for cacheable PFN maps if S2FWB is enabled by following
> the vma pgprot (this patch).
>
> 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false
> for !FWB), pending maintainers' feedback on the necessity of this capability.
Hi, just a humble reminder to take a look at the summary and next steps.
Marc, will you be able to confirm if you still think we should have the
"cacheable PFNMAP" KVM cap?
If yes, I'll send out the series inclusive of that.
> Please let me know if there are any inaccuracies.
>
> Thanks
> Ankit Agrawal
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-16 8:51 ` Ankit Agrawal
2025-04-21 16:03 ` Ankit Agrawal
@ 2025-04-22 7:49 ` Oliver Upton
2025-04-22 13:54 ` Jason Gunthorpe
2025-04-22 14:53 ` Sean Christopherson
1 sibling, 2 replies; 61+ messages in thread
From: Oliver Upton @ 2025-04-22 7:49 UTC (permalink / raw)
To: Ankit Agrawal
Cc: Sean Christopherson, Jason Gunthorpe, Marc Zyngier,
Catalin Marinas, joey.gouly@arm.com, suzuki.poulose@arm.com,
yuzenghui@huawei.com, will@kernel.org, ryan.roberts@arm.com,
shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke, Dheeraj Nigam,
Krishnakant Jaju, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
ddutile@redhat.com, tabba@google.com, qperret@google.com,
kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
Hi,
On Wed, Apr 16, 2025 at 08:51:05AM +0000, Ankit Agrawal wrote:
> Hi, summarizing the discussion so far and outlining the next steps. The key points
> are as follows:
> 1. KVM cap to expose whether the kernel supports mapping cacheable PFNMAP:
> If the host doesn't have FWB, then the capability doesn't exist. Jason, Oliver, Caitlin
> and Sean points that this may not be required as userspace do not have
> much choice anyways. KVM has to follow the PTEs and userspace cannot ask
> for something different. However, Marc points that enumerating FWB support
> would allow userspace to discover the support and prevent live-migration
> across FWB and non-FWB hosts. Jason suggested that this may still be fine as
> we have already built in VFIO side protection where a live migration can be
> attempted and then fail because of late-detected HW incompatibilities.
>
> 2. New memslot flag that VMM passes at memslot registration:
> Discussion point that this is not necessary and KVM should just follow the
> VMA pgprot.
>
> 3. Fallback path handling for PFNMAP when the FWB is not set:
> Discussion points that there shouldn't be any fallback path and the memslot
> should just fail. i.e. KVM should not allow degrading cachable to non-cachable
> when it can't do flushing. This is to prevent the potential security issue
> pointed by Jason (S1 cacheable, S2 noncacheable).
>
>
> So AIU, the next step is to send out the updated series with the following patches:
> 1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region)
> and during fault handling (user_mem_abort()).
Yes, we need to prevent the creation of stage-2 mappings to PFNMAP memory
that uses cacheable attributes in the host stage-1. I believe we have alignment
that this is a bugfix.
> 2. Enable support for cacheable PFN maps if S2FWB is enabled by following
> the vma pgprot (this patch).
>
> 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false
> for !FWB), pending maintainers' feedback on the necessity of this capability.
Regarding UAPI: I'm still convinced that we need the VMM to buy in to this
behavior. And no, it doesn't matter if this is some VFIO-based mapping
or kernel-managed memory.
The reality is that userspace is an equal participant in remaining coherent with
the guest. Whether or not FWB is employed for a particular region of IPA
space is useful information for userspace deciding what it needs to do to access guest
memory. Ignoring the Nvidia widget for a second, userspace also needs to know this for
'normal', kernel-managed memory so it understands what CMOs may be necessary when (for
example) doing live migration of the VM.
So this KVM CAP needs to be paired with a memslot flag.
- The capability says KVM is able to enforce Write-Back at stage-2
- The memslot flag says userspace expects a particular GFN range to guarantee
Write-Back semantics. This can be applied to 'normal', kernel-managed memory
and PFNMAP thingies that have cacheable attributes at host stage-1.
- Under no situation do we allow userspace to create non-cacheable mapping at
stage-2 for something PFNMAP cacheable at stage-1.
No matter what, my understanding is that we all agree the driver which provided the
host stage-1 mapping is the authoritative source for memory attributes compatible
with a given device. The accompanying UAPI is necessary for the VMM to understand how
to handle arbitrary cacheable mappings provided to the VM.
Thanks,
Oliver
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-22 7:49 ` Oliver Upton
@ 2025-04-22 13:54 ` Jason Gunthorpe
2025-04-22 16:50 ` Catalin Marinas
2025-04-22 14:53 ` Sean Christopherson
1 sibling, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-04-22 13:54 UTC (permalink / raw)
To: Oliver Upton
Cc: Ankit Agrawal, Sean Christopherson, Marc Zyngier, Catalin Marinas,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 22, 2025 at 12:49:28AM -0700, Oliver Upton wrote:
> The reality is that userspace is an equal participant in remaining coherent with
> the guest. Whether or not FWB is employed for a particular region of IPA
> space is useful information for userspace deciding what it needs to do to access guest
> memory. Ignoring the Nvidia widget for a second, userspace also needs to know this for
> 'normal', kernel-managed memory so it understands what CMOs may be necessary when (for
> example) doing live migration of the VM.
Really? How does it work today then? Is this another existing problem?
Userspace is doing CMOs during live migration that are not necessary?
> So this KVM CAP needs to be paired with a memslot flag.
>
> - The capability says KVM is able to enforce Write-Back at stage-2
Sure
> - The memslot flag says userspace expects a particular GFN range to guarantee
> Write-Back semantics. This can be applied to 'normal', kernel-managed memory
> and PFNMAP thingies that have cacheable attributes at host stage-1.
Userspace doesn't actaully know if it has a cachable mapping from VFIO
though :(
I don't really see a point in this. If the KVM has the cap then
userspace should assume the S2FWB behavior for all cachable memslots.
What should happen if you have S2FWB but don't pass the flag? For
normal kernel memory it should still use S2FWB. Thus for cachable
PFNMAP it makes sense that it should also still use S2FWB without the
flag?
So, if you set the flag and don't have S2FWB it will fail the memslot,
but then why not just rely on userspace to read the CAP and not create
the memslot in the first place?
If you don't set the flag then it should go ahead and use S2FWB anyhow
and not fail anyhow..
It doesn't make alot of sense to me and brings more complexity to
force userspace to discover the cachability of the VFIO side.
> - Under no situation do we allow userspace to create non-cacheable mapping at
> stage-2 for something PFNMAP cacheable at stage-1.
Yes. memslot creation should fail, and page fault should fail.
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-22 7:49 ` Oliver Upton
2025-04-22 13:54 ` Jason Gunthorpe
@ 2025-04-22 14:53 ` Sean Christopherson
1 sibling, 0 replies; 61+ messages in thread
From: Sean Christopherson @ 2025-04-22 14:53 UTC (permalink / raw)
To: Oliver Upton
Cc: Ankit Agrawal, Jason Gunthorpe, Marc Zyngier, Catalin Marinas,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 22, 2025, Oliver Upton wrote:
> Hi,
>
> On Wed, Apr 16, 2025 at 08:51:05AM +0000, Ankit Agrawal wrote:
> > Hi, summarizing the discussion so far and outlining the next steps. The key points
> > are as follows:
> > 1. KVM cap to expose whether the kernel supports mapping cacheable PFNMAP:
> > If the host doesn't have FWB, then the capability doesn't exist. Jason, Oliver, Caitlin
> > and Sean points that this may not be required as userspace do not have
> > much choice anyways. KVM has to follow the PTEs and userspace cannot ask
> > for something different. However, Marc points that enumerating FWB support
> > would allow userspace to discover the support and prevent live-migration
> > across FWB and non-FWB hosts. Jason suggested that this may still be fine as
> > we have already built in VFIO side protection where a live migration can be
> > attempted and then fail because of late-detected HW incompatibilities.
> >
> > 2. New memslot flag that VMM passes at memslot registration:
> > Discussion point that this is not necessary and KVM should just follow the
> > VMA pgprot.
> >
> > 3. Fallback path handling for PFNMAP when the FWB is not set:
> > Discussion points that there shouldn't be any fallback path and the memslot
> > should just fail. i.e. KVM should not allow degrading cachable to non-cachable
> > when it can't do flushing. This is to prevent the potential security issue
> > pointed by Jason (S1 cacheable, S2 noncacheable).
> >
> >
> > So AIU, the next step is to send out the updated series with the following patches:
> > 1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region)
> > and during fault handling (user_mem_abort()).
>
> Yes, we need to prevent the creation of stage-2 mappings to PFNMAP memory
> that uses cacheable attributes in the host stage-1. I believe we have alignment
> that this is a bugfix.
>
> > 2. Enable support for cacheable PFN maps if S2FWB is enabled by following
> > the vma pgprot (this patch).
> >
> > 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false
> > for !FWB), pending maintainers' feedback on the necessity of this capability.
>
> Regarding UAPI: I'm still convinced that we need the VMM to buy in to this
> behavior. And no, it doesn't matter if this is some VFIO-based mapping
> or kernel-managed memory.
>
> The reality is that userspace is an equal participant in remaining coherent with
> the guest. Whether or not FWB is employed for a particular region of IPA
> space is useful information for userspace deciding what it needs to do to access guest
> memory. Ignoring the Nvidia widget for a second, userspace also needs to know this for
> 'normal', kernel-managed memory so it understands what CMOs may be necessary when (for
> example) doing live migration of the VM.
>
> So this KVM CAP needs to be paired with a memslot flag.
>
> - The capability says KVM is able to enforce Write-Back at stage-2
>
> - The memslot flag says userspace expects a particular GFN range to guarantee
> Write-Back semantics. This can be applied to 'normal', kernel-managed memory
> and PFNMAP thingies that have cacheable attributes at host stage-1.
I am very strongly opposed to adding a memslot flag. IMO, it sets a terrible
precedent, and I am struggling to understand why a per-VM CAP isn't sufficient
protection for the VMM.
> - Under no situation do we allow userspace to create non-cacheable mapping at
> stage-2 for something PFNMAP cacheable at stage-1.
>
> No matter what, my understanding is that we all agree the driver which provided the
> host stage-1 mapping is the authoritative source for memory attributes compatible
> with a given device. The accompanying UAPI is necessary for the VMM to understand how
> to handle arbitrary cacheable mappings provided to the VM.
>
> Thanks,
> Oliver
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-22 13:54 ` Jason Gunthorpe
@ 2025-04-22 16:50 ` Catalin Marinas
2025-04-22 17:03 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Catalin Marinas @ 2025-04-22 16:50 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 22, 2025 at 10:54:52AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 22, 2025 at 12:49:28AM -0700, Oliver Upton wrote:
> > - The memslot flag says userspace expects a particular GFN range to guarantee
> > Write-Back semantics. This can be applied to 'normal', kernel-managed memory
> > and PFNMAP thingies that have cacheable attributes at host stage-1.
>
> Userspace doesn't actaully know if it has a cachable mapping from VFIO
> though :(
Yes, that's why I couldn't figure out how useful a memory slot flag
would be.
A CAP is definitely useful and some way of preventing migrating between
hosts with different capabilities for VM_PFNMAP is needed. However, I
think we only want to prevent migration _if_ there's a cacheable
VM_PFNMAP, otherwise KVM handles coherency via CMOs already.
So, for the above, the VMM needs to know that it somehow got into such
situation. If it knows the device (VFIO) capabilities and that the user
mapping is Cacheable, coupled with the new KVM CAP, it can infer that
Stage 2 will be S2FWB, no need for a memory slot flag. If it doesn't
have such information, maybe a new memory slot flag can be used to probe
what Stage 2 mapping is going to be: ask for KVM_MEM_PFNMAP_WB. If it
fails, Stage 2 is Device/NC and can attempt again with the WB flag.
It's a bit of a stretch for the KVM API but IIUC there's no option to
query the properties of a memory slot.
FWIW, I don't think we need a memory slot flag for host kernel managed
memory, its S2FWB already.
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-22 16:50 ` Catalin Marinas
@ 2025-04-22 17:03 ` Jason Gunthorpe
2025-04-22 21:28 ` Oliver Upton
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-04-22 17:03 UTC (permalink / raw)
To: Catalin Marinas
Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 22, 2025 at 05:50:32PM +0100, Catalin Marinas wrote:
> So, for the above, the VMM needs to know that it somehow got into such
> situation. If it knows the device (VFIO) capabilities and that the user
> mapping is Cacheable, coupled with the new KVM CAP, it can infer that
> Stage 2 will be S2FWB, no need for a memory slot flag.
So long as the memslot creation fails for cachable PFNMAP without
S2FWB the VMM is fine. qemu will begin its first steps to startup the
migration destination and immediately fail. The migration will be
aborted before it even gets started on the source side.
As I said before, the present situation requires the site's
orchestration to manage compatibility for live migration of VFIO
devices. We only expect that the migration will abort early if the
site has made a configuration error.
> have such information, maybe a new memory slot flag can be used to probe
> what Stage 2 mapping is going to be: ask for KVM_MEM_PFNMAP_WB. If it
> fails, Stage 2 is Device/NC and can attempt again with the WB flag.
> It's a bit of a stretch for the KVM API but IIUC there's no option to
> query the properties of a memory slot.
I don't know of any use case for something like this. If VFIO gives
the VMM a cachable mapping there is no fallback to WB.
The operator could use a different VFIO device, one that doesn't need
cachable, but the VMM can't flip the VFIO device between modes on the
fly.
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-22 17:03 ` Jason Gunthorpe
@ 2025-04-22 21:28 ` Oliver Upton
2025-04-22 23:35 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Oliver Upton @ 2025-04-22 21:28 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Catalin Marinas, Ankit Agrawal, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 22, 2025 at 10:54:52AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 22, 2025 at 12:49:28AM -0700, Oliver Upton wrote:
> > The reality is that userspace is an equal participant in remaining coherent with
> > the guest. Whether or not FWB is employed for a particular region of IPA
> > space is useful information for userspace deciding what it needs to do to access guest
> > memory. Ignoring the Nvidia widget for a second, userspace also needs to know this for
> > 'normal', kernel-managed memory so it understands what CMOs may be necessary when (for
> > example) doing live migration of the VM.
>
> Really? How does it work today then? Is this another existing problem?
> Userspace is doing CMOs during live migration that are not necessary?
Yes, this is a pre-existing problem. I'm not aware of a live migration
implementation that handles !S2FWB correctly, and assumes all guest
accesses are done through a cacheable alias.
So, if a VMM wants to do migration of VMs on !S2FWB correctly, it'd
probably want to know it can elide CMOs on something that actually bears
the feature.
> > - The memslot flag says userspace expects a particular GFN range to guarantee
> > Write-Back semantics. This can be applied to 'normal', kernel-managed memory
> > and PFNMAP thingies that have cacheable attributes at host stage-1.
>
> Userspace doesn't actaully know if it has a cachable mapping from VFIO
> though :(
That seems like a shortcoming on the VFIO side, not a KVM issue. What if
userspace wants to do atomics on some VFIO mapping, doesn't it need to
know that it has something with WB?
> I don't really see a point in this. If the KVM has the cap then
> userspace should assume the S2FWB behavior for all cachable memslots.
Wait, so userspace simultaneously doesn't know the cacheability at host
stage-1 but *does* for stage-2? This is why I contend that userspace
needs a mechanism to discover the memory attributes on a given memslot.
Without it there's no way of knowing what's a cacheable memslot.
Along those lines, how is the VMM going to describe that cacheable
PFNMAP region to the guest?
> What should happen if you have S2FWB but don't pass the flag? For
> normal kernel memory it should still use S2FWB. Thus for cachable
> PFNMAP it makes sense that it should also still use S2FWB without the
> flag?
For kernel-managed memory, I agree. Accepting the flag for a memslot
containing such memory would solely be for discoverability.
OTOH, cacheable PFNMAP is a new feature and I see no issue compelling
the use of a new bit with it.
On Tue, Apr 22, 2025 at 02:03:24PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 22, 2025 at 05:50:32PM +0100, Catalin Marinas wrote:
>
> > So, for the above, the VMM needs to know that it somehow got into such
> > situation. If it knows the device (VFIO) capabilities and that the user
> > mapping is Cacheable, coupled with the new KVM CAP, it can infer that
> > Stage 2 will be S2FWB, no need for a memory slot flag.
>
> So long as the memslot creation fails for cachable PFNMAP without
> S2FWB the VMM is fine. qemu will begin its first steps to startup the
> migration destination and immediately fail. The migration will be
> aborted before it even gets started on the source side.
>
> As I said before, the present situation requires the site's
> orchestration to manage compatibility for live migration of VFIO
> devices. We only expect that the migration will abort early if the
> site has made a configuration error.
>
> > have such information, maybe a new memory slot flag can be used to probe
> > what Stage 2 mapping is going to be: ask for KVM_MEM_PFNMAP_WB. If it
> > fails, Stage 2 is Device/NC and can attempt again with the WB flag.
> > It's a bit of a stretch for the KVM API but IIUC there's no option to
> > query the properties of a memory slot.
>
> I don't know of any use case for something like this. If VFIO gives
> the VMM a cachable mapping there is no fallback to WB.
>
> The operator could use a different VFIO device, one that doesn't need
> cachable, but the VMM can't flip the VFIO device between modes on the
> fly.
I agree with you that in the context of a VFIO device userspace doesn't
have any direct influence on the resulting memory attributes.
The entire reason I'm dragging my feet about this is I'm concerned we've
papered over the complexity of memory attributes (regardless of
provenance) for way too long. KVM's done enough to make this dance 'work'
in the context of kernel-managed memory, but adding more implicit KVM
behavior for cacheable thingies makes the KVM UAPI even more
unintelligible (as if it weren't already).
So this flag isn't about giving userspace any degree of control over
memory attributes. Just a way to know for things it _expects_ to be
treated as cacheable can be guaranteed to use cacheable attributes in
the VM.
Thanks,
Oliver
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-22 21:28 ` Oliver Upton
@ 2025-04-22 23:35 ` Jason Gunthorpe
2025-04-23 10:45 ` Catalin Marinas
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-04-22 23:35 UTC (permalink / raw)
To: Oliver Upton
Cc: Catalin Marinas, Ankit Agrawal, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 22, 2025 at 02:28:18PM -0700, Oliver Upton wrote:
> So, if a VMM wants to do migration of VMs on !S2FWB correctly, it'd
> probably want to know it can elide CMOs on something that actually bears
> the feature.
OK
> > > - The memslot flag says userspace expects a particular GFN range to guarantee
> > > Write-Back semantics. This can be applied to 'normal', kernel-managed memory
> > > and PFNMAP thingies that have cacheable attributes at host stage-1.
> >
> > Userspace doesn't actaully know if it has a cachable mapping from VFIO
> > though :(
>
> That seems like a shortcoming on the VFIO side, not a KVM issue. What if
> userspace wants to do atomics on some VFIO mapping, doesn't it need to
> know that it has something with WB?
VFIO is almost always un cachable. So far there is only one device
merged, and CXL is coming, that would even support/require cachable.
VFIO always had this sort of programming model where the userspace
needs to know details about the device it is using so it hasn't really
been an issue so far that the kernel doesn't tell the userspace what
cachability it got.. We could add it but now we are adding new kernel
code, qemu code and kvm code that is actually pretty pointless.
> > I don't really see a point in this. If the KVM has the cap then
> > userspace should assume the S2FWB behavior for all cachable memslots.
>
> Wait, so userspace simultaneously doesn't know the cacheability at host
> stage-1 but *does* for stage-2?
No, it doesn't know either. The point is the VMM doesn't care about
any of this. It just wants to connect KVM to VFIO and have the kernel
internally negotiate the details of how that works.
There is zero value in the VMM being aware that KVM/VFIO is using
cachable or non-cachable mappings because it will never touch this
memory anyhow, and arguably it would be happier if it wasn't even in a
VMA in the first place.
> This is why I contend that userspace needs a mechanism to discover
> the memory attributes on a given memslot. Without it there's no way
> of knowing what's a cacheable memslot.
If it cares about this then it should know by virtue of having put a
cachable VMA into the memslot.
> Along those lines, how is the VMM going to describe that cacheable
> PFNMAP region to the guest?
Heh. It creates a virtual PCI device in the guest and the space is
mapped to a virtual BAR. When the guest driver binds to this device it
will map the virtual BAR as cachable instead of as IO because it knows
to do that based on the vPCI device ID.
If something goes weird and the guest tries to use UC instead of
cachable the S2FWB will block it and the guest will probably
malfunction.
CXL will probably have the VMM understand things a bit more and will
generate the various ACPI tables CXL uses.
> > What should happen if you have S2FWB but don't pass the flag? For
> > normal kernel memory it should still use S2FWB. Thus for cachable
> > PFNMAP it makes sense that it should also still use S2FWB without the
> > flag?
>
> For kernel-managed memory, I agree. Accepting the flag for a memslot
> containing such memory would solely be for discoverability.
>
> OTOH, cacheable PFNMAP is a new feature and I see no issue compelling
> the use of a new bit with it.
Feels weird to me. Now the VMM has to discover if the KVM supports the
new flag and only use it on new KVM versions just to accomplish..
nothing?
> The entire reason I'm dragging my feet about this is I'm concerned we've
> papered over the complexity of memory attributes (regardless of
> provenance) for way too long. KVM's done enough to make this dance 'work'
> in the context of kernel-managed memory, but adding more implicit KVM
> behavior for cacheable thingies makes the KVM UAPI even more
> unintelligible (as if it weren't already).
It is very complex.. I'm not sure adding a flag in this case is making
it any simpler though.
If you had a flag from day 0 that said 'this is a MMIO mapping do MMIO
stuff' that would be alot clearer about how it should be used.
But here we have a flag that doesn't seem well defined. When should
the VMM set this flag?
> So this flag isn't about giving userspace any degree of control over
> memory attributes. Just a way to know for things it _expects_ to be
> treated as cacheable can be guaranteed to use cacheable attributes in
> the VM.
Can you get some agreement with Sean? He seems very strongly opposed to
this direction.
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-22 23:35 ` Jason Gunthorpe
@ 2025-04-23 10:45 ` Catalin Marinas
2025-04-23 12:02 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Catalin Marinas @ 2025-04-23 10:45 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 22, 2025 at 08:35:56PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 22, 2025 at 02:28:18PM -0700, Oliver Upton wrote:
> > Wait, so userspace simultaneously doesn't know the cacheability at host
> > stage-1 but *does* for stage-2?
>
> No, it doesn't know either. The point is the VMM doesn't care about
> any of this. It just wants to connect KVM to VFIO and have the kernel
> internally negotiate the details of how that works.
>
> There is zero value in the VMM being aware that KVM/VFIO is using
> cachable or non-cachable mappings because it will never touch this
> memory anyhow, and arguably it would be happier if it wasn't even in a
> VMA in the first place.
I think this was discussed at some point in the past - what about
ignoring the VMA altogether and using an fd-based approach? Keep the
current VFIO+vma ABI non-cacheable and introduce a new one for cacheable
I/O mappings, e.g. KVM_MEM_IOFD. Is there a way for VFIO to communicate
the attributes to KVM on the type of mapping in the absence of a VMA
(e.g. via the inode)? If not:
Stage 2 could default to non-cacheable and we can add a
KVM_MEMORY_ATTRIBUTE_IO_WB as an option to
ioctl(KVM_SET_MEMORY_ATTRIBUTES) (setup requiring two ioctls). The
latter could fail if S2FWB is not available. But I agree, it does mean
building knowledge of the device in the VMM (it might not be that bad).
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-23 10:45 ` Catalin Marinas
@ 2025-04-23 12:02 ` Jason Gunthorpe
2025-04-23 12:26 ` Catalin Marinas
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 12:02 UTC (permalink / raw)
To: Catalin Marinas
Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Apr 23, 2025 at 11:45:04AM +0100, Catalin Marinas wrote:
> On Tue, Apr 22, 2025 at 08:35:56PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 22, 2025 at 02:28:18PM -0700, Oliver Upton wrote:
> > > Wait, so userspace simultaneously doesn't know the cacheability at host
> > > stage-1 but *does* for stage-2?
> >
> > No, it doesn't know either. The point is the VMM doesn't care about
> > any of this. It just wants to connect KVM to VFIO and have the kernel
> > internally negotiate the details of how that works.
> >
> > There is zero value in the VMM being aware that KVM/VFIO is using
> > cachable or non-cachable mappings because it will never touch this
> > memory anyhow, and arguably it would be happier if it wasn't even in a
> > VMA in the first place.
>
> I think this was discussed at some point in the past - what about
> ignoring the VMA altogether and using an fd-based approach? Keep the
> current VFIO+vma ABI non-cacheable and introduce a new one for cacheable
> I/O mappings, e.g. KVM_MEM_IOFD. Is there a way for VFIO to communicate
> the attributes to KVM on the type of mapping in the absence of a VMA
> (e.g. via the inode)? If not:
I hope to see that someday. The CC people are working in that
direction, but we are still far from getting enough agreement from all
the stakeholders on how that will work.
> Stage 2 could default to non-cacheable and we can add a
> KVM_MEMORY_ATTRIBUTE_IO_WB as an option to
I don't think we should ever allow KVM to create a non-cachable
mapping of an otherwise cachable object??
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-23 12:02 ` Jason Gunthorpe
@ 2025-04-23 12:26 ` Catalin Marinas
2025-04-23 13:03 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Catalin Marinas @ 2025-04-23 12:26 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Apr 23, 2025 at 09:02:43AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 23, 2025 at 11:45:04AM +0100, Catalin Marinas wrote:
> > On Tue, Apr 22, 2025 at 08:35:56PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 22, 2025 at 02:28:18PM -0700, Oliver Upton wrote:
> > > > Wait, so userspace simultaneously doesn't know the cacheability at host
> > > > stage-1 but *does* for stage-2?
> > >
> > > No, it doesn't know either. The point is the VMM doesn't care about
> > > any of this. It just wants to connect KVM to VFIO and have the kernel
> > > internally negotiate the details of how that works.
> > >
> > > There is zero value in the VMM being aware that KVM/VFIO is using
> > > cachable or non-cachable mappings because it will never touch this
> > > memory anyhow, and arguably it would be happier if it wasn't even in a
> > > VMA in the first place.
> >
> > I think this was discussed at some point in the past - what about
> > ignoring the VMA altogether and using an fd-based approach? Keep the
> > current VFIO+vma ABI non-cacheable and introduce a new one for cacheable
> > I/O mappings, e.g. KVM_MEM_IOFD. Is there a way for VFIO to communicate
> > the attributes to KVM on the type of mapping in the absence of a VMA
> > (e.g. via the inode)? If not:
>
> I hope to see that someday. The CC people are working in that
> direction, but we are still far from getting enough agreement from all
> the stakeholders on how that will work.
>
> > Stage 2 could default to non-cacheable and we can add a
> > KVM_MEMORY_ATTRIBUTE_IO_WB as an option to
>
> I don't think we should ever allow KVM to create a non-cachable
> mapping of an otherwise cachable object??
Ah, good point. So we want a strict cacheable mapping even if there is
no user/VMM mapping alias. Is it because KVM cannot do cache maintenance
on the PFNMAP or because the device does not support other memory types?
Both valid reasons though.
In this case KVM still needs to know the properties of the device. Not
sure how it could best do this without a vma.
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-23 12:26 ` Catalin Marinas
@ 2025-04-23 13:03 ` Jason Gunthorpe
2025-04-29 10:47 ` Ankit Agrawal
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 13:03 UTC (permalink / raw)
To: Catalin Marinas
Cc: Oliver Upton, Ankit Agrawal, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, Apr 23, 2025 at 01:26:51PM +0100, Catalin Marinas wrote:
> Ah, good point. So we want a strict cacheable mapping even if there is
> no user/VMM mapping alias. Is it because KVM cannot do cache maintenance
> on the PFNMAP or because the device does not support other memory types?
> Both valid reasons though.
Lack of cache maintenance is the main problem for the vfio-grace device
> In this case KVM still needs to know the properties of the device. Not
> sure how it could best do this without a vma.
Well, the idea I hope to succeed with would annotate this kind of
information inside the page list that would be exchanged through the
FD.
There is an monstrous thread here about this topic:
https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com/
I can't find it in the huge thread but I did explain some thoughts on
how it could work
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-23 13:03 ` Jason Gunthorpe
@ 2025-04-29 10:47 ` Ankit Agrawal
2025-04-29 13:27 ` Catalin Marinas
0 siblings, 1 reply; 61+ messages in thread
From: Ankit Agrawal @ 2025-04-29 10:47 UTC (permalink / raw)
To: Jason Gunthorpe, Catalin Marinas
Cc: Oliver Upton, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
>> In this case KVM still needs to know the properties of the device. Not
>> sure how it could best do this without a vma.
>
> Well, the idea I hope to succeed with would annotate this kind of
> information inside the page list that would be exchanged through the
> FD.
>
> There is an monstrous thread here about this topic:
>
>
>
> https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com/
>
> I can't find it in the huge thread but I did explain some thoughts on
> how it could work
>
> Jason
Hi,
Based on the recent discussions, I gather that having a KVM_CAP
to expose the support for cacheable PFNMAP to the VMM would be
useful from VM migration point of view.
However it appears that the memslot flag isn't a must-have. The memslot
flag cannot influence the KVM code anyways. For FWB, the PFNMAP would
be cacheable and userspace should just assume S2FWB behavior; it would
be a security bug otherwise as Jason pointed earlier (S1 cacheable,
S2 noncacheable). For !FWB, a cacheable PFNMAP could not be allowed
and VMM shouldn't attempt to create memslot at all by referring the cap.
Also, can we take the fd based communication path between VFIO
and KVM separately?
I am planning to send out the series with the following implementation.
Please let me know if there are any disagreements or concerns.
1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region)
and during fault handling (user_mem_abort()).
2. Enable support for cacheable PFN maps if S2FWB is enabled by following
the vma pgprot (this patch).
3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false
for !FWB).
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-29 10:47 ` Ankit Agrawal
@ 2025-04-29 13:27 ` Catalin Marinas
2025-04-29 14:14 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Catalin Marinas @ 2025-04-29 13:27 UTC (permalink / raw)
To: Ankit Agrawal
Cc: Jason Gunthorpe, Oliver Upton, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 29, 2025 at 10:47:58AM +0000, Ankit Agrawal wrote:
> >> In this case KVM still needs to know the properties of the device. Not
> >> sure how it could best do this without a vma.
> >
> > Well, the idea I hope to succeed with would annotate this kind of
> > information inside the page list that would be exchanged through the
> > FD.
> >
> > There is an monstrous thread here about this topic:
> >
> >
> >
> > https://lore.kernel.org/all/20250107142719.179636-1-yilun.xu@linux.intel.com/
> >
> > I can't find it in the huge thread but I did explain some thoughts on
> > how it could work
> >
> > Jason
>
> Hi,
>
> Based on the recent discussions, I gather that having a KVM_CAP
> to expose the support for cacheable PFNMAP to the VMM would be
> useful from VM migration point of view.
>
> However it appears that the memslot flag isn't a must-have. The memslot
> flag cannot influence the KVM code anyways. For FWB, the PFNMAP would
> be cacheable and userspace should just assume S2FWB behavior; it would
> be a security bug otherwise as Jason pointed earlier (S1 cacheable,
> S2 noncacheable). For !FWB, a cacheable PFNMAP could not be allowed
> and VMM shouldn't attempt to create memslot at all by referring the cap.
>
> Also, can we take the fd based communication path between VFIO
> and KVM separately?
>
> I am planning to send out the series with the following implementation.
> Please let me know if there are any disagreements or concerns.
>
> 1. Block cacheable PFN map in memslot creation (kvm_arch_prepare_memory_region)
> and during fault handling (user_mem_abort()).
I forgot the details here. I think it makes sense in general but as the
first patch, we currently block cacheable PFNMAP anyway. Probably what
you meant already but - this patch should block the PFNMAP slot if
there's a cacheable S1 alias.
> 2. Enable support for cacheable PFN maps if S2FWB is enabled by following
> the vma pgprot (this patch).
> 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false
> for !FWB).
I'll defer the memslot flag decision to the KVM maintainers. If we had
one, it will enforce (2) or reject it as per (1) depending on the S1
attributes.
Without the memslot flag, I assume at least the VMM will have to enable
KVM_CAP_ARM_WB_PFNMAP (or whatever it will be called) to get the new
behaviour.
BTW, we should reject exec mappings as well (they probably fail for S1
VFIO since set_pte_at() will try to do cache maintenance).
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-29 13:27 ` Catalin Marinas
@ 2025-04-29 14:14 ` Jason Gunthorpe
2025-04-29 16:03 ` Catalin Marinas
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-04-29 14:14 UTC (permalink / raw)
To: Catalin Marinas
Cc: Ankit Agrawal, Oliver Upton, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 29, 2025 at 02:27:02PM +0100, Catalin Marinas wrote:
> BTW, we should reject exec mappings as well (they probably fail for S1
> VFIO since set_pte_at() will try to do cache maintenance).
To be clear the S2 should leave the mapping as execute allowed
though. Only the VM knows how it will use this memory and VM's do
actually execute out of the cachable PFNMAP VMA today. The VM will set
any execute deny/allow on its S1 table according to how it uses the
memory.
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-29 14:14 ` Jason Gunthorpe
@ 2025-04-29 16:03 ` Catalin Marinas
2025-04-29 16:44 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Catalin Marinas @ 2025-04-29 16:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Ankit Agrawal, Oliver Upton, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 29, 2025 at 11:14:37AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 29, 2025 at 02:27:02PM +0100, Catalin Marinas wrote:
> > BTW, we should reject exec mappings as well (they probably fail for S1
> > VFIO since set_pte_at() will try to do cache maintenance).
>
> To be clear the S2 should leave the mapping as execute allowed
> though. Only the VM knows how it will use this memory and VM's do
> actually execute out of the cachable PFNMAP VMA today. The VM will set
> any execute deny/allow on its S1 table according to how it uses the
> memory.
If S2 is executable, wouldn't KVM try to invalidate the I-cache and it
won't have an alias to do this? Unless it doesn't end up in
stage2_map_walker_try_leaf() or the walk has been flagged as skipping
the CMO.
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-29 16:03 ` Catalin Marinas
@ 2025-04-29 16:44 ` Jason Gunthorpe
2025-04-29 18:09 ` Catalin Marinas
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-04-29 16:44 UTC (permalink / raw)
To: Catalin Marinas
Cc: Ankit Agrawal, Oliver Upton, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 29, 2025 at 05:03:18PM +0100, Catalin Marinas wrote:
> On Tue, Apr 29, 2025 at 11:14:37AM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 29, 2025 at 02:27:02PM +0100, Catalin Marinas wrote:
> > > BTW, we should reject exec mappings as well (they probably fail for S1
> > > VFIO since set_pte_at() will try to do cache maintenance).
> >
> > To be clear the S2 should leave the mapping as execute allowed
> > though. Only the VM knows how it will use this memory and VM's do
> > actually execute out of the cachable PFNMAP VMA today. The VM will set
> > any execute deny/allow on its S1 table according to how it uses the
> > memory.
>
> If S2 is executable, wouldn't KVM try to invalidate the I-cache and it
> won't have an alias to do this? Unless it doesn't end up in
> stage2_map_walker_try_leaf() or the walk has been flagged as skipping
> the CMO.
Okay, that does seem to have been overlooked a bit. The answer I got
back is:
Cachable PFNMAP is also relying on ARM64_HAS_CACHE_DIC also, simlar to
how S2FWB allows KVM to avoid flushing the D cache, that CPU cap
allows KVM to avoid flushing the icache and turns icache_inval_pou()
into a NOP.
Ankit, you should add that check as well..
Thanks,
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-29 16:44 ` Jason Gunthorpe
@ 2025-04-29 18:09 ` Catalin Marinas
2025-04-29 18:19 ` Jason Gunthorpe
0 siblings, 1 reply; 61+ messages in thread
From: Catalin Marinas @ 2025-04-29 18:09 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Ankit Agrawal, Oliver Upton, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 29, 2025 at 01:44:30PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 29, 2025 at 05:03:18PM +0100, Catalin Marinas wrote:
> > On Tue, Apr 29, 2025 at 11:14:37AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 29, 2025 at 02:27:02PM +0100, Catalin Marinas wrote:
> > > > BTW, we should reject exec mappings as well (they probably fail for S1
> > > > VFIO since set_pte_at() will try to do cache maintenance).
> > >
> > > To be clear the S2 should leave the mapping as execute allowed
> > > though. Only the VM knows how it will use this memory and VM's do
> > > actually execute out of the cachable PFNMAP VMA today. The VM will set
> > > any execute deny/allow on its S1 table according to how it uses the
> > > memory.
> >
> > If S2 is executable, wouldn't KVM try to invalidate the I-cache and it
> > won't have an alias to do this? Unless it doesn't end up in
> > stage2_map_walker_try_leaf() or the walk has been flagged as skipping
> > the CMO.
>
> Okay, that does seem to have been overlooked a bit. The answer I got
> back is:
>
> Cachable PFNMAP is also relying on ARM64_HAS_CACHE_DIC also, simlar to
> how S2FWB allows KVM to avoid flushing the D cache, that CPU cap
> allows KVM to avoid flushing the icache and turns icache_inval_pou()
> into a NOP.
Another CAP for executable PFNMAP then? I feel like this is a different
use-case (e.g. more like general purpose CXL attached memory) than the
GPU one. Unless FWB implies CTR_EL0.DIC (AFAIK, it doesn't) we may be
restricting some CPUs.
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-29 18:09 ` Catalin Marinas
@ 2025-04-29 18:19 ` Jason Gunthorpe
2025-05-07 15:26 ` Ankit Agrawal
0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2025-04-29 18:19 UTC (permalink / raw)
To: Catalin Marinas
Cc: Ankit Agrawal, Oliver Upton, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Tue, Apr 29, 2025 at 07:09:42PM +0100, Catalin Marinas wrote:
> On Tue, Apr 29, 2025 at 01:44:30PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 29, 2025 at 05:03:18PM +0100, Catalin Marinas wrote:
> > > On Tue, Apr 29, 2025 at 11:14:37AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Apr 29, 2025 at 02:27:02PM +0100, Catalin Marinas wrote:
> > > > > BTW, we should reject exec mappings as well (they probably fail for S1
> > > > > VFIO since set_pte_at() will try to do cache maintenance).
> > > >
> > > > To be clear the S2 should leave the mapping as execute allowed
> > > > though. Only the VM knows how it will use this memory and VM's do
> > > > actually execute out of the cachable PFNMAP VMA today. The VM will set
> > > > any execute deny/allow on its S1 table according to how it uses the
> > > > memory.
> > >
> > > If S2 is executable, wouldn't KVM try to invalidate the I-cache and it
> > > won't have an alias to do this? Unless it doesn't end up in
> > > stage2_map_walker_try_leaf() or the walk has been flagged as skipping
> > > the CMO.
> >
> > Okay, that does seem to have been overlooked a bit. The answer I got
> > back is:
> >
> > Cachable PFNMAP is also relying on ARM64_HAS_CACHE_DIC also, simlar to
> > how S2FWB allows KVM to avoid flushing the D cache, that CPU cap
> > allows KVM to avoid flushing the icache and turns icache_inval_pou()
> > into a NOP.
>
> Another CAP for executable PFNMAP then?
IDK, either that or a more general cap 'support PFNMAP VMAs'?
> I feel like this is a different
> use-case (e.g. more like general purpose CXL attached memory) than the
> GPU one.
The GPUs we have today pretty much pretend to be CXL attached memory
so they can and do execute from it.
> Unless FWB implies CTR_EL0.DIC (AFAIK, it doesn't) we may be
> restricting some CPUs.
Yes, it will further narrow the CPUs down.
However, we just did this discussion for BBML2 + SMMUv3 SVA. I think
the same argument holds. If someone is crazy enough to build a CPU
with CXLish support and uses an old core without DIC, IDC and S2FWB
then they are going to have a bunch of work to fix the SW to support
it. Right now we know of no system that exists like this..
Jason
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-04-29 18:19 ` Jason Gunthorpe
@ 2025-05-07 15:26 ` Ankit Agrawal
2025-05-09 12:47 ` Catalin Marinas
0 siblings, 1 reply; 61+ messages in thread
From: Ankit Agrawal @ 2025-05-07 15:26 UTC (permalink / raw)
To: Jason Gunthorpe, Catalin Marinas
Cc: Oliver Upton, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
>> Unless FWB implies CTR_EL0.DIC (AFAIK, it doesn't) we may be
>> restricting some CPUs.
>
> Yes, it will further narrow the CPUs down.
>
> However, we just did this discussion for BBML2 + SMMUv3 SVA. I think
> the same argument holds. If someone is crazy enough to build a CPU
> with CXLish support and uses an old core without DIC, IDC and S2FWB
> then they are going to have a bunch of work to fix the SW to support
> it. Right now we know of no system that exists like this..
>
> Jason
Catalin, do you agree if I can go ahead and add the check for
ARM64_HAS_CACHE_DIC?
>> Another CAP for executable PFNMAP then?
>
> IDK, either that or a more general cap 'support PFNMAP VMAs'?
I think it would be good to have the generic cap that is safe for
executable as well.
>> However it appears that the memslot flag isn't a must-have. The memslot
>> flag cannot influence the KVM code anyways. For FWB, the PFNMAP would
>> be cacheable and userspace should just assume S2FWB behavior; it would
>> be a security bug otherwise as Jason pointed earlier (S1 cacheable,
>> S2 noncacheable). For !FWB, a cacheable PFNMAP could not be allowed
>> and VMM shouldn't attempt to create memslot at all by referring the cap.
...
>> 2. Enable support for cacheable PFN maps if S2FWB is enabled by following
>> the vma pgprot (this patch).
>> 3. Add and expose the new KVM cap to expose cacheable PFNMAP (set to false
>> for !FWB).
>
> I'll defer the memslot flag decision to the KVM maintainers. If we had
> one, it will enforce (2) or reject it as per (1) depending on the S1
> attributes.
Marc, Oliver, let me know your thoughts if we should add memslot flag and I'll
implement accordingly.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-05-07 15:26 ` Ankit Agrawal
@ 2025-05-09 12:47 ` Catalin Marinas
0 siblings, 0 replies; 61+ messages in thread
From: Catalin Marinas @ 2025-05-09 12:47 UTC (permalink / raw)
To: Ankit Agrawal
Cc: Jason Gunthorpe, Oliver Upton, Sean Christopherson, Marc Zyngier,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
will@kernel.org, ryan.roberts@arm.com, shahuang@redhat.com,
lpieralisi@kernel.org, david@redhat.com, Aniket Agashe, Neo Jia,
Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
Alistair Popple, John Hubbard, Dan Williams, Zhi Wang, Matt Ochs,
Uday Dhoke, Dheeraj Nigam, Krishnakant Jaju,
alex.williamson@redhat.com, sebastianene@google.com,
coltonlewis@google.com, kevin.tian@intel.com, yi.l.liu@intel.com,
ardb@kernel.org, akpm@linux-foundation.org, gshan@redhat.com,
linux-mm@kvack.org, ddutile@redhat.com, tabba@google.com,
qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
On Wed, May 07, 2025 at 03:26:05PM +0000, Ankit Agrawal wrote:
> >> Unless FWB implies CTR_EL0.DIC (AFAIK, it doesn't) we may be
> >> restricting some CPUs.
> >
> > Yes, it will further narrow the CPUs down.
> >
> > However, we just did this discussion for BBML2 + SMMUv3 SVA. I think
> > the same argument holds. If someone is crazy enough to build a CPU
> > with CXLish support and uses an old core without DIC, IDC and S2FWB
> > then they are going to have a bunch of work to fix the SW to support
> > it. Right now we know of no system that exists like this..
> >
> > Jason
>
> Catalin, do you agree if I can go ahead and add the check for
> ARM64_HAS_CACHE_DIC?
As long as we don't leave out some hardware that has FWB but not DIC,
that's fine by me.
--
Catalin
^ permalink raw reply [flat|nested] 61+ messages in thread
end of thread, other threads:[~2025-05-09 12:47 UTC | newest]
Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-10 10:30 [PATCH v3 0/1] KVM: arm64: Map GPU device memory as cacheable ankita
2025-03-10 10:30 ` [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
2025-03-10 11:54 ` Marc Zyngier
2025-03-11 3:42 ` Ankit Agrawal
2025-03-11 11:18 ` Marc Zyngier
2025-03-11 12:07 ` Ankit Agrawal
2025-03-12 8:21 ` Marc Zyngier
2025-03-17 5:55 ` Ankit Agrawal
2025-03-17 9:27 ` Marc Zyngier
2025-03-17 19:54 ` Catalin Marinas
2025-03-18 9:39 ` Marc Zyngier
2025-03-18 12:55 ` Jason Gunthorpe
2025-03-18 19:27 ` Catalin Marinas
2025-03-18 19:35 ` David Hildenbrand
2025-03-18 19:40 ` Oliver Upton
2025-03-20 3:30 ` bibo mao
2025-03-20 7:24 ` bibo mao
2025-03-18 23:17 ` Jason Gunthorpe
2025-03-19 18:03 ` Catalin Marinas
2025-03-18 19:30 ` Oliver Upton
2025-03-18 23:09 ` Jason Gunthorpe
2025-03-19 7:01 ` Oliver Upton
2025-03-19 17:04 ` Jason Gunthorpe
2025-03-19 18:11 ` Catalin Marinas
2025-03-19 19:22 ` Jason Gunthorpe
2025-03-19 21:48 ` Catalin Marinas
2025-03-26 8:31 ` Ankit Agrawal
2025-03-26 14:53 ` Sean Christopherson
2025-03-26 15:42 ` Marc Zyngier
2025-03-26 16:10 ` Sean Christopherson
2025-03-26 18:02 ` Marc Zyngier
2025-03-26 18:24 ` Sean Christopherson
2025-03-26 18:51 ` Oliver Upton
2025-03-31 14:44 ` Jason Gunthorpe
2025-03-31 14:56 ` Jason Gunthorpe
2025-04-07 15:20 ` Sean Christopherson
2025-04-07 16:15 ` Jason Gunthorpe
2025-04-07 16:43 ` Sean Christopherson
2025-04-16 8:51 ` Ankit Agrawal
2025-04-21 16:03 ` Ankit Agrawal
2025-04-22 7:49 ` Oliver Upton
2025-04-22 13:54 ` Jason Gunthorpe
2025-04-22 16:50 ` Catalin Marinas
2025-04-22 17:03 ` Jason Gunthorpe
2025-04-22 21:28 ` Oliver Upton
2025-04-22 23:35 ` Jason Gunthorpe
2025-04-23 10:45 ` Catalin Marinas
2025-04-23 12:02 ` Jason Gunthorpe
2025-04-23 12:26 ` Catalin Marinas
2025-04-23 13:03 ` Jason Gunthorpe
2025-04-29 10:47 ` Ankit Agrawal
2025-04-29 13:27 ` Catalin Marinas
2025-04-29 14:14 ` Jason Gunthorpe
2025-04-29 16:03 ` Catalin Marinas
2025-04-29 16:44 ` Jason Gunthorpe
2025-04-29 18:09 ` Catalin Marinas
2025-04-29 18:19 ` Jason Gunthorpe
2025-05-07 15:26 ` Ankit Agrawal
2025-05-09 12:47 ` Catalin Marinas
2025-04-22 14:53 ` Sean Christopherson
2025-03-18 12:57 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).