[PATCH v6 0/5] KVM: arm64: Map GPU device memory as cacheable

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/5] KVM: arm64: Map GPU device memory as cacheable
@ 2025-05-24  1:39 ankita
  2025-05-24  1:39 ` [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping ankita
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: ankita @ 2025-05-24  1:39 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

Grace based platforms such as Grace Hopper/Blackwell Superchips have
CPU accessible cache coherent GPU memory. The GPU device memory is
essentially a DDR memory and retains properties such as cacheability,
unaligned accesses, atomics and handling of executable faults. This
requires the device memory to be mapped as NORMAL in stage-2.

Today KVM forces the memory to either NORMAL or DEVICE_nGnRE depending
on whether the memory region is added to the kernel. The KVM code is
thus restrictive and prevents device memory that is not added to the
kernel to be marked as cacheable. The patch aims to solve this.

A cachebility check is made if the VM_PFNMAP is set in VMA flags by
consulting the VMA pgprot value. If the pgprot mapping type is
cacheable, it is considered safe to be mapped cacheable as the KVM
S2 will have the same Normal memory type as the VMA has in the S1
and KVM has no additional responsibility for safety.

Note when FWB (Force Write Back) is not enabled, the kernel expects to
trivially do cache management by flushing the memory by linearly
converting a kvm_pte to phys_addr to a KVA. The cache management thus
relies on memory being mapped. Since the GPU device memory is not kernel
mapped, exit when the FWB is not supported. Similarly, ARM64_HAS_CACHE_DIC
allows KVM to avoid flushing the icache and turns icache_inval_pou() into
a NOP. So the cacheable PFNMAP is made contingent on these two hardware
features.

The ability to safely do the cacheable mapping of PFNMAP is exposed
through a KVM capability. The userspace is supposed to query it and
consequently set a new memslot flag if it desire to do such mapping.

The changes are heavily influenced by the discussions among
maintainers Marc Zyngier and Oliver Upton besides Jason Gunthorpe,
Catalin Marinas, David Hildenbrand, Sean Christopherson [1] in v3.
Many thanks for their valuable suggestions.

Applied over next-20250407 and tested on the Grace Hopper and
Grace Blackwell platforms by booting up VM, loading NVIDIA module [2]
and running nvidia-smi in the VM.

To run CUDA workloads, there is a dependency on the IOMMUFD and the
Nested Page Table patches being worked on separately by Nicolin Chen.
(nicolinc@nvidia.com). NVIDIA has provided git repositories which
includes all the requisite kernel [3] and Qemu [4] patches in case
one wants to try.

v5 -> v6
1. 2/5 updated to add kvm_arch_supports_cacheable_pfnmap weak
definition to avoid build warnings. (Donald Dutile).

v4 -> v5
1. Invert the check to allow MT_DEVICE_* or NORMAL_NC instead of
disallowing MT_NORMAL in 1/5. (Catalin Marinas)
2. Removed usage of stage2_has_fwb and directly using the FWB
cap check. (Oliver Upton)
3. Introduced kvm_arch_supports_cacheable_pfnmap to check if
the prereq features are present. (David Hildenbrand)

v3 -> v4
1. Fixed a security bug due to mismatched attributes between S1 and
S2 mapping to move it to a separate patch. Suggestion by
Jason Gunthorpe (jgg@nvidia.com).
2. New minor patch to change the scope of the FWB support indicator
function.
3. Patch to introduce a new memslot flag. Suggestion by Oliver Upton
(oliver.upton@linux.dev) and Marc Zyngier (maz@kernel.org)
4. Patch to introduce a new KVM cap to expose cacheable PFNMAP support.
Suggestion by Marc Zyngier (maz@kernel.org).
5. Added checks for ARM64_HAS_CACHE_DIC. Suggestion by Catalin Marinas
(catalin.marinas@arm.com)

v2 -> v3
1. Restricted the new changes to check for cacheability to VM_PFNMAP
   based on David Hildenbrand's (david@redhat.com) suggestion.
2. Removed the MTE checks based on Jason Gunthorpe's (jgg@nvidia.com)
   observation that it already done earlier in
   kvm_arch_prepare_memory_region.
3. Dropped the pfn_valid() checks based on suggestions by
   Catalin Marinas (catalin.marinas@arm.com).
4. Removed the code for exec fault handling as it is not needed
   anymore.

v1 -> v2
1. Removed kvm_is_device_pfn() as a determiner for device type memory
   determination. Instead using pfn_valid()
2. Added handling for MTE.
3. Minor cleanup.

Link: https://lore.kernel.org/all/20250310103008.3471-1-ankita@nvidia.com [1]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2]
Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [3]
Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [4]

v5 Link:
Link: https://lore.kernel.org/all/20250523154445.3779-1-ankita@nvidia.com/

Ankit Agrawal (5):
  KVM: arm64: Block cacheable PFNMAP mapping
  KVM: arm64: New function to determine hardware cache management
    support
  kvm: arm64: New memslot flag to indicate cacheable mapping
  KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
  KVM: arm64: Expose new KVM cap for cacheable PFNMAP

 Documentation/virt/kvm/api.rst | 17 ++++++++-
 arch/arm64/kvm/arm.c           |  7 ++++
 arch/arm64/kvm/mmu.c           | 70 +++++++++++++++++++++++++++++++++-
 include/linux/kvm_host.h       |  2 +
 include/uapi/linux/kvm.h       |  2 +
 virt/kvm/kvm_main.c            | 11 +++++-
 6 files changed, 105 insertions(+), 4 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
  2025-05-24  1:39 [PATCH v6 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
@ 2025-05-24  1:39 ` ankita
  2025-05-26 15:25   ` Jason Gunthorpe
  2025-06-06 18:11   ` Sean Christopherson
  2025-05-24  1:39 ` [PATCH v6 2/5] KVM: arm64: New function to determine hardware cache management support ankita
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: ankita @ 2025-05-24  1:39 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

Fixes a security bug due to mismatched attributes between S1 and
S2 mapping.

Currently, it is possible for a region to be cacheable in S1, but mapped
non cached in S2. This creates a potential issue where the VMM may
sanitize cacheable memory across VMs using cacheable stores, ensuring
it is zeroed. However, if KVM subsequently assigns this memory to a VM
as uncached, the VM could end up accessing stale, non-zeroed data from
a previous VM, leading to unintended data exposure. This is a security
risk.

Block such mismatch attributes case by returning EINVAL when userspace
try to map PFNMAP cacheable. Only allow NORMAL_NC and DEVICE_*.

CC: Oliver Upton <oliver.upton@linux.dev>
CC: Sean Christopherson <seanjc@google.com>
CC: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/kvm/mmu.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2feb6c6b63af..305a0e054f81 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1466,6 +1466,18 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+/*
+ * Determine the memory region cacheability from VMA's pgprot. This
+ * is used to set the stage 2 PTEs.
+ */
+static unsigned long mapping_type_noncacheable(pgprot_t page_prot)
+{
+	unsigned long mt = FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
+
+	return (mt == MT_NORMAL_NC || mt == MT_DEVICE_nGnRnE ||
+		mt == MT_DEVICE_nGnRE);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1612,6 +1624,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
 
+	if ((vma->vm_flags & VM_PFNMAP) &&
+	    !mapping_type_noncacheable(vma->vm_page_prot))
+		return -EINVAL;
+
 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
 
@@ -2207,6 +2223,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				ret = -EINVAL;
 				break;
 			}
+
+			/* Cacheable PFNMAP is not allowed */
+			if (!mapping_type_noncacheable(vma->vm_page_prot)) {
+				ret = -EINVAL;
+				break;
+			}
 		}
 		hva = min(reg_end, vma->vm_end);
 	} while (hva < reg_end);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 2/5] KVM: arm64: New function to determine hardware cache management support
  2025-05-24  1:39 [PATCH v6 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
  2025-05-24  1:39 ` [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping ankita
@ 2025-05-24  1:39 ` ankita
  2025-05-27  0:25   ` Jason Gunthorpe
  2025-05-24  1:39 ` [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping ankita
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: ankita @ 2025-05-24  1:39 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

The hardware supports safely mapping PFNMAP as cacheable if it
is capable of managing cache. This can be determined by the presence
of FWB (Force Write Back) and CACHE_DIC feature.

When FWB is not enabled, the kernel expects to trivially do cache
management by flushing the memory by linearly converting a kvm_pte to
phys_addr to a KVA. The cache management thus relies on memory being
mapped. Since the GPU device memory is not kernel mapped, exit when
the FWB is not supported. Similarly, ARM64_HAS_CACHE_DIC allows KVM
to avoid flushing the icache and turns icache_inval_pou() into a NOP.
So the cacheable PFNMAP is contingent on these two hardware features.

Introduce a new function to make the check for presence of those
features.

CC: David Hildenbrand <david@redhat.com>
CC: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/kvm/mmu.c     | 12 ++++++++++++
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      |  5 +++++
 3 files changed, 19 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 305a0e054f81..124655da02ca 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1287,6 +1287,18 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	kvm_nested_s2_wp(kvm);
 }
 
+/**
+ * kvm_arch_supports_cacheable_pfnmap() - Determine whether hardware
+ *      supports cache management.
+ *
+ * Return: True if FWB and DIC is supported.
+ */
+bool kvm_arch_supports_cacheable_pfnmap(void)
+{
+	return cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) &&
+	       cpus_have_final_cap(ARM64_HAS_CACHE_DIC);
+}
+
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
 {
 	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 291d49b9bf05..390f147d8f31 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1231,6 +1231,8 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
 /* flush memory translations pointing to 'slot' */
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 				   struct kvm_memory_slot *slot);
+/* hardware supports cache management */
+bool kvm_arch_supports_cacheable_pfnmap(void);
 
 int kvm_prefetch_pages(struct kvm_memory_slot *slot, gfn_t gfn,
 		       struct page **pages, int nr_pages);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e85b33a92624..c7ecca504cdd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1526,6 +1526,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
 #define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \
 	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
 
+bool __weak kvm_arch_supports_cacheable_pfnmap(void)
+{
+	return false;
+}
+
 static int check_memory_region_flags(struct kvm *kvm,
 				     const struct kvm_userspace_memory_region2 *mem)
 {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping
  2025-05-24  1:39 [PATCH v6 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
  2025-05-24  1:39 ` [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping ankita
  2025-05-24  1:39 ` [PATCH v6 2/5] KVM: arm64: New function to determine hardware cache management support ankita
@ 2025-05-24  1:39 ` ankita
  2025-05-27  0:26   ` Jason Gunthorpe
  2025-05-24  1:39 ` [PATCH v6 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
  2025-05-24  1:39 ` [PATCH v6 5/5] KVM: arm64: Expose new KVM cap for cacheable PFNMAP ankita
  4 siblings, 1 reply; 19+ messages in thread
From: ankita @ 2025-05-24  1:39 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

Introduce a new memslot flag KVM_MEM_ENABLE_CACHEABLE_PFNMAP
as a tool for userspace to indicate that it expects a particular
PFN range to be mapped cacheable.

This will serve as a guide for the KVM to activate the code that
allows cacheable PFNMAP.

CC: Oliver Upton <oliver.upton@linux.dev>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 include/uapi/linux/kvm.h | 1 +
 virt/kvm/kvm_main.c      | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b6ae8ad8934b..9defefe7bdf0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
 #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
+#define KVM_MEM_ENABLE_CACHEABLE_PFNMAP	(1UL << 3)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c7ecca504cdd..cddda7f21413 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1524,7 +1524,8 @@ static void kvm_replace_memslot(struct kvm *kvm,
  * only allows these.
  */
 #define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \
-	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY | \
+	 KVM_MEM_ENABLE_CACHEABLE_PFNMAP)
 
 bool __weak kvm_arch_supports_cacheable_pfnmap(void)
 {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
  2025-05-24  1:39 [PATCH v6 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
                   ` (2 preceding siblings ...)
  2025-05-24  1:39 ` [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping ankita
@ 2025-05-24  1:39 ` ankita
  2025-06-06 18:14   ` Sean Christopherson
  2025-05-24  1:39 ` [PATCH v6 5/5] KVM: arm64: Expose new KVM cap for cacheable PFNMAP ankita
  4 siblings, 1 reply; 19+ messages in thread
From: ankita @ 2025-05-24  1:39 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
based on pfn_is_map_memory (which tracks whether the device memory
is added to the kernel) and ignores the per-VMA flags that indicates the
memory attributes. The KVM code is thus restrictive and allows only for
the memory that is added to the kernel to be marked as cacheable.

The device memory such as on the Grace Hopper/Blackwell systems
is interchangeable with DDR memory and retains properties such as
cacheability, unaligned accesses, atomics and handling of executable
faults. This requires the device memory to be mapped as NORMAL in
stage-2.

Given that the GPU device memory is not added to the kernel (but is rather
VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
to be mapped Normal cacheable. The patch aims to solve this use case.

Note when FWB is not enabled, the kernel expects to trivially do
cache management by flushing the memory by linearly converting a
kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
cache management thus relies on memory being mapped. Moreover
ARM64_HAS_CACHE_DIC CPU cap allows KVM to avoid flushing the icache
and turns icache_inval_pou() into a NOP. These two capabilities
are thus a requirement of the cacheable PFNMAP feature. Make use of
kvm_arch_supports_cacheable_pfnmap() to check them.

A cachebility check is made if the VM_PFNMAP is set in VMA flags by
consulting the VMA pgprot value. If the pgprot mapping type is cacheable,
it is safe to be mapped S2 cacheable as the KVM S2 will have the same
Normal memory type as the VMA has in the S1 and KVM has no additional
responsibility for safety. Checking pgprot as NORMAL is thus a KVM
sanity check.

Introduce a new variable cacheable_devmem to indicate a safely
cacheable mapping. Do not set the device variable when cacheable_devmem
is true. This essentially have the effect of setting stage-2 mapping
as NORMAL through kvm_pgtable_stage2_map.

Add check for COW VM_PFNMAP and refuse such mapping.

No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
already tests it at an early stage during memslot creation. There would
not even be a fault if the memslot is not created.

CC: Oliver Upton <oliver.upton@linux.dev>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 39 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 124655da02ca..c505efc4d174 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1499,6 +1499,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	bool write_fault, writable, force_pte = false;
 	bool exec_fault, mte_allowed;
 	bool device = false, vfio_allow_any_uc = false;
+	bool cacheable_devmem = false;
 	unsigned long mmu_seq;
 	phys_addr_t ipa = fault_ipa;
 	struct kvm *kvm = vcpu->kvm;
@@ -1636,9 +1637,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,

 	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;

-	if ((vma->vm_flags & VM_PFNMAP) &&
-	    !mapping_type_noncacheable(vma->vm_page_prot))
-		return -EINVAL;
+	if (vma->vm_flags & VM_PFNMAP) {
+		/* Reject COW VM_PFNMAP */
+		if (is_cow_mapping(vma->vm_flags))
+			return -EINVAL;
+
+		/*
+		 * If the VM_PFNMAP is set in VMA flags, do a KVM sanity
+		 * check to see if pgprot mapping type is MT_NORMAL and a
+		 * safely cacheable device memory.
+		 */
+		if (!mapping_type_noncacheable(vma->vm_page_prot))
+			cacheable_devmem = true;
+	}

 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
@@ -1671,10 +1682,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
 		 * and must not be upgraded.
 		 *
-		 * In both cases, we don't let transparent_hugepage_adjust()
+		 * Do not set device as the device memory is cacheable. Note
+		 * that such mapping is safe as the KVM S2 will have the same
+		 * Normal memory type as the VMA has in the S1.
 		 * change things at the last minute.
 		 */
-		device = true;
+		if (!cacheable_devmem)
+			device = true;
 	} else if (logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
@@ -1756,6 +1770,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		prot |= KVM_PGTABLE_PROT_X;
 	}

+	/*
+	 *  When FWB is unsupported KVM needs to do cache flushes
+	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
+	 *  only possible if the memory is already mapped into the kernel map.
+	 *
+	 *  Outright reject as the cacheable device memory is not present in
+	 *  the kernel map and not suitable for cache management.
+	 */
+	if (cacheable_devmem && !kvm_arch_supports_cacheable_pfnmap()) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
 	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
@@ -2236,8 +2263,13 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				break;
 			}

-			/* Cacheable PFNMAP is not allowed */
-			if (!mapping_type_noncacheable(vma->vm_page_prot)) {
+			/*
+			 * Cacheable PFNMAP is allowed only if the hardware
+			 * supports it and userspace asks for it.
+			 */
+			if (!mapping_type_noncacheable(vma->vm_page_prot) &&
+			    (!(new->flags & KVM_MEM_ENABLE_CACHEABLE_PFNMAP) ||
+			     !kvm_arch_supports_cacheable_pfnmap())) {
 				ret = -EINVAL;
 				break;
 			}
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 5/5] KVM: arm64: Expose new KVM cap for cacheable PFNMAP
  2025-05-24  1:39 [PATCH v6 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
                   ` (3 preceding siblings ...)
  2025-05-24  1:39 ` [PATCH v6 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
@ 2025-05-24  1:39 ` ankita
  4 siblings, 0 replies; 19+ messages in thread
From: ankita @ 2025-05-24  1:39 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

Introduce a new KVM capability to expose to the userspace whether
cacheable mapping of PFNMAP is supported.

The ability to safely do the cacheable mapping of PFNMAP is contingent
on S2FWB and ARM64_HAS_CACHE_DIC. S2FWB allows KVM to avoid flushing
the D cache, ARM64_HAS_CACHE_DIC allows KVM to avoid flushing the icache
and turns icache_inval_pou() into a NOP. The cap would be false if
those requirements are missing and is checked by making use of
kvm_arch_supports_cacheable_pfnmap.

This capability would allow userspace to discover the support.
This would be used in conjunction with the
KVM_MEM_ENABLE_CACHEABLE_PFNMAP memslot flag. Userspace is
required to query this capability before it can set the memslot
flag.

This cap could also be used by userspace to prevent live-migration
across FWB and non-FWB hosts.

CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
CC: Oliver Upton <oliver.upton@linux.dev>
CC: David Hildenbrand <david@redhat.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 Documentation/virt/kvm/api.rst | 17 ++++++++++++++++-
 arch/arm64/kvm/arm.c           |  7 +++++++
 include/uapi/linux/kvm.h       |  1 +
 virt/kvm/kvm_main.c            |  3 +++
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 47c7c3f92314..ad4c5e131977 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8478,7 +8478,7 @@ ENOSYS for the others.
 When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
 type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
 
-7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
+7.42 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
 -------------------------------------
 
 :Architectures: arm64
@@ -8496,6 +8496,21 @@ aforementioned registers before the first KVM_RUN. These registers are VM
 scoped, meaning that the same set of values are presented on all vCPUs in a
 given VM.
 
+7.43 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED
+-------------------------------------------
+
+:Architectures: arm64
+:Target: VM
+:Parameters: None
+
+This capability indicate to the userspace whether a PFNMAP memory region
+can be safely mapped as cacheable. This relies on the presence of
+force write back (FWB) feature support on the hardware.
+
+The usermode could query this capability and subsequently set the
+KVM_MEM_ENABLE_CACHEABLE_PFNMAP memslot flag forming a handshake to
+activate the code.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 68fec8c95fee..ea34b08237c4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -402,6 +402,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES:
 		r = BIT(0);
 		break;
+	case KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED:
+		if (!kvm)
+			r = -EINVAL;
+		else
+			r = kvm_arch_supports_cacheable_pfnmap();
+		break;
+
 	default:
 		r = 0;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 9defefe7bdf0..fb868586d73d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -931,6 +931,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
 #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239
+#define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 240
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cddda7f21413..25af7292810c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1537,6 +1537,9 @@ static int check_memory_region_flags(struct kvm *kvm,
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
+	if (kvm_arch_supports_cacheable_pfnmap())
+		valid_flags |= KVM_MEM_ENABLE_CACHEABLE_PFNMAP;
+
 	if (kvm_arch_has_private_mem(kvm))
 		valid_flags |= KVM_MEM_GUEST_MEMFD;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
  2025-05-24  1:39 ` [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping ankita
@ 2025-05-26 15:25   ` Jason Gunthorpe
  2025-05-27  4:04     ` Ankit Agrawal
  2025-06-06 18:11   ` Sean Christopherson
  1 sibling, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2025-05-26 15:25 UTC (permalink / raw)
  To: ankita
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david,
	aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

On Sat, May 24, 2025 at 01:39:39AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Fixes a security bug due to mismatched attributes between S1 and
> S2 mapping.
> 
> Currently, it is possible for a region to be cacheable in S1, but
> mapped

"cachable in the userspace VMA"

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 2/5] KVM: arm64: New function to determine hardware cache management support
  2025-05-24  1:39 ` [PATCH v6 2/5] KVM: arm64: New function to determine hardware cache management support ankita
@ 2025-05-27  0:25   ` Jason Gunthorpe
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2025-05-27  0:25 UTC (permalink / raw)
  To: ankita
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david,
	aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

On Sat, May 24, 2025 at 01:39:40AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>

How about:

VM_PFNMAP VMA's are allowed to contain PTE's which point to physical
addresses that does not have a struct page and may not be in the kernel
direct map.

However ARM64 KVM relies on a simple conversion from physaddr to a
kernel virtual address when it does cache maintenance as the CMO
instructions work on virtual addresses. This simple approach does not
work for physical addresses from VM_PFNMAP since those addresses may
not have a kernel virtual address, or it may be difficult to find it.

Fortunately if the ARM64 CPU has two features, S2FWB and CACHE DIC,
then KVM no longer needs to do cache flushing and NOP's all the
CMOs. This has the effect of no longer requiring a KVA for addresses
mapped into the S2.

Add a new function, kvm_arch_supports_cacheable_pfnmap(), to report
this capability. From a core prespective it means the arch can accept
a cachable VM_PFNMAP as a memslot. From an ARM64 perspective it means
that no KVA is required.

> +/**
> + * kvm_arch_supports_cacheable_pfnmap() - Determine whether hardware
> + *      supports cache management.
> + *
> + * Return: True if FWB and DIC is supported.

I would elaborate some of the above commit message here so people
understand why FWB and DIC are connected to this.

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping
  2025-05-24  1:39 ` [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping ankita
@ 2025-05-27  0:26   ` Jason Gunthorpe
  2025-05-27  4:33     ` Ankit Agrawal
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2025-05-27  0:26 UTC (permalink / raw)
  To: ankita
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david,
	aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

On Sat, May 24, 2025 at 01:39:41AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Introduce a new memslot flag KVM_MEM_ENABLE_CACHEABLE_PFNMAP
> as a tool for userspace to indicate that it expects a particular
> PFN range to be mapped cacheable.
> 
> This will serve as a guide for the KVM to activate the code that
> allows cacheable PFNMAP.
> 
> CC: Oliver Upton <oliver.upton@linux.dev>
> CC: Catalin Marinas <catalin.marinas@arm.com>
> CC: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: Marc Zyngier <maz@kernel.org>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  include/uapi/linux/kvm.h | 1 +
>  virt/kvm/kvm_main.c      | 3 ++-
>  2 files changed, 3 insertions(+), 1 deletion(-)

I thought we agreed not to do this? Sean was strongly against it
right?

There is no easy way for VFIO to know to set it, and the kernel will
not allow switching a cachable VMA to non-cachable anyhow.

So all it does is make it harder to create a memslot.

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
  2025-05-26 15:25   ` Jason Gunthorpe
@ 2025-05-27  4:04     ` Ankit Agrawal
  0 siblings, 0 replies; 19+ messages in thread
From: Ankit Agrawal @ 2025-05-27  4:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: maz@kernel.org, oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
	shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Krishnakant Jaju,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
	Dheeraj Nigam, alex.williamson@redhat.com,
	sebastianene@google.com, coltonlewis@google.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
	akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
	ddutile@redhat.com, tabba@google.com, qperret@google.com,
	seanjc@google.com, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maobibo@loongson.cn

>> Fixes a security bug due to mismatched attributes between S1 and
>> S2 mapping.
>>
>> Currently, it is possible for a region to be cacheable in S1, but
>> mapped
>
> "cachable in the userspace VMA"

Thanks Jason for catching that! Will update the text.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping
  2025-05-27  0:26   ` Jason Gunthorpe
@ 2025-05-27  4:33     ` Ankit Agrawal
  2025-06-02  4:42       ` Ankit Agrawal
  2025-06-06 17:57       ` Sean Christopherson
  0 siblings, 2 replies; 19+ messages in thread
From: Ankit Agrawal @ 2025-05-27  4:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: maz@kernel.org, oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
	shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Krishnakant Jaju,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
	Dheeraj Nigam, alex.williamson@redhat.com,
	sebastianene@google.com, coltonlewis@google.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
	akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
	ddutile@redhat.com, tabba@google.com, qperret@google.com,
	seanjc@google.com, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maobibo@loongson.cn

> I thought we agreed not to do this? Sean was strongly against it
> right?

> There is no easy way for VFIO to know to set it, and the kernel will
> not allow switching a cachable VMA to non-cachable anyhow.

> So all it does is make it harder to create a memslot.

Oliver had mentioned earlier that he would still prefer a memslot flag as
VMM should convey its intent through that flag:
https://lore.kernel.org/all/aAdKCGCuwlUeUXKY@linux.dev/

Oliver, could you please confirm if you are convinced with not having this
flag? Can we rely on MT_NORMAL in vma mapping to convey this?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping
  2025-05-27  4:33     ` Ankit Agrawal
@ 2025-06-02  4:42       ` Ankit Agrawal
  2025-06-06 17:57       ` Sean Christopherson
  1 sibling, 0 replies; 19+ messages in thread
From: Ankit Agrawal @ 2025-06-02  4:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: maz@kernel.org, oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
	shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Krishnakant Jaju,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
	Dheeraj Nigam, alex.williamson@redhat.com,
	sebastianene@google.com, coltonlewis@google.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
	akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
	ddutile@redhat.com, tabba@google.com, qperret@google.com,
	seanjc@google.com, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maobibo@loongson.cn

>> I thought we agreed not to do this? Sean was strongly against it
>> right?
>>
>> There is no easy way for VFIO to know to set it, and the kernel will
>> not allow switching a cachable VMA to non-cachable anyhow.
>>
>> So all it does is make it harder to create a memslot.

> Oliver had mentioned earlier that he would still prefer a memslot flag as
> VMM should convey its intent through that flag:
> https://lore.kernel.org/all/aAdKCGCuwlUeUXKY@linux.dev/
>
> Oliver, could you please confirm if you are convinced with not having this
> flag? Can we rely on MT_NORMAL in vma mapping to convey this?

Humble reminder for Oliver to comment to conclude on whether we really
need this memslot flag.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping
  2025-05-27  4:33     ` Ankit Agrawal
  2025-06-02  4:42       ` Ankit Agrawal
@ 2025-06-06 17:57       ` Sean Christopherson
  2025-06-13 19:38         ` Oliver Upton
  1 sibling, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-06-06 17:57 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
	shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Krishnakant Jaju,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
	Dheeraj Nigam, alex.williamson@redhat.com,
	sebastianene@google.com, coltonlewis@google.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
	akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
	ddutile@redhat.com, tabba@google.com, qperret@google.com,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maobibo@loongson.cn

On Tue, May 27, 2025, Ankit Agrawal wrote:
> > I thought we agreed not to do this? Sean was strongly against it
> > right?

Yes.  NAK, at least for this implementation.

IMO, this has no business being in KVM's uAPI, and it's a mess.  KVM x86
unconditionally supports cacheable PFNMAP mappings, yet this:

  bool __weak kvm_arch_supports_cacheable_pfnmap(void)
  {
	return false;
  }

	if (kvm_arch_supports_cacheable_pfnmap())
		valid_flags |= KVM_MEM_ENABLE_CACHEABLE_PFNMAP;

means x86 will disallow KVM_MEM_ENABLE_CACHEABLE_PFNMAP.  Which is fine-ish from
a uAPI perspective, as the flag is documented as arm64-only, and we can state that
all other architectures always allow cacheable mappings.  But even that is a mess,
because KVM won't _guarantee_ the final mapping is cacheable.

On AMD, there's simply no sane way to force WB (KVM can't override guest PAT,
i.e. the memtype requested/set by the guest's stage-1 page tables).

On Intel, after years of pain, we _finally_ got KVM out of a mess where KVM was
forcing WB for all non-MMIO memory.  Only to have to immediately revert and add
KVM_X86_QUIRK_IGNORE_GUEST_PAT because buggy guest drivers were relying on KVM's
behavior :-(

So there's zero chance of this memslot flag ever being supported on x86.  Which,
again, is fine for uAPI.  But for internal code it's going to be all kinds of
confusing, because kvm_arch_supports_cacheable_pfnmap() is a flat out lie.

And as proposed, the memslot flag also doesn't actually address Oliver's want:

  The memslot flag says userspace expects a particular GFN range to guarantee
                                                                    ^^^^^^^^^
  Write-Back semantics.

IIUC, what Oliver wants is:

			if (mapping_type_noncacheable(vma->vm_page_prot)) {
				if (new->flags & KVM_MEM_FORCE_CACHEABLE_PFNMAP)
					return -EINVAL;
			} else {
				if (!kvm_arch_supports_cacheable_pfnmap()))
					return -EINVAL;
			}

That's at least a bit more palatable, as it doesn't create impossible situations
on x86, e.g. x86 simply doesn't support letting userspace force a cacheable.

And Oliver also stated:

  Whether or not FWB is employed for a particular region of IPA space is useful
  information for userspace deciding what it needs to do to access guest memory. 

The above would only cover half of that, i.e. wouldn't prevent userspace from
getting surprised by a WB mapping.  So I think it would need to be this?

			if (mapping_type_noncacheable(vma->vm_page_prot) !=
                            !(new->flags & KVM_MEM_FORCE_CACHEABLE_PFNMAP))
				return -EINVAL;

Which I don't hate as much, but I still don't love it, as it's overly specific,
e.g. only helps with PFNMAP memory, and pushes a sanity from userspace into KVM.

Which is another complaint with this uAPI: it effectively assumes/implies PFNMAP is
device memory, but that's simply not true.  There are zero guarantees with respect
to what actually lies behind any given PFNMAP.  It could be device memory, but
it could also be regular RAM, or something in between.

I would much prefer we have a way userspace query the effective memtype for a
range of memory, either for a VMA or for a KVM mapping, and let _userspace_ do
whatever sanity checks it wants.  That seems like it would be more generally
useful, and would be feasible to support on multiple architectures.  Though I'd
probably prefer to avoid even that, e.g. in favor of providing enough information
in other ways so that userspace can (somewhat easily) deduce how KVM will behave
for a giving mapping.

> > There is no easy way for VFIO to know to set it, and the kernel will
> > not allow switching a cachable VMA to non-cachable anyhow.
> 
> > So all it does is make it harder to create a memslot.
> 
> Oliver had mentioned earlier that he would still prefer a memslot flag as
> VMM should convey its intent through that flag:
>
> https://lore.kernel.org/all/aAdKCGCuwlUeUXKY@linux.dev/
> Oliver, could you please confirm if you are convinced with not having this
> flag? Can we rely on MT_NORMAL in vma mapping to convey this?

Is MT_NORMAL visable and/or controllable by userspace?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
  2025-05-24  1:39 ` [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping ankita
  2025-05-26 15:25   ` Jason Gunthorpe
@ 2025-06-06 18:11   ` Sean Christopherson
  2025-06-09 12:24     ` Jason Gunthorpe
  1 sibling, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-06-06 18:11 UTC (permalink / raw)
  To: ankita
  Cc: jgg, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david,
	aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, kvmarm,
	linux-kernel, linux-arm-kernel, maobibo

On Sat, May 24, 2025, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Fixes a security bug due to mismatched attributes between S1 and
> S2 mapping.
> 
> Currently, it is possible for a region to be cacheable in S1, but mapped
> non cached in S2. This creates a potential issue where the VMM may
> sanitize cacheable memory across VMs using cacheable stores, ensuring
> it is zeroed. However, if KVM subsequently assigns this memory to a VM
> as uncached, the VM could end up accessing stale, non-zeroed data from
> a previous VM, leading to unintended data exposure. This is a security
> risk.
> 
> Block such mismatch attributes case by returning EINVAL when userspace
> try to map PFNMAP cacheable. Only allow NORMAL_NC and DEVICE_*.
> 
> CC: Oliver Upton <oliver.upton@linux.dev>
> CC: Sean Christopherson <seanjc@google.com>
> CC: Catalin Marinas <catalin.marinas@arm.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/kvm/mmu.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 2feb6c6b63af..305a0e054f81 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1466,6 +1466,18 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_MTE_ALLOWED;
>  }
>  
> +/*
> + * Determine the memory region cacheability from VMA's pgprot. This
> + * is used to set the stage 2 PTEs.
> + */
> +static unsigned long mapping_type_noncacheable(pgprot_t page_prot)

Return a bool.  And given that all the usage queries cachaeable, maybe invert
this predicate?

> +{
> +	unsigned long mt = FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
> +
> +	return (mt == MT_NORMAL_NC || mt == MT_DEVICE_nGnRnE ||
> +		mt == MT_DEVICE_nGnRE);
> +}

No need for the parantheses.  And since the values are clumped together, maybe
use a switch statement to let the compiler optimize the checks (though I'm
guessing modern compilers will optimize either way).

E.g.

static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
{
	switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
	case MT_NORMAL_NC:
	case MT_DEVICE_nGnRnE:
	case MT_DEVICE_nGnRE:
		return false;
	default:
		return true;
	}
}


>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_s2_trans *nested,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> @@ -1612,6 +1624,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
>  
> +	if ((vma->vm_flags & VM_PFNMAP) &&
> +	    !mapping_type_noncacheable(vma->vm_page_prot))

I don't think this is correct, and there's a very real chance this will break
existing setups.  PFNMAP memory isn't strictly device memory, and IIUC, KVM
force DEVICE/NORMAL_NC based on kvm_is_device_pfn(), not based on VM_PFNMAP.

	if (kvm_is_device_pfn(pfn)) {
		/*
		 * If the page was identified as device early by looking at
		 * the VMA flags, vma_pagesize is already representing the
		 * largest quantity we can map.  If instead it was mapped
		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
		 * and must not be upgraded.
		 *
		 * In both cases, we don't let transparent_hugepage_adjust()
		 * change things at the last minute.
		 */
		device = true;
	}

	if (device) {
		if (vfio_allow_any_uc)
			prot |= KVM_PGTABLE_PROT_NORMAL_NC;
		else
			prot |= KVM_PGTABLE_PROT_DEVICE;
	} else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
		   (!nested || kvm_s2_trans_executable(nested))) {
		prot |= KVM_PGTABLE_PROT_X;
	}

which gets morphed into the hardware memtype attributes as:

	switch (prot & (KVM_PGTABLE_PROT_DEVICE |
			KVM_PGTABLE_PROT_NORMAL_NC)) {
	case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
		return -EINVAL;
	case KVM_PGTABLE_PROT_DEVICE:
		if (prot & KVM_PGTABLE_PROT_X)
			return -EINVAL;
		attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRE);
		break;
	case KVM_PGTABLE_PROT_NORMAL_NC:
		if (prot & KVM_PGTABLE_PROT_X)
			return -EINVAL;
		attr = KVM_S2_MEMATTR(pgt, NORMAL_NC);
		break;
	default:
		attr = KVM_S2_MEMATTR(pgt, NORMAL);
	}

E.g. if the admin hides RAM from the kernel and manages it in userspace via
/dev/mem, this will break (I think).

So I believe what you want is something like this:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index eeda92330ade..4129ab5ac871 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1466,6 +1466,18 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
        return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
+{
+       switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
+       case MT_NORMAL_NC:
+       case MT_DEVICE_nGnRnE:
+       case MT_DEVICE_nGnRE:
+               return false;
+       default:
+               return true;
+       }
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
                          struct kvm_s2_trans *nested,
                          struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1473,7 +1485,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 {
        int ret = 0;
        bool write_fault, writable, force_pte = false;
-       bool exec_fault, mte_allowed;
+       bool exec_fault, mte_allowed, is_vma_cacheable;
        bool device = false, vfio_allow_any_uc = false;
        unsigned long mmu_seq;
        phys_addr_t ipa = fault_ipa;
@@ -1615,6 +1627,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
        vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
 
+       is_vma_cacheable = kvm_vma_is_cacheable(vma);
+
        /* Don't use the VMA after the unlock -- it may have vanished */
        vma = NULL;
 
@@ -1639,6 +1653,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
                return -EFAULT;
 
        if (kvm_is_device_pfn(pfn)) {
+               if (is_vma_cacheable)
+                       return -EINVAL;
+
                /*
                 * If the page was identified as device early by looking at
                 * the VMA flags, vma_pagesize is already representing the
@@ -1722,6 +1739,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
                prot |= KVM_PGTABLE_PROT_X;
 
        if (device) {
+               if (is_vma_cacheable) {
+                       ret = -EINVAL;
+                       goto out;
+               }
+
                if (vfio_allow_any_uc)
                        prot |= KVM_PGTABLE_PROT_NORMAL_NC;
                else



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
  2025-05-24  1:39 ` [PATCH v6 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
@ 2025-06-06 18:14   ` Sean Christopherson
  0 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-06-06 18:14 UTC (permalink / raw)
  To: ankita
  Cc: jgg, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david,
	aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, kvmarm,
	linux-kernel, linux-arm-kernel, maobibo

On Sat, May 24, 2025, ankita@nvidia.com wrote:
> @@ -1636,9 +1637,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
>  
> -	if ((vma->vm_flags & VM_PFNMAP) &&
> -	    !mapping_type_noncacheable(vma->vm_page_prot))
> -		return -EINVAL;
> +	if (vma->vm_flags & VM_PFNMAP) {
> +		/* Reject COW VM_PFNMAP */
> +		if (is_cow_mapping(vma->vm_flags))
> +			return -EINVAL;
> +
> +		/*
> +		 * If the VM_PFNMAP is set in VMA flags, do a KVM sanity
> +		 * check to see if pgprot mapping type is MT_NORMAL and a
> +		 * safely cacheable device memory.
> +		 */
> +		if (!mapping_type_noncacheable(vma->vm_page_prot))
> +			cacheable_devmem = true;
> +	}
>  
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
> @@ -1671,10 +1682,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
>  		 * and must not be upgraded.
>  		 *
> -		 * In both cases, we don't let transparent_hugepage_adjust()
> +		 * Do not set device as the device memory is cacheable. Note
> +		 * that such mapping is safe as the KVM S2 will have the same
> +		 * Normal memory type as the VMA has in the S1.
>  		 * change things at the last minute.
>  		 */
> -		device = true;
> +		if (!cacheable_devmem)
> +			device = true;

I doubt this is correct.  "device" is used for more than just the memtype.  E.g.
hugepage adjustments, MTE, etc. all consult "device".  I.e. don't conflate device
with VM_PFNMAP.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
  2025-06-06 18:11   ` Sean Christopherson
@ 2025-06-09 12:24     ` Jason Gunthorpe
  2025-06-09 14:21       ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2025-06-09 12:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: ankita, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david,
	aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, kvmarm,
	linux-kernel, linux-arm-kernel, maobibo

On Fri, Jun 06, 2025 at 11:11:56AM -0700, Sean Christopherson wrote:
> > @@ -1612,6 +1624,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  
> >  	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> >  
> > +	if ((vma->vm_flags & VM_PFNMAP) &&
> > +	    !mapping_type_noncacheable(vma->vm_page_prot))
> 
> I don't think this is correct, and there's a very real chance this will break
> existing setups.  PFNMAP memory isn't strictly device memory, and IIUC, KVM
> force DEVICE/NORMAL_NC based on kvm_is_device_pfn(), not based on VM_PFNMAP.

kvm_is_device_pfn() effecitvely means KVM can't use CMOs on that
PFN. It doesn't really mean anything more..

PFNMAP says the same thing, or at least from a mm perspective we don't
want drivers taking PFNMAP memory and then trying to guess if there
are struct pages/KVAs for it. PFNMAP memory is supposed to be fully
opaque.

Though that confusion seems to be a separate issue from this patch.

> 	if (kvm_is_device_pfn(pfn)) {
> 		/*
> 		 * If the page was identified as device early by looking at
> 		 * the VMA flags, vma_pagesize is already representing the
> 		 * largest quantity we can map.  If instead it was mapped
> 		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
> 		 * and must not be upgraded.
> 		 *
> 		 * In both cases, we don't let transparent_hugepage_adjust()
> 		 * change things at the last minute.
> 		 */
> 		device = true;

"device" here is sort of a mis-nomer, it is really just trying to
setup the S2 so that CMOs are not going go to be done.

Calling it 'disable_cmo' would sure make this code clearer..

> @@ -1639,6 +1653,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                 return -EFAULT;
>  
>         if (kvm_is_device_pfn(pfn)) {
> +               if (is_vma_cacheable)
> +                       return -EINVAL;
> +

eg

if (!kvm_can_use_cmo_pfn(pfn)) {
               if (is_vma_cacheable)
                       return -EINVAL;

>                  * If the page was identified as device early by looking at
>                  * the VMA flags, vma_pagesize is already representing the
> @@ -1722,6 +1739,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                 prot |= KVM_PGTABLE_PROT_X;
>  
>         if (device) {
> +               if (is_vma_cacheable) {
> +                       ret = -EINVAL;
> +                       goto out;
> +               }

if (disable_cmo) {
               if (is_vma_cacheable)
                       return -EINVAL;

Makes alot more sense, right? If KVM can't do CMOs then it should not
attempt to use memory mapped into the VMA as cachable.

>                 if (vfio_allow_any_uc)
>                         prot |= KVM_PGTABLE_PROT_NORMAL_NC;
>                 else
> 

Regardless, this seems good for this patch at least.

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping
  2025-06-09 12:24     ` Jason Gunthorpe
@ 2025-06-09 14:21       ` Sean Christopherson
  0 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-06-09 14:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ankita, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david,
	aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, kvmarm,
	linux-kernel, linux-arm-kernel, maobibo

On Mon, Jun 09, 2025, Jason Gunthorpe wrote:
> On Fri, Jun 06, 2025 at 11:11:56AM -0700, Sean Christopherson wrote:
> > > @@ -1612,6 +1624,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >  
> > >  	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> > >  
> > > +	if ((vma->vm_flags & VM_PFNMAP) &&
> > > +	    !mapping_type_noncacheable(vma->vm_page_prot))
> > 
> > I don't think this is correct, and there's a very real chance this will break
> > existing setups.  PFNMAP memory isn't strictly device memory, and IIUC, KVM
> > force DEVICE/NORMAL_NC based on kvm_is_device_pfn(), not based on VM_PFNMAP.
> 
> kvm_is_device_pfn() effecitvely means KVM can't use CMOs on that
> PFN. It doesn't really mean anything more..

Ah, kvm_is_device_pfn() isn't actually detecting device memory, it's simply
detecting memory that isn't in the direct map.

> PFNMAP says the same thing, or at least from a mm perspective we don't
> want drivers taking PFNMAP memory and then trying to guess if there
> are struct pages/KVAs for it. PFNMAP memory is supposed to be fully
> opaque.
> 
> Though that confusion seems to be a separate issue from this patch.
> 
> > 	if (kvm_is_device_pfn(pfn)) {
> > 		/*
> > 		 * If the page was identified as device early by looking at
> > 		 * the VMA flags, vma_pagesize is already representing the
> > 		 * largest quantity we can map.  If instead it was mapped
> > 		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
> > 		 * and must not be upgraded.
> > 		 *
> > 		 * In both cases, we don't let transparent_hugepage_adjust()
> > 		 * change things at the last minute.
> > 		 */
> > 		device = true;
> 
> "device" here is sort of a mis-nomer, it is really just trying to
> setup the S2 so that CMOs are not going go to be done.
> 
> Calling it 'disable_cmo' would sure make this code clearer..
> 
> > @@ -1639,6 +1653,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                 return -EFAULT;
> >  
> >         if (kvm_is_device_pfn(pfn)) {
> > +               if (is_vma_cacheable)
> > +                       return -EINVAL;
> > +
> 
> eg
> 
> if (!kvm_can_use_cmo_pfn(pfn)) {
>                if (is_vma_cacheable)
>                        return -EINVAL;
> 
> >                  * If the page was identified as device early by looking at
> >                  * the VMA flags, vma_pagesize is already representing the
> > @@ -1722,6 +1739,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                 prot |= KVM_PGTABLE_PROT_X;
> >  
> >         if (device) {
> > +               if (is_vma_cacheable) {
> > +                       ret = -EINVAL;
> > +                       goto out;
> > +               }
> 
> if (disable_cmo) {
>                if (is_vma_cacheable)
>                        return -EINVAL;
> 
> Makes alot more sense, right? If KVM can't do CMOs then it should not
> attempt to use memory mapped into the VMA as cachable.

Yes, for sure.

> >                 if (vfio_allow_any_uc)
> >                         prot |= KVM_PGTABLE_PROT_NORMAL_NC;
> >                 else
> > 
> 
> Regardless, this seems good for this patch at least.
> 
> Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping
  2025-06-06 17:57       ` Sean Christopherson
@ 2025-06-13 19:38         ` Oliver Upton
  2025-06-16 11:37           ` Ankit Agrawal
  0 siblings, 1 reply; 19+ messages in thread
From: Oliver Upton @ 2025-06-13 19:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ankit Agrawal, Jason Gunthorpe, maz@kernel.org,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
	shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Krishnakant Jaju,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
	Dheeraj Nigam, alex.williamson@redhat.com,
	sebastianene@google.com, coltonlewis@google.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
	akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
	ddutile@redhat.com, tabba@google.com, qperret@google.com,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maobibo@loongson.cn

Hey,

Sorry for going AWOL on this for so long, buried under work for a while.

On Fri, Jun 06, 2025 at 10:57:34AM -0700, Sean Christopherson wrote:
> I would much prefer we have a way userspace query the effective memtype for a
> range of memory, either for a VMA or for a KVM mapping, and let _userspace_ do
> whatever sanity checks it wants.  That seems like it would be more generally
> useful, and would be feasible to support on multiple architectures.  Though I'd
> probably prefer to avoid even that, e.g. in favor of providing enough information
> in other ways so that userspace can (somewhat easily) deduce how KVM will behave
> for a giving mapping.

Agreed, and really userspace needs to know what it has in its own
stage-1 for that to make sense. The idea with a memslot flag is that
you'd get a 'handshake' with KVM, although that only works for a single
memory type.

What's really needed is a fine-grained enumeration as the architecture
allows an implementation to break uniprocessor semantics + coherency for _any_
deviation in memory attributes (e.g. Device-nGnRE v. Device-nGnRnE).
Although in practice it's usually a Normal-* v. Device-* mismatch that
we actually expose to the VMM.

So, in the absence of a complete solution, I guess we can forgo the
memslot flag. OTOH, the KVM cap is still useful since even now we do the
wrong thing with cacheable PFNMAP so KVM_SET_USER_MEMORY_REGION
accepting a VMA doesn't mean much.

Burden is on the VMM to decide what that means in the context of $THING
it wants to install into a memslot.

> > > There is no easy way for VFIO to know to set it, and the kernel will
> > > not allow switching a cachable VMA to non-cachable anyhow.
> > 
> > > So all it does is make it harder to create a memslot.
> > 
> > Oliver had mentioned earlier that he would still prefer a memslot flag as
> > VMM should convey its intent through that flag:
> >
> > https://lore.kernel.org/all/aAdKCGCuwlUeUXKY@linux.dev/
> > Oliver, could you please confirm if you are convinced with not having this
> > flag? Can we rely on MT_NORMAL in vma mapping to convey this?

Yes, following the VMAs memory attributes is the right thing to do. To
be clear, this is something I'd really like to have settled for 6.17.

> Is MT_NORMAL visable and/or controllable by userspace?

Generally speaking, no.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping
  2025-06-13 19:38         ` Oliver Upton
@ 2025-06-16 11:37           ` Ankit Agrawal
  0 siblings, 0 replies; 19+ messages in thread
From: Ankit Agrawal @ 2025-06-16 11:37 UTC (permalink / raw)
  To: Oliver Upton, Sean Christopherson
  Cc: Jason Gunthorpe, maz@kernel.org, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
	shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Krishnakant Jaju,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
	Dheeraj Nigam, alex.williamson@redhat.com,
	sebastianene@google.com, coltonlewis@google.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
	akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
	ddutile@redhat.com, tabba@google.com, qperret@google.com,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maobibo@loongson.cn

> So, in the absence of a complete solution, I guess we can forgo the
> memslot flag. OTOH, the KVM cap is still useful since even now we do the
> wrong thing with cacheable PFNMAP so KVM_SET_USER_MEMORY_REGION
> accepting a VMA doesn't mean much.

Thanks Sean, Jason and Oliver for sharing your thoughts. I'll remove this
flag, and continue to keep the KVM cap in the next version posting.

>> > https://lore.kernel.org/all/aAdKCGCuwlUeUXKY@linux.dev/
>> > Oliver, could you please confirm if you are convinced with not having this
>> > flag? Can we rely on MT_NORMAL in vma mapping to convey this?
>
> Yes, following the VMAs memory attributes is the right thing to do. To
> be clear, this is something I'd really like to have settled for 6.17.

Ack.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-06-16 12:43 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-24  1:39 [PATCH v6 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
2025-05-24  1:39 ` [PATCH v6 1/5] KVM: arm64: Block cacheable PFNMAP mapping ankita
2025-05-26 15:25   ` Jason Gunthorpe
2025-05-27  4:04     ` Ankit Agrawal
2025-06-06 18:11   ` Sean Christopherson
2025-06-09 12:24     ` Jason Gunthorpe
2025-06-09 14:21       ` Sean Christopherson
2025-05-24  1:39 ` [PATCH v6 2/5] KVM: arm64: New function to determine hardware cache management support ankita
2025-05-27  0:25   ` Jason Gunthorpe
2025-05-24  1:39 ` [PATCH v6 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping ankita
2025-05-27  0:26   ` Jason Gunthorpe
2025-05-27  4:33     ` Ankit Agrawal
2025-06-02  4:42       ` Ankit Agrawal
2025-06-06 17:57       ` Sean Christopherson
2025-06-13 19:38         ` Oliver Upton
2025-06-16 11:37           ` Ankit Agrawal
2025-05-24  1:39 ` [PATCH v6 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
2025-06-06 18:14   ` Sean Christopherson
2025-05-24  1:39 ` [PATCH v6 5/5] KVM: arm64: Expose new KVM cap for cacheable PFNMAP ankita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).