[PATCH v5 0/5] KVM: arm64: Map GPU device memory as cacheable

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/5] KVM: arm64: Map GPU device memory as cacheable
@ 2025-05-23 15:44 ankita
  2025-05-23 15:44 ` [PATCH v5 1/5] KVM: arm64: Block cacheable PFNMAP mapping ankita
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: ankita @ 2025-05-23 15:44 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

Grace based platforms such as Grace Hopper/Blackwell Superchips have
CPU accessible cache coherent GPU memory. The GPU device memory is
essentially a DDR memory and retains properties such as cacheability,
unaligned accesses, atomics and handling of executable faults. This
requires the device memory to be mapped as NORMAL in stage-2.

Today KVM forces the memory to either NORMAL or DEVICE_nGnRE depending
on whether the memory region is added to the kernel. The KVM code is
thus restrictive and prevents device memory that is not added to the
kernel to be marked as cacheable. The patch aims to solve this.

A cachebility check is made if the VM_PFNMAP is set in VMA flags by
consulting the VMA pgprot value. If the pgprot mapping type is
cacheable, it is considered safe to be mapped cacheable as the KVM
S2 will have the same Normal memory type as the VMA has in the S1
and KVM has no additional responsibility for safety.

Note when FWB (Force Write Back) is not enabled, the kernel expects to
trivially do cache management by flushing the memory by linearly
converting a kvm_pte to phys_addr to a KVA. The cache management thus
relies on memory being mapped. Since the GPU device memory is not kernel
mapped, exit when the FWB is not supported. Similarly, ARM64_HAS_CACHE_DIC
allows KVM to avoid flushing the icache and turns icache_inval_pou() into
a NOP. So the cacheable PFNMAP is made contingent on these two hardware
features.

The ability to safely do the cacheable mapping of PFNMAP is exposed
through a KVM capability. The userspace is supposed to query it and
consequently set a new memslot flag if it desire to do such mapping.

The changes are heavily influenced by the discussions among
maintainers Marc Zyngier and Oliver Upton besides Jason Gunthorpe,
Catalin Marinas, David Hildenbrand, Sean Christopherson [1] in v3.
Many thanks for their valuable suggestions.

Applied over next-20250407 and tested on the Grace Hopper and
Grace Blackwell platforms by booting up VM, loading NVIDIA module [2]
and running nvidia-smi in the VM.

To run CUDA workloads, there is a dependency on the IOMMUFD and the
Nested Page Table patches being worked on separately by Nicolin Chen.
(nicolinc@nvidia.com). NVIDIA has provided git repositories which
includes all the requisite kernel [3] and Qemu [4] patches in case
one wants to try.

v4 -> v5
1. Invert the check to allow MT_DEVICE_* or NORMAL_NC instead of
disallowing MT_NORMAL in 1/5. (Catalin Marinas)
2. Removed usage of stage2_has_fwb and directly using the FWB
cap check. (Oliver Upton)
3. Introduced kvm_arch_supports_cacheable_pfnmap to check if
the prereq features are present. (David Hildenbrand)

v3 -> v4
1. Fixed a security bug due to mismatched attributes between S1 and
S2 mapping to move it to a separate patch. Suggestion by
Jason Gunthorpe (jgg@nvidia.com).
2. New minor patch to change the scope of the FWB support indicator
function.
3. Patch to introduce a new memslot flag. Suggestion by Oliver Upton
(oliver.upton@linux.dev) and Marc Zyngier (maz@kernel.org)
4. Patch to introduce a new KVM cap to expose cacheable PFNMAP support.
Suggestion by Marc Zyngier (maz@kernel.org).
5. Added checks for ARM64_HAS_CACHE_DIC. Suggestion by Catalin Marinas
(catalin.marinas@arm.com)

v2 -> v3
1. Restricted the new changes to check for cacheability to VM_PFNMAP
   based on David Hildenbrand's (david@redhat.com) suggestion.
2. Removed the MTE checks based on Jason Gunthorpe's (jgg@nvidia.com)
   observation that it already done earlier in
   kvm_arch_prepare_memory_region.
3. Dropped the pfn_valid() checks based on suggestions by
   Catalin Marinas (catalin.marinas@arm.com).
4. Removed the code for exec fault handling as it is not needed
   anymore.

v1 -> v2
1. Removed kvm_is_device_pfn() as a determiner for device type memory
   determination. Instead using pfn_valid()
2. Added handling for MTE.
3. Minor cleanup.

Link: https://lore.kernel.org/all/20250310103008.3471-1-ankita@nvidia.com [1]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2]
Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [3]
Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [4]

v4 Link:
Link: https://lore.kernel.org/all/20250310103008.3471-1-ankita@nvidia.com

Ankit Agrawal (5):
  KVM: arm64: Block cacheable PFNMAP mapping
  KVM: arm64: New function to determine hardware cache management
    support
  kvm: arm64: New memslot flag to indicate cacheable mapping
  KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
  KVM: arm64: Expose new KVM cap for cacheable PFNMAP

 Documentation/virt/kvm/api.rst | 17 ++++++++-
 arch/arm64/kvm/arm.c           |  7 ++++
 arch/arm64/kvm/mmu.c           | 70 +++++++++++++++++++++++++++++++++-
 include/linux/kvm_host.h       |  2 +
 include/uapi/linux/kvm.h       |  2 +
 virt/kvm/kvm_main.c            | 11 +++++-
 6 files changed, 105 insertions(+), 4 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 1/5] KVM: arm64: Block cacheable PFNMAP mapping
  2025-05-23 15:44 [PATCH v5 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
@ 2025-05-23 15:44 ` ankita
  2025-05-23 15:44 ` [PATCH v5 2/5] KVM: arm64: New function to determine hardware cache management support ankita
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: ankita @ 2025-05-23 15:44 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

Fixes a security bug due to mismatched attributes between S1 and
S2 mapping.

Currently, it is possible for a region to be cacheable in S1, but mapped
non cached in S2. This creates a potential issue where the VMM may
sanitize cacheable memory across VMs using cacheable stores, ensuring
it is zeroed. However, if KVM subsequently assigns this memory to a VM
as uncached, the VM could end up accessing stale, non-zeroed data from
a previous VM, leading to unintended data exposure. This is a security
risk.

Block such mismatch attributes case by returning EINVAL when userspace
try to map PFNMAP cacheable. Only allow NORMAL_NC and DEVICE_*.

CC: Oliver Upton <oliver.upton@linux.dev>
CC: Sean Christopherson <seanjc@google.com>
CC: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/kvm/mmu.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2feb6c6b63af..305a0e054f81 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1466,6 +1466,18 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+/*
+ * Determine the memory region cacheability from VMA's pgprot. This
+ * is used to set the stage 2 PTEs.
+ */
+static unsigned long mapping_type_noncacheable(pgprot_t page_prot)
+{
+	unsigned long mt = FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
+
+	return (mt == MT_NORMAL_NC || mt == MT_DEVICE_nGnRnE ||
+		mt == MT_DEVICE_nGnRE);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1612,6 +1624,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
 
+	if ((vma->vm_flags & VM_PFNMAP) &&
+	    !mapping_type_noncacheable(vma->vm_page_prot))
+		return -EINVAL;
+
 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
 
@@ -2207,6 +2223,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				ret = -EINVAL;
 				break;
 			}
+
+			/* Cacheable PFNMAP is not allowed */
+			if (!mapping_type_noncacheable(vma->vm_page_prot)) {
+				ret = -EINVAL;
+				break;
+			}
 		}
 		hva = min(reg_end, vma->vm_end);
 	} while (hva < reg_end);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 2/5] KVM: arm64: New function to determine hardware cache management support
  2025-05-23 15:44 [PATCH v5 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
  2025-05-23 15:44 ` [PATCH v5 1/5] KVM: arm64: Block cacheable PFNMAP mapping ankita
@ 2025-05-23 15:44 ` ankita
  2025-05-23 19:30   ` Donald Dutile
  2025-05-23 15:44 ` [PATCH v5 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping ankita
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: ankita @ 2025-05-23 15:44 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

The hardware supports safely mapping PFNMAP as cacheable if it
is capable of managing cache. This can be determined by the presence
of FWB (Force Write Back) and CACHE_DIC feature.

When FWB is not enabled, the kernel expects to trivially do cache
management by flushing the memory by linearly converting a kvm_pte to
phys_addr to a KVA. The cache management thus relies on memory being
mapped. Since the GPU device memory is not kernel mapped, exit when
the FWB is not supported. Similarly, ARM64_HAS_CACHE_DIC allows KVM
to avoid flushing the icache and turns icache_inval_pou() into a NOP.
So the cacheable PFNMAP is contingent on these two hardware features.

Introduce a new function to make the check for presence of those
features.

CC: David Hildenbrand <david@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/kvm/mmu.c     | 12 ++++++++++++
 include/linux/kvm_host.h |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 305a0e054f81..124655da02ca 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1287,6 +1287,18 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	kvm_nested_s2_wp(kvm);
 }
 
+/**
+ * kvm_arch_supports_cacheable_pfnmap() - Determine whether hardware
+ *      supports cache management.
+ *
+ * Return: True if FWB and DIC is supported.
+ */
+bool kvm_arch_supports_cacheable_pfnmap(void)
+{
+	return cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) &&
+	       cpus_have_final_cap(ARM64_HAS_CACHE_DIC);
+}
+
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
 {
 	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 291d49b9bf05..3750d216d456 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1231,6 +1231,8 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
 /* flush memory translations pointing to 'slot' */
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 				   struct kvm_memory_slot *slot);
+/* hardware support cache management */
+bool kvm_arch_supports_cacheable_pfnmap(void);
 
 int kvm_prefetch_pages(struct kvm_memory_slot *slot, gfn_t gfn,
 		       struct page **pages, int nr_pages);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping
  2025-05-23 15:44 [PATCH v5 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
  2025-05-23 15:44 ` [PATCH v5 1/5] KVM: arm64: Block cacheable PFNMAP mapping ankita
  2025-05-23 15:44 ` [PATCH v5 2/5] KVM: arm64: New function to determine hardware cache management support ankita
@ 2025-05-23 15:44 ` ankita
  2025-05-23 15:44 ` [PATCH v5 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
  2025-05-23 15:44 ` [PATCH v5 5/5] KVM: arm64: Expose new KVM cap for cacheable PFNMAP ankita
  4 siblings, 0 replies; 10+ messages in thread
From: ankita @ 2025-05-23 15:44 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

Introduce a new memslot flag KVM_MEM_ENABLE_CACHEABLE_PFNMAP
as a tool for userspace to indicate that it expects a particular
PFN range to be mapped cacheable.

This will serve as a guide for the KVM to activate the code that
allows cacheable PFNMAP.

CC: Oliver Upton <oliver.upton@linux.dev>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 include/uapi/linux/kvm.h | 1 +
 virt/kvm/kvm_main.c      | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b6ae8ad8934b..9defefe7bdf0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
 #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
+#define KVM_MEM_ENABLE_CACHEABLE_PFNMAP	(1UL << 3)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e85b33a92624..5e0532c3abc2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1524,7 +1524,8 @@ static void kvm_replace_memslot(struct kvm *kvm,
  * only allows these.
  */
 #define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \
-	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY | \
+	 KVM_MEM_ENABLE_CACHEABLE_PFNMAP)
 
 static int check_memory_region_flags(struct kvm *kvm,
 				     const struct kvm_userspace_memory_region2 *mem)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
  2025-05-23 15:44 [PATCH v5 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
                   ` (2 preceding siblings ...)
  2025-05-23 15:44 ` [PATCH v5 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping ankita
@ 2025-05-23 15:44 ` ankita
  2025-05-23 15:44 ` [PATCH v5 5/5] KVM: arm64: Expose new KVM cap for cacheable PFNMAP ankita
  4 siblings, 0 replies; 10+ messages in thread
From: ankita @ 2025-05-23 15:44 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
based on pfn_is_map_memory (which tracks whether the device memory
is added to the kernel) and ignores the per-VMA flags that indicates the
memory attributes. The KVM code is thus restrictive and allows only for
the memory that is added to the kernel to be marked as cacheable.

The device memory such as on the Grace Hopper/Blackwell systems
is interchangeable with DDR memory and retains properties such as
cacheability, unaligned accesses, atomics and handling of executable
faults. This requires the device memory to be mapped as NORMAL in
stage-2.

Given that the GPU device memory is not added to the kernel (but is rather
VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
to be mapped Normal cacheable. The patch aims to solve this use case.

Note when FWB is not enabled, the kernel expects to trivially do
cache management by flushing the memory by linearly converting a
kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
cache management thus relies on memory being mapped. Moreover
ARM64_HAS_CACHE_DIC CPU cap allows KVM to avoid flushing the icache
and turns icache_inval_pou() into a NOP. These two capabilities
are thus a requirement of the cacheable PFNMAP feature. Make use of
kvm_arch_supports_cacheable_pfnmap() to check them.

A cachebility check is made if the VM_PFNMAP is set in VMA flags by
consulting the VMA pgprot value. If the pgprot mapping type is cacheable,
it is safe to be mapped S2 cacheable as the KVM S2 will have the same
Normal memory type as the VMA has in the S1 and KVM has no additional
responsibility for safety. Checking pgprot as NORMAL is thus a KVM
sanity check.

Introduce a new variable cacheable_devmem to indicate a safely
cacheable mapping. Do not set the device variable when cacheable_devmem
is true. This essentially have the effect of setting stage-2 mapping
as NORMAL through kvm_pgtable_stage2_map.

Add check for COW VM_PFNMAP and refuse such mapping.

No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
already tests it at an early stage during memslot creation. There would
not even be a fault if the memslot is not created.

CC: Oliver Upton <oliver.upton@linux.dev>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/kvm/mmu.c | 46 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 39 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 124655da02ca..c505efc4d174 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1499,6 +1499,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	bool write_fault, writable, force_pte = false;
 	bool exec_fault, mte_allowed;
 	bool device = false, vfio_allow_any_uc = false;
+	bool cacheable_devmem = false;
 	unsigned long mmu_seq;
 	phys_addr_t ipa = fault_ipa;
 	struct kvm *kvm = vcpu->kvm;
@@ -1636,9 +1637,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,

 	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;

-	if ((vma->vm_flags & VM_PFNMAP) &&
-	    !mapping_type_noncacheable(vma->vm_page_prot))
-		return -EINVAL;
+	if (vma->vm_flags & VM_PFNMAP) {
+		/* Reject COW VM_PFNMAP */
+		if (is_cow_mapping(vma->vm_flags))
+			return -EINVAL;
+
+		/*
+		 * If the VM_PFNMAP is set in VMA flags, do a KVM sanity
+		 * check to see if pgprot mapping type is MT_NORMAL and a
+		 * safely cacheable device memory.
+		 */
+		if (!mapping_type_noncacheable(vma->vm_page_prot))
+			cacheable_devmem = true;
+	}

 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
@@ -1671,10 +1682,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
 		 * and must not be upgraded.
 		 *
-		 * In both cases, we don't let transparent_hugepage_adjust()
+		 * Do not set device as the device memory is cacheable. Note
+		 * that such mapping is safe as the KVM S2 will have the same
+		 * Normal memory type as the VMA has in the S1.
 		 * change things at the last minute.
 		 */
-		device = true;
+		if (!cacheable_devmem)
+			device = true;
 	} else if (logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
@@ -1756,6 +1770,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		prot |= KVM_PGTABLE_PROT_X;
 	}

+	/*
+	 *  When FWB is unsupported KVM needs to do cache flushes
+	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
+	 *  only possible if the memory is already mapped into the kernel map.
+	 *
+	 *  Outright reject as the cacheable device memory is not present in
+	 *  the kernel map and not suitable for cache management.
+	 */
+	if (cacheable_devmem && !kvm_arch_supports_cacheable_pfnmap()) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
 	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
@@ -2236,8 +2263,13 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				break;
 			}

-			/* Cacheable PFNMAP is not allowed */
-			if (!mapping_type_noncacheable(vma->vm_page_prot)) {
+			/*
+			 * Cacheable PFNMAP is allowed only if the hardware
+			 * supports it and userspace asks for it.
+			 */
+			if (!mapping_type_noncacheable(vma->vm_page_prot) &&
+			    (!(new->flags & KVM_MEM_ENABLE_CACHEABLE_PFNMAP) ||
+			     !kvm_arch_supports_cacheable_pfnmap())) {
 				ret = -EINVAL;
 				break;
 			}
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 5/5] KVM: arm64: Expose new KVM cap for cacheable PFNMAP
  2025-05-23 15:44 [PATCH v5 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
                   ` (3 preceding siblings ...)
  2025-05-23 15:44 ` [PATCH v5 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
@ 2025-05-23 15:44 ` ankita
  4 siblings, 0 replies; 10+ messages in thread
From: ankita @ 2025-05-23 15:44 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, ddutile, tabba, qperret, seanjc,
	kvmarm, linux-kernel, linux-arm-kernel, maobibo

From: Ankit Agrawal <ankita@nvidia.com>

Introduce a new KVM capability to expose to the userspace whether
cacheable mapping of PFNMAP is supported.

The ability to safely do the cacheable mapping of PFNMAP is contingent
on S2FWB and ARM64_HAS_CACHE_DIC. S2FWB allows KVM to avoid flushing
the D cache, ARM64_HAS_CACHE_DIC allows KVM to avoid flushing the icache
and turns icache_inval_pou() into a NOP. The cap would be false if
those requirements are missing and is checked by making use of
kvm_arch_supports_cacheable_pfnmap.

This capability would allow userspace to discover the support.
This would be used in conjunction with the
KVM_MEM_ENABLE_CACHEABLE_PFNMAP memslot flag. Userspace is
required to query this capability before it can set the memslot
flag.

This cap could also be used by userspace to prevent live-migration
across FWB and non-FWB hosts.

CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
CC: Oliver Upton <oliver.upton@linux.dev>
CC: David Hildenbrand <david@redhat.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 Documentation/virt/kvm/api.rst | 17 ++++++++++++++++-
 arch/arm64/kvm/arm.c           |  7 +++++++
 include/uapi/linux/kvm.h       |  1 +
 virt/kvm/kvm_main.c            |  8 ++++++++
 4 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 47c7c3f92314..ad4c5e131977 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8478,7 +8478,7 @@ ENOSYS for the others.
 When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
 type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
 
-7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
+7.42 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
 -------------------------------------
 
 :Architectures: arm64
@@ -8496,6 +8496,21 @@ aforementioned registers before the first KVM_RUN. These registers are VM
 scoped, meaning that the same set of values are presented on all vCPUs in a
 given VM.
 
+7.43 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED
+-------------------------------------------
+
+:Architectures: arm64
+:Target: VM
+:Parameters: None
+
+This capability indicate to the userspace whether a PFNMAP memory region
+can be safely mapped as cacheable. This relies on the presence of
+force write back (FWB) feature support on the hardware.
+
+The usermode could query this capability and subsequently set the
+KVM_MEM_ENABLE_CACHEABLE_PFNMAP memslot flag forming a handshake to
+activate the code.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 68fec8c95fee..ea34b08237c4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -402,6 +402,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES:
 		r = BIT(0);
 		break;
+	case KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED:
+		if (!kvm)
+			r = -EINVAL;
+		else
+			r = kvm_arch_supports_cacheable_pfnmap();
+		break;
+
 	default:
 		r = 0;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 9defefe7bdf0..fb868586d73d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -931,6 +931,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
 #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239
+#define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 240
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5e0532c3abc2..25af7292810c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1527,11 +1527,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY | \
 	 KVM_MEM_ENABLE_CACHEABLE_PFNMAP)
 
+bool __weak kvm_arch_supports_cacheable_pfnmap(void)
+{
+	return false;
+}
+
 static int check_memory_region_flags(struct kvm *kvm,
 				     const struct kvm_userspace_memory_region2 *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
+	if (kvm_arch_supports_cacheable_pfnmap())
+		valid_flags |= KVM_MEM_ENABLE_CACHEABLE_PFNMAP;
+
 	if (kvm_arch_has_private_mem(kvm))
 		valid_flags |= KVM_MEM_GUEST_MEMFD;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 2/5] KVM: arm64: New function to determine hardware cache management support
  2025-05-23 15:44 ` [PATCH v5 2/5] KVM: arm64: New function to determine hardware cache management support ankita
@ 2025-05-23 19:30   ` Donald Dutile
  2025-05-23 19:36     ` Donald Dutile
  0 siblings, 1 reply; 10+ messages in thread
From: Donald Dutile @ 2025-05-23 19:30 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, tabba, qperret, seanjc, kvmarm,
	linux-kernel, linux-arm-kernel, maobibo



On 5/23/25 11:44 AM, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The hardware supports safely mapping PFNMAP as cacheable if it
> is capable of managing cache. This can be determined by the presence
> of FWB (Force Write Back) and CACHE_DIC feature.
> 
> When FWB is not enabled, the kernel expects to trivially do cache
> management by flushing the memory by linearly converting a kvm_pte to
> phys_addr to a KVA. The cache management thus relies on memory being
> mapped. Since the GPU device memory is not kernel mapped, exit when
> the FWB is not supported. Similarly, ARM64_HAS_CACHE_DIC allows KVM
> to avoid flushing the icache and turns icache_inval_pou() into a NOP.
> So the cacheable PFNMAP is contingent on these two hardware features.
> 
> Introduce a new function to make the check for presence of those
> features.
> 
> CC: David Hildenbrand <david@redhat.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>   arch/arm64/kvm/mmu.c     | 12 ++++++++++++
>   include/linux/kvm_host.h |  2 ++
>   2 files changed, 14 insertions(+)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 305a0e054f81..124655da02ca 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1287,6 +1287,18 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>   	kvm_nested_s2_wp(kvm);
>   }
>   
> +/**
> + * kvm_arch_supports_cacheable_pfnmap() - Determine whether hardware
> + *      supports cache management.
> + *
> + * Return: True if FWB and DIC is supported.
> + */
> +bool kvm_arch_supports_cacheable_pfnmap(void)
> +{
> +	return cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) &&
> +	       cpus_have_final_cap(ARM64_HAS_CACHE_DIC);
> +}
> +
>   static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
>   {
>   	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 291d49b9bf05..3750d216d456 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1231,6 +1231,8 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
>   /* flush memory translations pointing to 'slot' */
>   void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>   				   struct kvm_memory_slot *slot);
> +/* hardware support cache management */
> +bool kvm_arch_supports_cacheable_pfnmap(void);
>   
Won't this cause a build warning on non-ARM builds, b/c there is no
resolution of this function for the other arch's?
Need #ifdef or default-rtn-0 function for arch's that don't have this function?

>   int kvm_prefetch_pages(struct kvm_memory_slot *slot, gfn_t gfn,
>   		       struct page **pages, int nr_pages);



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 2/5] KVM: arm64: New function to determine hardware cache management support
  2025-05-23 19:30   ` Donald Dutile
@ 2025-05-23 19:36     ` Donald Dutile
  2025-05-24  1:41       ` Ankit Agrawal
  0 siblings, 1 reply; 10+ messages in thread
From: Donald Dutile @ 2025-05-23 19:36 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
	yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
	lpieralisi, david
  Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
	apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
	alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
	ardb, akpm, gshan, linux-mm, tabba, qperret, seanjc, kvmarm,
	linux-kernel, linux-arm-kernel, maobibo



On 5/23/25 3:30 PM, Donald Dutile wrote:
> 
> 
> On 5/23/25 11:44 AM, ankita@nvidia.com wrote:
>> From: Ankit Agrawal <ankita@nvidia.com>
>>
>> The hardware supports safely mapping PFNMAP as cacheable if it
>> is capable of managing cache. This can be determined by the presence
>> of FWB (Force Write Back) and CACHE_DIC feature.
>>
>> When FWB is not enabled, the kernel expects to trivially do cache
>> management by flushing the memory by linearly converting a kvm_pte to
>> phys_addr to a KVA. The cache management thus relies on memory being
>> mapped. Since the GPU device memory is not kernel mapped, exit when
>> the FWB is not supported. Similarly, ARM64_HAS_CACHE_DIC allows KVM
>> to avoid flushing the icache and turns icache_inval_pou() into a NOP.
>> So the cacheable PFNMAP is contingent on these two hardware features.
>>
>> Introduce a new function to make the check for presence of those
>> features.
>>
>> CC: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
>> ---
>>   arch/arm64/kvm/mmu.c     | 12 ++++++++++++
>>   include/linux/kvm_host.h |  2 ++
>>   2 files changed, 14 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 305a0e054f81..124655da02ca 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1287,6 +1287,18 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>>       kvm_nested_s2_wp(kvm);
>>   }
>> +/**
>> + * kvm_arch_supports_cacheable_pfnmap() - Determine whether hardware
>> + *      supports cache management.
>> + *
>> + * Return: True if FWB and DIC is supported.
>> + */
>> +bool kvm_arch_supports_cacheable_pfnmap(void)
>> +{
>> +    return cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) &&
>> +           cpus_have_final_cap(ARM64_HAS_CACHE_DIC);
>> +}
>> +
>>   static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
>>   {
>>       send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 291d49b9bf05..3750d216d456 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1231,6 +1231,8 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
>>   /* flush memory translations pointing to 'slot' */
>>   void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>>                      struct kvm_memory_slot *slot);
>> +/* hardware support cache management */
>> +bool kvm_arch_supports_cacheable_pfnmap(void);
> Won't this cause a build warning on non-ARM builds, b/c there is no
> resolution of this function for the other arch's?
> Need #ifdef or default-rtn-0 function for arch's that don't have this function?
> 
ah, I see you have the weak function in patch 5/5.
But I think you have to move that hunk to this patch, so a bisect won't cause
a build warning (or failure, depending on how a distro sets -W in its builds).

- Don
>>   int kvm_prefetch_pages(struct kvm_memory_slot *slot, gfn_t gfn,
>>                  struct page **pages, int nr_pages);



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 2/5] KVM: arm64: New function to determine hardware cache management support
  2025-05-23 19:36     ` Donald Dutile
@ 2025-05-24  1:41       ` Ankit Agrawal
  2025-05-24  2:38         ` Donald Dutile
  0 siblings, 1 reply; 10+ messages in thread
From: Ankit Agrawal @ 2025-05-24  1:41 UTC (permalink / raw)
  To: Donald Dutile, Jason Gunthorpe, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
	shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com
  Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Krishnakant Jaju,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
	Dheeraj Nigam, alex.williamson@redhat.com,
	sebastianene@google.com, coltonlewis@google.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
	akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
	tabba@google.com, qperret@google.com, seanjc@google.com,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maobibo@loongson.cn

>>> +/* hardware support cache management */
>>> +bool kvm_arch_supports_cacheable_pfnmap(void);
>> Won't this cause a build warning on non-ARM builds, b/c there is no
>> resolution of this function for the other arch's?
>> Need #ifdef or default-rtn-0 function for arch's that don't have this function?
>>
>
> ah, I see you have the weak function in patch 5/5.
> But I think you have to move that hunk to this patch, so a bisect won't cause
> a build warning (or failure, depending on how a distro sets -W in its builds).

Thanks Donald for catching that. Fixed in v6.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 2/5] KVM: arm64: New function to determine hardware cache management support
  2025-05-24  1:41       ` Ankit Agrawal
@ 2025-05-24  2:38         ` Donald Dutile
  0 siblings, 0 replies; 10+ messages in thread
From: Donald Dutile @ 2025-05-24  2:38 UTC (permalink / raw)
  To: Ankit Agrawal, Jason Gunthorpe, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
	shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com
  Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Krishnakant Jaju,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
	Dheeraj Nigam, alex.williamson@redhat.com,
	sebastianene@google.com, coltonlewis@google.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
	akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
	tabba@google.com, qperret@google.com, seanjc@google.com,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maobibo@loongson.cn



On 5/23/25 9:41 PM, Ankit Agrawal wrote:
>>>> +/* hardware support cache management */
>>>> +bool kvm_arch_supports_cacheable_pfnmap(void);
>>> Won't this cause a build warning on non-ARM builds, b/c there is no
>>> resolution of this function for the other arch's?
>>> Need #ifdef or default-rtn-0 function for arch's that don't have this function?
>>>
>>
>> ah, I see you have the weak function in patch 5/5.
>> But I think you have to move that hunk to this patch, so a bisect won't cause
>> a build warning (or failure, depending on how a distro sets -W in its builds).
> 
> Thanks Donald for catching that. Fixed in v6.
> 
Thanks for fixing the nit.



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-05-24  2:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-23 15:44 [PATCH v5 0/5] KVM: arm64: Map GPU device memory as cacheable ankita
2025-05-23 15:44 ` [PATCH v5 1/5] KVM: arm64: Block cacheable PFNMAP mapping ankita
2025-05-23 15:44 ` [PATCH v5 2/5] KVM: arm64: New function to determine hardware cache management support ankita
2025-05-23 19:30   ` Donald Dutile
2025-05-23 19:36     ` Donald Dutile
2025-05-24  1:41       ` Ankit Agrawal
2025-05-24  2:38         ` Donald Dutile
2025-05-23 15:44 ` [PATCH v5 3/5] kvm: arm64: New memslot flag to indicate cacheable mapping ankita
2025-05-23 15:44 ` [PATCH v5 4/5] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
2025-05-23 15:44 ` [PATCH v5 5/5] KVM: arm64: Expose new KVM cap for cacheable PFNMAP ankita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).