* [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable
@ 2025-07-05 7:17 ankita
2025-07-05 7:17 ` [PATCH v10 1/6] KVM: arm64: Rename the device variable to s2_force_noncacheable ankita
` (7 more replies)
0 siblings, 8 replies; 17+ messages in thread
From: ankita @ 2025-07-05 7:17 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
lpieralisi, david, ddutile, seanjc
Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
From: Ankit Agrawal <ankita@nvidia.com>
Grace based platforms such as Grace Hopper/Blackwell Superchips have
CPU accessible cache coherent GPU memory. The GPU device memory is
essentially a DDR memory and retains properties such as cacheability,
unaligned accesses, atomics and handling of executable faults. This
requires the device memory to be mapped as NORMAL in stage-2.
Today KVM forces the memory to either NORMAL or DEVICE_nGnRE depending
on whether the memory region is added to the kernel. The KVM code is
thus restrictive and prevents device memory that is not added to the
kernel to be marked as cacheable. The patch aims to solve this.
A cachebility check is made by consulting the VMA pgprot value. If
the pgprot mapping type is cacheable, it is considered safe to be
mapped cacheable as the KVM S2 will have the same Normal memory type
as the VMA has in the S1 and KVM has no additional responsibility
for safety.
Note when FWB (Force Write Back) is not enabled, the kernel expects to
trivially do cache management by flushing the memory by linearly
converting a kvm_pte to phys_addr to a KVA. The cache management thus
relies on memory being mapped. Since the GPU device memory is not kernel
mapped, exit when the FWB is not supported. Similarly, ARM64_HAS_CACHE_DIC
allows KVM to avoid flushing the icache and turns icache_inval_pou() into
a NOP. So the cacheable PFNMAP is made contingent on these two hardware
features.
The ability to safely do the cacheable mapping of PFNMAP is exposed
through a KVM capability for userspace consumption.
The changes are heavily influenced by the discussions among
maintainers Marc Zyngier and Oliver Upton besides Jason Gunthorpe,
Catalin Marinas, David Hildenbrand, Sean Christopherson [1]. Many
thanks for their valuable suggestions. Thanks to Donald Dutile
for testing the patch series and providing Tested-by.
Applied over next-20250610 and tested on the Grace Blackwell
platform by booting up VM, loading NVIDIA module [2] and running
nvidia-smi in the VM.
To run CUDA workloads, there is a dependency on the IOMMUFD and the
Nested Page Table patches being worked on separately by Nicolin Chen.
(nicolinc@nvidia.com). NVIDIA has provided git repositories which
includes all the requisite kernel [3] and Qemu [4] patches in case
one wants to try.
v9 -> v10
1. Removed dead code in 5/6 (Jason Gunthorpe).
2. Collected Reviewed-by from Jason Gunthorpe and David Hildenbrand
on all but 5/6 patch. (Thanks!)
3. Collected Tested-by from Donald Dutile (Thanks!).
v8 -> v9
1. Included MIXEDMAP to also be considered for cacheable mapping.
(Jason Gunthorpe).
2. Minor text nits (Jason Gunthorpe).
v7 -> v8
1. Renamed device variable to s2_force_noncacheable. (Jason Gunthorpe,
Catalin Marinas)
2. Updated code location that block S1 cacheable, S2 non-cacheable mapping.
(Jason Gunthorpe, Catalin Marinas)
3. Added comments in the code for COW and cacheability checks.
(Jason Gunthorpe, Catalin Marinas)
4. Reorganised the code to setup s2_force_noncacheable variable.
(Jason Gunthorpe).
5. Collected Reviewed-By on patch 4/6. (Catalin Marinas)
v6 -> v7
1. New patch to rename symbols to more accurately reflect the
CMO usage functionality (Jason Gunthorpe).
2. Updated the block cacheable PFNMAP patch invert the cacheability
check function (Sean Christopherson).
3. Removed the memslot flag KVM_MEM_ENABLE_CACHEABLE_PFNMAP.
(Jason Gunthorpe, Sean Christopherson, Oliver Upton).
4. Commit message changes in 2/5. (Jason Gunthorpe)
v5 -> v6
1. 2/5 updated to add kvm_arch_supports_cacheable_pfnmap weak
definition to avoid build warnings. (Donald Dutile).
v4 -> v5
1. Invert the check to allow MT_DEVICE_* or NORMAL_NC instead of
disallowing MT_NORMAL in 1/5. (Catalin Marinas)
2. Removed usage of stage2_has_fwb and directly using the FWB
cap check. (Oliver Upton)
3. Introduced kvm_arch_supports_cacheable_pfnmap to check if
the prereq features are present. (David Hildenbrand)
v3 -> v4
1. Fixed a security bug due to mismatched attributes between S1 and
S2 mapping to move it to a separate patch. Suggestion by
Jason Gunthorpe (jgg@nvidia.com).
2. New minor patch to change the scope of the FWB support indicator
function.
3. Patch to introduce a new memslot flag. Suggestion by Oliver Upton
(oliver.upton@linux.dev) and Marc Zyngier (maz@kernel.org)
4. Patch to introduce a new KVM cap to expose cacheable PFNMAP support.
Suggestion by Marc Zyngier (maz@kernel.org).
5. Added checks for ARM64_HAS_CACHE_DIC. Suggestion by Catalin Marinas
(catalin.marinas@arm.com)
v2 -> v3
1. Restricted the new changes to check for cacheability to VM_PFNMAP
based on David Hildenbrand's (david@redhat.com) suggestion.
2. Removed the MTE checks based on Jason Gunthorpe's (jgg@nvidia.com)
observation that it already done earlier in
kvm_arch_prepare_memory_region.
3. Dropped the pfn_valid() checks based on suggestions by
Catalin Marinas (catalin.marinas@arm.com).
4. Removed the code for exec fault handling as it is not needed
anymore.
v1 -> v2
1. Removed kvm_is_device_pfn() as a determiner for device type memory
determination. Instead using pfn_valid()
2. Added handling for MTE.
3. Minor cleanup.
Link: https://lore.kernel.org/all/20250310103008.3471-1-ankita@nvidia.com [1]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2]
Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [3]
Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [4]
v9 Link:
Link: https://lore.kernel.org/all/20250621042111.3992-1-ankita@nvidia.com/
Ankit Agrawal (6):
KVM: arm64: Rename the device variable to s2_force_noncacheable
KVM: arm64: Update the check to detect device memory
KVM: arm64: Block cacheable PFNMAP mapping
KVM: arm64: New function to determine hardware cache management
support
KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
KVM: arm64: Expose new KVM cap for cacheable PFNMAP
Documentation/virt/kvm/api.rst | 13 +++-
arch/arm64/kvm/arm.c | 7 ++
arch/arm64/kvm/mmu.c | 118 ++++++++++++++++++++++++++-------
include/linux/kvm_host.h | 2 +
include/uapi/linux/kvm.h | 1 +
virt/kvm/kvm_main.c | 5 ++
6 files changed, 122 insertions(+), 24 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v10 1/6] KVM: arm64: Rename the device variable to s2_force_noncacheable
2025-07-05 7:17 [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable ankita
@ 2025-07-05 7:17 ` ankita
2025-07-07 0:51 ` Catalin Marinas
2025-07-05 7:17 ` [PATCH v10 2/6] KVM: arm64: Update the check to detect device memory ankita
` (6 subsequent siblings)
7 siblings, 1 reply; 17+ messages in thread
From: ankita @ 2025-07-05 7:17 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
lpieralisi, david, ddutile, seanjc
Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
From: Ankit Agrawal <ankita@nvidia.com>
For cache maintenance on a region, ARM KVM relies on that
region to be mapped to the Kernal virtual address as CMOs
operate on VA.
Currently the device variable is effectively trying to setup
the S2 mapping as non cacheable for memory regions that are
not mapped in the Kernel VA. This could be either device or
Normal_NC depending on the VM_ALLOW_ANY_UNCACHED flag in the
VMA.
Thus "device" could be better renamed to s2_force_noncacheable
which implies that it is ensuring that region be mapped as
non-cacheable.
CC: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Tested-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
arch/arm64/kvm/mmu.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2942ec92c5a4..1601ab9527d4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1478,7 +1478,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
int ret = 0;
bool write_fault, writable, force_pte = false;
bool exec_fault, mte_allowed;
- bool device = false, vfio_allow_any_uc = false;
+ bool s2_force_noncacheable = false, vfio_allow_any_uc = false;
unsigned long mmu_seq;
phys_addr_t ipa = fault_ipa;
struct kvm *kvm = vcpu->kvm;
@@ -1653,7 +1653,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
* In both cases, we don't let transparent_hugepage_adjust()
* change things at the last minute.
*/
- device = true;
+ s2_force_noncacheable = true;
} else if (logging_active && !write_fault) {
/*
* Only actually map the page as writable if this was a write
@@ -1662,7 +1662,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
writable = false;
}
- if (exec_fault && device)
+ if (exec_fault && s2_force_noncacheable)
return -ENOEXEC;
/*
@@ -1695,7 +1695,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
* If we are not forced to use page mapping, check if we are
* backed by a THP and thus use block mapping if possible.
*/
- if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
+ if (vma_pagesize == PAGE_SIZE && !(force_pte || s2_force_noncacheable)) {
if (fault_is_perm && fault_granule > PAGE_SIZE)
vma_pagesize = fault_granule;
else
@@ -1709,7 +1709,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
}
}
- if (!fault_is_perm && !device && kvm_has_mte(kvm)) {
+ if (!fault_is_perm && !s2_force_noncacheable && kvm_has_mte(kvm)) {
/* Check the VMM hasn't introduced a new disallowed VMA */
if (mte_allowed) {
sanitise_mte_tags(kvm, pfn, vma_pagesize);
@@ -1725,7 +1725,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (exec_fault)
prot |= KVM_PGTABLE_PROT_X;
- if (device) {
+ if (s2_force_noncacheable) {
if (vfio_allow_any_uc)
prot |= KVM_PGTABLE_PROT_NORMAL_NC;
else
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v10 2/6] KVM: arm64: Update the check to detect device memory
2025-07-05 7:17 [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable ankita
2025-07-05 7:17 ` [PATCH v10 1/6] KVM: arm64: Rename the device variable to s2_force_noncacheable ankita
@ 2025-07-05 7:17 ` ankita
2025-07-07 0:52 ` Catalin Marinas
2025-07-05 7:17 ` [PATCH v10 3/6] KVM: arm64: Block cacheable PFNMAP mapping ankita
` (5 subsequent siblings)
7 siblings, 1 reply; 17+ messages in thread
From: ankita @ 2025-07-05 7:17 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
lpieralisi, david, ddutile, seanjc
Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
From: Ankit Agrawal <ankita@nvidia.com>
Currently, the kvm_is_device_pfn() detects if the memory is kernel
mapped through pfn_is_map_memory. It thus implies whether KVM can
use Cache Maintenance Operations (CMOs) on that PFN. It is a bit
of a misnomer as it does not necessarily detect whether a PFN
is for a device memory. Moreover, the function is only used at
one place.
It would be better to directly call pfn_is_map_memory. Moreover
we should restrict this call to VM_PFNMAP or VM_MIXEDMAP. Non PFMAP
or MIXEDMAP VMA's must always contain normal pages which are
struct page backed, have KVA's and are cachable. So we should always
be able to go from phys to KVA to do a CMO.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Tested-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
arch/arm64/kvm/mmu.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1601ab9527d4..5fe24f30999d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -193,11 +193,6 @@ int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm,
return 0;
}
-static bool kvm_is_device_pfn(unsigned long pfn)
-{
- return !pfn_is_map_memory(pfn);
-}
-
static void *stage2_memcache_zalloc_page(void *arg)
{
struct kvm_mmu_memory_cache *mc = arg;
@@ -1492,6 +1487,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
struct kvm_pgtable *pgt;
struct page *page;
+ vm_flags_t vm_flags;
enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
if (fault_is_perm)
@@ -1619,6 +1615,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
+ vm_flags = vma->vm_flags;
+
/* Don't use the VMA after the unlock -- it may have vanished */
vma = NULL;
@@ -1642,7 +1640,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (is_error_noslot_pfn(pfn))
return -EFAULT;
- if (kvm_is_device_pfn(pfn)) {
+ if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) {
/*
* If the page was identified as device early by looking at
* the VMA flags, vma_pagesize is already representing the
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v10 3/6] KVM: arm64: Block cacheable PFNMAP mapping
2025-07-05 7:17 [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable ankita
2025-07-05 7:17 ` [PATCH v10 1/6] KVM: arm64: Rename the device variable to s2_force_noncacheable ankita
2025-07-05 7:17 ` [PATCH v10 2/6] KVM: arm64: Update the check to detect device memory ankita
@ 2025-07-05 7:17 ` ankita
2025-07-07 0:54 ` Catalin Marinas
2025-07-05 7:17 ` [PATCH v10 4/6] KVM: arm64: New function to determine hardware cache management support ankita
` (4 subsequent siblings)
7 siblings, 1 reply; 17+ messages in thread
From: ankita @ 2025-07-05 7:17 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
lpieralisi, david, ddutile, seanjc
Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
From: Ankit Agrawal <ankita@nvidia.com>
Fixes a security bug due to mismatched attributes between S1 and
S2 mapping.
Currently, it is possible for a region to be cacheable in the userspace
VMA, but mapped non cached in S2. This creates a potential issue where
the VMM may sanitize cacheable memory across VMs using cacheable stores,
ensuring it is zeroed. However, if KVM subsequently assigns this memory
to a VM as uncached, the VM could end up accessing stale, non-zeroed data
from a previous VM, leading to unintended data exposure. This is a security
risk.
Block such mismatch attributes case by returning EINVAL when userspace
try to map PFNMAP cacheable. Only allow NORMAL_NC and DEVICE_*.
CC: Oliver Upton <oliver.upton@linux.dev>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Sean Christopherson <seanjc@google.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Tested-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
arch/arm64/kvm/mmu.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 5fe24f30999d..68c0f1c25dec 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1465,6 +1465,22 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
return vma->vm_flags & VM_MTE_ALLOWED;
}
+/*
+ * Determine the memory region cacheability from VMA's pgprot. This
+ * is used to set the stage 2 PTEs.
+ */
+static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
+{
+ switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
+ case MT_NORMAL_NC:
+ case MT_DEVICE_nGnRnE:
+ case MT_DEVICE_nGnRE:
+ return false;
+ default:
+ return true;
+ }
+}
+
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct kvm_s2_trans *nested,
struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1472,7 +1488,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
{
int ret = 0;
bool write_fault, writable, force_pte = false;
- bool exec_fault, mte_allowed;
+ bool exec_fault, mte_allowed, is_vma_cacheable;
bool s2_force_noncacheable = false, vfio_allow_any_uc = false;
unsigned long mmu_seq;
phys_addr_t ipa = fault_ipa;
@@ -1617,6 +1633,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
vm_flags = vma->vm_flags;
+ is_vma_cacheable = kvm_vma_is_cacheable(vma);
+
/* Don't use the VMA after the unlock -- it may have vanished */
vma = NULL;
@@ -1660,6 +1678,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
writable = false;
}
+ /*
+ * Prohibit a region to be mapped non cacheable in S2 and marked as
+ * cacheabled in the userspace VMA. Such mismatched mapping is a
+ * security risk.
+ */
+ if (is_vma_cacheable && s2_force_noncacheable)
+ return -EINVAL;
+
if (exec_fault && s2_force_noncacheable)
return -ENOEXEC;
@@ -2219,6 +2245,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
ret = -EINVAL;
break;
}
+
+ /* Cacheable PFNMAP is not allowed */
+ if (kvm_vma_is_cacheable(vma)) {
+ ret = -EINVAL;
+ break;
+ }
}
hva = min(reg_end, vma->vm_end);
} while (hva < reg_end);
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v10 4/6] KVM: arm64: New function to determine hardware cache management support
2025-07-05 7:17 [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable ankita
` (2 preceding siblings ...)
2025-07-05 7:17 ` [PATCH v10 3/6] KVM: arm64: Block cacheable PFNMAP mapping ankita
@ 2025-07-05 7:17 ` ankita
2025-07-05 7:17 ` [PATCH v10 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
` (3 subsequent siblings)
7 siblings, 0 replies; 17+ messages in thread
From: ankita @ 2025-07-05 7:17 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
lpieralisi, david, ddutile, seanjc
Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
From: Ankit Agrawal <ankita@nvidia.com>
VM_PFNMAP VMA's are allowed to contain PTE's which point to physical
addresses that does not have a struct page and may not be in the kernel
direct map.
However ARM64 KVM relies on a simple conversion from physaddr to a
kernel virtual address when it does cache maintenance as the CMO
instructions work on virtual addresses. This simple approach does not
work for physical addresses from VM_PFNMAP since those addresses may
not have a kernel virtual address, or it may be difficult to find it.
Fortunately if the ARM64 CPU has two features, S2FWB and CACHE DIC,
then KVM no longer needs to do cache flushing and NOP's all the
CMOs. This has the effect of no longer requiring a KVA for addresses
mapped into the S2.
Add a new function, kvm_arch_supports_cacheable_pfnmap(), to report
this capability. From a core prespective it means the arch can accept
a cachable VM_PFNMAP as a memslot. From an ARM64 perspective it means
that no KVA is required.
CC: Jason Gunthorpe <jgg@nvidia.com>
CC: David Hildenbrand <david@redhat.com>
CC: Donald Dutile <ddutile@redhat.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Tested-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
arch/arm64/kvm/mmu.c | 23 +++++++++++++++++++++++
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 5 +++++
3 files changed, 30 insertions(+)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 68c0f1c25dec..d8d2eb8a409e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1282,6 +1282,29 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
kvm_nested_s2_wp(kvm);
}
+/**
+ * kvm_arch_supports_cacheable_pfnmap() - Determine whether hardware
+ * supports cache management.
+ *
+ * ARM64 KVM relies on a simple conversion from physaddr to a kernel
+ * virtual address (KVA) when it does cache maintenance as the CMO
+ * instructions work on virtual addresses. This is incompatible with
+ * VM_PFNMAP VMAs which may not have a kernel direct mapping to a
+ * virtual address.
+ *
+ * With S2FWB and CACHE DIC features, KVM need not do cache flushing
+ * and CMOs are NOP'd. This has the effect of no longer requiring a
+ * KVA for addresses mapped into the S2. The presence of these features
+ * are thus necessary to support cacheable S2 mapping of VM_PFNMAP.
+ *
+ * Return: True if FWB and DIC is supported.
+ */
+bool kvm_arch_supports_cacheable_pfnmap(void)
+{
+ return cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) &&
+ cpus_have_final_cap(ARM64_HAS_CACHE_DIC);
+}
+
static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
{
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3bde4fb5c6aa..c91d5b5f8c39 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1235,6 +1235,8 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm);
/* flush memory translations pointing to 'slot' */
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot);
+/* hardware supports cache management */
+bool kvm_arch_supports_cacheable_pfnmap(void);
int kvm_prefetch_pages(struct kvm_memory_slot *slot, gfn_t gfn,
struct page **pages, int nr_pages);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index eec82775c5bf..feacfb203a70 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1583,6 +1583,11 @@ static void kvm_replace_memslot(struct kvm *kvm,
#define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \
(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+bool __weak kvm_arch_supports_cacheable_pfnmap(void)
+{
+ return false;
+}
+
static int check_memory_region_flags(struct kvm *kvm,
const struct kvm_userspace_memory_region2 *mem)
{
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v10 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-07-05 7:17 [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable ankita
` (3 preceding siblings ...)
2025-07-05 7:17 ` [PATCH v10 4/6] KVM: arm64: New function to determine hardware cache management support ankita
@ 2025-07-05 7:17 ` ankita
2025-07-07 1:00 ` Catalin Marinas
` (2 more replies)
2025-07-05 7:17 ` [PATCH v10 6/6] KVM: arm64: Expose new KVM cap for cacheable PFNMAP ankita
` (2 subsequent siblings)
7 siblings, 3 replies; 17+ messages in thread
From: ankita @ 2025-07-05 7:17 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
lpieralisi, david, ddutile, seanjc
Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
From: Ankit Agrawal <ankita@nvidia.com>
Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
based on pfn_is_map_memory (which tracks whether the device memory
is in the kernel map) and ignores the per-VMA flags that indicates the
memory attributes. The KVM code is thus restrictive and allows only for
the memory that is added to the kernel to be marked as cacheable.
The device memory such as on the Grace Hopper/Blackwell systems
is interchangeable with DDR memory and retains properties such as
cacheability, unaligned accesses, atomics and handling of executable
faults. This requires the device memory to be mapped as NORMAL in
stage-2.
Given that the GPU device memory is not added to the kernel (but is rather
VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
to be mapped Normal cacheable. The patch aims to solve this use case.
Note when FWB is not enabled, the kernel expects to trivially do
cache management by flushing the memory by linearly converting a
kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
cache management thus relies on memory being mapped. Moreover
ARM64_HAS_CACHE_DIC CPU cap allows KVM to avoid flushing the icache
and turns icache_inval_pou() into a NOP. These two capabilities
are thus a requirement of the cacheable PFNMAP feature. Make use of
kvm_arch_supports_cacheable_pfnmap() to check them.
A cachebility check is made by consulting the VMA pgprot value.
If the pgprot mapping type is cacheable, it is safe to be mapped S2
cacheable as the KVM S2 will have the same Normal memory type as the
VMA has in the S1 and KVM has no additional responsibility for safety.
Checking pgprot as NORMAL is thus a KVM sanity check.
No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
already tests it at an early stage during memslot creation. There would
not even be a fault if the memslot is not created.
CC: Oliver Upton <oliver.upton@linux.dev>
CC: Sean Christopherson <seanjc@google.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Tested-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
arch/arm64/kvm/mmu.c | 61 +++++++++++++++++++++++++++++---------------
1 file changed, 40 insertions(+), 21 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d8d2eb8a409e..ded8a5d11fd3 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1681,18 +1681,41 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (is_error_noslot_pfn(pfn))
return -EFAULT;
+ /*
+ * Check if this is non-struct page memory PFN, and cannot support
+ * CMOs. It could potentially be unsafe to access as cachable.
+ */
if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) {
- /*
- * If the page was identified as device early by looking at
- * the VMA flags, vma_pagesize is already representing the
- * largest quantity we can map. If instead it was mapped
- * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
- * and must not be upgraded.
- *
- * In both cases, we don't let transparent_hugepage_adjust()
- * change things at the last minute.
- */
- s2_force_noncacheable = true;
+ if (is_vma_cacheable) {
+ /*
+ * Whilst the VMA owner expects cacheable mapping to this
+ * PFN, hardware also has to support the FWB and CACHE DIC
+ * features.
+ *
+ * ARM64 KVM relies on kernel VA mapping to the PFN to
+ * perform cache maintenance as the CMO instructions work on
+ * virtual addresses. VM_PFNMAP region are not necessarily
+ * mapped to a KVA and hence the presence of hardware features
+ * S2FWB and CACHE DIC are mandatory for cache maintenance.
+ *
+ * Check if the hardware supports it before allowing the VMA
+ * owner request for cacheable mapping.
+ */
+ if (!kvm_arch_supports_cacheable_pfnmap())
+ return -EFAULT;
+ } else {
+ /*
+ * If the page was identified as device early by looking at
+ * the VMA flags, vma_pagesize is already representing the
+ * largest quantity we can map. If instead it was mapped
+ * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
+ * and must not be upgraded.
+ *
+ * In both cases, we don't let transparent_hugepage_adjust()
+ * change things at the last minute.
+ */
+ s2_force_noncacheable = true;
+ }
} else if (logging_active && !write_fault) {
/*
* Only actually map the page as writable if this was a write
@@ -1701,14 +1724,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
writable = false;
}
- /*
- * Prohibit a region to be mapped non cacheable in S2 and marked as
- * cacheabled in the userspace VMA. Such mismatched mapping is a
- * security risk.
- */
- if (is_vma_cacheable && s2_force_noncacheable)
- return -EINVAL;
-
if (exec_fault && s2_force_noncacheable)
return -ENOEXEC;
@@ -2269,8 +2284,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
break;
}
- /* Cacheable PFNMAP is not allowed */
- if (kvm_vma_is_cacheable(vma)) {
+ /*
+ * Cacheable PFNMAP is allowed only if the hardware
+ * supports it.
+ */
+ if (kvm_vma_is_cacheable(vma) &&
+ !kvm_arch_supports_cacheable_pfnmap()) {
ret = -EINVAL;
break;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v10 6/6] KVM: arm64: Expose new KVM cap for cacheable PFNMAP
2025-07-05 7:17 [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable ankita
` (4 preceding siblings ...)
2025-07-05 7:17 ` [PATCH v10 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
@ 2025-07-05 7:17 ` ankita
2025-07-07 1:02 ` Catalin Marinas
2025-07-07 16:39 ` [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable Ankit Agrawal
2025-07-07 23:57 ` Oliver Upton
7 siblings, 1 reply; 17+ messages in thread
From: ankita @ 2025-07-05 7:17 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
lpieralisi, david, ddutile, seanjc
Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
From: Ankit Agrawal <ankita@nvidia.com>
Introduce a new KVM capability to expose to the userspace whether
cacheable mapping of PFNMAP is supported.
The ability to safely do the cacheable mapping of PFNMAP is contingent
on S2FWB and ARM64_HAS_CACHE_DIC. S2FWB allows KVM to avoid flushing
the D cache, ARM64_HAS_CACHE_DIC allows KVM to avoid flushing the icache
and turns icache_inval_pou() into a NOP. The cap would be false if
those requirements are missing and is checked by making use of
kvm_arch_supports_cacheable_pfnmap.
This capability would allow userspace to discover the support.
It could for instance be used by userspace to prevent live-migration
across FWB and non-FWB hosts.
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
CC: Oliver Upton <oliver.upton@linux.dev>
CC: David Hildenbrand <david@redhat.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Donald Dutile <ddutile@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
Documentation/virt/kvm/api.rst | 13 ++++++++++++-
arch/arm64/kvm/arm.c | 7 +++++++
include/uapi/linux/kvm.h | 1 +
3 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 1bd2d42e6424..615cdbdd505f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8528,7 +8528,7 @@ ENOSYS for the others.
When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.
-7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
+7.42 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
-------------------------------------
:Architectures: arm64
@@ -8557,6 +8557,17 @@ given VM.
When this capability is enabled, KVM resets the VCPU when setting
MP_STATE_INIT_RECEIVED through IOCTL. The original MP_STATE is preserved.
+7.43 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED
+-------------------------------------------
+
+:Architectures: arm64
+:Target: VM
+:Parameters: None
+
+This capability indicate to the userspace whether a PFNMAP memory region
+can be safely mapped as cacheable. This relies on the presence of
+force write back (FWB) feature support on the hardware.
+
8. Other capabilities.
======================
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index de2b4e9c9f9f..9fb8901dcd86 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -408,6 +408,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES:
r = BIT(0);
break;
+ case KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED:
+ if (!kvm)
+ r = -EINVAL;
+ else
+ r = kvm_arch_supports_cacheable_pfnmap();
+ break;
+
default:
r = 0;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d00b85cb168c..ed9a46875a49 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -934,6 +934,7 @@ struct kvm_enable_cap {
#define KVM_CAP_ARM_EL2 240
#define KVM_CAP_ARM_EL2_E2H0 241
#define KVM_CAP_RISCV_MP_STATE_RESET 242
+#define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
struct kvm_irq_routing_irqchip {
__u32 irqchip;
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v10 1/6] KVM: arm64: Rename the device variable to s2_force_noncacheable
2025-07-05 7:17 ` [PATCH v10 1/6] KVM: arm64: Rename the device variable to s2_force_noncacheable ankita
@ 2025-07-07 0:51 ` Catalin Marinas
0 siblings, 0 replies; 17+ messages in thread
From: Catalin Marinas @ 2025-07-07 0:51 UTC (permalink / raw)
To: ankita
Cc: jgg, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
will, ryan.roberts, shahuang, lpieralisi, david, ddutile, seanjc,
aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
On Sat, Jul 05, 2025 at 07:17:12AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> For cache maintenance on a region, ARM KVM relies on that
> region to be mapped to the Kernal virtual address as CMOs
> operate on VA.
>
> Currently the device variable is effectively trying to setup
> the S2 mapping as non cacheable for memory regions that are
> not mapped in the Kernel VA. This could be either device or
> Normal_NC depending on the VM_ALLOW_ANY_UNCACHED flag in the
> VMA.
>
> Thus "device" could be better renamed to s2_force_noncacheable
> which implies that it is ensuring that region be mapped as
> non-cacheable.
>
> CC: Catalin Marinas <catalin.marinas@arm.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Tested-by: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v10 2/6] KVM: arm64: Update the check to detect device memory
2025-07-05 7:17 ` [PATCH v10 2/6] KVM: arm64: Update the check to detect device memory ankita
@ 2025-07-07 0:52 ` Catalin Marinas
0 siblings, 0 replies; 17+ messages in thread
From: Catalin Marinas @ 2025-07-07 0:52 UTC (permalink / raw)
To: ankita
Cc: jgg, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
will, ryan.roberts, shahuang, lpieralisi, david, ddutile, seanjc,
aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
On Sat, Jul 05, 2025 at 07:17:13AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Currently, the kvm_is_device_pfn() detects if the memory is kernel
> mapped through pfn_is_map_memory. It thus implies whether KVM can
> use Cache Maintenance Operations (CMOs) on that PFN. It is a bit
> of a misnomer as it does not necessarily detect whether a PFN
> is for a device memory. Moreover, the function is only used at
> one place.
>
> It would be better to directly call pfn_is_map_memory. Moreover
> we should restrict this call to VM_PFNMAP or VM_MIXEDMAP. Non PFMAP
> or MIXEDMAP VMA's must always contain normal pages which are
> struct page backed, have KVA's and are cachable. So we should always
> be able to go from phys to KVA to do a CMO.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Tested-by: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v10 3/6] KVM: arm64: Block cacheable PFNMAP mapping
2025-07-05 7:17 ` [PATCH v10 3/6] KVM: arm64: Block cacheable PFNMAP mapping ankita
@ 2025-07-07 0:54 ` Catalin Marinas
0 siblings, 0 replies; 17+ messages in thread
From: Catalin Marinas @ 2025-07-07 0:54 UTC (permalink / raw)
To: ankita
Cc: jgg, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
will, ryan.roberts, shahuang, lpieralisi, david, ddutile, seanjc,
aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
On Sat, Jul 05, 2025 at 07:17:14AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Fixes a security bug due to mismatched attributes between S1 and
> S2 mapping.
>
> Currently, it is possible for a region to be cacheable in the userspace
> VMA, but mapped non cached in S2. This creates a potential issue where
> the VMM may sanitize cacheable memory across VMs using cacheable stores,
> ensuring it is zeroed. However, if KVM subsequently assigns this memory
> to a VM as uncached, the VM could end up accessing stale, non-zeroed data
> from a previous VM, leading to unintended data exposure. This is a security
> risk.
>
> Block such mismatch attributes case by returning EINVAL when userspace
> try to map PFNMAP cacheable. Only allow NORMAL_NC and DEVICE_*.
>
> CC: Oliver Upton <oliver.upton@linux.dev>
> CC: Catalin Marinas <catalin.marinas@arm.com>
> CC: Sean Christopherson <seanjc@google.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Tested-by: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v10 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-07-05 7:17 ` [PATCH v10 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
@ 2025-07-07 1:00 ` Catalin Marinas
2025-07-07 7:32 ` David Hildenbrand
2025-07-07 12:27 ` Jason Gunthorpe
2 siblings, 0 replies; 17+ messages in thread
From: Catalin Marinas @ 2025-07-07 1:00 UTC (permalink / raw)
To: ankita
Cc: jgg, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
will, ryan.roberts, shahuang, lpieralisi, david, ddutile, seanjc,
aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
On Sat, Jul 05, 2025 at 07:17:16AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
> based on pfn_is_map_memory (which tracks whether the device memory
> is in the kernel map) and ignores the per-VMA flags that indicates the
> memory attributes. The KVM code is thus restrictive and allows only for
> the memory that is added to the kernel to be marked as cacheable.
>
> The device memory such as on the Grace Hopper/Blackwell systems
> is interchangeable with DDR memory and retains properties such as
> cacheability, unaligned accesses, atomics and handling of executable
> faults. This requires the device memory to be mapped as NORMAL in
> stage-2.
>
> Given that the GPU device memory is not added to the kernel (but is rather
> VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
> VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
> to be mapped Normal cacheable. The patch aims to solve this use case.
>
> Note when FWB is not enabled, the kernel expects to trivially do
> cache management by flushing the memory by linearly converting a
> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
> cache management thus relies on memory being mapped. Moreover
> ARM64_HAS_CACHE_DIC CPU cap allows KVM to avoid flushing the icache
> and turns icache_inval_pou() into a NOP. These two capabilities
> are thus a requirement of the cacheable PFNMAP feature. Make use of
> kvm_arch_supports_cacheable_pfnmap() to check them.
>
> A cachebility check is made by consulting the VMA pgprot value.
> If the pgprot mapping type is cacheable, it is safe to be mapped S2
> cacheable as the KVM S2 will have the same Normal memory type as the
> VMA has in the S1 and KVM has no additional responsibility for safety.
> Checking pgprot as NORMAL is thus a KVM sanity check.
>
> No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
> already tests it at an early stage during memslot creation. There would
> not even be a fault if the memslot is not created.
>
> CC: Oliver Upton <oliver.upton@linux.dev>
> CC: Sean Christopherson <seanjc@google.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Tested-by: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v10 6/6] KVM: arm64: Expose new KVM cap for cacheable PFNMAP
2025-07-05 7:17 ` [PATCH v10 6/6] KVM: arm64: Expose new KVM cap for cacheable PFNMAP ankita
@ 2025-07-07 1:02 ` Catalin Marinas
0 siblings, 0 replies; 17+ messages in thread
From: Catalin Marinas @ 2025-07-07 1:02 UTC (permalink / raw)
To: ankita
Cc: jgg, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
will, ryan.roberts, shahuang, lpieralisi, david, ddutile, seanjc,
aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
On Sat, Jul 05, 2025 at 07:17:17AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Introduce a new KVM capability to expose to the userspace whether
> cacheable mapping of PFNMAP is supported.
>
> The ability to safely do the cacheable mapping of PFNMAP is contingent
> on S2FWB and ARM64_HAS_CACHE_DIC. S2FWB allows KVM to avoid flushing
> the D cache, ARM64_HAS_CACHE_DIC allows KVM to avoid flushing the icache
> and turns icache_inval_pou() into a NOP. The cap would be false if
> those requirements are missing and is checked by making use of
> kvm_arch_supports_cacheable_pfnmap.
>
> This capability would allow userspace to discover the support.
> It could for instance be used by userspace to prevent live-migration
> across FWB and non-FWB hosts.
>
> CC: Catalin Marinas <catalin.marinas@arm.com>
> CC: Jason Gunthorpe <jgg@nvidia.com>
> CC: Oliver Upton <oliver.upton@linux.dev>
> CC: David Hildenbrand <david@redhat.com>
> Suggested-by: Marc Zyngier <maz@kernel.org>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v10 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-07-05 7:17 ` [PATCH v10 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
2025-07-07 1:00 ` Catalin Marinas
@ 2025-07-07 7:32 ` David Hildenbrand
2025-07-07 12:27 ` Jason Gunthorpe
2 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-07-07 7:32 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, will, ryan.roberts, shahuang,
lpieralisi, ddutile, seanjc
Cc: aniketa, cjia, kwankhede, kjaju, targupta, vsethi, acurrid,
apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
On 05.07.25 09:17, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
> based on pfn_is_map_memory (which tracks whether the device memory
> is in the kernel map) and ignores the per-VMA flags that indicates the
> memory attributes. The KVM code is thus restrictive and allows only for
> the memory that is added to the kernel to be marked as cacheable.
>
> The device memory such as on the Grace Hopper/Blackwell systems
> is interchangeable with DDR memory and retains properties such as
> cacheability, unaligned accesses, atomics and handling of executable
> faults. This requires the device memory to be mapped as NORMAL in
> stage-2.
>
> Given that the GPU device memory is not added to the kernel (but is rather
> VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
> VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
> to be mapped Normal cacheable. The patch aims to solve this use case.
>
> Note when FWB is not enabled, the kernel expects to trivially do
> cache management by flushing the memory by linearly converting a
> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
> cache management thus relies on memory being mapped. Moreover
> ARM64_HAS_CACHE_DIC CPU cap allows KVM to avoid flushing the icache
> and turns icache_inval_pou() into a NOP. These two capabilities
> are thus a requirement of the cacheable PFNMAP feature. Make use of
> kvm_arch_supports_cacheable_pfnmap() to check them.
>
> A cachebility check is made by consulting the VMA pgprot value.
> If the pgprot mapping type is cacheable, it is safe to be mapped S2
> cacheable as the KVM S2 will have the same Normal memory type as the
> VMA has in the S1 and KVM has no additional responsibility for safety.
> Checking pgprot as NORMAL is thus a KVM sanity check.
>
> No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
> already tests it at an early stage during memslot creation. There would
> not even be a fault if the memslot is not created.
>
> CC: Oliver Upton <oliver.upton@linux.dev>
> CC: Sean Christopherson <seanjc@google.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Tested-by: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
Reviewed-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v10 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
2025-07-05 7:17 ` [PATCH v10 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
2025-07-07 1:00 ` Catalin Marinas
2025-07-07 7:32 ` David Hildenbrand
@ 2025-07-07 12:27 ` Jason Gunthorpe
2 siblings, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2025-07-07 12:27 UTC (permalink / raw)
To: ankita
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, ryan.roberts, shahuang, lpieralisi, david,
ddutile, seanjc, aniketa, cjia, kwankhede, kjaju, targupta,
vsethi, acurrid, apopple, jhubbard, danw, zhiw, mochs, udhoke,
dnigam, alex.williamson, sebastianene, coltonlewis, kevin.tian,
yi.l.liu, ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm,
linux-kernel, linux-arm-kernel, maobibo
On Sat, Jul 05, 2025 at 07:17:16AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
> based on pfn_is_map_memory (which tracks whether the device memory
> is in the kernel map) and ignores the per-VMA flags that indicates the
> memory attributes. The KVM code is thus restrictive and allows only for
> the memory that is added to the kernel to be marked as cacheable.
>
> The device memory such as on the Grace Hopper/Blackwell systems
> is interchangeable with DDR memory and retains properties such as
> cacheability, unaligned accesses, atomics and handling of executable
> faults. This requires the device memory to be mapped as NORMAL in
> stage-2.
>
> Given that the GPU device memory is not added to the kernel (but is rather
> VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
> VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
> to be mapped Normal cacheable. The patch aims to solve this use case.
>
> Note when FWB is not enabled, the kernel expects to trivially do
> cache management by flushing the memory by linearly converting a
> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
> cache management thus relies on memory being mapped. Moreover
> ARM64_HAS_CACHE_DIC CPU cap allows KVM to avoid flushing the icache
> and turns icache_inval_pou() into a NOP. These two capabilities
> are thus a requirement of the cacheable PFNMAP feature. Make use of
> kvm_arch_supports_cacheable_pfnmap() to check them.
>
> A cachebility check is made by consulting the VMA pgprot value.
> If the pgprot mapping type is cacheable, it is safe to be mapped S2
> cacheable as the KVM S2 will have the same Normal memory type as the
> VMA has in the S1 and KVM has no additional responsibility for safety.
> Checking pgprot as NORMAL is thus a KVM sanity check.
>
> No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
> already tests it at an early stage during memslot creation. There would
> not even be a fault if the memslot is not created.
>
> CC: Oliver Upton <oliver.upton@linux.dev>
> CC: Sean Christopherson <seanjc@google.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Tested-by: Donald Dutile <ddutile@redhat.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> arch/arm64/kvm/mmu.c | 61 +++++++++++++++++++++++++++++---------------
> 1 file changed, 40 insertions(+), 21 deletions(-)
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Jason
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable
2025-07-05 7:17 [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable ankita
` (5 preceding siblings ...)
2025-07-05 7:17 ` [PATCH v10 6/6] KVM: arm64: Expose new KVM cap for cacheable PFNMAP ankita
@ 2025-07-07 16:39 ` Ankit Agrawal
2025-07-07 23:57 ` Oliver Upton
7 siblings, 0 replies; 17+ messages in thread
From: Ankit Agrawal @ 2025-07-07 16:39 UTC (permalink / raw)
To: Jason Gunthorpe, maz@kernel.org, oliver.upton@linux.dev,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
ddutile@redhat.com, seanjc@google.com
Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Krishnakant Jaju,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, maobibo@loongson.cn
Thank you so much Jason, Catalin and David for reviewing the patch series
and providing your Reviewed-by.
Oliver, would you be able to apply the series if everything looks fine?
Thanks
Ankit Agrawal
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable
2025-07-05 7:17 [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable ankita
` (6 preceding siblings ...)
2025-07-07 16:39 ` [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable Ankit Agrawal
@ 2025-07-07 23:57 ` Oliver Upton
2025-07-09 14:34 ` Ankit Agrawal
7 siblings, 1 reply; 17+ messages in thread
From: Oliver Upton @ 2025-07-07 23:57 UTC (permalink / raw)
To: maz, joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
ryan.roberts, shahuang, lpieralisi, david, ddutile, seanjc,
Jason Gunthorpe, ankita
Cc: Oliver Upton, aniketa, cjia, kwankhede, kjaju, targupta, vsethi,
acurrid, apopple, jhubbard, danw, zhiw, mochs, udhoke, dnigam,
alex.williamson, sebastianene, coltonlewis, kevin.tian, yi.l.liu,
ardb, akpm, gshan, linux-mm, tabba, qperret, kvmarm, linux-kernel,
linux-arm-kernel, maobibo
On Sat, 05 Jul 2025 07:17:11 +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Grace based platforms such as Grace Hopper/Blackwell Superchips have
> CPU accessible cache coherent GPU memory. The GPU device memory is
> essentially a DDR memory and retains properties such as cacheability,
> unaligned accesses, atomics and handling of executable faults. This
> requires the device memory to be mapped as NORMAL in stage-2.
>
> [...]
I've gone through one additional round of bikeshedding on the series,
primarily fixing some typos and refining changelogs/comments. Note that
I squashed the kvm_arch_supports_cacheable_pfnmap() into the patch that
adds its caller and unwired it from arch-neutral code entirely.
Please do shout if there's an issue with any of this and thanks for
keeping up with the several rounds of review.
Applied to next, thanks!
[1/6] KVM: arm64: Rename the device variable to s2_force_noncacheable
https://git.kernel.org/kvmarm/kvmarm/c/8cc9dc1ae4fb
[2/6] KVM: arm64: Update the check to detect device memory
https://git.kernel.org/kvmarm/kvmarm/c/216887f79d98
[3/6] KVM: arm64: Block cacheable PFNMAP mapping
https://git.kernel.org/kvmarm/kvmarm/c/2a8dfab26677
[5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
https://git.kernel.org/kvmarm/kvmarm/c/0c67288e0c8b
[6/6] KVM: arm64: Expose new KVM cap for cacheable PFNMAP
https://git.kernel.org/kvmarm/kvmarm/c/f55ce5a6cd33
--
Best,
Oliver
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable
2025-07-07 23:57 ` Oliver Upton
@ 2025-07-09 14:34 ` Ankit Agrawal
0 siblings, 0 replies; 17+ messages in thread
From: Ankit Agrawal @ 2025-07-09 14:34 UTC (permalink / raw)
To: Oliver Upton, maz@kernel.org, joey.gouly@arm.com,
suzuki.poulose@arm.com, yuzenghui@huawei.com,
catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com,
shahuang@redhat.com, lpieralisi@kernel.org, david@redhat.com,
ddutile@redhat.com, seanjc@google.com, Jason Gunthorpe
Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Krishnakant Jaju,
Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
John Hubbard, Dan Williams, Zhi Wang, Matt Ochs, Uday Dhoke,
Dheeraj Nigam, alex.williamson@redhat.com,
sebastianene@google.com, coltonlewis@google.com,
kevin.tian@intel.com, yi.l.liu@intel.com, ardb@kernel.org,
akpm@linux-foundation.org, gshan@redhat.com, linux-mm@kvack.org,
tabba@google.com, qperret@google.com, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, maobibo@loongson.cn
Thank you so much Oliver for fixing the series and applying to the next
tree and also for your feedbacks during the review process!
> I've gone through one additional round of bikeshedding on the series,
> primarily fixing some typos and refining changelogs/comments. Note that
> I squashed the kvm_arch_supports_cacheable_pfnmap() into the patch that
> adds its caller and unwired it from arch-neutral code entirely.
>
> Please do shout if there's an issue with any of this and thanks for
> keeping up with the several rounds of review.
>
> Applied to next, thanks!
>
> [1/6] KVM: arm64: Rename the device variable to s2_force_noncacheable
> https://git.kernel.org/kvmarm/kvmarm/c/8cc9dc1ae4fb
>
> [2/6] KVM: arm64: Update the check to detect device memory
> https://git.kernel.org/kvmarm/kvmarm/c/216887f79d98
>
> [3/6] KVM: arm64: Block cacheable PFNMAP mapping
> https://git.kernel.org/kvmarm/kvmarm/c/2a8dfab26677
>
> [5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags
> https://git.kernel.org/kvmarm/kvmarm/c/0c67288e0c8b
>
> [6/6] KVM: arm64: Expose new KVM cap for cacheable PFNMAP
> https://git.kernel.org/kvmarm/kvmarm/c/f55ce5a6cd33
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-07-09 14:34 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-05 7:17 [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable ankita
2025-07-05 7:17 ` [PATCH v10 1/6] KVM: arm64: Rename the device variable to s2_force_noncacheable ankita
2025-07-07 0:51 ` Catalin Marinas
2025-07-05 7:17 ` [PATCH v10 2/6] KVM: arm64: Update the check to detect device memory ankita
2025-07-07 0:52 ` Catalin Marinas
2025-07-05 7:17 ` [PATCH v10 3/6] KVM: arm64: Block cacheable PFNMAP mapping ankita
2025-07-07 0:54 ` Catalin Marinas
2025-07-05 7:17 ` [PATCH v10 4/6] KVM: arm64: New function to determine hardware cache management support ankita
2025-07-05 7:17 ` [PATCH v10 5/6] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags ankita
2025-07-07 1:00 ` Catalin Marinas
2025-07-07 7:32 ` David Hildenbrand
2025-07-07 12:27 ` Jason Gunthorpe
2025-07-05 7:17 ` [PATCH v10 6/6] KVM: arm64: Expose new KVM cap for cacheable PFNMAP ankita
2025-07-07 1:02 ` Catalin Marinas
2025-07-07 16:39 ` [PATCH v10 0/6] KVM: arm64: Map GPU device memory as cacheable Ankit Agrawal
2025-07-07 23:57 ` Oliver Upton
2025-07-09 14:34 ` Ankit Agrawal
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).