* [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate)
@ 2025-08-19 21:51 Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 01/28] KVM: arm64: Add a new function to donate memory with prot Mostafa Saleh
` (27 more replies)
0 siblings, 28 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
This is v4 of pKVM SMMUv3 support, this version is quite different from
the previous ones as it implements nested SMMUv3 using trap and emulate:
v1: Implements full fledged pv interface
https://lore.kernel.org/kvmarm/20230201125328.2186498-1-jean-philippe@linaro.org/
v2: Implements full fledged pv interface (+ more features as evtq and s1)
https://lore.kernel.org/kvmarm/20241212180423.1578358-1-smostafa@google.com/
v3: Only DMA isolation (using pv)
https://lore.kernel.org/kvmarm/20250728175316.3706196-1-smostafa@google.com/
Based on the feedback on v3, having a separate driver was too complicated
to maintain. So, the alternatives were either to integrate the KVM
implementation in the current driver and rely on impl ops, I have PoC for it:
https://android-kvm.googlesource.com/linux/+log/refs/heads/pkvm-smmu-implops-poc
Or just go for the final goal which is nested translation using trap and
emulate which is implemented in this series.
Other major changes, is that io-pgtable-arm is not split to a common file,
however kernel specific code was factored out (mostly memory allocation
and selftests) based on Robin feedback.
Design:
=======
Assumptions:
------------
As mentioned, this is a completely different approach which uses trapping
of the SMMUv3 MMIO space and emulating some of these accesses.
One of the important points, is that this doesn’t emulate the full SMMUv3
architecture, but only the parts used by Linux kernel, that’s why enablement
of this (ARM_SMMU_V3_PKVM) depends on (ARM_SMMU_V3=y) so we are sure of the
driver behaviour.
Any new change in the driver will likely trigger a WARN_ON ending up in panic.
Most notable assumptions:
- Changing of stream table format/size or l2 pointers is not allowed after
initialization.
- leaf=0 CFGI is not allowed
- CFGI_ALL with any value but 31 is not allowed
- Some commands which are not used are not allowed (ex CMD_TLBI_NH_ALL)
- Values set in ARM_SMMU_CR1 are hardcoded and don't change.
Emulation logic mainly targets:
1) Command Queue
----------------
At boot time, the hypervisor will allocate a shadow command queue (doesn’t need
to match the host size) which then sets up in HW, then it will trap access to
i) ARM_SMMU_CMDQ_BASE
That can only be written when the cmdq is disabled. Then on enable, the hypervisor
will put the host command queue in a shared state to avoid transition into the
hypervisor or VMs. It will be unshared with the cmdq is disabled
ii) ARM_SMMU_CMDQ_PROD
Trigger emulation code, where the hypervisor will copy the commands between cons and
prod, of the host queue and sanitise them (mostly WARNs if the host is malicious and
issuing commands it shouldn’t) then eagerly consume them, updating the host cons.
iii) ARM_SMMU_CMDQ_CONS
No much logic, just return the emulated cons + error bits.
2) Stream table
---------------
Similar to the command queue, the first level is allocated at boot with max possible
size, then the hypervisor will trap access to:
i) ARM_SMMU_STRTAB_BASE/ARM_SMMU_STRTAB_BASE_CFG: Keep track of the stream table to
put it in a shared state.
On CFGI_STE, the hypervisor will read the STE in scope from the host copy, shadow
L2 pointers if needed and attach stage-2.
3) GBPA
-------
The hypervisor will set GBPA to abort at boot, then any access to GBPA from the host
will return the value set by the host. The host only sets ABORT so that is fine.
Otherwise we can always return ABORT is set even if not which makes it look like HW
is not responding to updates.
Bisectibility:
==============
I wrote the patches where most of them are bisectable at run time (so we can run
with a prefix of the series till MMIO emulation, cmdq emulation, STE or full nested)
that was very helpful in debugging, and I kept like this to make debugging easier.
Constraints:
============
1) Discovery:
-------------
Only device tree is supported at the moment.
I don’t usually use ACPI, but I can look into adding that later. (not make this
series bigger)
2) Errata:
----------
Some HW with both stage-1 and stage-2 but can’t run nested translation due to some errata,
which makes the driver remove nesting for MMU_700, I believe this is too restrictive.
At the moment KVM will use nesting if advertised. (Or we need other mechanism to exclude
only the affected HW)
3) Shadow page table
--------------------
Uses page granularity (leaf) for memory, that’s because of the lack of split_block_unmap()
logic. I am currently looking into the possibility of sharing page tables,
if that turned complicated (as expected) it might be worth to re-add this logic
Boot flow:
==========
The hypervisor initialises at “module_init”.
Before that, at “core_initcall” the SMMUv3 KVM code will
- Register the hypervisor ops with the hypervisor
- Parse the device tree and populate an array with the SMMUs to the hypervisor.
At “module_init”, the hypervisor init will run, where the SMMU driver will:
- Take over the SMMus description
- Probe the SMMUs (from IDRs) I tried to make most of this code common using macros.
- Take over the SMMUs MMIO space so it will be trapped.
- Take over and set up the shadow command queue and stream table.
With “ARM_SMMU_V3_PKVM” enabled, the current SMMU driver will register at
“device_initcall_sync” so it can run after the kernel de-privileges and the
hypervisor is set up.
Future work
===========
1) Sharing page tables will be an interesting optimization, but requires dealing with
stage-2 page faults (which are handled by the kernel), BBM and possibly more complexity.
2) There is currently ongoing work to enable RPM, that will possibly enable/disable
the SMMU frequently, we might need some optimizations to avoid re-shadowing the
CMDQ/STE unnecessarily.
3) Look into ACPI support.
Patches overview
=================
The patches are split as follows:
Patches 01-03: Core hypervisor: Add donation for NC, dealing with MMIO,
and arch timer abstraction.
Patches 04-08: Refactoring of io-pgtable-arm and SMMUv3 driver
Patches 09-12: Hypervisor IOMMU core: IOMMU pagetable management, dabts…
Patches 13-28: KVM SMMUv3 code
Tested on Qemu(S1 only, S2 only and nested) and Morello board.
Also tested with PAGE_SIZE 4k,16k, and 64k.
A development branch can be found in:
https://android-kvm.googlesource.com/linux/+log/refs/heads/pkvm-smmu-v4
Jean-Philippe Brucker (1):
iommu/arm-smmu-v3-kvm: Add SMMUv3 driver
Mostafa Saleh (27):
KVM: arm64: Add a new function to donate memory with prot
KVM: arm64: Donate MMIO to the hypervisor
KVM: arm64: pkvm: Add pkvm_time_get()
iommu/io-pgtable-arm: Move selftests to a separate file
iommu/io-pgtable-arm: Factor kernel specific code out
iommu/arm-smmu-v3: Split code with hyp
iommu/arm-smmu-v3: Move TLB range invalidation into a macro
iommu/arm-smmu-v3: Move IDR parsing to common functions
KVM: arm64: iommu: Introduce IOMMU driver infrastructure
KVM: arm64: iommu: Shadow host stage-2 page table
KVM: arm64: iommu: Add memory pool
KVM: arm64: iommu: Support DABT for IOMMU
iommu/arm-smmu-v3: Add KVM mode in the driver
iommu/arm-smmu-v3: Load the driver later in KVM mode
iommu/arm-smmu-v3-kvm: Create array for hyp SMMUv3
iommu/arm-smmu-v3-kvm: Take over SMMUs
iommu/arm-smmu-v3-kvm: Probe SMMU HW
iommu/arm-smmu-v3-kvm: Add MMIO emulation
iommu/arm-smmu-v3-kvm: Shadow the command queue
iommu/arm-smmu-v3-kvm: Add CMDQ functions
iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
iommu/arm-smmu-v3-kvm: Shadow stream table
iommu/arm-smmu-v3-kvm: Shadow STEs
iommu/arm-smmu-v3-kvm: Emulate GBPA
iommu/arm-smmu-v3-kvm: Support io-pgtable
iommu/arm-smmu-v3-kvm: Shadow the CPU stage-2 page table
iommu/arm-smmu-v3-kvm: Enable nesting
arch/arm64/include/asm/kvm_arm.h | 2 +
arch/arm64/include/asm/kvm_host.h | 7 +
arch/arm64/kvm/Makefile | 3 +-
arch/arm64/kvm/hyp/include/nvhe/iommu.h | 21 +
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 3 +
arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 2 +
arch/arm64/kvm/hyp/nvhe/Makefile | 10 +-
arch/arm64/kvm/hyp/nvhe/iommu/iommu.c | 130 +++
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 90 +-
arch/arm64/kvm/hyp/nvhe/setup.c | 17 +
arch/arm64/kvm/hyp/nvhe/timer-sr.c | 33 +
arch/arm64/kvm/hyp/pgtable.c | 9 +-
arch/arm64/kvm/iommu.c | 32 +
arch/arm64/kvm/pkvm.c | 1 +
drivers/iommu/Makefile | 2 +-
drivers/iommu/arm/Kconfig | 9 +
drivers/iommu/arm/arm-smmu-v3/Makefile | 3 +-
.../arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c | 114 ++
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c | 158 +++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 342 +-----
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 220 ++++
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 1036 +++++++++++++++++
.../iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h | 67 ++
.../arm/arm-smmu-v3/pkvm/io-pgtable-arm-hyp.c | 64 +
drivers/iommu/io-pgtable-arm-kernel.c | 305 +++++
drivers/iommu/io-pgtable-arm.c | 346 +-----
drivers/iommu/io-pgtable-arm.h | 66 ++
27 files changed, 2439 insertions(+), 653 deletions(-)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/iommu.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
create mode 100644 arch/arm64/kvm/iommu.c
create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c
create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
create mode 100644 drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
create mode 100644 drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
create mode 100644 drivers/iommu/arm/arm-smmu-v3/pkvm/io-pgtable-arm-hyp.c
create mode 100644 drivers/iommu/io-pgtable-arm-kernel.c
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH v4 01/28] KVM: arm64: Add a new function to donate memory with prot
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-09 13:46 ` Will Deacon
2025-08-19 21:51 ` [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor Mostafa Saleh
` (26 subsequent siblings)
27 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Soon, IOMMU drivers running in the hypervisor might interact with
non-coherent devices, so it needs a mechanism to map memory as
non cacheable.
Add ___pkvm_host_donate_hyp() which accepts a new argument for prot,
so the driver can add KVM_PGTABLE_PROT_NORMAL_NC.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 11 +++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 5f9d56754e39..52d7ee91e18c 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -36,6 +36,7 @@ int __pkvm_prot_finalize(void);
int __pkvm_host_share_hyp(u64 pfn);
int __pkvm_host_unshare_hyp(u64 pfn);
int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
+int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 8957734d6183..861e448183fd 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -769,13 +769,15 @@ int __pkvm_host_unshare_hyp(u64 pfn)
return ret;
}
-int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
+int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
{
u64 phys = hyp_pfn_to_phys(pfn);
u64 size = PAGE_SIZE * nr_pages;
void *virt = __hyp_va(phys);
int ret;
+ WARN_ON(prot & KVM_PGTABLE_PROT_X);
+
host_lock_component();
hyp_lock_component();
@@ -787,7 +789,7 @@ int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
goto unlock;
__hyp_set_page_state_range(phys, size, PKVM_PAGE_OWNED);
- WARN_ON(pkvm_create_mappings_locked(virt, virt + size, PAGE_HYP));
+ WARN_ON(pkvm_create_mappings_locked(virt, virt + size, prot));
WARN_ON(host_stage2_set_owner_locked(phys, size, PKVM_ID_HYP));
unlock:
@@ -797,6 +799,11 @@ int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
return ret;
}
+int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
+{
+ return ___pkvm_host_donate_hyp(pfn, nr_pages, PAGE_HYP);
+}
+
int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages)
{
u64 phys = hyp_pfn_to_phys(pfn);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 01/28] KVM: arm64: Add a new function to donate memory with prot Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-09 14:12 ` Will Deacon
2025-09-14 20:41 ` Pranjal Shrivastava
2025-08-19 21:51 ` [PATCH v4 03/28] KVM: arm64: pkvm: Add pkvm_time_get() Mostafa Saleh
` (25 subsequent siblings)
27 siblings, 2 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Add a function to donate MMIO to the hypervisor so IOMMU hypervisor
drivers can use that to protect the MMIO of IOMMU.
The initial attempt to implement this was to have a new flag to
"___pkvm_host_donate_hyp" to accept MMIO. However that had many problems,
it was quite intrusive for host/hyp to check/set page state to make it
aware of MMIO and to encode the state in the page table in that case.
Which is called in paths that can be sensitive to performance (FFA, VMs..)
As donating MMIO is very rare, and we don’t need to encode the full state,
it’s reasonable to have a separate function to do this.
It will init the host s2 page table with an invalid leaf with the owner ID
to prevent the host from mapping the page on faults.
Also, prevent kvm_pgtable_stage2_unmap() from removing owner ID from
stage-2 PTEs, as this can be triggered from recycle logic under memory
pressure. There is no code relying on this, as all ownership changes is
done via kvm_pgtable_stage2_set_owner()
For error path in IOMMU drivers, add a function to donate MMIO back
from hyp to host.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 +
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 64 +++++++++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 9 +--
3 files changed, 68 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 52d7ee91e18c..98e173da0f9b 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -37,6 +37,8 @@ int __pkvm_host_share_hyp(u64 pfn);
int __pkvm_host_unshare_hyp(u64 pfn);
int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
+int __pkvm_host_donate_hyp_mmio(u64 pfn);
+int __pkvm_hyp_donate_host_mmio(u64 pfn);
int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 861e448183fd..c9a15ef6b18d 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -799,6 +799,70 @@ int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
return ret;
}
+int __pkvm_host_donate_hyp_mmio(u64 pfn)
+{
+ u64 phys = hyp_pfn_to_phys(pfn);
+ void *virt = __hyp_va(phys);
+ int ret;
+ kvm_pte_t pte;
+
+ host_lock_component();
+ hyp_lock_component();
+
+ ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
+ if (ret)
+ goto unlock;
+
+ if (pte && !kvm_pte_valid(pte)) {
+ ret = -EPERM;
+ goto unlock;
+ }
+
+ ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
+ if (ret)
+ goto unlock;
+ if (pte) {
+ ret = -EBUSY;
+ goto unlock;
+ }
+
+ ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, PAGE_HYP_DEVICE);
+ if (ret)
+ goto unlock;
+ /*
+ * We set HYP as the owner of the MMIO pages in the host stage-2, for:
+ * - host aborts: host_stage2_adjust_range() would fail for invalid non zero PTEs.
+ * - recycle under memory pressure: host_stage2_unmap_dev_all() would call
+ * kvm_pgtable_stage2_unmap() which will not clear non zero invalid ptes (counted).
+ * - other MMIO donation: Would fail as we check that the PTE is valid or empty.
+ */
+ WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
+ PAGE_SIZE, &host_s2_pool, PKVM_ID_HYP));
+unlock:
+ hyp_unlock_component();
+ host_unlock_component();
+
+ return ret;
+}
+
+int __pkvm_hyp_donate_host_mmio(u64 pfn)
+{
+ u64 phys = hyp_pfn_to_phys(pfn);
+ u64 virt = (u64)__hyp_va(phys);
+ size_t size = PAGE_SIZE;
+
+ host_lock_component();
+ hyp_lock_component();
+
+ WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
+ WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
+ PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
+ hyp_unlock_component();
+ host_unlock_component();
+
+ return 0;
+}
+
int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
{
return ___pkvm_host_donate_hyp(pfn, nr_pages, PAGE_HYP);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index c351b4abd5db..ba06b0c21d5a 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1095,13 +1095,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
kvm_pte_t *childp = NULL;
bool need_flush = false;
- if (!kvm_pte_valid(ctx->old)) {
- if (stage2_pte_is_counted(ctx->old)) {
- kvm_clear_pte(ctx->ptep);
- mm_ops->put_page(ctx->ptep);
- }
- return 0;
- }
+ if (!kvm_pte_valid(ctx->old))
+ return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
if (kvm_pte_table(ctx->old, ctx->level)) {
childp = kvm_pte_follow(ctx->old, mm_ops);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 03/28] KVM: arm64: pkvm: Add pkvm_time_get()
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 01/28] KVM: arm64: Add a new function to donate memory with prot Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-09 14:16 ` Will Deacon
2025-08-19 21:51 ` [PATCH v4 04/28] iommu/io-pgtable-arm: Move selftests to a separate file Mostafa Saleh
` (24 subsequent siblings)
27 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Add a function to return time in us.
This can be used from IOMMU drivers while waiting for conditions as
for SMMUv3 TLB invalidation waiting for sync.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 2 ++
arch/arm64/kvm/hyp/nvhe/setup.c | 4 ++++
arch/arm64/kvm/hyp/nvhe/timer-sr.c | 33 ++++++++++++++++++++++++++
3 files changed, 39 insertions(+)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
index ce31d3b73603..6c19691720cd 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
@@ -87,4 +87,6 @@ bool kvm_handle_pvm_restricted(struct kvm_vcpu *vcpu, u64 *exit_code);
void kvm_init_pvm_id_regs(struct kvm_vcpu *vcpu);
int kvm_check_pvm_sysreg_table(void);
+int pkvm_timer_init(void);
+u64 pkvm_time_get(void);
#endif /* __ARM64_KVM_NVHE_PKVM_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index a48d3f5a5afb..ee6435473204 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -304,6 +304,10 @@ void __noreturn __pkvm_init_finalise(void)
};
pkvm_pgtable.mm_ops = &pkvm_pgtable_mm_ops;
+ ret = pkvm_timer_init();
+ if (ret)
+ goto out;
+
ret = fix_host_ownership();
if (ret)
goto out;
diff --git a/arch/arm64/kvm/hyp/nvhe/timer-sr.c b/arch/arm64/kvm/hyp/nvhe/timer-sr.c
index ff176f4ce7de..e166cd5a56b8 100644
--- a/arch/arm64/kvm/hyp/nvhe/timer-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/timer-sr.c
@@ -11,6 +11,10 @@
#include <asm/kvm_hyp.h>
#include <asm/kvm_mmu.h>
+#include <nvhe/pkvm.h>
+
+static u32 timer_freq;
+
void __kvm_timer_set_cntvoff(u64 cntvoff)
{
write_sysreg(cntvoff, cntvoff_el2);
@@ -68,3 +72,32 @@ void __timer_enable_traps(struct kvm_vcpu *vcpu)
sysreg_clear_set(cnthctl_el2, clr, set);
}
+
+static u64 pkvm_ticks_get(void)
+{
+ return __arch_counter_get_cntvct();
+}
+
+#define SEC_TO_US 1000000
+
+int pkvm_timer_init(void)
+{
+ timer_freq = read_sysreg(cntfrq_el0);
+ /*
+ * TODO: The highest privileged level is supposed to initialize this
+ * register. But on some systems (which?), this information is only
+ * contained in the device-tree, so we'll need to find it out some other
+ * way.
+ */
+ if (!timer_freq || timer_freq < SEC_TO_US)
+ return -ENODEV;
+ return 0;
+}
+
+#define pkvm_time_ticks_to_us(ticks) ((u64)(ticks) * SEC_TO_US / timer_freq)
+
+/* Return time in us. */
+u64 pkvm_time_get(void)
+{
+ return pkvm_time_ticks_to_us(pkvm_ticks_get());
+}
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 04/28] iommu/io-pgtable-arm: Move selftests to a separate file
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (2 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 03/28] KVM: arm64: pkvm: Add pkvm_time_get() Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-15 14:37 ` Pranjal Shrivastava
2025-09-15 16:45 ` Jason Gunthorpe
2025-08-19 21:51 ` [PATCH v4 05/28] iommu/io-pgtable-arm: Factor kernel specific code out Mostafa Saleh
` (23 subsequent siblings)
27 siblings, 2 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Soon, io-pgtable-arm.c will be compiled as part of the KVM/arm64
in the hypervisor object, which doesn't have many of the kernel APIs,
as faux devices, printk...
We would need to factor this things outside of this file, this patch
moves the selftests outside, which remove many of the kernel
dependencies, which also is not needed by the hypervisor.
Create io-pgtable-arm-kernel.c for that, and in the next patch
the rest of the code is factored out.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
drivers/iommu/Makefile | 2 +-
drivers/iommu/io-pgtable-arm-kernel.c | 216 +++++++++++++++++++++++
drivers/iommu/io-pgtable-arm.c | 245 --------------------------
drivers/iommu/io-pgtable-arm.h | 41 +++++
4 files changed, 258 insertions(+), 246 deletions(-)
create mode 100644 drivers/iommu/io-pgtable-arm-kernel.c
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 355294fa9033..d601b0e25ef5 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
-obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
+obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o io-pgtable-arm-kernel.o
obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
obj-$(CONFIG_IOMMU_IOVA) += iova.o
obj-$(CONFIG_OF_IOMMU) += of_iommu.o
diff --git a/drivers/iommu/io-pgtable-arm-kernel.c b/drivers/iommu/io-pgtable-arm-kernel.c
new file mode 100644
index 000000000000..f3b869310964
--- /dev/null
+++ b/drivers/iommu/io-pgtable-arm-kernel.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * CPU-agnostic ARM page table allocator.
+ *
+ * Copyright (C) 2014 ARM Limited
+ *
+ * Author: Will Deacon <will.deacon@arm.com>
+ */
+#define pr_fmt(fmt) "arm-lpae io-pgtable: " fmt
+
+#include <linux/device/faux.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include "io-pgtable-arm.h"
+
+#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE_SELFTEST
+
+static struct io_pgtable_cfg *cfg_cookie __initdata;
+
+static void __init dummy_tlb_flush_all(void *cookie)
+{
+ WARN_ON(cookie != cfg_cookie);
+}
+
+static void __init dummy_tlb_flush(unsigned long iova, size_t size,
+ size_t granule, void *cookie)
+{
+ WARN_ON(cookie != cfg_cookie);
+ WARN_ON(!(size & cfg_cookie->pgsize_bitmap));
+}
+
+static void __init dummy_tlb_add_page(struct iommu_iotlb_gather *gather,
+ unsigned long iova, size_t granule,
+ void *cookie)
+{
+ dummy_tlb_flush(iova, granule, granule, cookie);
+}
+
+static const struct iommu_flush_ops dummy_tlb_ops __initconst = {
+ .tlb_flush_all = dummy_tlb_flush_all,
+ .tlb_flush_walk = dummy_tlb_flush,
+ .tlb_add_page = dummy_tlb_add_page,
+};
+
+static void __init arm_lpae_dump_ops(struct io_pgtable_ops *ops)
+{
+ struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+ struct io_pgtable_cfg *cfg = &data->iop.cfg;
+
+ pr_err("cfg: pgsize_bitmap 0x%lx, ias %u-bit\n",
+ cfg->pgsize_bitmap, cfg->ias);
+ pr_err("data: %d levels, 0x%zx pgd_size, %u pg_shift, %u bits_per_level, pgd @ %p\n",
+ ARM_LPAE_MAX_LEVELS - data->start_level, ARM_LPAE_PGD_SIZE(data),
+ ilog2(ARM_LPAE_GRANULE(data)), data->bits_per_level, data->pgd);
+}
+
+#define __FAIL(ops, i) ({ \
+ WARN(1, "selftest: test failed for fmt idx %d\n", (i)); \
+ arm_lpae_dump_ops(ops); \
+ -EFAULT; \
+})
+
+static int __init arm_lpae_run_tests(struct io_pgtable_cfg *cfg)
+{
+ static const enum io_pgtable_fmt fmts[] __initconst = {
+ ARM_64_LPAE_S1,
+ ARM_64_LPAE_S2,
+ };
+
+ int i, j;
+ unsigned long iova;
+ size_t size, mapped;
+ struct io_pgtable_ops *ops;
+
+ for (i = 0; i < ARRAY_SIZE(fmts); ++i) {
+ cfg_cookie = cfg;
+ ops = alloc_io_pgtable_ops(fmts[i], cfg, cfg);
+ if (!ops) {
+ pr_err("selftest: failed to allocate io pgtable ops\n");
+ return -ENOMEM;
+ }
+
+ /*
+ * Initial sanity checks.
+ * Empty page tables shouldn't provide any translations.
+ */
+ if (ops->iova_to_phys(ops, 42))
+ return __FAIL(ops, i);
+
+ if (ops->iova_to_phys(ops, SZ_1G + 42))
+ return __FAIL(ops, i);
+
+ if (ops->iova_to_phys(ops, SZ_2G + 42))
+ return __FAIL(ops, i);
+
+ /*
+ * Distinct mappings of different granule sizes.
+ */
+ iova = 0;
+ for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
+ size = 1UL << j;
+
+ if (ops->map_pages(ops, iova, iova, size, 1,
+ IOMMU_READ | IOMMU_WRITE |
+ IOMMU_NOEXEC | IOMMU_CACHE,
+ GFP_KERNEL, &mapped))
+ return __FAIL(ops, i);
+
+ /* Overlapping mappings */
+ if (!ops->map_pages(ops, iova, iova + size, size, 1,
+ IOMMU_READ | IOMMU_NOEXEC,
+ GFP_KERNEL, &mapped))
+ return __FAIL(ops, i);
+
+ if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
+ return __FAIL(ops, i);
+
+ iova += SZ_1G;
+ }
+
+ /* Full unmap */
+ iova = 0;
+ for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
+ size = 1UL << j;
+
+ if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
+ return __FAIL(ops, i);
+
+ if (ops->iova_to_phys(ops, iova + 42))
+ return __FAIL(ops, i);
+
+ /* Remap full block */
+ if (ops->map_pages(ops, iova, iova, size, 1,
+ IOMMU_WRITE, GFP_KERNEL, &mapped))
+ return __FAIL(ops, i);
+
+ if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
+ return __FAIL(ops, i);
+
+ iova += SZ_1G;
+ }
+
+ /*
+ * Map/unmap the last largest supported page of the IAS, this can
+ * trigger corner cases in the concatednated page tables.
+ */
+ mapped = 0;
+ size = 1UL << __fls(cfg->pgsize_bitmap);
+ iova = (1UL << cfg->ias) - size;
+ if (ops->map_pages(ops, iova, iova, size, 1,
+ IOMMU_READ | IOMMU_WRITE |
+ IOMMU_NOEXEC | IOMMU_CACHE,
+ GFP_KERNEL, &mapped))
+ return __FAIL(ops, i);
+ if (mapped != size)
+ return __FAIL(ops, i);
+ if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
+ return __FAIL(ops, i);
+
+ free_io_pgtable_ops(ops);
+ }
+
+ return 0;
+}
+
+static int __init arm_lpae_do_selftests(void)
+{
+ static const unsigned long pgsize[] __initconst = {
+ SZ_4K | SZ_2M | SZ_1G,
+ SZ_16K | SZ_32M,
+ SZ_64K | SZ_512M,
+ };
+
+ static const unsigned int address_size[] __initconst = {
+ 32, 36, 40, 42, 44, 48,
+ };
+
+ int i, j, k, pass = 0, fail = 0;
+ struct faux_device *dev;
+ struct io_pgtable_cfg cfg = {
+ .tlb = &dummy_tlb_ops,
+ .coherent_walk = true,
+ .quirks = IO_PGTABLE_QUIRK_NO_WARN,
+ };
+
+ dev = faux_device_create("io-pgtable-test", NULL, 0);
+ if (!dev)
+ return -ENOMEM;
+
+ cfg.iommu_dev = &dev->dev;
+
+ for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
+ for (j = 0; j < ARRAY_SIZE(address_size); ++j) {
+ /* Don't use ias > oas as it is not valid for stage-2. */
+ for (k = 0; k <= j; ++k) {
+ cfg.pgsize_bitmap = pgsize[i];
+ cfg.ias = address_size[k];
+ cfg.oas = address_size[j];
+ pr_info("selftest: pgsize_bitmap 0x%08lx, IAS %u OAS %u\n",
+ pgsize[i], cfg.ias, cfg.oas);
+ if (arm_lpae_run_tests(&cfg))
+ fail++;
+ else
+ pass++;
+ }
+ }
+ }
+
+ pr_info("selftest: completed with %d PASS %d FAIL\n", pass, fail);
+ faux_device_destroy(dev);
+
+ return fail ? -EFAULT : 0;
+}
+subsys_initcall(arm_lpae_do_selftests);
+#endif
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 96425e92f313..791a2c4ecb83 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -7,15 +7,10 @@
* Author: Will Deacon <will.deacon@arm.com>
*/
-#define pr_fmt(fmt) "arm-lpae io-pgtable: " fmt
-
#include <linux/atomic.h>
#include <linux/bitops.h>
#include <linux/io-pgtable.h>
-#include <linux/kernel.h>
-#include <linux/device/faux.h>
#include <linux/sizes.h>
-#include <linux/slab.h>
#include <linux/types.h>
#include <linux/dma-mapping.h>
@@ -24,33 +19,6 @@
#include "io-pgtable-arm.h"
#include "iommu-pages.h"
-#define ARM_LPAE_MAX_ADDR_BITS 52
-#define ARM_LPAE_S2_MAX_CONCAT_PAGES 16
-#define ARM_LPAE_MAX_LEVELS 4
-
-/* Struct accessors */
-#define io_pgtable_to_data(x) \
- container_of((x), struct arm_lpae_io_pgtable, iop)
-
-#define io_pgtable_ops_to_data(x) \
- io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
-
-/*
- * Calculate the right shift amount to get to the portion describing level l
- * in a virtual address mapped by the pagetable in d.
- */
-#define ARM_LPAE_LVL_SHIFT(l,d) \
- (((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level) + \
- ilog2(sizeof(arm_lpae_iopte)))
-
-#define ARM_LPAE_GRANULE(d) \
- (sizeof(arm_lpae_iopte) << (d)->bits_per_level)
-#define ARM_LPAE_PGD_SIZE(d) \
- (sizeof(arm_lpae_iopte) << (d)->pgd_bits)
-
-#define ARM_LPAE_PTES_PER_TABLE(d) \
- (ARM_LPAE_GRANULE(d) >> ilog2(sizeof(arm_lpae_iopte)))
-
/*
* Calculate the index at level l used to map virtual address a using the
* pagetable in d.
@@ -163,18 +131,6 @@
#define iopte_set_writeable_clean(ptep) \
set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long *)(ptep))
-struct arm_lpae_io_pgtable {
- struct io_pgtable iop;
-
- int pgd_bits;
- int start_level;
- int bits_per_level;
-
- void *pgd;
-};
-
-typedef u64 arm_lpae_iopte;
-
static inline bool iopte_leaf(arm_lpae_iopte pte, int lvl,
enum io_pgtable_fmt fmt)
{
@@ -1274,204 +1230,3 @@ struct io_pgtable_init_fns io_pgtable_arm_mali_lpae_init_fns = {
.alloc = arm_mali_lpae_alloc_pgtable,
.free = arm_lpae_free_pgtable,
};
-
-#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE_SELFTEST
-
-static struct io_pgtable_cfg *cfg_cookie __initdata;
-
-static void __init dummy_tlb_flush_all(void *cookie)
-{
- WARN_ON(cookie != cfg_cookie);
-}
-
-static void __init dummy_tlb_flush(unsigned long iova, size_t size,
- size_t granule, void *cookie)
-{
- WARN_ON(cookie != cfg_cookie);
- WARN_ON(!(size & cfg_cookie->pgsize_bitmap));
-}
-
-static void __init dummy_tlb_add_page(struct iommu_iotlb_gather *gather,
- unsigned long iova, size_t granule,
- void *cookie)
-{
- dummy_tlb_flush(iova, granule, granule, cookie);
-}
-
-static const struct iommu_flush_ops dummy_tlb_ops __initconst = {
- .tlb_flush_all = dummy_tlb_flush_all,
- .tlb_flush_walk = dummy_tlb_flush,
- .tlb_add_page = dummy_tlb_add_page,
-};
-
-static void __init arm_lpae_dump_ops(struct io_pgtable_ops *ops)
-{
- struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
- struct io_pgtable_cfg *cfg = &data->iop.cfg;
-
- pr_err("cfg: pgsize_bitmap 0x%lx, ias %u-bit\n",
- cfg->pgsize_bitmap, cfg->ias);
- pr_err("data: %d levels, 0x%zx pgd_size, %u pg_shift, %u bits_per_level, pgd @ %p\n",
- ARM_LPAE_MAX_LEVELS - data->start_level, ARM_LPAE_PGD_SIZE(data),
- ilog2(ARM_LPAE_GRANULE(data)), data->bits_per_level, data->pgd);
-}
-
-#define __FAIL(ops, i) ({ \
- WARN(1, "selftest: test failed for fmt idx %d\n", (i)); \
- arm_lpae_dump_ops(ops); \
- -EFAULT; \
-})
-
-static int __init arm_lpae_run_tests(struct io_pgtable_cfg *cfg)
-{
- static const enum io_pgtable_fmt fmts[] __initconst = {
- ARM_64_LPAE_S1,
- ARM_64_LPAE_S2,
- };
-
- int i, j;
- unsigned long iova;
- size_t size, mapped;
- struct io_pgtable_ops *ops;
-
- for (i = 0; i < ARRAY_SIZE(fmts); ++i) {
- cfg_cookie = cfg;
- ops = alloc_io_pgtable_ops(fmts[i], cfg, cfg);
- if (!ops) {
- pr_err("selftest: failed to allocate io pgtable ops\n");
- return -ENOMEM;
- }
-
- /*
- * Initial sanity checks.
- * Empty page tables shouldn't provide any translations.
- */
- if (ops->iova_to_phys(ops, 42))
- return __FAIL(ops, i);
-
- if (ops->iova_to_phys(ops, SZ_1G + 42))
- return __FAIL(ops, i);
-
- if (ops->iova_to_phys(ops, SZ_2G + 42))
- return __FAIL(ops, i);
-
- /*
- * Distinct mappings of different granule sizes.
- */
- iova = 0;
- for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
- size = 1UL << j;
-
- if (ops->map_pages(ops, iova, iova, size, 1,
- IOMMU_READ | IOMMU_WRITE |
- IOMMU_NOEXEC | IOMMU_CACHE,
- GFP_KERNEL, &mapped))
- return __FAIL(ops, i);
-
- /* Overlapping mappings */
- if (!ops->map_pages(ops, iova, iova + size, size, 1,
- IOMMU_READ | IOMMU_NOEXEC,
- GFP_KERNEL, &mapped))
- return __FAIL(ops, i);
-
- if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
- return __FAIL(ops, i);
-
- iova += SZ_1G;
- }
-
- /* Full unmap */
- iova = 0;
- for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
- size = 1UL << j;
-
- if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
- return __FAIL(ops, i);
-
- if (ops->iova_to_phys(ops, iova + 42))
- return __FAIL(ops, i);
-
- /* Remap full block */
- if (ops->map_pages(ops, iova, iova, size, 1,
- IOMMU_WRITE, GFP_KERNEL, &mapped))
- return __FAIL(ops, i);
-
- if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
- return __FAIL(ops, i);
-
- iova += SZ_1G;
- }
-
- /*
- * Map/unmap the last largest supported page of the IAS, this can
- * trigger corner cases in the concatednated page tables.
- */
- mapped = 0;
- size = 1UL << __fls(cfg->pgsize_bitmap);
- iova = (1UL << cfg->ias) - size;
- if (ops->map_pages(ops, iova, iova, size, 1,
- IOMMU_READ | IOMMU_WRITE |
- IOMMU_NOEXEC | IOMMU_CACHE,
- GFP_KERNEL, &mapped))
- return __FAIL(ops, i);
- if (mapped != size)
- return __FAIL(ops, i);
- if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
- return __FAIL(ops, i);
-
- free_io_pgtable_ops(ops);
- }
-
- return 0;
-}
-
-static int __init arm_lpae_do_selftests(void)
-{
- static const unsigned long pgsize[] __initconst = {
- SZ_4K | SZ_2M | SZ_1G,
- SZ_16K | SZ_32M,
- SZ_64K | SZ_512M,
- };
-
- static const unsigned int address_size[] __initconst = {
- 32, 36, 40, 42, 44, 48,
- };
-
- int i, j, k, pass = 0, fail = 0;
- struct faux_device *dev;
- struct io_pgtable_cfg cfg = {
- .tlb = &dummy_tlb_ops,
- .coherent_walk = true,
- .quirks = IO_PGTABLE_QUIRK_NO_WARN,
- };
-
- dev = faux_device_create("io-pgtable-test", NULL, 0);
- if (!dev)
- return -ENOMEM;
-
- cfg.iommu_dev = &dev->dev;
-
- for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
- for (j = 0; j < ARRAY_SIZE(address_size); ++j) {
- /* Don't use ias > oas as it is not valid for stage-2. */
- for (k = 0; k <= j; ++k) {
- cfg.pgsize_bitmap = pgsize[i];
- cfg.ias = address_size[k];
- cfg.oas = address_size[j];
- pr_info("selftest: pgsize_bitmap 0x%08lx, IAS %u OAS %u\n",
- pgsize[i], cfg.ias, cfg.oas);
- if (arm_lpae_run_tests(&cfg))
- fail++;
- else
- pass++;
- }
- }
- }
-
- pr_info("selftest: completed with %d PASS %d FAIL\n", pass, fail);
- faux_device_destroy(dev);
-
- return fail ? -EFAULT : 0;
-}
-subsys_initcall(arm_lpae_do_selftests);
-#endif
diff --git a/drivers/iommu/io-pgtable-arm.h b/drivers/iommu/io-pgtable-arm.h
index ba7cfdf7afa0..a06a23543cff 100644
--- a/drivers/iommu/io-pgtable-arm.h
+++ b/drivers/iommu/io-pgtable-arm.h
@@ -2,6 +2,8 @@
#ifndef IO_PGTABLE_ARM_H_
#define IO_PGTABLE_ARM_H_
+#include <linux/io-pgtable.h>
+
#define ARM_LPAE_TCR_TG0_4K 0
#define ARM_LPAE_TCR_TG0_64K 1
#define ARM_LPAE_TCR_TG0_16K 2
@@ -27,4 +29,43 @@
#define ARM_LPAE_TCR_PS_48_BIT 0x5ULL
#define ARM_LPAE_TCR_PS_52_BIT 0x6ULL
+/* Struct accessors */
+#define io_pgtable_to_data(x) \
+ container_of((x), struct arm_lpae_io_pgtable, iop)
+
+#define io_pgtable_ops_to_data(x) \
+ io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
+
+struct arm_lpae_io_pgtable {
+ struct io_pgtable iop;
+
+ int pgd_bits;
+ int start_level;
+ int bits_per_level;
+
+ void *pgd;
+};
+
+#define ARM_LPAE_MAX_ADDR_BITS 52
+#define ARM_LPAE_S2_MAX_CONCAT_PAGES 16
+#define ARM_LPAE_MAX_LEVELS 4
+
+/*
+ * Calculate the right shift amount to get to the portion describing level l
+ * in a virtual address mapped by the pagetable in d.
+ */
+#define ARM_LPAE_LVL_SHIFT(l,d) \
+ (((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level) + \
+ ilog2(sizeof(arm_lpae_iopte)))
+
+#define ARM_LPAE_GRANULE(d) \
+ (sizeof(arm_lpae_iopte) << (d)->bits_per_level)
+#define ARM_LPAE_PGD_SIZE(d) \
+ (sizeof(arm_lpae_iopte) << (d)->pgd_bits)
+
+#define ARM_LPAE_PTES_PER_TABLE(d) \
+ (ARM_LPAE_GRANULE(d) >> ilog2(sizeof(arm_lpae_iopte)))
+
+typedef u64 arm_lpae_iopte;
+
#endif /* IO_PGTABLE_ARM_H_ */
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 05/28] iommu/io-pgtable-arm: Factor kernel specific code out
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (3 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 04/28] iommu/io-pgtable-arm: Move selftests to a separate file Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 06/28] iommu/arm-smmu-v3: Split code with hyp Mostafa Saleh
` (22 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Some of the currently used APIs are only part of the kernel and not
available in the hypervisor, factor those out of the common file:
- alloc/free memory
- CMOs
- virt/phys conversions
Which is implemented by the kernel in io-pgtable-arm-kernel.c and
similarly for the hypervisor later in this series.
va/pa conversion kept as macros.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
drivers/iommu/io-pgtable-arm-kernel.c | 89 ++++++++++++++++++++++++
drivers/iommu/io-pgtable-arm.c | 99 +++------------------------
drivers/iommu/io-pgtable-arm.h | 14 ++++
3 files changed, 113 insertions(+), 89 deletions(-)
diff --git a/drivers/iommu/io-pgtable-arm-kernel.c b/drivers/iommu/io-pgtable-arm-kernel.c
index f3b869310964..d3056487b0f6 100644
--- a/drivers/iommu/io-pgtable-arm-kernel.c
+++ b/drivers/iommu/io-pgtable-arm-kernel.c
@@ -9,10 +9,99 @@
#define pr_fmt(fmt) "arm-lpae io-pgtable: " fmt
#include <linux/device/faux.h>
+#include <linux/dma-mapping.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include "io-pgtable-arm.h"
+#include "iommu-pages.h"
+
+static dma_addr_t __arm_lpae_dma_addr(void *pages)
+{
+ return (dma_addr_t)virt_to_phys(pages);
+}
+
+void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
+ struct io_pgtable_cfg *cfg,
+ void *cookie)
+{
+ struct device *dev = cfg->iommu_dev;
+ size_t alloc_size;
+ dma_addr_t dma;
+ void *pages;
+
+ /*
+ * For very small starting-level translation tables the HW requires a
+ * minimum alignment of at least 64 to cover all cases.
+ */
+ alloc_size = max(size, 64);
+ if (cfg->alloc)
+ pages = cfg->alloc(cookie, alloc_size, gfp);
+ else
+ pages = iommu_alloc_pages_node_sz(dev_to_node(dev), gfp,
+ alloc_size);
+
+ if (!pages)
+ return NULL;
+
+ if (!cfg->coherent_walk) {
+ dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
+ if (dma_mapping_error(dev, dma))
+ goto out_free;
+ /*
+ * We depend on the IOMMU being able to work with any physical
+ * address directly, so if the DMA layer suggests otherwise by
+ * translating or truncating them, that bodes very badly...
+ */
+ if (dma != virt_to_phys(pages))
+ goto out_unmap;
+ }
+
+ return pages;
+
+out_unmap:
+ dev_err(dev, "Cannot accommodate DMA translation for IOMMU page tables\n");
+ dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
+
+out_free:
+ if (cfg->free)
+ cfg->free(cookie, pages, size);
+ else
+ iommu_free_pages(pages);
+
+ return NULL;
+}
+
+void __arm_lpae_free_pages(void *pages, size_t size,
+ struct io_pgtable_cfg *cfg,
+ void *cookie)
+{
+ if (!cfg->coherent_walk)
+ dma_unmap_single(cfg->iommu_dev, __arm_lpae_dma_addr(pages),
+ size, DMA_TO_DEVICE);
+
+ if (cfg->free)
+ cfg->free(cookie, pages, size);
+ else
+ iommu_free_pages(pages);
+}
+
+void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
+ struct io_pgtable_cfg *cfg)
+{
+ dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
+ sizeof(*ptep) * num_entries, DMA_TO_DEVICE);
+}
+
+void *__arm_lpae_alloc_data(size_t size, gfp_t gfp)
+{
+ return kmalloc(size, gfp);
+}
+
+void __arm_lpae_free_data(void *p)
+{
+ return kfree(p);
+}
#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE_SELFTEST
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 791a2c4ecb83..2ca09081c3b0 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -12,12 +12,10 @@
#include <linux/io-pgtable.h>
#include <linux/sizes.h>
#include <linux/types.h>
-#include <linux/dma-mapping.h>
#include <asm/barrier.h>
#include "io-pgtable-arm.h"
-#include "iommu-pages.h"
/*
* Calculate the index at level l used to map virtual address a using the
@@ -118,7 +116,7 @@
#define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL
/* IOPTE accessors */
-#define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d))
+#define iopte_deref(pte,d) __arm_lpae_phys_to_virt(iopte_to_paddr(pte, d))
#define iopte_type(pte) \
(((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
@@ -208,83 +206,6 @@ static inline bool arm_lpae_concat_mandatory(struct io_pgtable_cfg *cfg,
(data->start_level == 1) && (oas == 40);
}
-static dma_addr_t __arm_lpae_dma_addr(void *pages)
-{
- return (dma_addr_t)virt_to_phys(pages);
-}
-
-static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
- struct io_pgtable_cfg *cfg,
- void *cookie)
-{
- struct device *dev = cfg->iommu_dev;
- size_t alloc_size;
- dma_addr_t dma;
- void *pages;
-
- /*
- * For very small starting-level translation tables the HW requires a
- * minimum alignment of at least 64 to cover all cases.
- */
- alloc_size = max(size, 64);
- if (cfg->alloc)
- pages = cfg->alloc(cookie, alloc_size, gfp);
- else
- pages = iommu_alloc_pages_node_sz(dev_to_node(dev), gfp,
- alloc_size);
-
- if (!pages)
- return NULL;
-
- if (!cfg->coherent_walk) {
- dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
- if (dma_mapping_error(dev, dma))
- goto out_free;
- /*
- * We depend on the IOMMU being able to work with any physical
- * address directly, so if the DMA layer suggests otherwise by
- * translating or truncating them, that bodes very badly...
- */
- if (dma != virt_to_phys(pages))
- goto out_unmap;
- }
-
- return pages;
-
-out_unmap:
- dev_err(dev, "Cannot accommodate DMA translation for IOMMU page tables\n");
- dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
-
-out_free:
- if (cfg->free)
- cfg->free(cookie, pages, size);
- else
- iommu_free_pages(pages);
-
- return NULL;
-}
-
-static void __arm_lpae_free_pages(void *pages, size_t size,
- struct io_pgtable_cfg *cfg,
- void *cookie)
-{
- if (!cfg->coherent_walk)
- dma_unmap_single(cfg->iommu_dev, __arm_lpae_dma_addr(pages),
- size, DMA_TO_DEVICE);
-
- if (cfg->free)
- cfg->free(cookie, pages, size);
- else
- iommu_free_pages(pages);
-}
-
-static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
- struct io_pgtable_cfg *cfg)
-{
- dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
- sizeof(*ptep) * num_entries, DMA_TO_DEVICE);
-}
-
static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cfg, int num_entries)
{
for (int i = 0; i < num_entries; i++)
@@ -360,7 +281,7 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
arm_lpae_iopte old, new;
struct io_pgtable_cfg *cfg = &data->iop.cfg;
- new = paddr_to_iopte(__pa(table), data) | ARM_LPAE_PTE_TYPE_TABLE;
+ new = paddr_to_iopte(__arm_lpae_virt_to_phys(table), data) | ARM_LPAE_PTE_TYPE_TABLE;
if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
new |= ARM_LPAE_PTE_NSTABLE;
@@ -581,7 +502,7 @@ static void arm_lpae_free_pgtable(struct io_pgtable *iop)
struct arm_lpae_io_pgtable *data = io_pgtable_to_data(iop);
__arm_lpae_free_pgtable(data, data->start_level, data->pgd);
- kfree(data);
+ __arm_lpae_free_data(data);
}
static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
@@ -895,7 +816,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
return NULL;
- data = kmalloc(sizeof(*data), GFP_KERNEL);
+ data = __arm_lpae_alloc_data(sizeof(*data), GFP_KERNEL);
if (!data)
return NULL;
@@ -1018,11 +939,11 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
wmb();
/* TTBR */
- cfg->arm_lpae_s1_cfg.ttbr = virt_to_phys(data->pgd);
+ cfg->arm_lpae_s1_cfg.ttbr = __arm_lpae_virt_to_phys(data->pgd);
return &data->iop;
out_free_data:
- kfree(data);
+ __arm_lpae_free_data(data);
return NULL;
}
@@ -1114,11 +1035,11 @@ arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
wmb();
/* VTTBR */
- cfg->arm_lpae_s2_cfg.vttbr = virt_to_phys(data->pgd);
+ cfg->arm_lpae_s2_cfg.vttbr = __arm_lpae_virt_to_phys(data->pgd);
return &data->iop;
out_free_data:
- kfree(data);
+ __arm_lpae_free_data(data);
return NULL;
}
@@ -1188,7 +1109,7 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
/* Ensure the empty pgd is visible before TRANSTAB can be written */
wmb();
- cfg->arm_mali_lpae_cfg.transtab = virt_to_phys(data->pgd) |
+ cfg->arm_mali_lpae_cfg.transtab = __arm_lpae_virt_to_phys(data->pgd) |
ARM_MALI_LPAE_TTBR_READ_INNER |
ARM_MALI_LPAE_TTBR_ADRMODE_TABLE;
if (cfg->coherent_walk)
@@ -1197,7 +1118,7 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
return &data->iop;
out_free_data:
- kfree(data);
+ __arm_lpae_free_data(data);
return NULL;
}
diff --git a/drivers/iommu/io-pgtable-arm.h b/drivers/iommu/io-pgtable-arm.h
index a06a23543cff..7d9f0b759275 100644
--- a/drivers/iommu/io-pgtable-arm.h
+++ b/drivers/iommu/io-pgtable-arm.h
@@ -68,4 +68,18 @@ struct arm_lpae_io_pgtable {
typedef u64 arm_lpae_iopte;
+void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
+ struct io_pgtable_cfg *cfg);
+void __arm_lpae_free_pages(void *pages, size_t size,
+ struct io_pgtable_cfg *cfg,
+ void *cookie);
+void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
+ struct io_pgtable_cfg *cfg,
+ void *cookie);
+void *__arm_lpae_alloc_data(size_t size, gfp_t gfp);
+void __arm_lpae_free_data(void *p);
+#ifndef __KVM_NVHE_HYPERVISOR__
+#define __arm_lpae_virt_to_phys __pa
+#define __arm_lpae_phys_to_virt __va
+#endif /* !__KVM_NVHE_HYPERVISOR__ */
#endif /* IO_PGTABLE_ARM_H_ */
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 06/28] iommu/arm-smmu-v3: Split code with hyp
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (4 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 05/28] iommu/io-pgtable-arm: Factor kernel specific code out Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-09 14:23 ` Will Deacon
2025-08-19 21:51 ` [PATCH v4 07/28] iommu/arm-smmu-v3: Move TLB range invalidation into a macro Mostafa Saleh
` (21 subsequent siblings)
27 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
The KVM SMMUv3 driver would re-use some of the cmdq code inside
the hypervisor, move these functions to a new common c file that
is shared between the host kernel and the hypervisor.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
drivers/iommu/arm/arm-smmu-v3/Makefile | 2 +-
.../arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c | 114 ++++++++++++++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 146 ------------------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 44 ++++++
4 files changed, 159 insertions(+), 147 deletions(-)
create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c
diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
index 493a659cc66b..1918b4a64cb0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_ARM_SMMU_V3) += arm_smmu_v3.o
-arm_smmu_v3-y := arm-smmu-v3.o
+arm_smmu_v3-y := arm-smmu-v3.o arm-smmu-v3-common-hyp.o
arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_IOMMUFD) += arm-smmu-v3-iommufd.o
arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
arm_smmu_v3-$(CONFIG_TEGRA241_CMDQV) += tegra241-cmdqv.o
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c
new file mode 100644
index 000000000000..62744c8548a8
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2015 ARM Limited
+ *
+ * Author: Will Deacon <will.deacon@arm.com>
+ * Arm SMMUv3 driver functions shared with hypervisor.
+ */
+
+#include "arm-smmu-v3.h"
+#include <asm-generic/errno-base.h>
+
+#include <linux/string.h>
+
+int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
+{
+ memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
+ cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
+
+ switch (ent->opcode) {
+ case CMDQ_OP_TLBI_EL2_ALL:
+ case CMDQ_OP_TLBI_NSNH_ALL:
+ break;
+ case CMDQ_OP_PREFETCH_CFG:
+ cmd[0] |= FIELD_PREP(CMDQ_PREFETCH_0_SID, ent->prefetch.sid);
+ break;
+ case CMDQ_OP_CFGI_CD:
+ cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SSID, ent->cfgi.ssid);
+ fallthrough;
+ case CMDQ_OP_CFGI_STE:
+ cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SID, ent->cfgi.sid);
+ cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_LEAF, ent->cfgi.leaf);
+ break;
+ case CMDQ_OP_CFGI_CD_ALL:
+ cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SID, ent->cfgi.sid);
+ break;
+ case CMDQ_OP_CFGI_ALL:
+ /* Cover the entire SID range */
+ cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
+ break;
+ case CMDQ_OP_TLBI_NH_VA:
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
+ fallthrough;
+ case CMDQ_OP_TLBI_EL2_VA:
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
+ cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
+ cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
+ cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
+ cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_VA_MASK;
+ break;
+ case CMDQ_OP_TLBI_S2_IPA:
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
+ cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
+ cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
+ cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
+ cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_IPA_MASK;
+ break;
+ case CMDQ_OP_TLBI_NH_ASID:
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
+ fallthrough;
+ case CMDQ_OP_TLBI_NH_ALL:
+ case CMDQ_OP_TLBI_S12_VMALL:
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
+ break;
+ case CMDQ_OP_TLBI_EL2_ASID:
+ cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
+ break;
+ case CMDQ_OP_ATC_INV:
+ cmd[0] |= FIELD_PREP(CMDQ_0_SSV, ent->substream_valid);
+ cmd[0] |= FIELD_PREP(CMDQ_ATC_0_GLOBAL, ent->atc.global);
+ cmd[0] |= FIELD_PREP(CMDQ_ATC_0_SSID, ent->atc.ssid);
+ cmd[0] |= FIELD_PREP(CMDQ_ATC_0_SID, ent->atc.sid);
+ cmd[1] |= FIELD_PREP(CMDQ_ATC_1_SIZE, ent->atc.size);
+ cmd[1] |= ent->atc.addr & CMDQ_ATC_1_ADDR_MASK;
+ break;
+ case CMDQ_OP_PRI_RESP:
+ cmd[0] |= FIELD_PREP(CMDQ_0_SSV, ent->substream_valid);
+ cmd[0] |= FIELD_PREP(CMDQ_PRI_0_SSID, ent->pri.ssid);
+ cmd[0] |= FIELD_PREP(CMDQ_PRI_0_SID, ent->pri.sid);
+ cmd[1] |= FIELD_PREP(CMDQ_PRI_1_GRPID, ent->pri.grpid);
+ switch (ent->pri.resp) {
+ case PRI_RESP_DENY:
+ case PRI_RESP_FAIL:
+ case PRI_RESP_SUCC:
+ break;
+ default:
+ return -EINVAL;
+ }
+ cmd[1] |= FIELD_PREP(CMDQ_PRI_1_RESP, ent->pri.resp);
+ break;
+ case CMDQ_OP_RESUME:
+ cmd[0] |= FIELD_PREP(CMDQ_RESUME_0_SID, ent->resume.sid);
+ cmd[0] |= FIELD_PREP(CMDQ_RESUME_0_RESP, ent->resume.resp);
+ cmd[1] |= FIELD_PREP(CMDQ_RESUME_1_STAG, ent->resume.stag);
+ break;
+ case CMDQ_OP_CMD_SYNC:
+ if (ent->sync.msiaddr) {
+ cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_IRQ);
+ cmd[1] |= ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
+ } else {
+ cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV);
+ }
+ cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH);
+ cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB);
+ break;
+ default:
+ return -ENOENT;
+ }
+
+ return 0;
+}
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 10cc6dc26b7b..1f765b4e36fa 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -138,18 +138,6 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
return space >= n;
}
-static bool queue_full(struct arm_smmu_ll_queue *q)
-{
- return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
- Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
-}
-
-static bool queue_empty(struct arm_smmu_ll_queue *q)
-{
- return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
- Q_WRP(q, q->prod) == Q_WRP(q, q->cons);
-}
-
static bool queue_consumed(struct arm_smmu_ll_queue *q, u32 prod)
{
return ((Q_WRP(q, q->cons) == Q_WRP(q, prod)) &&
@@ -168,12 +156,6 @@ static void queue_sync_cons_out(struct arm_smmu_queue *q)
writel_relaxed(q->llq.cons, q->cons_reg);
}
-static void queue_inc_cons(struct arm_smmu_ll_queue *q)
-{
- u32 cons = (Q_WRP(q, q->cons) | Q_IDX(q, q->cons)) + 1;
- q->cons = Q_OVF(q->cons) | Q_WRP(q, cons) | Q_IDX(q, cons);
-}
-
static void queue_sync_cons_ovf(struct arm_smmu_queue *q)
{
struct arm_smmu_ll_queue *llq = &q->llq;
@@ -205,12 +187,6 @@ static int queue_sync_prod_in(struct arm_smmu_queue *q)
return ret;
}
-static u32 queue_inc_prod_n(struct arm_smmu_ll_queue *q, int n)
-{
- u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + n;
- return Q_OVF(q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
-}
-
static void queue_poll_init(struct arm_smmu_device *smmu,
struct arm_smmu_queue_poll *qp)
{
@@ -238,14 +214,6 @@ static int queue_poll(struct arm_smmu_queue_poll *qp)
return 0;
}
-static void queue_write(__le64 *dst, u64 *src, size_t n_dwords)
-{
- int i;
-
- for (i = 0; i < n_dwords; ++i)
- *dst++ = cpu_to_le64(*src++);
-}
-
static void queue_read(u64 *dst, __le64 *src, size_t n_dwords)
{
int i;
@@ -266,108 +234,6 @@ static int queue_remove_raw(struct arm_smmu_queue *q, u64 *ent)
}
/* High-level queue accessors */
-static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
-{
- memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
- cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
-
- switch (ent->opcode) {
- case CMDQ_OP_TLBI_EL2_ALL:
- case CMDQ_OP_TLBI_NSNH_ALL:
- break;
- case CMDQ_OP_PREFETCH_CFG:
- cmd[0] |= FIELD_PREP(CMDQ_PREFETCH_0_SID, ent->prefetch.sid);
- break;
- case CMDQ_OP_CFGI_CD:
- cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SSID, ent->cfgi.ssid);
- fallthrough;
- case CMDQ_OP_CFGI_STE:
- cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SID, ent->cfgi.sid);
- cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_LEAF, ent->cfgi.leaf);
- break;
- case CMDQ_OP_CFGI_CD_ALL:
- cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SID, ent->cfgi.sid);
- break;
- case CMDQ_OP_CFGI_ALL:
- /* Cover the entire SID range */
- cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
- break;
- case CMDQ_OP_TLBI_NH_VA:
- cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
- fallthrough;
- case CMDQ_OP_TLBI_EL2_VA:
- cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
- cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
- cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
- cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
- cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
- cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
- cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_VA_MASK;
- break;
- case CMDQ_OP_TLBI_S2_IPA:
- cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
- cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
- cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
- cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
- cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
- cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
- cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_IPA_MASK;
- break;
- case CMDQ_OP_TLBI_NH_ASID:
- cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
- fallthrough;
- case CMDQ_OP_TLBI_NH_ALL:
- case CMDQ_OP_TLBI_S12_VMALL:
- cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
- break;
- case CMDQ_OP_TLBI_EL2_ASID:
- cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
- break;
- case CMDQ_OP_ATC_INV:
- cmd[0] |= FIELD_PREP(CMDQ_0_SSV, ent->substream_valid);
- cmd[0] |= FIELD_PREP(CMDQ_ATC_0_GLOBAL, ent->atc.global);
- cmd[0] |= FIELD_PREP(CMDQ_ATC_0_SSID, ent->atc.ssid);
- cmd[0] |= FIELD_PREP(CMDQ_ATC_0_SID, ent->atc.sid);
- cmd[1] |= FIELD_PREP(CMDQ_ATC_1_SIZE, ent->atc.size);
- cmd[1] |= ent->atc.addr & CMDQ_ATC_1_ADDR_MASK;
- break;
- case CMDQ_OP_PRI_RESP:
- cmd[0] |= FIELD_PREP(CMDQ_0_SSV, ent->substream_valid);
- cmd[0] |= FIELD_PREP(CMDQ_PRI_0_SSID, ent->pri.ssid);
- cmd[0] |= FIELD_PREP(CMDQ_PRI_0_SID, ent->pri.sid);
- cmd[1] |= FIELD_PREP(CMDQ_PRI_1_GRPID, ent->pri.grpid);
- switch (ent->pri.resp) {
- case PRI_RESP_DENY:
- case PRI_RESP_FAIL:
- case PRI_RESP_SUCC:
- break;
- default:
- return -EINVAL;
- }
- cmd[1] |= FIELD_PREP(CMDQ_PRI_1_RESP, ent->pri.resp);
- break;
- case CMDQ_OP_RESUME:
- cmd[0] |= FIELD_PREP(CMDQ_RESUME_0_SID, ent->resume.sid);
- cmd[0] |= FIELD_PREP(CMDQ_RESUME_0_RESP, ent->resume.resp);
- cmd[1] |= FIELD_PREP(CMDQ_RESUME_1_STAG, ent->resume.stag);
- break;
- case CMDQ_OP_CMD_SYNC:
- if (ent->sync.msiaddr) {
- cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_IRQ);
- cmd[1] |= ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
- } else {
- cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV);
- }
- cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH);
- cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB);
- break;
- default:
- return -ENOENT;
- }
-
- return 0;
-}
-
static struct arm_smmu_cmdq *arm_smmu_get_cmdq(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_ent *ent)
{
@@ -1508,18 +1374,6 @@ static void arm_smmu_free_cd_tables(struct arm_smmu_master *master)
}
/* Stream table manipulation functions */
-static void arm_smmu_write_strtab_l1_desc(struct arm_smmu_strtab_l1 *dst,
- dma_addr_t l2ptr_dma)
-{
- u64 val = 0;
-
- val |= FIELD_PREP(STRTAB_L1_DESC_SPAN, STRTAB_SPLIT + 1);
- val |= l2ptr_dma & STRTAB_L1_DESC_L2PTR_MASK;
-
- /* The HW has 64 bit atomicity with stores to the L2 STE table */
- WRITE_ONCE(dst->l2ptr, cpu_to_le64(val));
-}
-
struct arm_smmu_ste_writer {
struct arm_smmu_entry_writer writer;
u32 sid;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index ea41d790463e..2698438cd35c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -997,6 +997,50 @@ void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master,
int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq *cmdq, u64 *cmds, int n,
bool sync);
+int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent);
+
+/* Queue functions shared between kernel and hyp. */
+static inline bool queue_full(struct arm_smmu_ll_queue *q)
+{
+ return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
+ Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
+}
+static inline bool queue_empty(struct arm_smmu_ll_queue *q)
+{
+ return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
+ Q_WRP(q, q->prod) == Q_WRP(q, q->cons);
+}
+static inline u32 queue_inc_prod_n(struct arm_smmu_ll_queue *q, int n)
+{
+ u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + n;
+ return Q_OVF(q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
+}
+
+static inline void queue_inc_cons(struct arm_smmu_ll_queue *q)
+{
+ u32 cons = (Q_WRP(q, q->cons) | Q_IDX(q, q->cons)) + 1;
+ q->cons = Q_OVF(q->cons) | Q_WRP(q, cons) | Q_IDX(q, cons);
+}
+
+
+static inline void queue_write(__le64 *dst, u64 *src, size_t n_dwords)
+{
+ int i;
+ for (i = 0; i < n_dwords; ++i)
+ *dst++ = cpu_to_le64(*src++);
+}
+
+static inline void arm_smmu_write_strtab_l1_desc(struct arm_smmu_strtab_l1 *dst,
+ dma_addr_t l2ptr_dma)
+{
+ u64 val = 0;
+
+ val |= FIELD_PREP(STRTAB_L1_DESC_SPAN, STRTAB_SPLIT + 1);
+ val |= l2ptr_dma & STRTAB_L1_DESC_L2PTR_MASK;
+
+ /* The HW has 64 bit atomicity with stores to the L2 STE table */
+ WRITE_ONCE(dst->l2ptr, cpu_to_le64(val));
+}
#ifdef CONFIG_ARM_SMMU_V3_SVA
bool arm_smmu_sva_supported(struct arm_smmu_device *smmu);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 07/28] iommu/arm-smmu-v3: Move TLB range invalidation into a macro
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (5 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 06/28] iommu/arm-smmu-v3: Split code with hyp Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-09 14:25 ` Will Deacon
2025-08-19 21:51 ` [PATCH v4 08/28] iommu/arm-smmu-v3: Move IDR parsing to common functions Mostafa Saleh
` (20 subsequent siblings)
27 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Range TLB invalidation has a very specific algorithm, instead of
re-writing it for the hypervisor, put it in a macro so it can be
re-used.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 59 +------------------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 64 +++++++++++++++++++++
2 files changed, 67 insertions(+), 56 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 1f765b4e36fa..41820a9180f4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2126,68 +2126,15 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
struct arm_smmu_domain *smmu_domain)
{
struct arm_smmu_device *smmu = smmu_domain->smmu;
- unsigned long end = iova + size, num_pages = 0, tg = 0;
- size_t inv_range = granule;
struct arm_smmu_cmdq_batch cmds;
if (!size)
return;
- if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
- /* Get the leaf page size */
- tg = __ffs(smmu_domain->domain.pgsize_bitmap);
-
- num_pages = size >> tg;
-
- /* Convert page size of 12,14,16 (log2) to 1,2,3 */
- cmd->tlbi.tg = (tg - 10) / 2;
-
- /*
- * Determine what level the granule is at. For non-leaf, both
- * io-pgtable and SVA pass a nominal last-level granule because
- * they don't know what level(s) actually apply, so ignore that
- * and leave TTL=0. However for various errata reasons we still
- * want to use a range command, so avoid the SVA corner case
- * where both scale and num could be 0 as well.
- */
- if (cmd->tlbi.leaf)
- cmd->tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
- else if ((num_pages & CMDQ_TLBI_RANGE_NUM_MAX) == 1)
- num_pages++;
- }
-
arm_smmu_cmdq_batch_init(smmu, &cmds, cmd);
-
- while (iova < end) {
- if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
- /*
- * On each iteration of the loop, the range is 5 bits
- * worth of the aligned size remaining.
- * The range in pages is:
- *
- * range = (num_pages & (0x1f << __ffs(num_pages)))
- */
- unsigned long scale, num;
-
- /* Determine the power of 2 multiple number of pages */
- scale = __ffs(num_pages);
- cmd->tlbi.scale = scale;
-
- /* Determine how many chunks of 2^scale size we have */
- num = (num_pages >> scale) & CMDQ_TLBI_RANGE_NUM_MAX;
- cmd->tlbi.num = num - 1;
-
- /* range is num * 2^scale * pgsize */
- inv_range = num << (scale + tg);
-
- /* Clear out the lower order bits for the next iteration */
- num_pages -= num << scale;
- }
-
- cmd->tlbi.addr = iova;
- arm_smmu_cmdq_batch_add(smmu, &cmds, cmd);
- iova += inv_range;
- }
+ arm_smmu_tlb_inv_build(cmd, iova, size, granule,
+ smmu_domain->domain.pgsize_bitmap,
+ smmu, arm_smmu_cmdq_batch_add, &cmds);
arm_smmu_cmdq_batch_submit(smmu, &cmds);
}
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 2698438cd35c..a222fb7ef2ec 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1042,6 +1042,70 @@ static inline void arm_smmu_write_strtab_l1_desc(struct arm_smmu_strtab_l1 *dst,
WRITE_ONCE(dst->l2ptr, cpu_to_le64(val));
}
+/**
+ * arm_smmu_tlb_inv_build - Create a range invalidation command
+ * @cmd: Base command initialized with OPCODE (S1, S2..), vmid and asid.
+ * @iova: Start IOVA to invalidate
+ * @size: Size of range
+ * @granule: Granule of invalidation
+ * @pgsize_bitmap: Page size bit map of the page table.
+ * @smmu: Struct for the smmu, must have ::features
+ * @add_cmd: Function to send/batch the invalidation command
+ * @cmds: Incase of batching, it includes the pointer to the batch
+ */
+#define arm_smmu_tlb_inv_build(cmd, iova, size, granule, pgsize_bitmap, smmu, add_cmd, cmds) \
+{ \
+ unsigned long _iova = (iova); \
+ size_t _size = (size); \
+ size_t _granule = (granule); \
+ unsigned long end = _iova + _size, num_pages = 0, tg = 0; \
+ size_t inv_range = _granule; \
+ if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { \
+ /* Get the leaf page size */ \
+ tg = __ffs(pgsize_bitmap); \
+ num_pages = _size >> tg; \
+ /* Convert page size of 12,14,16 (log2) to 1,2,3 */ \
+ cmd->tlbi.tg = (tg - 10) / 2; \
+ /*
+ * Determine what level the granule is at. For non-leaf, both
+ * io-pgtable and SVA pass a nominal last-level granule because
+ * they don't know what level(s) actually apply, so ignore that
+ * and leave TTL=0. However for various errata reasons we still
+ * want to use a range command, so avoid the SVA corner case
+ * where both scale and num could be 0 as well.
+ */ \
+ if (cmd->tlbi.leaf) \
+ cmd->tlbi.ttl = 4 - ((ilog2(_granule) - 3) / (tg - 3)); \
+ else if ((num_pages & CMDQ_TLBI_RANGE_NUM_MAX) == 1) \
+ num_pages++; \
+ } \
+ while (_iova < end) { \
+ if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { \
+ /*
+ * On each iteration of the loop, the range is 5 bits
+ * worth of the aligned size remaining.
+ * The range in pages is:
+ *
+ * range = (num_pages & (0x1f << __ffs(num_pages)))
+ */ \
+ unsigned long scale, num; \
+ /* Determine the power of 2 multiple number of pages */ \
+ scale = __ffs(num_pages); \
+ cmd->tlbi.scale = scale; \
+ /* Determine how many chunks of 2^scale size we have */ \
+ num = (num_pages >> scale) & CMDQ_TLBI_RANGE_NUM_MAX; \
+ cmd->tlbi.num = num - 1; \
+ /* range is num * 2^scale * pgsize */ \
+ inv_range = num << (scale + tg); \
+ /* Clear out the lower order bits for the next iteration */ \
+ num_pages -= num << scale; \
+ } \
+ cmd->tlbi.addr = _iova; \
+ add_cmd(smmu, cmds, cmd); \
+ _iova += inv_range; \
+ } \
+} \
+
#ifdef CONFIG_ARM_SMMU_V3_SVA
bool arm_smmu_sva_supported(struct arm_smmu_device *smmu);
void arm_smmu_sva_notifier_synchronize(void);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 08/28] iommu/arm-smmu-v3: Move IDR parsing to common functions
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (6 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 07/28] iommu/arm-smmu-v3: Move TLB range invalidation into a macro Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 09/28] KVM: arm64: iommu: Introduce IOMMU driver infrastructure Mostafa Saleh
` (19 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Move parsing of IDRs to functions so that it can be re-used
from the hypervisor.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 112 +++-----------------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 111 +++++++++++++++++++
2 files changed, 126 insertions(+), 97 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 41820a9180f4..10ca07c6dbe9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -4112,57 +4112,17 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
/* IDR0 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
- /* 2-level structures */
- if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL)
- smmu->features |= ARM_SMMU_FEAT_2_LVL_STRTAB;
-
- if (reg & IDR0_CD2L)
- smmu->features |= ARM_SMMU_FEAT_2_LVL_CDTAB;
-
- /*
- * Translation table endianness.
- * We currently require the same endianness as the CPU, but this
- * could be changed later by adding a new IO_PGTABLE_QUIRK.
- */
- switch (FIELD_GET(IDR0_TTENDIAN, reg)) {
- case IDR0_TTENDIAN_MIXED:
- smmu->features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE;
- break;
-#ifdef __BIG_ENDIAN
- case IDR0_TTENDIAN_BE:
- smmu->features |= ARM_SMMU_FEAT_TT_BE;
- break;
-#else
- case IDR0_TTENDIAN_LE:
- smmu->features |= ARM_SMMU_FEAT_TT_LE;
- break;
-#endif
- default:
+ smmu->features |= smmu_idr0_features(reg);
+ if (FIELD_GET(IDR0_TTENDIAN, reg) == IDR0_TTENDIAN_RESERVED) {
dev_err(smmu->dev, "unknown/unsupported TT endianness!\n");
return -ENXIO;
}
-
- /* Boolean feature flags */
- if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
- smmu->features |= ARM_SMMU_FEAT_PRI;
-
- if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
- smmu->features |= ARM_SMMU_FEAT_ATS;
-
- if (reg & IDR0_SEV)
- smmu->features |= ARM_SMMU_FEAT_SEV;
-
- if (reg & IDR0_MSI) {
- smmu->features |= ARM_SMMU_FEAT_MSI;
- if (coherent && !disable_msipolling)
- smmu->options |= ARM_SMMU_OPT_MSIPOLL;
- }
-
- if (reg & IDR0_HYP) {
- smmu->features |= ARM_SMMU_FEAT_HYP;
- if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN))
- smmu->features |= ARM_SMMU_FEAT_E2H;
- }
+ if (coherent && !disable_msipolling &&
+ smmu->features & ARM_SMMU_FEAT_MSI)
+ smmu->options |= ARM_SMMU_OPT_MSIPOLL;
+ if (smmu->features & ARM_SMMU_FEAT_HYP &&
+ cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN))
+ smmu->features |= ARM_SMMU_FEAT_E2H;
arm_smmu_get_httu(smmu, reg);
@@ -4174,21 +4134,7 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
dev_warn(smmu->dev, "IDR0.COHACC overridden by FW configuration (%s)\n",
str_true_false(coherent));
- switch (FIELD_GET(IDR0_STALL_MODEL, reg)) {
- case IDR0_STALL_MODEL_FORCE:
- smmu->features |= ARM_SMMU_FEAT_STALL_FORCE;
- fallthrough;
- case IDR0_STALL_MODEL_STALL:
- smmu->features |= ARM_SMMU_FEAT_STALLS;
- }
-
- if (reg & IDR0_S1P)
- smmu->features |= ARM_SMMU_FEAT_TRANS_S1;
-
- if (reg & IDR0_S2P)
- smmu->features |= ARM_SMMU_FEAT_TRANS_S2;
-
- if (!(reg & (IDR0_S1P | IDR0_S2P))) {
+ if (!(smmu->features & (ARM_SMMU_FEAT_TRANS_S1 | ARM_SMMU_FEAT_TRANS_S2))) {
dev_err(smmu->dev, "no translation support!\n");
return -ENXIO;
}
@@ -4253,10 +4199,7 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
/* IDR3 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
- if (FIELD_GET(IDR3_RIL, reg))
- smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
- if (FIELD_GET(IDR3_FWB, reg))
- smmu->features |= ARM_SMMU_FEAT_S2FWB;
+ smmu->features |= smmu_idr3_features(reg);
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
@@ -4265,43 +4208,18 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
smmu->evtq.max_stalls = FIELD_GET(IDR5_STALL_MAX, reg);
/* Page sizes */
- if (reg & IDR5_GRAN64K)
- smmu->pgsize_bitmap |= SZ_64K | SZ_512M;
- if (reg & IDR5_GRAN16K)
- smmu->pgsize_bitmap |= SZ_16K | SZ_32M;
- if (reg & IDR5_GRAN4K)
- smmu->pgsize_bitmap |= SZ_4K | SZ_2M | SZ_1G;
+ smmu->pgsize_bitmap = smmu_idr5_to_pgsize(reg);
/* Input address size */
if (FIELD_GET(IDR5_VAX, reg) == IDR5_VAX_52_BIT)
smmu->features |= ARM_SMMU_FEAT_VAX;
- /* Output address size */
- switch (FIELD_GET(IDR5_OAS, reg)) {
- case IDR5_OAS_32_BIT:
- smmu->oas = 32;
- break;
- case IDR5_OAS_36_BIT:
- smmu->oas = 36;
- break;
- case IDR5_OAS_40_BIT:
- smmu->oas = 40;
- break;
- case IDR5_OAS_42_BIT:
- smmu->oas = 42;
- break;
- case IDR5_OAS_44_BIT:
- smmu->oas = 44;
- break;
- case IDR5_OAS_52_BIT:
- smmu->oas = 52;
+ smmu->oas = smmu_idr5_to_oas(reg);
+ if (smmu->oas == 52)
smmu->pgsize_bitmap |= 1ULL << 42; /* 4TB */
- break;
- default:
+ else if (!smmu->oas) {
dev_info(smmu->dev,
- "unknown output address size. Truncating to 48-bit\n");
- fallthrough;
- case IDR5_OAS_48_BIT:
+ "unknown output address size. Truncating to 48-bit\n");
smmu->oas = 48;
}
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index a222fb7ef2ec..8ffcc2e32474 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -26,6 +26,7 @@ struct arm_smmu_device;
#define IDR0_STALL_MODEL_FORCE 2
#define IDR0_TTENDIAN GENMASK(22, 21)
#define IDR0_TTENDIAN_MIXED 0
+#define IDR0_TTENDIAN_RESERVED 1
#define IDR0_TTENDIAN_LE 2
#define IDR0_TTENDIAN_BE 3
#define IDR0_CD2L (1 << 19)
@@ -1042,6 +1043,116 @@ static inline void arm_smmu_write_strtab_l1_desc(struct arm_smmu_strtab_l1 *dst,
WRITE_ONCE(dst->l2ptr, cpu_to_le64(val));
}
+static inline u32 smmu_idr0_features(u32 reg)
+{
+ u32 features = 0;
+
+ /* 2-level structures */
+ if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL)
+ features |= ARM_SMMU_FEAT_2_LVL_STRTAB;
+
+ if (reg & IDR0_CD2L)
+ features |= ARM_SMMU_FEAT_2_LVL_CDTAB;
+
+ /*
+ * Translation table endianness.
+ * We currently require the same endianness as the CPU, but this
+ * could be changed later by adding a new IO_PGTABLE_QUIRK.
+ */
+ switch (FIELD_GET(IDR0_TTENDIAN, reg)) {
+ case IDR0_TTENDIAN_MIXED:
+ features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE;
+ break;
+#ifdef __BIG_ENDIAN
+ case IDR0_TTENDIAN_BE:
+ features |= ARM_SMMU_FEAT_TT_BE;
+ break;
+#else
+ case IDR0_TTENDIAN_LE:
+ features |= ARM_SMMU_FEAT_TT_LE;
+ break;
+#endif
+ }
+
+ /* Boolean feature flags */
+ if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
+ features |= ARM_SMMU_FEAT_PRI;
+
+ if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
+ features |= ARM_SMMU_FEAT_ATS;
+
+ if (reg & IDR0_SEV)
+ features |= ARM_SMMU_FEAT_SEV;
+
+ if (reg & IDR0_MSI)
+ features |= ARM_SMMU_FEAT_MSI;
+
+ if (reg & IDR0_HYP)
+ features |= ARM_SMMU_FEAT_HYP;
+
+ switch (FIELD_GET(IDR0_STALL_MODEL, reg)) {
+ case IDR0_STALL_MODEL_FORCE:
+ features |= ARM_SMMU_FEAT_STALL_FORCE;
+ fallthrough;
+ case IDR0_STALL_MODEL_STALL:
+ features |= ARM_SMMU_FEAT_STALLS;
+ }
+
+ if (reg & IDR0_S1P)
+ features |= ARM_SMMU_FEAT_TRANS_S1;
+
+ if (reg & IDR0_S2P)
+ features |= ARM_SMMU_FEAT_TRANS_S2;
+
+ return features;
+}
+
+static inline u32 smmu_idr3_features(u32 reg)
+{
+ u32 features = 0;
+
+ if (FIELD_GET(IDR3_RIL, reg))
+ features |= ARM_SMMU_FEAT_RANGE_INV;
+ if (FIELD_GET(IDR3_FWB, reg))
+ features |= ARM_SMMU_FEAT_S2FWB;
+
+ return features;
+}
+
+static inline u32 smmu_idr5_to_oas(u32 reg)
+{
+ switch (FIELD_GET(IDR5_OAS, reg)) {
+ case IDR5_OAS_32_BIT:
+ return 32;
+ case IDR5_OAS_36_BIT:
+ return 36;
+ case IDR5_OAS_40_BIT:
+ return 40;
+ case IDR5_OAS_42_BIT:
+ return 42;
+ case IDR5_OAS_44_BIT:
+ return 44;
+ case IDR5_OAS_48_BIT:
+ return 48;
+ case IDR5_OAS_52_BIT:
+ return 52;
+ }
+ return 0;
+}
+
+static inline unsigned long smmu_idr5_to_pgsize(u32 reg)
+{
+ unsigned long pgsize_bitmap = 0;
+
+ if (reg & IDR5_GRAN64K)
+ pgsize_bitmap |= SZ_64K | SZ_512M;
+ if (reg & IDR5_GRAN16K)
+ pgsize_bitmap |= SZ_16K | SZ_32M;
+ if (reg & IDR5_GRAN4K)
+ pgsize_bitmap |= SZ_4K | SZ_2M | SZ_1G;
+ return pgsize_bitmap;
+}
+
/**
* arm_smmu_tlb_inv_build - Create a range invalidation command
* @cmd: Base command initialized with OPCODE (S1, S2..), vmid and asid.
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 09/28] KVM: arm64: iommu: Introduce IOMMU driver infrastructure
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (7 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 08/28] iommu/arm-smmu-v3: Move IDR parsing to common functions Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table Mostafa Saleh
` (18 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
To establish DMA isolation, KVM needs an IOMMU driver which provide
ops implemented at EL2.
Only one driver can be used and is registered with
kvm_iommu_register_driver() by passing pointer to the ops.
This must be called before module_init() which is the point KVM
initializes.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
arch/arm64/include/asm/kvm_host.h | 2 ++
arch/arm64/kvm/Makefile | 3 ++-
arch/arm64/kvm/hyp/include/nvhe/iommu.h | 13 +++++++++++++
arch/arm64/kvm/hyp/nvhe/Makefile | 3 ++-
arch/arm64/kvm/hyp/nvhe/iommu/iommu.c | 18 ++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/setup.c | 5 +++++
arch/arm64/kvm/iommu.c | 15 +++++++++++++++
7 files changed, 57 insertions(+), 2 deletions(-)
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/iommu.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
create mode 100644 arch/arm64/kvm/iommu.c
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 3e41a880b062..1a08066eaf7e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1674,5 +1674,7 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
void check_feature_map(void);
+struct kvm_iommu_ops;
+int kvm_iommu_register_driver(struct kvm_iommu_ops *hyp_ops);
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 7c329e01c557..5528704bfd72 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -23,7 +23,8 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
vgic/vgic-v3.o vgic/vgic-v4.o \
vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
- vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o
+ vgic/vgic-its.o vgic/vgic-debug.o vgic/vgic-v3-nested.o \
+ iommu.o
kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o
diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
new file mode 100644
index 000000000000..1ac70cc28a9e
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ARM64_KVM_NVHE_IOMMU_H__
+#define __ARM64_KVM_NVHE_IOMMU_H__
+
+#include <asm/kvm_host.h>
+
+struct kvm_iommu_ops {
+ int (*init)(void);
+};
+
+int kvm_iommu_init(void);
+
+#endif /* __ARM64_KVM_NVHE_IOMMU_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index a76522d63c3e..393ff143f0be 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -24,7 +24,8 @@ CFLAGS_switch.nvhe.o += -Wno-override-init
hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
hyp-main.o hyp-smp.o psci-relay.o early_alloc.o page_alloc.o \
- cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o
+ cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o \
+ iommu/iommu.o
hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
new file mode 100644
index 000000000000..a01c036c55be
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * IOMMU operations for pKVM
+ *
+ * Copyright (C) 2022 Linaro Ltd.
+ */
+#include <nvhe/iommu.h>
+
+/* Only one set of ops supported */
+struct kvm_iommu_ops *kvm_iommu_ops;
+
+int kvm_iommu_init(void)
+{
+ if (!kvm_iommu_ops || !kvm_iommu_ops->init)
+ return -ENODEV;
+
+ return kvm_iommu_ops->init();
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index ee6435473204..bdbc77395e03 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -13,6 +13,7 @@
#include <nvhe/early_alloc.h>
#include <nvhe/ffa.h>
#include <nvhe/gfp.h>
+#include <nvhe/iommu.h>
#include <nvhe/memory.h>
#include <nvhe/mem_protect.h>
#include <nvhe/mm.h>
@@ -320,6 +321,10 @@ void __noreturn __pkvm_init_finalise(void)
if (ret)
goto out;
+ ret = kvm_iommu_init();
+ if (ret)
+ goto out;
+
ret = hyp_ffa_init(ffa_proxy_pages);
if (ret)
goto out;
diff --git a/arch/arm64/kvm/iommu.c b/arch/arm64/kvm/iommu.c
new file mode 100644
index 000000000000..926a1a94698f
--- /dev/null
+++ b/arch/arm64/kvm/iommu.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ * Author: Mostafa Saleh <smostafa@google.com>
+ */
+
+#include <linux/kvm_host.h>
+
+extern struct kvm_iommu_ops *kvm_nvhe_sym(kvm_iommu_ops);
+
+int kvm_iommu_register_driver(struct kvm_iommu_ops *hyp_ops)
+{
+ kvm_nvhe_sym(kvm_iommu_ops) = hyp_ops;
+ return 0;
+}
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (8 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 09/28] KVM: arm64: iommu: Introduce IOMMU driver infrastructure Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-09 14:42 ` Will Deacon
2025-08-19 21:51 ` [PATCH v4 11/28] KVM: arm64: iommu: Add memory pool Mostafa Saleh
` (17 subsequent siblings)
27 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Create a shadow page table for the IOMMU that shadows the
host CPU stage-2 into the IOMMUs to establish DMA isolation.
An initial snapshot is created after the driver init, then
on every permission change a callback would be called for
the IOMMU driver to update the page table.
For some cases, an SMMUv3 may be able to share the same page
table used with the host CPU stage-2 directly.
However, this is too strict and requires changes to the core hypervisor
page table code, plus it would require the hypervisor to handle IOMMU
page faults. This can be added later as an optimization for SMMUV3.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
arch/arm64/kvm/hyp/include/nvhe/iommu.h | 4 ++
arch/arm64/kvm/hyp/nvhe/iommu/iommu.c | 83 ++++++++++++++++++++++++-
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 5 ++
3 files changed, 90 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index 1ac70cc28a9e..219363045b1c 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -3,11 +3,15 @@
#define __ARM64_KVM_NVHE_IOMMU_H__
#include <asm/kvm_host.h>
+#include <asm/kvm_pgtable.h>
struct kvm_iommu_ops {
int (*init)(void);
+ void (*host_stage2_idmap)(phys_addr_t start, phys_addr_t end, int prot);
};
int kvm_iommu_init(void);
+void kvm_iommu_host_stage2_idmap(phys_addr_t start, phys_addr_t end,
+ enum kvm_pgtable_prot prot);
#endif /* __ARM64_KVM_NVHE_IOMMU_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index a01c036c55be..f7d1c8feb358 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -4,15 +4,94 @@
*
* Copyright (C) 2022 Linaro Ltd.
*/
+#include <linux/iommu.h>
+
#include <nvhe/iommu.h>
+#include <nvhe/mem_protect.h>
+#include <nvhe/spinlock.h>
/* Only one set of ops supported */
struct kvm_iommu_ops *kvm_iommu_ops;
+/* Protected by host_mmu.lock */
+static bool kvm_idmap_initialized;
+
+static inline int pkvm_to_iommu_prot(enum kvm_pgtable_prot prot)
+{
+ int iommu_prot = 0;
+
+ if (prot & KVM_PGTABLE_PROT_R)
+ iommu_prot |= IOMMU_READ;
+ if (prot & KVM_PGTABLE_PROT_W)
+ iommu_prot |= IOMMU_WRITE;
+ if (prot == PKVM_HOST_MMIO_PROT)
+ iommu_prot |= IOMMU_MMIO;
+
+ /* We don't understand that, might be dangerous. */
+ WARN_ON(prot & ~PKVM_HOST_MEM_PROT);
+ return iommu_prot;
+}
+
+static int __snapshot_host_stage2(const struct kvm_pgtable_visit_ctx *ctx,
+ enum kvm_pgtable_walk_flags visit)
+{
+ u64 start = ctx->addr;
+ kvm_pte_t pte = *ctx->ptep;
+ u32 level = ctx->level;
+ u64 end = start + kvm_granule_size(level);
+ int prot = IOMMU_READ | IOMMU_WRITE;
+
+ /* Keep unmapped. */
+ if (pte && !kvm_pte_valid(pte))
+ return 0;
+
+ if (kvm_pte_valid(pte))
+ prot = pkvm_to_iommu_prot(kvm_pgtable_stage2_pte_prot(pte));
+ else if (!addr_is_memory(start))
+ prot |= IOMMU_MMIO;
+
+ kvm_iommu_ops->host_stage2_idmap(start, end, prot);
+ return 0;
+}
+
+static int kvm_iommu_snapshot_host_stage2(void)
+{
+ int ret;
+ struct kvm_pgtable_walker walker = {
+ .cb = __snapshot_host_stage2,
+ .flags = KVM_PGTABLE_WALK_LEAF,
+ };
+ struct kvm_pgtable *pgt = &host_mmu.pgt;
+
+ hyp_spin_lock(&host_mmu.lock);
+ ret = kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker);
+ /* Start receiving calls to host_stage2_idmap. */
+ kvm_idmap_initialized = !!ret;
+ hyp_spin_unlock(&host_mmu.lock);
+
+ return ret;
+}
+
int kvm_iommu_init(void)
{
- if (!kvm_iommu_ops || !kvm_iommu_ops->init)
+ int ret;
+
+ if (!kvm_iommu_ops || !kvm_iommu_ops->init ||
+ !kvm_iommu_ops->host_stage2_idmap)
return -ENODEV;
- return kvm_iommu_ops->init();
+ ret = kvm_iommu_ops->init();
+ if (ret)
+ return ret;
+ return kvm_iommu_snapshot_host_stage2();
+}
+
+void kvm_iommu_host_stage2_idmap(phys_addr_t start, phys_addr_t end,
+ enum kvm_pgtable_prot prot)
+{
+ hyp_assert_lock_held(&host_mmu.lock);
+
+ if (!kvm_idmap_initialized)
+ return;
+ kvm_iommu_ops->host_stage2_idmap(start, end, pkvm_to_iommu_prot(prot));
}
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index c9a15ef6b18d..bce6690f29c0 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -15,6 +15,7 @@
#include <hyp/fault.h>
#include <nvhe/gfp.h>
+#include <nvhe/iommu.h>
#include <nvhe/memory.h>
#include <nvhe/mem_protect.h>
#include <nvhe/mm.h>
@@ -529,6 +530,7 @@ static void __host_update_page_state(phys_addr_t addr, u64 size, enum pkvm_page_
int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id)
{
int ret;
+ enum kvm_pgtable_prot prot;
if (!range_is_memory(addr, addr + size))
return -EPERM;
@@ -538,6 +540,9 @@ int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id)
if (ret)
return ret;
+ prot = owner_id == PKVM_ID_HOST ? PKVM_HOST_MEM_PROT : 0;
+ kvm_iommu_host_stage2_idmap(addr, addr + size, prot);
+
/* Don't forget to update the vmemmap tracking for the host */
if (owner_id == PKVM_ID_HOST)
__host_update_page_state(addr, size, PKVM_PAGE_OWNED);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 11/28] KVM: arm64: iommu: Add memory pool
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (9 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 12/28] KVM: arm64: iommu: Support DABT for IOMMU Mostafa Saleh
` (16 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
IOMMU drivers would require to allocate memory for the shadow page
table. Similar to the host stage-2 CPU page table, the IOMMU pool
is allocated early from the carveout and it's memory is added in
a pool which the IOMMU driver can allocate from and reclaim at
run time.
At this point the nr_pages is 0 as there are no driver, in the next
patches when the SMMUv3 driver is added, it will add it's own function
to return the number of pages needed in kvm/iommu.c.
Unfortunately, this part has 2 leak into kvm/iommu as this happens too
early before drivers can have any init calls.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
arch/arm64/include/asm/kvm_host.h | 1 +
arch/arm64/kvm/hyp/include/nvhe/iommu.h | 5 ++++-
arch/arm64/kvm/hyp/nvhe/iommu/iommu.c | 20 +++++++++++++++++++-
arch/arm64/kvm/hyp/nvhe/setup.c | 10 +++++++++-
arch/arm64/kvm/iommu.c | 11 +++++++++++
arch/arm64/kvm/pkvm.c | 1 +
6 files changed, 45 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 1a08066eaf7e..fcb4b26072f7 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1676,5 +1676,6 @@ void check_feature_map(void);
struct kvm_iommu_ops;
int kvm_iommu_register_driver(struct kvm_iommu_ops *hyp_ops);
+size_t kvm_iommu_pages(void);
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index 219363045b1c..9f4906c6dcc9 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -10,8 +10,11 @@ struct kvm_iommu_ops {
void (*host_stage2_idmap)(phys_addr_t start, phys_addr_t end, int prot);
};
-int kvm_iommu_init(void);
+int kvm_iommu_init(void *pool_base, size_t nr_pages);
void kvm_iommu_host_stage2_idmap(phys_addr_t start, phys_addr_t end,
enum kvm_pgtable_prot prot);
+void *kvm_iommu_donate_pages(u8 order);
+void kvm_iommu_reclaim_pages(void *ptr);
+
#endif /* __ARM64_KVM_NVHE_IOMMU_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index f7d1c8feb358..1673165c7330 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -15,6 +15,7 @@ struct kvm_iommu_ops *kvm_iommu_ops;
/* Protected by host_mmu.lock */
static bool kvm_idmap_initialized;
+static struct hyp_pool iommu_pages_pool;
static inline int pkvm_to_iommu_prot(enum kvm_pgtable_prot prot)
{
@@ -72,7 +73,7 @@ static int kvm_iommu_snapshot_host_stage2(void)
return ret;
}
-int kvm_iommu_init(void)
+int kvm_iommu_init(void *pool_base, size_t nr_pages)
{
int ret;
@@ -80,6 +81,13 @@ int kvm_iommu_init(void)
!kvm_iommu_ops->host_stage2_idmap)
return -ENODEV;
+ if (nr_pages) {
+ ret = hyp_pool_init(&iommu_pages_pool, hyp_virt_to_pfn(pool_base),
+ nr_pages, 0);
+ if (ret)
+ return ret;
+ }
+
ret = kvm_iommu_ops->init();
if (ret)
return ret;
@@ -95,3 +103,13 @@ void kvm_iommu_host_stage2_idmap(phys_addr_t start, phys_addr_t end,
return;
kvm_iommu_ops->host_stage2_idmap(start, end, pkvm_to_iommu_prot(prot));
}
+
+void *kvm_iommu_donate_pages(u8 order)
+{
+ return hyp_alloc_pages(&iommu_pages_pool, order);
+}
+
+void kvm_iommu_reclaim_pages(void *ptr)
+{
+ hyp_put_page(&iommu_pages_pool, ptr);
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index bdbc77395e03..09ecee2cd864 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -21,6 +21,7 @@
#include <nvhe/trap_handler.h>
unsigned long hyp_nr_cpus;
+size_t hyp_kvm_iommu_pages;
#define hyp_percpu_size ((unsigned long)__per_cpu_end - \
(unsigned long)__per_cpu_start)
@@ -33,6 +34,7 @@ static void *selftest_base;
static void *ffa_proxy_pages;
static struct kvm_pgtable_mm_ops pkvm_pgtable_mm_ops;
static struct hyp_pool hpool;
+static void *iommu_base;
static int divide_memory_pool(void *virt, unsigned long size)
{
@@ -70,6 +72,12 @@ static int divide_memory_pool(void *virt, unsigned long size)
if (!ffa_proxy_pages)
return -ENOMEM;
+ if (hyp_kvm_iommu_pages) {
+ iommu_base = hyp_early_alloc_contig(hyp_kvm_iommu_pages);
+ if (!iommu_base)
+ return -ENOMEM;
+ }
+
return 0;
}
@@ -321,7 +329,7 @@ void __noreturn __pkvm_init_finalise(void)
if (ret)
goto out;
- ret = kvm_iommu_init();
+ ret = kvm_iommu_init(iommu_base, hyp_kvm_iommu_pages);
if (ret)
goto out;
diff --git a/arch/arm64/kvm/iommu.c b/arch/arm64/kvm/iommu.c
index 926a1a94698f..5460b1bd44a6 100644
--- a/arch/arm64/kvm/iommu.c
+++ b/arch/arm64/kvm/iommu.c
@@ -7,9 +7,20 @@
#include <linux/kvm_host.h>
extern struct kvm_iommu_ops *kvm_nvhe_sym(kvm_iommu_ops);
+extern size_t kvm_nvhe_sym(hyp_kvm_iommu_pages);
int kvm_iommu_register_driver(struct kvm_iommu_ops *hyp_ops)
{
kvm_nvhe_sym(kvm_iommu_ops) = hyp_ops;
return 0;
}
+
+size_t kvm_iommu_pages(void)
+{
+ /*
+ * This is called very early during setup_arch() where no initcalls,
+ * so this has to call specific functions per each KVM driver.
+ */
+ kvm_nvhe_sym(hyp_kvm_iommu_pages) = 0;
+ return 0;
+}
diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index fcd70bfe44fb..6098beda36fa 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -63,6 +63,7 @@ void __init kvm_hyp_reserve(void)
hyp_mem_pages += hyp_vmemmap_pages(STRUCT_HYP_PAGE_SIZE);
hyp_mem_pages += pkvm_selftest_pages();
hyp_mem_pages += hyp_ffa_proxy_pages();
+ hyp_mem_pages += kvm_iommu_pages();
/*
* Try to allocate a PMD-aligned region to reduce TLB pressure once
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 12/28] KVM: arm64: iommu: Support DABT for IOMMU
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (10 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 11/28] KVM: arm64: iommu: Add memory pool Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 13/28] iommu/arm-smmu-v3-kvm: Add SMMUv3 driver Mostafa Saleh
` (15 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
SMMUv3 driver need to trap and emulate access to the MMIO space as
part of the nested implementation.
Add a handler for DABTs for IOMMU drivers to be able to do so, in
case the host fault in page, check if it's part of IOMMU emulation
first.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
arch/arm64/include/asm/kvm_arm.h | 2 ++
arch/arm64/kvm/hyp/include/nvhe/iommu.h | 3 ++-
arch/arm64/kvm/hyp/nvhe/iommu/iommu.c | 15 +++++++++++++++
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 10 ++++++++++
4 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 1da290aeedce..8d63308ccd5c 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -331,6 +331,8 @@
/* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
#define HPFAR_MASK (~UL(0xf))
+
+#define FAR_MASK GENMASK_ULL(11, 0)
/*
* We have
* PAR [PA_Shift - 1 : 12] = PA [PA_Shift - 1 : 12]
diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index 9f4906c6dcc9..10fe4fbf7424 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -8,6 +8,7 @@
struct kvm_iommu_ops {
int (*init)(void);
void (*host_stage2_idmap)(phys_addr_t start, phys_addr_t end, int prot);
+ bool (*dabt_handler)(struct user_pt_regs *regs, u64 esr, u64 addr);
};
int kvm_iommu_init(void *pool_base, size_t nr_pages);
@@ -16,5 +17,5 @@ void kvm_iommu_host_stage2_idmap(phys_addr_t start, phys_addr_t end,
enum kvm_pgtable_prot prot);
void *kvm_iommu_donate_pages(u8 order);
void kvm_iommu_reclaim_pages(void *ptr);
-
+bool kvm_iommu_host_dabt_handler(struct user_pt_regs *regs, u64 esr, u64 addr);
#endif /* __ARM64_KVM_NVHE_IOMMU_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index 1673165c7330..376b30f557a2 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -4,6 +4,10 @@
*
* Copyright (C) 2022 Linaro Ltd.
*/
+#include <asm/kvm_hyp.h>
+
+#include <hyp/adjust_pc.h>
+
#include <linux/iommu.h>
#include <nvhe/iommu.h>
@@ -113,3 +117,14 @@ void kvm_iommu_reclaim_pages(void *ptr)
{
hyp_put_page(&iommu_pages_pool, ptr);
}
+
+bool kvm_iommu_host_dabt_handler(struct user_pt_regs *regs, u64 esr, u64 addr)
+{
+ if (kvm_iommu_ops && kvm_iommu_ops->dabt_handler &&
+ kvm_iommu_ops->dabt_handler(regs, esr, addr)) {
+ /* DABT handled by the driver, skip to next instruction. */
+ kvm_skip_host_instr();
+ return true;
+ }
+ return false;
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index bce6690f29c0..7371b2183e1e 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -595,6 +595,11 @@ static int host_stage2_idmap(u64 addr)
return ret;
}
+static bool is_dabt(u64 esr)
+{
+ return ESR_ELx_EC(esr) == ESR_ELx_EC_DABT_LOW;
+}
+
void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
{
struct kvm_vcpu_fault_info fault;
@@ -617,6 +622,11 @@ void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
*/
BUG_ON(!(fault.hpfar_el2 & HPFAR_EL2_NS));
addr = FIELD_GET(HPFAR_EL2_FIPA, fault.hpfar_el2) << 12;
+ addr |= fault.far_el2 & FAR_MASK;
+
+ if (is_dabt(esr) && !addr_is_memory(addr) &&
+ kvm_iommu_host_dabt_handler(&host_ctxt->regs, esr, addr))
+ return;
ret = host_stage2_idmap(addr);
BUG_ON(ret && ret != -EAGAIN);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 13/28] iommu/arm-smmu-v3-kvm: Add SMMUv3 driver
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (11 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 12/28] KVM: arm64: iommu: Support DABT for IOMMU Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 14/28] iommu/arm-smmu-v3: Add KVM mode in the driver Mostafa Saleh
` (14 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
From: Jean-Philippe Brucker <jean-philippe@linaro.org>
Add the skeleton for an Arm SMMUv3 driver at EL2.
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
arch/arm64/kvm/hyp/nvhe/Makefile | 5 ++++
drivers/iommu/arm/Kconfig | 9 ++++++
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 29 +++++++++++++++++++
.../iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h | 16 ++++++++++
4 files changed, 59 insertions(+)
create mode 100644 drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
create mode 100644 drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 393ff143f0be..c71c96262378 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -31,6 +31,11 @@ hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
hyp-obj-y += $(lib-objs)
+HYP_SMMU_V3_DRV_PATH = ../../../../../drivers/iommu/arm/arm-smmu-v3
+
+hyp-obj-$(CONFIG_ARM_SMMU_V3_PKVM) += $(HYP_SMMU_V3_DRV_PATH)/pkvm/arm-smmu-v3.o \
+ $(HYP_SMMU_V3_DRV_PATH)/arm-smmu-v3-common-hyp.o
+
##
## Build rules for compiling nVHE hyp code
## Output of this folder is `kvm_nvhe.o`, a partially linked object
diff --git a/drivers/iommu/arm/Kconfig b/drivers/iommu/arm/Kconfig
index ef42bbe07dbe..7eeb94d2499d 100644
--- a/drivers/iommu/arm/Kconfig
+++ b/drivers/iommu/arm/Kconfig
@@ -142,3 +142,12 @@ config QCOM_IOMMU
select ARM_DMA_USE_IOMMU
help
Support for IOMMU on certain Qualcomm SoCs.
+
+config ARM_SMMU_V3_PKVM
+ bool "ARM SMMUv3 support for protected Virtual Machines"
+ depends on KVM && ARM64 && ARM_SMMU_V3=y
+ help
+ Enable a SMMUv3 driver in the KVM hypervisor, to protect VMs against
+ memory accesses from devices owned by the host.
+
+ Say Y here if you intend to enable KVM in protected mode.
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
new file mode 100644
index 000000000000..fa8b71152560
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pKVM hyp driver for the Arm SMMUv3
+ *
+ * Copyright (C) 2022 Linaro Ltd.
+ */
+#include <asm/kvm_hyp.h>
+
+#include <nvhe/iommu.h>
+
+#include "arm_smmu_v3.h"
+
+size_t __ro_after_init kvm_hyp_arm_smmu_v3_count;
+struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
+
+static int smmu_init(void)
+{
+ return -ENOSYS;
+}
+
+static void smmu_host_stage2_idmap(phys_addr_t start, phys_addr_t end, int prot)
+{
+}
+
+/* Shared with the kernel driver in EL1 */
+struct kvm_iommu_ops smmu_ops = {
+ .init = smmu_init,
+ .host_stage2_idmap = smmu_host_stage2_idmap,
+};
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
new file mode 100644
index 000000000000..f6ad91d3fb85
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_ARM_SMMU_V3_H
+#define __KVM_ARM_SMMU_V3_H
+
+#include <asm/kvm_asm.h>
+
+struct hyp_arm_smmu_v3_device {
+};
+
+extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
+#define kvm_hyp_arm_smmu_v3_count kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count)
+
+extern struct hyp_arm_smmu_v3_device *kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_smmus);
+#define kvm_hyp_arm_smmu_v3_smmus kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_smmus)
+
+#endif /* __KVM_ARM_SMMU_V3_H */
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 14/28] iommu/arm-smmu-v3: Add KVM mode in the driver
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (12 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 13/28] iommu/arm-smmu-v3-kvm: Add SMMUv3 driver Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-12 13:52 ` Will Deacon
2025-08-19 21:51 ` [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode Mostafa Saleh
` (13 subsequent siblings)
27 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Add a file only compiled for KVM mode.
At the moment it registers the driver with KVM, and add the hook
needed for memory allocation.
Next, it will create the array with available SMMUs and their
description.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
arch/arm64/include/asm/kvm_host.h | 4 +++
arch/arm64/kvm/iommu.c | 10 ++++--
drivers/iommu/arm/arm-smmu-v3/Makefile | 1 +
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c | 36 +++++++++++++++++++
4 files changed, 49 insertions(+), 2 deletions(-)
create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index fcb4b26072f7..52212c0f2e9c 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1678,4 +1678,8 @@ struct kvm_iommu_ops;
int kvm_iommu_register_driver(struct kvm_iommu_ops *hyp_ops);
size_t kvm_iommu_pages(void);
+#ifdef CONFIG_ARM_SMMU_V3_PKVM
+size_t smmu_hyp_pgt_pages(void);
+#endif
+
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/iommu.c b/arch/arm64/kvm/iommu.c
index 5460b1bd44a6..0475f7c95c6c 100644
--- a/arch/arm64/kvm/iommu.c
+++ b/arch/arm64/kvm/iommu.c
@@ -17,10 +17,16 @@ int kvm_iommu_register_driver(struct kvm_iommu_ops *hyp_ops)
size_t kvm_iommu_pages(void)
{
+ size_t nr_pages = 0;
+
/*
* This is called very early during setup_arch() where no initcalls,
* so this has to call specific functions per each KVM driver.
*/
- kvm_nvhe_sym(hyp_kvm_iommu_pages) = 0;
- return 0;
+#ifdef CONFIG_ARM_SMMU_V3_PKVM
+ nr_pages = smmu_hyp_pgt_pages();
+#endif
+
+ kvm_nvhe_sym(hyp_kvm_iommu_pages) = nr_pages;
+ return nr_pages;
}
diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
index 1918b4a64cb0..284ad71c5282 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -4,5 +4,6 @@ arm_smmu_v3-y := arm-smmu-v3.o arm-smmu-v3-common-hyp.o
arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_IOMMUFD) += arm-smmu-v3-iommufd.o
arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
arm_smmu_v3-$(CONFIG_TEGRA241_CMDQV) += tegra241-cmdqv.o
+arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_PKVM) += arm-smmu-v3-kvm.o
obj-$(CONFIG_ARM_SMMU_V3_KUNIT_TEST) += arm-smmu-v3-test.o
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
new file mode 100644
index 000000000000..ac4eac6d567f
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pKVM host driver for the Arm SMMUv3
+ *
+ * Copyright (C) 2022 Linaro Ltd.
+ */
+#include <asm/kvm_mmu.h>
+#include <asm/kvm_pkvm.h>
+
+#include <linux/of_platform.h>
+
+#include "arm-smmu-v3.h"
+#include "pkvm/arm_smmu_v3.h"
+
+extern struct kvm_iommu_ops kvm_nvhe_sym(smmu_ops);
+
+size_t smmu_hyp_pgt_pages(void)
+{
+ /*
+ * SMMUv3 uses the same format as stage-2 and hence have the same memory
+ * requirements, we add extra 500 pages for L2 ste.
+ */
+ if (of_find_compatible_node(NULL, NULL, "arm,smmu-v3"))
+ return host_s2_pgtable_pages() + 500;
+ return 0;
+}
+
+static int kvm_arm_smmu_v3_register(void)
+{
+ if (!is_protected_kvm_enabled())
+ return 0;
+
+ return kvm_iommu_register_driver(kern_hyp_va(lm_alias(&kvm_nvhe_sym(smmu_ops))));
+};
+
+core_initcall(kvm_arm_smmu_v3_register);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (13 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 14/28] iommu/arm-smmu-v3: Add KVM mode in the driver Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-12 13:54 ` Will Deacon
2025-08-19 21:51 ` [PATCH v4 16/28] iommu/arm-smmu-v3-kvm: Create array for hyp SMMUv3 Mostafa Saleh
` (12 subsequent siblings)
27 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
While in KVM mode, the driver must be loaded after the hypervisor
initializes.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 ++++++++++++++++-----
1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 10ca07c6dbe9..a04730b5fe41 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -4576,12 +4576,6 @@ static const struct of_device_id arm_smmu_of_match[] = {
};
MODULE_DEVICE_TABLE(of, arm_smmu_of_match);
-static void arm_smmu_driver_unregister(struct platform_driver *drv)
-{
- arm_smmu_sva_notifier_synchronize();
- platform_driver_unregister(drv);
-}
-
static struct platform_driver arm_smmu_driver = {
.driver = {
.name = "arm-smmu-v3",
@@ -4592,8 +4586,27 @@ static struct platform_driver arm_smmu_driver = {
.remove = arm_smmu_device_remove,
.shutdown = arm_smmu_device_shutdown,
};
+
+#ifndef CONFIG_ARM_SMMU_V3_PKVM
+static void arm_smmu_driver_unregister(struct platform_driver *drv)
+{
+ arm_smmu_sva_notifier_synchronize();
+ platform_driver_unregister(drv);
+}
+
module_driver(arm_smmu_driver, platform_driver_register,
arm_smmu_driver_unregister);
+#else
+/*
+ * Must be done after the hypervisor initializes at module_init()
+ * No need for unregister as this is a built in driver.
+ */
+static int arm_smmu_driver_register(void)
+{
+ return platform_driver_register(&arm_smmu_driver);
+}
+device_initcall_sync(arm_smmu_driver_register);
+#endif /* !CONFIG_ARM_SMMU_V3_PKVM */
MODULE_DESCRIPTION("IOMMU API for ARM architected SMMUv3 implementations");
MODULE_AUTHOR("Will Deacon <will@kernel.org>");
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 16/28] iommu/arm-smmu-v3-kvm: Create array for hyp SMMUv3
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (14 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-09 18:30 ` Daniel Mentz
2025-08-19 21:51 ` [PATCH v4 17/28] iommu/arm-smmu-v3-kvm: Take over SMMUs Mostafa Saleh
` (11 subsequent siblings)
27 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
As the hypervisor has no access to firmware tables, the device discovery
is done from the kernel, where it parses firmware tables and populates a
list of devices to the hypervisor, which later takes over.
At the moment only the device tree is supported.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c | 93 ++++++++++++++++++-
.../iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h | 13 +++
2 files changed, 105 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index ac4eac6d567f..27ea39c0fb1f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -7,6 +7,7 @@
#include <asm/kvm_mmu.h>
#include <asm/kvm_pkvm.h>
+#include <linux/of_address.h>
#include <linux/of_platform.h>
#include "arm-smmu-v3.h"
@@ -14,6 +15,75 @@
extern struct kvm_iommu_ops kvm_nvhe_sym(smmu_ops);
+static size_t kvm_arm_smmu_count;
+static struct hyp_arm_smmu_v3_device *kvm_arm_smmu_array;
+
+static void kvm_arm_smmu_array_free(void)
+{
+ int order;
+
+ order = get_order(kvm_arm_smmu_count * sizeof(*kvm_arm_smmu_array));
+ free_pages((unsigned long)kvm_arm_smmu_array, order);
+}
+
+/*
+ * The hypervisor have to know the basic information about the SMMUs
+ * from the firmware.
+ * This has to be done before the SMMUv3 probes and does anything meaningful
+ * with the hardware, otherwise it becomes harder to reason about the SMMU
+ * state and we'd require to hand-off the state to the hypervisor at certain point
+ * while devices are live, which is complicated and dangerous.
+ * Instead, the hypervisor is interested in a very small part of the probe path,
+ * so just add a separate logic for it.
+ */
+static int kvm_arm_smmu_array_alloc(void)
+{
+ int smmu_order;
+ struct device_node *np;
+ int ret;
+ int i = 0;
+
+ kvm_arm_smmu_count = 0;
+ for_each_compatible_node(np, NULL, "arm,smmu-v3")
+ kvm_arm_smmu_count++;
+
+ if (!kvm_arm_smmu_count)
+ return -ENODEV;
+
+ smmu_order = get_order(kvm_arm_smmu_count * sizeof(*kvm_arm_smmu_array));
+ kvm_arm_smmu_array = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, smmu_order);
+ if (!kvm_arm_smmu_array)
+ return -ENOMEM;
+
+ /* Basic device tree parsing. */
+ for_each_compatible_node(np, NULL, "arm,smmu-v3") {
+ struct resource res;
+
+ ret = of_address_to_resource(np, 0, &res);
+ if (ret)
+ goto out_err;
+ kvm_arm_smmu_array[i].mmio_addr = res.start;
+ kvm_arm_smmu_array[i].mmio_size = resource_size(&res);
+ if (kvm_arm_smmu_array[i].mmio_size < SZ_128K) {
+ pr_err("SMMUv3(%s) has unsupported size(0x%lx)\n", np->name,
+ kvm_arm_smmu_array[i].mmio_size);
+ ret = -EINVAL;
+ goto out_err;
+ }
+
+ if (of_dma_is_coherent(np))
+ kvm_arm_smmu_array[i].features |= ARM_SMMU_FEAT_COHERENCY;
+
+ i++;
+ }
+
+ return 0;
+
+out_err:
+ kvm_arm_smmu_array_free();
+ return ret;
+}
+
size_t smmu_hyp_pgt_pages(void)
{
/*
@@ -27,10 +97,31 @@ size_t smmu_hyp_pgt_pages(void)
static int kvm_arm_smmu_v3_register(void)
{
+ int ret;
+
if (!is_protected_kvm_enabled())
return 0;
- return kvm_iommu_register_driver(kern_hyp_va(lm_alias(&kvm_nvhe_sym(smmu_ops))));
+ ret = kvm_arm_smmu_array_alloc();
+ if (ret)
+ return ret;
+
+ ret = kvm_iommu_register_driver(kern_hyp_va(lm_alias(&kvm_nvhe_sym(smmu_ops))));
+ if (ret)
+ goto out_err;
+
+ /*
+ * These variables are stored in the nVHE image, and won't be accessible
+ * after KVM initialization. Ownership of kvm_arm_smmu_array will be
+ * transferred to the hypervisor as well.
+ */
+ kvm_hyp_arm_smmu_v3_smmus = kvm_arm_smmu_array;
+ kvm_hyp_arm_smmu_v3_count = kvm_arm_smmu_count;
+ return ret;
+
+out_err:
+ kvm_arm_smmu_array_free();
+ return ret;
};
core_initcall(kvm_arm_smmu_v3_register);
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
index f6ad91d3fb85..744ee2b7f0b4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
@@ -4,7 +4,20 @@
#include <asm/kvm_asm.h>
+/*
+ * Parameters from the trusted host:
+ * @mmio_addr base address of the SMMU registers
+ * @mmio_size size of the registers resource
+ * @features Features of SMMUv3, subset of the main driver
+ *
+ * Other members are filled and used at runtime by the SMMU driver.
+ * @base Virtual address of SMMU registers
+ */
struct hyp_arm_smmu_v3_device {
+ phys_addr_t mmio_addr;
+ size_t mmio_size;
+ void __iomem *base;
+ u32 features;
};
extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 17/28] iommu/arm-smmu-v3-kvm: Take over SMMUs
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (15 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 16/28] iommu/arm-smmu-v3-kvm: Create array for hyp SMMUv3 Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 18/28] iommu/arm-smmu-v3-kvm: Probe SMMU HW Mostafa Saleh
` (10 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Donate the array with SMMU description to the hypervisor as it
can't be changed by the host after de-privileges.
Also, donate the SMMU resources to the hypervisor.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 81 ++++++++++++++++++-
1 file changed, 80 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index fa8b71152560..b56feae81dda 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -7,15 +7,94 @@
#include <asm/kvm_hyp.h>
#include <nvhe/iommu.h>
+#include <nvhe/mem_protect.h>
#include "arm_smmu_v3.h"
size_t __ro_after_init kvm_hyp_arm_smmu_v3_count;
struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
+#define for_each_smmu(smmu) \
+ for ((smmu) = kvm_hyp_arm_smmu_v3_smmus; \
+ (smmu) != &kvm_hyp_arm_smmu_v3_smmus[kvm_hyp_arm_smmu_v3_count]; \
+ (smmu)++)
+
+/* Transfer ownership of memory */
+static int smmu_take_pages(u64 phys, size_t size)
+{
+ WARN_ON(!PAGE_ALIGNED(phys) || !PAGE_ALIGNED(size));
+ return __pkvm_host_donate_hyp(phys >> PAGE_SHIFT, size >> PAGE_SHIFT);
+}
+
+static void smmu_reclaim_pages(u64 phys, size_t size)
+{
+ WARN_ON(!PAGE_ALIGNED(phys) || !PAGE_ALIGNED(size));
+ WARN_ON(__pkvm_hyp_donate_host(phys >> PAGE_SHIFT, size >> PAGE_SHIFT));
+}
+
+/* Put the device in a state that can be probed by the host driver. */
+static void smmu_deinit_device(struct hyp_arm_smmu_v3_device *smmu)
+{
+ int i;
+ size_t nr_pages = smmu->mmio_size >> PAGE_SHIFT;
+
+ for (i = 0 ; i < nr_pages ; ++i) {
+ u64 pfn = (smmu->mmio_addr >> PAGE_SHIFT) + i;
+
+ WARN_ON(__pkvm_hyp_donate_host_mmio(pfn));
+ }
+}
+
+static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
+{
+ int i;
+ size_t nr_pages;
+
+ if (!PAGE_ALIGNED(smmu->mmio_addr | smmu->mmio_size))
+ return -EINVAL;
+
+ nr_pages = smmu->mmio_size >> PAGE_SHIFT;
+ for (i = 0 ; i < nr_pages ; ++i) {
+ u64 pfn = (smmu->mmio_addr >> PAGE_SHIFT) + i;
+
+ /*
+ * This should never happen, so it's fine to be strict to avoid
+ * complicated error handling.
+ */
+ WARN_ON(__pkvm_host_donate_hyp_mmio(pfn));
+ }
+ smmu->base = hyp_phys_to_virt(smmu->mmio_addr);
+
+ return 0;
+}
+
static int smmu_init(void)
{
- return -ENOSYS;
+ int ret;
+ struct hyp_arm_smmu_v3_device *smmu;
+ size_t smmu_arr_size = PAGE_ALIGN(sizeof(*kvm_hyp_arm_smmu_v3_smmus) *
+ kvm_hyp_arm_smmu_v3_count);
+
+ kvm_hyp_arm_smmu_v3_smmus = kern_hyp_va(kvm_hyp_arm_smmu_v3_smmus);
+ ret = smmu_take_pages(hyp_virt_to_phys(kvm_hyp_arm_smmu_v3_smmus),
+ smmu_arr_size);
+ if (ret)
+ return ret;
+
+ for_each_smmu(smmu) {
+ ret = smmu_init_device(smmu);
+ if (ret)
+ goto out_reclaim_smmu;
+ }
+
+ return 0;
+
+out_reclaim_smmu:
+ while (smmu != kvm_hyp_arm_smmu_v3_smmus)
+ smmu_deinit_device(--smmu);
+ smmu_reclaim_pages(hyp_virt_to_phys(kvm_hyp_arm_smmu_v3_smmus),
+ smmu_arr_size);
+ return ret;
}
static void smmu_host_stage2_idmap(phys_addr_t start, phys_addr_t end, int prot)
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 18/28] iommu/arm-smmu-v3-kvm: Probe SMMU HW
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (16 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 17/28] iommu/arm-smmu-v3-kvm: Take over SMMUs Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 19/28] iommu/arm-smmu-v3-kvm: Add MMIO emulation Mostafa Saleh
` (9 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Probe SMMU features from the IDR register space, most of
the logic is common with the kernel.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 +
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 57 ++++++++++++++++++-
.../iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h | 8 +++
3 files changed, 64 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 8ffcc2e32474..f0e1feee8a49 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -48,6 +48,7 @@ struct arm_smmu_device;
#define IDR0_S2P (1 << 0)
#define ARM_SMMU_IDR1 0x4
+#define IDR1_ECMDQ (1 << 31)
#define IDR1_TABLES_PRESET (1 << 30)
#define IDR1_QUEUES_PRESET (1 << 29)
#define IDR1_REL (1 << 28)
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index b56feae81dda..e45b4e50b1e4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -10,6 +10,7 @@
#include <nvhe/mem_protect.h>
#include "arm_smmu_v3.h"
+#include "../arm-smmu-v3.h"
size_t __ro_after_init kvm_hyp_arm_smmu_v3_count;
struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
@@ -45,9 +46,56 @@ static void smmu_deinit_device(struct hyp_arm_smmu_v3_device *smmu)
}
}
+/*
+ * Mini-probe and validation for the hypervisor.
+ */
+static int smmu_probe(struct hyp_arm_smmu_v3_device *smmu)
+{
+ u32 reg;
+
+ /* IDR0 */
+ reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
+ smmu->features = smmu_idr0_features(reg);
+
+ /*
+ * Some MMU600 and MMU700 have errata that prevent them from using nesting,
+ * not sure how can we identify those, so it's recommended not to enable this
+ * drivers on such systems.
+ * And preventing any of those will be too restrictive
+ */
+ if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1) ||
+ !(smmu->features & ARM_SMMU_FEAT_TRANS_S2))
+ return -ENXIO;
+
+ reg = readl_relaxed(smmu->base + ARM_SMMU_IDR1);
+ if (reg & (IDR1_TABLES_PRESET | IDR1_QUEUES_PRESET | IDR1_REL | IDR1_ECMDQ))
+ return -EINVAL;
+
+ smmu->sid_bits = FIELD_GET(IDR1_SIDSIZE, reg);
+ /* Follows the kernel logic */
+ if (smmu->sid_bits <= STRTAB_SPLIT)
+ smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+
+ reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+ smmu->features |= smmu_idr3_features(reg);
+
+ reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
+ smmu->pgsize_bitmap = smmu_idr5_to_pgsize(reg);
+
+ smmu->oas = smmu_idr5_to_oas(reg);
+ if (smmu->oas == 52)
+ smmu->pgsize_bitmap |= 1ULL << 42;
+ else if (!smmu->oas)
+ smmu->oas = 48;
+
+ smmu->ias = 64;
+ smmu->ias = min(smmu->ias, smmu->oas);
+ return 0;
+}
+
static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
{
- int i;
+ int i, ret;
size_t nr_pages;
if (!PAGE_ALIGNED(smmu->mmio_addr | smmu->mmio_size))
@@ -64,8 +112,13 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
WARN_ON(__pkvm_host_donate_hyp_mmio(pfn));
}
smmu->base = hyp_phys_to_virt(smmu->mmio_addr);
-
+ ret = smmu_probe(smmu);
+ if (ret)
+ goto out_ret;
return 0;
+out_ret:
+ smmu_deinit_device(smmu);
+ return ret;
}
static int smmu_init(void)
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
index 744ee2b7f0b4..3550fa695539 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
@@ -12,12 +12,20 @@
*
* Other members are filled and used at runtime by the SMMU driver.
* @base Virtual address of SMMU registers
+ * @ias IPA size
+ * @oas PA size
+ * @pgsize_bitmap Supported page sizes
+ * @sid_bits Max number of SID bits supported
*/
struct hyp_arm_smmu_v3_device {
phys_addr_t mmio_addr;
size_t mmio_size;
void __iomem *base;
u32 features;
+ unsigned long ias;
+ unsigned long oas;
+ unsigned long pgsize_bitmap;
+ unsigned int sid_bits;
};
extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 19/28] iommu/arm-smmu-v3-kvm: Add MMIO emulation
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (17 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 18/28] iommu/arm-smmu-v3-kvm: Probe SMMU HW Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 20/28] iommu/arm-smmu-v3-kvm: Shadow the command queue Mostafa Saleh
` (8 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
At the moment most registers are just passthrough, then in the next
patches CMDQ/STE emulation will be added which inserts logic to some
register access.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 125 ++++++++++++++++++
.../iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h | 10 ++
2 files changed, 135 insertions(+)
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index e45b4e50b1e4..32f199aeec9e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -8,6 +8,7 @@
#include <nvhe/iommu.h>
#include <nvhe/mem_protect.h>
+#include <nvhe/trap_handler.h>
#include "arm_smmu_v3.h"
#include "../arm-smmu-v3.h"
@@ -140,6 +141,8 @@ static int smmu_init(void)
goto out_reclaim_smmu;
}
+ BUILD_BUG_ON(sizeof(hyp_spinlock_t) != sizeof(u32));
+
return 0;
out_reclaim_smmu:
@@ -150,6 +153,127 @@ static int smmu_init(void)
return ret;
}
+static bool smmu_dabt_device(struct hyp_arm_smmu_v3_device *smmu,
+ struct user_pt_regs *regs,
+ u64 esr, u32 off)
+{
+ bool is_write = esr & ESR_ELx_WNR;
+ unsigned int len = BIT((esr & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT);
+ int rd = (esr & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
+ const u64 read_write = -1ULL;
+ const u64 no_access = 0;
+ u64 mask = no_access;
+ const u64 read_only = is_write ? no_access : read_write;
+ u64 val = regs->regs[rd];
+
+ switch (off) {
+ case ARM_SMMU_IDR0:
+ /* Clear stage-2 support, hide MSI to avoid write back to cmdq */
+ mask = read_only & ~(IDR0_S2P | IDR0_VMID16 | IDR0_MSI | IDR0_HYP);
+ WARN_ON(len != sizeof(u32));
+ break;
+ /* Passthrough the register access for bisectiblity, handled later */
+ case ARM_SMMU_CMDQ_BASE:
+ case ARM_SMMU_CMDQ_PROD:
+ case ARM_SMMU_CMDQ_CONS:
+ case ARM_SMMU_STRTAB_BASE:
+ case ARM_SMMU_STRTAB_BASE_CFG:
+ case ARM_SMMU_GBPA:
+ mask = read_write;
+ break;
+ case ARM_SMMU_CR0:
+ mask = read_write;
+ WARN_ON(len != sizeof(u32));
+ break;
+ case ARM_SMMU_CR1: {
+ /* Based on Linux implementation */
+ u64 cr2_template = FIELD_PREP(CR1_TABLE_SH, ARM_SMMU_SH_ISH) |
+ FIELD_PREP(CR1_TABLE_OC, CR1_CACHE_WB) |
+ FIELD_PREP(CR1_TABLE_IC, CR1_CACHE_WB) |
+ FIELD_PREP(CR1_QUEUE_SH, ARM_SMMU_SH_ISH) |
+ FIELD_PREP(CR1_QUEUE_OC, CR1_CACHE_WB) |
+ FIELD_PREP(CR1_QUEUE_IC, CR1_CACHE_WB);
+ /* Don't mess with shareability/cacheability. */
+ if (is_write)
+ WARN_ON(val != cr2_template);
+ mask = read_write;
+ WARN_ON(len != sizeof(u32));
+ break;
+ }
+ /*
+ * These should be safe, just enforce RO or RW and size according to architecture.
+ * There are some other registers that are not used by Linux as IDR2, IDR4
+ * that won't be allowed.
+ */
+ case ARM_SMMU_EVTQ_PROD + SZ_64K:
+ case ARM_SMMU_EVTQ_CONS + SZ_64K:
+ case ARM_SMMU_EVTQ_IRQ_CFG1:
+ case ARM_SMMU_EVTQ_IRQ_CFG2:
+ case ARM_SMMU_PRIQ_PROD + SZ_64K:
+ case ARM_SMMU_PRIQ_CONS + SZ_64K:
+ case ARM_SMMU_PRIQ_IRQ_CFG1:
+ case ARM_SMMU_PRIQ_IRQ_CFG2:
+ case ARM_SMMU_GERRORN:
+ case ARM_SMMU_GERROR_IRQ_CFG1:
+ case ARM_SMMU_GERROR_IRQ_CFG2:
+ case ARM_SMMU_IRQ_CTRLACK:
+ case ARM_SMMU_IRQ_CTRL:
+ case ARM_SMMU_CR0ACK:
+ case ARM_SMMU_CR2:
+ /* These are 32 bit registers. */
+ WARN_ON(len != sizeof(u32));
+ fallthrough;
+ case ARM_SMMU_EVTQ_BASE:
+ case ARM_SMMU_EVTQ_IRQ_CFG0:
+ case ARM_SMMU_PRIQ_BASE:
+ case ARM_SMMU_PRIQ_IRQ_CFG0:
+ case ARM_SMMU_GERROR_IRQ_CFG0:
+ mask = read_write;
+ break;
+ case ARM_SMMU_IIDR:
+ case ARM_SMMU_IDR5:
+ case ARM_SMMU_IDR3:
+ case ARM_SMMU_IDR1:
+ case ARM_SMMU_GERROR:
+ WARN_ON(len != sizeof(u32));
+ mask = read_only;
+ };
+
+ if (WARN_ON(!mask))
+ goto out_ret;
+
+ if (is_write) {
+ if (len == sizeof(u64))
+ writeq_relaxed(regs->regs[rd] & mask, smmu->base + off);
+ else
+ writel_relaxed(regs->regs[rd] & mask, smmu->base + off);
+ } else {
+ if (len == sizeof(u64))
+ regs->regs[rd] = readq_relaxed(smmu->base + off) & mask;
+ else
+ regs->regs[rd] = readl_relaxed(smmu->base + off) & mask;
+ }
+
+out_ret:
+ return true;
+}
+
+static bool smmu_dabt_handler(struct user_pt_regs *regs, u64 esr, u64 addr)
+{
+ struct hyp_arm_smmu_v3_device *smmu;
+ bool ret;
+
+ for_each_smmu(smmu) {
+ if (addr < smmu->mmio_addr || addr >= smmu->mmio_addr + smmu->mmio_size)
+ continue;
+ hyp_spin_lock(&smmu->lock);
+ ret = smmu_dabt_device(smmu, regs, esr, addr - smmu->mmio_addr);
+ hyp_spin_unlock(&smmu->lock);
+ return ret;
+ }
+ return false;
+}
+
static void smmu_host_stage2_idmap(phys_addr_t start, phys_addr_t end, int prot)
{
}
@@ -158,4 +282,5 @@ static void smmu_host_stage2_idmap(phys_addr_t start, phys_addr_t end, int prot)
struct kvm_iommu_ops smmu_ops = {
.init = smmu_init,
.host_stage2_idmap = smmu_host_stage2_idmap,
+ .dabt_handler = smmu_dabt_handler,
};
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
index 3550fa695539..dfeaed728982 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
@@ -4,6 +4,10 @@
#include <asm/kvm_asm.h>
+#ifdef __KVM_NVHE_HYPERVISOR__
+#include <nvhe/spinlock.h>
+#endif
+
/*
* Parameters from the trusted host:
* @mmio_addr base address of the SMMU registers
@@ -16,6 +20,7 @@
* @oas PA size
* @pgsize_bitmap Supported page sizes
* @sid_bits Max number of SID bits supported
+ * @lock Lock to protect SMMU
*/
struct hyp_arm_smmu_v3_device {
phys_addr_t mmio_addr;
@@ -26,6 +31,11 @@ struct hyp_arm_smmu_v3_device {
unsigned long oas;
unsigned long pgsize_bitmap;
unsigned int sid_bits;
+#ifdef __KVM_NVHE_HYPERVISOR__
+ hyp_spinlock_t lock;
+#else
+ u32 lock;
+#endif
};
extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 20/28] iommu/arm-smmu-v3-kvm: Shadow the command queue
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (18 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 19/28] iommu/arm-smmu-v3-kvm: Add MMIO emulation Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 21/28] iommu/arm-smmu-v3-kvm: Add CMDQ functions Mostafa Saleh
` (7 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
At boot allocate a command queue per SMMU which is used as a shadow
by the hypervisor.
The command queue size is 64K which is more than enough, as the
hypervisor would consume all the entries per a command queue prod
write, which means it can handle up to 4096 at a time.
Then, the host command queue needs to be pinned in a shared state, so
it can't be donated to VMs, and avoid tricking the hypervisor into
accessing them. This is done each time the command queue is enabled,
and undone each time the command queue is disabled.
The hypervisor won’t access the host command queue when it is disabled
from the host.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c | 20 ++++
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 108 +++++++++++++++++-
.../iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h | 8 ++
3 files changed, 135 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index 27ea39c0fb1f..86e6c68aad4e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -13,6 +13,8 @@
#include "arm-smmu-v3.h"
#include "pkvm/arm_smmu_v3.h"
+#define SMMU_KVM_CMDQ_ORDER 4
+
extern struct kvm_iommu_ops kvm_nvhe_sym(smmu_ops);
static size_t kvm_arm_smmu_count;
@@ -58,6 +60,7 @@ static int kvm_arm_smmu_array_alloc(void)
/* Basic device tree parsing. */
for_each_compatible_node(np, NULL, "arm,smmu-v3") {
struct resource res;
+ void *cmdq_base;
ret = of_address_to_resource(np, 0, &res);
if (ret)
@@ -74,6 +77,23 @@ static int kvm_arm_smmu_array_alloc(void)
if (of_dma_is_coherent(np))
kvm_arm_smmu_array[i].features |= ARM_SMMU_FEAT_COHERENCY;
+ /*
+ * Allocate shadow for the command queue, it doesn't have to be the same
+ * size as the host.
+ * Only populate base_dma and llq.max_n_shift, the hypervisor will init
+ * the rest.
+ * We don't what size the host will choose at this point, the shadow copy
+ * will 64K which is a reasonable size.
+ */
+ cmdq_base = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, SMMU_KVM_CMDQ_ORDER);
+ if (!cmdq_base) {
+ ret = -ENOMEM;
+ goto out_err;
+ }
+
+ kvm_arm_smmu_array[i].cmdq.base_dma = virt_to_phys(cmdq_base);
+ kvm_arm_smmu_array[i].cmdq.llq.max_n_shift = SMMU_KVM_CMDQ_ORDER + PAGE_SHIFT -
+ CMDQ_ENT_SZ_SHIFT;
i++;
}
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index 32f199aeec9e..d3ab4b814be4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -11,7 +11,6 @@
#include <nvhe/trap_handler.h>
#include "arm_smmu_v3.h"
-#include "../arm-smmu-v3.h"
size_t __ro_after_init kvm_hyp_arm_smmu_v3_count;
struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
@@ -34,6 +33,35 @@ static void smmu_reclaim_pages(u64 phys, size_t size)
WARN_ON(__pkvm_hyp_donate_host(phys >> PAGE_SHIFT, size >> PAGE_SHIFT));
}
+/*
+ * CMDQ, STE host copies are accessed by the hypervisor, we share them to
+ * - Prevent the host from passing protected VM memory.
+ * - Having them mapped in the hyp page table.
+ */
+static int smmu_share_pages(phys_addr_t addr, size_t size)
+{
+ int i;
+ size_t nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+ for (i = 0 ; i < nr_pages ; ++i)
+ WARN_ON(__pkvm_host_share_hyp((addr + i * PAGE_SIZE) >> PAGE_SHIFT));
+
+ return hyp_pin_shared_mem(hyp_phys_to_virt(addr), hyp_phys_to_virt(addr + size));
+}
+
+static int smmu_unshare_pages(phys_addr_t addr, size_t size)
+{
+ int i;
+ size_t nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+ hyp_unpin_shared_mem(hyp_phys_to_virt(addr), hyp_phys_to_virt(addr + size));
+
+ for (i = 0 ; i < nr_pages ; ++i)
+ WARN_ON(__pkvm_host_unshare_hyp((addr + i * PAGE_SIZE) >> PAGE_SHIFT));
+
+ return 0;
+}
+
/* Put the device in a state that can be probed by the host driver. */
static void smmu_deinit_device(struct hyp_arm_smmu_v3_device *smmu)
{
@@ -94,6 +122,41 @@ static int smmu_probe(struct hyp_arm_smmu_v3_device *smmu)
return 0;
}
+/*
+ * The kernel part of the driver will allocate the shadow cmdq,
+ * which is different from the one used by the driver.
+ * This function only donates it.
+ */
+static int smmu_init_cmdq(struct hyp_arm_smmu_v3_device *smmu)
+{
+ size_t cmdq_size;
+ int ret;
+ enum kvm_pgtable_prot prot = PAGE_HYP;
+
+ cmdq_size = (1 << (smmu->cmdq.llq.max_n_shift)) *
+ CMDQ_ENT_DWORDS * 8;
+
+ if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
+ prot |= KVM_PGTABLE_PROT_NORMAL_NC;
+
+ ret = ___pkvm_host_donate_hyp(smmu->cmdq.base_dma >> PAGE_SHIFT,
+ PAGE_ALIGN(cmdq_size) >> PAGE_SHIFT, prot);
+ if (ret)
+ return ret;
+
+ smmu->cmdq.base = hyp_phys_to_virt(smmu->cmdq.base_dma);
+ smmu->cmdq.prod_reg = smmu->base + ARM_SMMU_CMDQ_PROD;
+ smmu->cmdq.cons_reg = smmu->base + ARM_SMMU_CMDQ_CONS;
+ smmu->cmdq.q_base = smmu->cmdq.base_dma |
+ FIELD_PREP(Q_BASE_LOG2SIZE, smmu->cmdq.llq.max_n_shift);
+ smmu->cmdq.ent_dwords = CMDQ_ENT_DWORDS;
+ memset(smmu->cmdq.base, 0, cmdq_size);
+ writel_relaxed(0, smmu->cmdq.prod_reg);
+ writel_relaxed(0, smmu->cmdq.cons_reg);
+ writeq_relaxed(smmu->cmdq.q_base, smmu->base + ARM_SMMU_CMDQ_BASE);
+ return 0;
+}
+
static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
{
int i, ret;
@@ -116,7 +179,13 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
ret = smmu_probe(smmu);
if (ret)
goto out_ret;
+
+ ret = smmu_init_cmdq(smmu);
+ if (ret)
+ goto out_ret;
+
return 0;
+
out_ret:
smmu_deinit_device(smmu);
return ret;
@@ -153,6 +222,27 @@ static int smmu_init(void)
return ret;
}
+static bool is_cmdq_enabled(struct hyp_arm_smmu_v3_device *smmu)
+{
+ return FIELD_GET(CR0_CMDQEN, smmu->cr0);
+}
+
+static void smmu_emulate_cmdq_enable(struct hyp_arm_smmu_v3_device *smmu)
+{
+ size_t cmdq_size;
+
+ smmu->cmdq_host.llq.max_n_shift = smmu->cmdq_host.q_base & Q_BASE_LOG2SIZE;
+ cmdq_size = (1 << smmu->cmdq_host.llq.max_n_shift) * CMDQ_ENT_DWORDS * 8;
+ WARN_ON(smmu_share_pages(smmu->cmdq_host.q_base & Q_BASE_ADDR_MASK, cmdq_size));
+}
+
+static void smmu_emulate_cmdq_disable(struct hyp_arm_smmu_v3_device *smmu)
+{
+ size_t cmdq_size = cmdq_size = (1 << smmu->cmdq_host.llq.max_n_shift) * CMDQ_ENT_DWORDS * 8;
+
+ WARN_ON(smmu_unshare_pages(smmu->cmdq_host.q_base & Q_BASE_ADDR_MASK, cmdq_size));
+}
+
static bool smmu_dabt_device(struct hyp_arm_smmu_v3_device *smmu,
struct user_pt_regs *regs,
u64 esr, u32 off)
@@ -174,6 +264,13 @@ static bool smmu_dabt_device(struct hyp_arm_smmu_v3_device *smmu,
break;
/* Passthrough the register access for bisectiblity, handled later */
case ARM_SMMU_CMDQ_BASE:
+
+ /* Not allowed by the architecture */
+ WARN_ON(is_cmdq_enabled(smmu));
+ if (is_write)
+ smmu->cmdq_host.q_base = val;
+ mask = read_write;
+ break;
case ARM_SMMU_CMDQ_PROD:
case ARM_SMMU_CMDQ_CONS:
case ARM_SMMU_STRTAB_BASE:
@@ -182,6 +279,15 @@ static bool smmu_dabt_device(struct hyp_arm_smmu_v3_device *smmu,
mask = read_write;
break;
case ARM_SMMU_CR0:
+ if (is_write) {
+ bool last_cmdq_en = is_cmdq_enabled(smmu);
+
+ smmu->cr0 = val;
+ if (!last_cmdq_en && is_cmdq_enabled(smmu))
+ smmu_emulate_cmdq_enable(smmu);
+ else if (last_cmdq_en && !is_cmdq_enabled(smmu))
+ smmu_emulate_cmdq_disable(smmu);
+ }
mask = read_write;
WARN_ON(len != sizeof(u32));
break;
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
index dfeaed728982..330da53f80d0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
@@ -8,6 +8,8 @@
#include <nvhe/spinlock.h>
#endif
+#include "../arm-smmu-v3.h"
+
/*
* Parameters from the trusted host:
* @mmio_addr base address of the SMMU registers
@@ -21,6 +23,9 @@
* @pgsize_bitmap Supported page sizes
* @sid_bits Max number of SID bits supported
* @lock Lock to protect SMMU
+ * @cmdq CMDQ as observed by HW
+ * @cmdq_host Host view of the command queue
+ * @cr0 Last value of CR0
*/
struct hyp_arm_smmu_v3_device {
phys_addr_t mmio_addr;
@@ -36,6 +41,9 @@ struct hyp_arm_smmu_v3_device {
#else
u32 lock;
#endif
+ struct arm_smmu_queue cmdq;
+ struct arm_smmu_queue cmdq_host;
+ u32 cr0;
};
extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 21/28] iommu/arm-smmu-v3-kvm: Add CMDQ functions
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (19 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 20/28] iommu/arm-smmu-v3-kvm: Shadow the command queue Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host Mostafa Saleh
` (6 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Add functions to access the command queue, there are 2 main usage:
- Hypervisor's own commands, as TLB invalidation, would use functions
as smmu_send_cmd(), which creates and sends a command.
- Add host commands to the shadow command queue, after being filtered,
these will be added with smmu_add_cmd_raw.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 98 +++++++++++++++++++
1 file changed, 98 insertions(+)
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index d3ab4b814be4..554229e466f3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -20,6 +20,33 @@ struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
(smmu) != &kvm_hyp_arm_smmu_v3_smmus[kvm_hyp_arm_smmu_v3_count]; \
(smmu)++)
+/*
+ * Wait until @cond is true.
+ * Return 0 on success, or -ETIMEDOUT
+ */
+#define smmu_wait(_cond) \
+({ \
+ int __ret = 0; \
+ u64 delay = pkvm_time_get() + ARM_SMMU_POLL_TIMEOUT_US; \
+ \
+ while (!(_cond)) { \
+ if (pkvm_time_get() >= delay) { \
+ __ret = -ETIMEDOUT; \
+ break; \
+ } \
+ } \
+ __ret; \
+})
+
+#define smmu_wait_event(_smmu, _cond) \
+({ \
+ if ((_smmu)->features & ARM_SMMU_FEAT_SEV) { \
+ while (!(_cond)) \
+ wfe(); \
+ } \
+ smmu_wait(_cond); \
+})
+
/* Transfer ownership of memory */
static int smmu_take_pages(u64 phys, size_t size)
{
@@ -62,6 +89,77 @@ static int smmu_unshare_pages(phys_addr_t addr, size_t size)
return 0;
}
+static bool smmu_cmdq_full(struct arm_smmu_queue *cmdq)
+{
+ struct arm_smmu_ll_queue *llq = &cmdq->llq;
+
+ WRITE_ONCE(llq->cons, readl_relaxed(cmdq->cons_reg));
+ return queue_full(llq);
+}
+
+static bool smmu_cmdq_empty(struct arm_smmu_queue *cmdq)
+{
+ struct arm_smmu_ll_queue *llq = &cmdq->llq;
+
+ WRITE_ONCE(llq->cons, readl_relaxed(cmdq->cons_reg));
+ return queue_empty(llq);
+}
+
+static void smmu_add_cmd_raw(struct hyp_arm_smmu_v3_device *smmu,
+ u64 *cmd)
+{
+ struct arm_smmu_queue *q = &smmu->cmdq;
+ struct arm_smmu_ll_queue *llq = &q->llq;
+
+ queue_write(Q_ENT(q, llq->prod), cmd, CMDQ_ENT_DWORDS);
+ llq->prod = queue_inc_prod_n(llq, 1);
+ writel_relaxed(llq->prod, q->prod_reg);
+}
+
+static int smmu_add_cmd(struct hyp_arm_smmu_v3_device *smmu,
+ struct arm_smmu_cmdq_ent *ent)
+{
+ int ret;
+ u64 cmd[CMDQ_ENT_DWORDS];
+
+ ret = smmu_wait_event(smmu, !smmu_cmdq_full(&smmu->cmdq));
+ if (ret)
+ return ret;
+
+ ret = arm_smmu_cmdq_build_cmd(cmd, ent);
+ if (ret)
+ return ret;
+
+ smmu_add_cmd_raw(smmu, cmd);
+ return 0;
+}
+
+static int smmu_sync_cmd(struct hyp_arm_smmu_v3_device *smmu)
+{
+ int ret;
+ struct arm_smmu_cmdq_ent cmd = {
+ .opcode = CMDQ_OP_CMD_SYNC,
+ };
+
+ ret = smmu_add_cmd(smmu, &cmd);
+ if (ret)
+ return ret;
+
+ return smmu_wait_event(smmu, smmu_cmdq_empty(&smmu->cmdq));
+}
+
+__maybe_unused
+static int smmu_send_cmd(struct hyp_arm_smmu_v3_device *smmu,
+ struct arm_smmu_cmdq_ent *cmd)
+{
+ int ret = smmu_add_cmd(smmu, cmd);
+
+ if (ret)
+ return ret;
+
+ return smmu_sync_cmd(smmu);
+}
+
/* Put the device in a state that can be probed by the host driver. */
static void smmu_deinit_device(struct hyp_arm_smmu_v3_device *smmu)
{
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (20 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 21/28] iommu/arm-smmu-v3-kvm: Add CMDQ functions Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-09-12 14:18 ` Will Deacon
2025-08-19 21:51 ` [PATCH v4 23/28] iommu/arm-smmu-v3-kvm: Shadow stream table Mostafa Saleh
` (5 subsequent siblings)
27 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Don’t allow access to the command queue from the host:
- ARM_SMMU_CMDQ_BASE: Only allowed to be written when CMDQ is disabled, we
use it to keep track of the host command queue base.
Reads return the saved value.
- ARM_SMMU_CMDQ_PROD: Writes trigger command queue emulation which sanitises
and filters the whole range. Reads returns the host copy.
- ARM_SMMU_CMDQ_CONS: Writes move the sw copy of the cons, but the host can’t
skip commands once submitted. Reads return the emulated value and the error
bits in the actual cons.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 108 +++++++++++++++++-
1 file changed, 105 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index 554229e466f3..10c6461bbf12 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -325,6 +325,88 @@ static bool is_cmdq_enabled(struct hyp_arm_smmu_v3_device *smmu)
return FIELD_GET(CR0_CMDQEN, smmu->cr0);
}
+static bool smmu_filter_command(struct hyp_arm_smmu_v3_device *smmu, u64 *command)
+{
+ u64 type = FIELD_GET(CMDQ_0_OP, command[0]);
+
+ switch (type) {
+ case CMDQ_OP_CFGI_STE:
+ /* TBD: SHADOW_STE*/
+ break;
+ case CMDQ_OP_CFGI_ALL:
+ {
+ /*
+ * Linux doesn't use range STE invalidation, and only use this
+ * for CFGI_ALL, which is done on reset and not on an new STE
+ * being used.
+ * Although, this is not architectural we rely on the current Linux
+ * implementation.
+ */
+ WARN_ON((FIELD_GET(CMDQ_CFGI_1_RANGE, command[1]) != 31));
+ break;
+ }
+ case CMDQ_OP_TLBI_NH_ASID:
+ case CMDQ_OP_TLBI_NH_VA:
+ case 0x13: /* CMD_TLBI_NH_VAA: Not used by Linux */
+ {
+ /* Only allow VMID = 0*/
+ if (FIELD_GET(CMDQ_TLBI_0_VMID, command[0]) == 0)
+ break;
+ break;
+ }
+ case 0x10: /* CMD_TLBI_NH_ALL: Not used by Linux */
+ case CMDQ_OP_TLBI_EL2_ALL:
+ case CMDQ_OP_TLBI_EL2_VA:
+ case CMDQ_OP_TLBI_EL2_ASID:
+ case CMDQ_OP_TLBI_S12_VMALL:
+ case 0x23: /* CMD_TLBI_EL2_VAA: Not used by Linux */
+ /* Malicous host */
+ return WARN_ON(true);
+ case CMDQ_OP_CMD_SYNC:
+ if (FIELD_GET(CMDQ_SYNC_0_CS, command[0]) == CMDQ_SYNC_0_CS_IRQ) {
+ /* Allow it, but let the host timeout, as this should never happen. */
+ command[0] &= ~CMDQ_SYNC_0_CS;
+ command[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV);
+ command[1] &= ~CMDQ_SYNC_1_MSIADDR_MASK;
+ }
+ break;
+ }
+
+ return false;
+}
+
+static void smmu_emulate_cmdq_insert(struct hyp_arm_smmu_v3_device *smmu)
+{
+ u64 *host_cmdq = hyp_phys_to_virt(smmu->cmdq_host.q_base & Q_BASE_ADDR_MASK);
+ int idx;
+ u64 cmd[CMDQ_ENT_DWORDS];
+ bool skip;
+
+ if (!is_cmdq_enabled(smmu))
+ return;
+
+ while (!queue_empty(&smmu->cmdq_host.llq)) {
+ /* Wait for the command queue to have some space. */
+ WARN_ON(smmu_wait_event(smmu, !smmu_cmdq_full(&smmu->cmdq)));
+
+ idx = Q_IDX(&smmu->cmdq_host.llq, smmu->cmdq_host.llq.cons);
+ /* Avoid TOCTOU */
+ memcpy(cmd, &host_cmdq[idx * CMDQ_ENT_DWORDS], CMDQ_ENT_DWORDS << 3);
+ skip = smmu_filter_command(smmu, cmd);
+ if (!skip)
+ smmu_add_cmd_raw(smmu, cmd);
+ queue_inc_cons(&smmu->cmdq_host.llq);
+ }
+
+ /*
+ * Wait till consumed, this can be improved a bit by returning to the host
+ * while flagging the current offset in the command queue with the host,
+ * this would be maintained from the hyp entering command or when the
+ * host issuing another read to cons.
+ */
+ WARN_ON(smmu_wait_event(smmu, smmu_cmdq_empty(&smmu->cmdq)));
+}
+
static void smmu_emulate_cmdq_enable(struct hyp_arm_smmu_v3_device *smmu)
{
size_t cmdq_size;
@@ -360,17 +442,37 @@ static bool smmu_dabt_device(struct hyp_arm_smmu_v3_device *smmu,
mask = read_only & ~(IDR0_S2P | IDR0_VMID16 | IDR0_MSI | IDR0_HYP);
WARN_ON(len != sizeof(u32));
break;
- /* Passthrough the register access for bisectiblity, handled later */
case ARM_SMMU_CMDQ_BASE:
/* Not allowed by the architecture */
WARN_ON(is_cmdq_enabled(smmu));
if (is_write)
smmu->cmdq_host.q_base = val;
- mask = read_write;
- break;
+ else
+ regs->regs[rd] = smmu->cmdq_host.q_base;
+ goto out_ret;
case ARM_SMMU_CMDQ_PROD:
+ if (is_write) {
+ smmu->cmdq_host.llq.prod = val;
+ smmu_emulate_cmdq_insert(smmu);
+ } else {
+ regs->regs[rd] = smmu->cmdq_host.llq.prod;
+ }
+ goto out_ret;
case ARM_SMMU_CMDQ_CONS:
+ if (is_write) {
+ /* Not allowed by the architecture */
+ WARN_ON(is_cmdq_enabled(smmu));
+ smmu->cmdq_host.llq.cons = val;
+ } else {
+ /* Propagate errors back to the host.*/
+ u32 cons = readl_relaxed(smmu->base + ARM_SMMU_CMDQ_CONS);
+ u32 err = CMDQ_CONS_ERR & cons;
+
+ regs->regs[rd] = smmu->cmdq_host.llq.cons | err;
+ }
+ goto out_ret;
+ /* Passthrough the register access for bisectiblity, handled later */
case ARM_SMMU_STRTAB_BASE:
case ARM_SMMU_STRTAB_BASE_CFG:
case ARM_SMMU_GBPA:
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 23/28] iommu/arm-smmu-v3-kvm: Shadow stream table
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (21 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 24/28] iommu/arm-smmu-v3-kvm: Shadow STEs Mostafa Saleh
` (4 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
This patch allocates the shadow stream table per SMMU.
We choose the size of that table to be 1MB which is the
max size used by host in the case of 2 levels.
In this patch all the host writes are still paththrough for
bisectibility, that is changed next where CFGI commands will be
trapped and used to update the shadow copy hypervisor that
will be used by HW.
Similar to the command queue, the host stream table is
shared/unshared each time the SMMU is enabled/disabled.
Handling of L2 tables is also done in the next patch when
the shadowing is added.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c | 13 +-
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 111 ++++++++++++++++++
.../iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h | 10 ++
3 files changed, 133 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index 86e6c68aad4e..821190abac5a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -14,6 +14,8 @@
#include "pkvm/arm_smmu_v3.h"
#define SMMU_KVM_CMDQ_ORDER 4
+#define SMMU_KVM_STRTAB_ORDER (get_order(STRTAB_MAX_L1_ENTRIES * \
+ sizeof(struct arm_smmu_strtab_l1)))
extern struct kvm_iommu_ops kvm_nvhe_sym(smmu_ops);
@@ -60,7 +62,7 @@ static int kvm_arm_smmu_array_alloc(void)
/* Basic device tree parsing. */
for_each_compatible_node(np, NULL, "arm,smmu-v3") {
struct resource res;
- void *cmdq_base;
+ void *cmdq_base, *strtab;
ret = of_address_to_resource(np, 0, &res);
if (ret)
@@ -94,6 +96,15 @@ static int kvm_arm_smmu_array_alloc(void)
kvm_arm_smmu_array[i].cmdq.base_dma = virt_to_phys(cmdq_base);
kvm_arm_smmu_array[i].cmdq.llq.max_n_shift = SMMU_KVM_CMDQ_ORDER + PAGE_SHIFT -
CMDQ_ENT_SZ_SHIFT;
+
+ strtab = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, SMMU_KVM_STRTAB_ORDER);
+ if (!strtab) {
+ ret = -ENOMEM;
+ goto out_err;
+ }
+ kvm_arm_smmu_array[i].strtab_dma = virt_to_phys(strtab);
+ kvm_arm_smmu_array[i].strtab_size = PAGE_SIZE << SMMU_KVM_STRTAB_ORDER;
+
i++;
}
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index 10c6461bbf12..d722f8ce0635 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -15,6 +15,14 @@
size_t __ro_after_init kvm_hyp_arm_smmu_v3_count;
struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
+/* strtab accessors */
+#define strtab_log2size(smmu) (FIELD_GET(STRTAB_BASE_CFG_LOG2SIZE, (smmu)->host_ste_cfg))
+#define strtab_size(smmu) ((1 << strtab_log2size(smmu)) * STRTAB_STE_DWORDS * 8)
+#define strtab_host_base(smmu) ((smmu)->host_ste_base & STRTAB_BASE_ADDR_MASK)
+#define strtab_split(smmu) (FIELD_GET(STRTAB_BASE_CFG_SPLIT, (smmu)->host_ste_cfg))
+#define strtab_l1_size(smmu) ((1 << (strtab_log2size(smmu) - strtab_split(smmu))) * \
+ (sizeof(struct arm_smmu_strtab_l1)))
+
#define for_each_smmu(smmu) \
for ((smmu) = kvm_hyp_arm_smmu_v3_smmus; \
(smmu) != &kvm_hyp_arm_smmu_v3_smmus[kvm_hyp_arm_smmu_v3_count]; \
@@ -255,6 +263,48 @@ static int smmu_init_cmdq(struct hyp_arm_smmu_v3_device *smmu)
return 0;
}
+static int smmu_init_strtab(struct hyp_arm_smmu_v3_device *smmu)
+{
+ int ret;
+ u32 reg;
+ enum kvm_pgtable_prot prot = PAGE_HYP;
+ struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+
+ if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
+ prot |= KVM_PGTABLE_PROT_NORMAL_NC;
+
+ ret = ___pkvm_host_donate_hyp(hyp_phys_to_pfn(smmu->strtab_dma),
+ smmu->strtab_size >> PAGE_SHIFT, prot);
+ if (ret)
+ return ret;
+ if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+ unsigned int last_sid_idx =
+ arm_smmu_strtab_l1_idx((1ULL << smmu->sid_bits) - 1);
+
+ cfg->l2.l1tab = hyp_phys_to_virt(smmu->strtab_dma);
+ cfg->l2.l1_dma = smmu->strtab_dma;
+ cfg->l2.num_l1_ents = min(last_sid_idx + 1, STRTAB_MAX_L1_ENTRIES);
+
+ reg = FIELD_PREP(STRTAB_BASE_CFG_FMT,
+ STRTAB_BASE_CFG_FMT_2LVL) |
+ FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE,
+ ilog2(cfg->l2.num_l1_ents) + STRTAB_SPLIT) |
+ FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
+ } else {
+ cfg->linear.table = hyp_phys_to_virt(smmu->strtab_dma);
+ cfg->linear.ste_dma = smmu->strtab_dma;
+ cfg->linear.num_ents = 1UL << smmu->sid_bits;
+ reg = FIELD_PREP(STRTAB_BASE_CFG_FMT,
+ STRTAB_BASE_CFG_FMT_LINEAR) |
+ FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
+ }
+
+ writeq_relaxed((smmu->strtab_dma & STRTAB_BASE_ADDR_MASK) | STRTAB_BASE_RA,
+ smmu->base + ARM_SMMU_STRTAB_BASE);
+ writel_relaxed(reg, smmu->base + ARM_SMMU_STRTAB_BASE_CFG);
+ return 0;
+}
+
static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
{
int i, ret;
@@ -282,6 +332,10 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
if (ret)
goto out_ret;
+ ret = smmu_init_strtab(smmu);
+ if (ret)
+ goto out_ret;
+
return 0;
out_ret:
@@ -320,6 +374,11 @@ static int smmu_init(void)
return ret;
}
+static bool is_smmu_enabled(struct hyp_arm_smmu_v3_device *smmu)
+{
+ return FIELD_GET(CR0_SMMUEN, smmu->cr0);
+}
+
static bool is_cmdq_enabled(struct hyp_arm_smmu_v3_device *smmu)
{
return FIELD_GET(CR0_CMDQEN, smmu->cr0);
@@ -407,6 +466,39 @@ static void smmu_emulate_cmdq_insert(struct hyp_arm_smmu_v3_device *smmu)
WARN_ON(smmu_wait_event(smmu, smmu_cmdq_empty(&smmu->cmdq)));
}
+static void smmu_update_ste_shadow(struct hyp_arm_smmu_v3_device *smmu, bool enabled)
+{
+ size_t strtab_size;
+ u32 fmt = FIELD_GET(STRTAB_BASE_CFG_FMT, smmu->host_ste_cfg);
+
+ /* Linux doesn't change the fmt nor size of the strtab in the run time. */
+ if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+ strtab_size = strtab_l1_size(smmu);
+ WARN_ON(fmt != STRTAB_BASE_CFG_FMT_2LVL);
+ WARN_ON((strtab_split(smmu) != STRTAB_SPLIT));
+ } else {
+ strtab_size = strtab_size(smmu);
+ WARN_ON(fmt != STRTAB_BASE_CFG_FMT_LINEAR);
+ WARN_ON(FIELD_GET(STRTAB_BASE_CFG_LOG2SIZE, smmu->host_ste_cfg) >
+ smmu->sid_bits);
+ }
+
+ if (enabled)
+ WARN_ON(smmu_share_pages(strtab_host_base(smmu), strtab_size));
+ else
+ WARN_ON(smmu_unshare_pages(strtab_host_base(smmu), strtab_size));
+}
+
+static void smmu_emulate_enable(struct hyp_arm_smmu_v3_device *smmu)
+{
+ smmu_update_ste_shadow(smmu, true);
+}
+
+static void smmu_emulate_disable(struct hyp_arm_smmu_v3_device *smmu)
+{
+ smmu_update_ste_shadow(smmu, false);
+}
+
static void smmu_emulate_cmdq_enable(struct hyp_arm_smmu_v3_device *smmu)
{
size_t cmdq_size;
@@ -474,19 +566,38 @@ static bool smmu_dabt_device(struct hyp_arm_smmu_v3_device *smmu,
goto out_ret;
/* Passthrough the register access for bisectiblity, handled later */
case ARM_SMMU_STRTAB_BASE:
+ if (is_write) {
+ /* Must only be written when SMMU_CR0.SMMUEN == 0.*/
+ WARN_ON(is_smmu_enabled(smmu));
+ smmu->host_ste_base = val;
+ }
+ mask = read_write;
+ break;
case ARM_SMMU_STRTAB_BASE_CFG:
+ if (is_write) {
+ /* Must only be written when SMMU_CR0.SMMUEN == 0.*/
+ WARN_ON(is_smmu_enabled(smmu));
+ smmu->host_ste_cfg = val;
+ }
+ mask = read_write;
+ break;
case ARM_SMMU_GBPA:
mask = read_write;
break;
case ARM_SMMU_CR0:
if (is_write) {
bool last_cmdq_en = is_cmdq_enabled(smmu);
+ bool last_smmu_en = is_smmu_enabled(smmu);
smmu->cr0 = val;
if (!last_cmdq_en && is_cmdq_enabled(smmu))
smmu_emulate_cmdq_enable(smmu);
else if (last_cmdq_en && !is_cmdq_enabled(smmu))
smmu_emulate_cmdq_disable(smmu);
+ if (!last_smmu_en && is_smmu_enabled(smmu))
+ smmu_emulate_enable(smmu);
+ else if (last_smmu_en && !is_smmu_enabled(smmu))
+ smmu_emulate_disable(smmu);
}
mask = read_write;
WARN_ON(len != sizeof(u32));
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
index 330da53f80d0..cf85e5efdd9e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
@@ -15,6 +15,8 @@
* @mmio_addr base address of the SMMU registers
* @mmio_size size of the registers resource
* @features Features of SMMUv3, subset of the main driver
+ * @strtab_dma Phys address of stream table
+ * @strtab_size Stream table size
*
* Other members are filled and used at runtime by the SMMU driver.
* @base Virtual address of SMMU registers
@@ -26,6 +28,9 @@
* @cmdq CMDQ as observed by HW
* @cmdq_host Host view of the command queue
* @cr0 Last value of CR0
+ * @host_ste_cfg Host stream table config
+ * @host_ste_base Host stream table base
+ * @strtab_cfg Stream table as seen by HW
*/
struct hyp_arm_smmu_v3_device {
phys_addr_t mmio_addr;
@@ -44,6 +49,11 @@ struct hyp_arm_smmu_v3_device {
struct arm_smmu_queue cmdq;
struct arm_smmu_queue cmdq_host;
u32 cr0;
+ dma_addr_t strtab_dma;
+ size_t strtab_size;
+ u64 host_ste_cfg;
+ u64 host_ste_base;
+ struct arm_smmu_strtab_cfg strtab_cfg;
};
extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 24/28] iommu/arm-smmu-v3-kvm: Shadow STEs
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (22 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 23/28] iommu/arm-smmu-v3-kvm: Shadow stream table Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 25/28] iommu/arm-smmu-v3-kvm: Emulate GBPA Mostafa Saleh
` (3 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
This patch adds STE emulation, this is done when the host sends the
CFGI_STE command.
In this patch we copy the STE as is to the shadow owned by the hypervisor,
in the next patch, stage-2 page table will be attached.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 95 +++++++++++++++++--
1 file changed, 89 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index d722f8ce0635..0f890a7d8db3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -22,6 +22,9 @@ struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
#define strtab_split(smmu) (FIELD_GET(STRTAB_BASE_CFG_SPLIT, (smmu)->host_ste_cfg))
#define strtab_l1_size(smmu) ((1 << (strtab_log2size(smmu) - strtab_split(smmu))) * \
(sizeof(struct arm_smmu_strtab_l1)))
+#define strtab_hyp_base(smmu) ((smmu)->features & ARM_SMMU_FEAT_2_LVL_STRTAB ? \
+ (u64 *)(smmu)->strtab_cfg.l2.l1tab :\
+ (u64 *)(smmu)->strtab_cfg.linear.table)
#define for_each_smmu(smmu) \
for ((smmu) = kvm_hyp_arm_smmu_v3_smmus; \
@@ -263,6 +266,83 @@ static int smmu_init_cmdq(struct hyp_arm_smmu_v3_device *smmu)
return 0;
}
+/* Get an STE for a stream table base. */
+static struct arm_smmu_ste *smmu_get_ste_ptr(struct hyp_arm_smmu_v3_device *smmu,
+ u32 sid, u64 *strtab)
+{
+ struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+ struct arm_smmu_ste *table = (struct arm_smmu_ste *)strtab;
+
+ if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+ struct arm_smmu_strtab_l1 *l1tab = (struct arm_smmu_strtab_l1 *)strtab;
+ u32 l1_idx = arm_smmu_strtab_l1_idx(sid);
+ struct arm_smmu_strtab_l2 *l2ptr;
+
+ if (WARN_ON(l1_idx >= cfg->l2.num_l1_ents) ||
+ !(l1tab[l1_idx].l2ptr & STRTAB_L1_DESC_SPAN))
+ return NULL;
+
+ l2ptr = hyp_phys_to_virt(l1tab[l1_idx].l2ptr & STRTAB_L1_DESC_L2PTR_MASK);
+ /* Two-level walk */
+ return &l2ptr->stes[arm_smmu_strtab_l2_idx(sid)];
+ }
+ if (WARN_ON(sid >= cfg->linear.num_ents))
+ return NULL;
+ return &table[sid];
+}
+
+static int smmu_shadow_l2_strtab(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
+{
+ struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+ struct arm_smmu_strtab_l2 *l2table;
+ u32 idx = arm_smmu_strtab_l1_idx(sid);
+ u64 *host_ste_base = hyp_phys_to_virt(strtab_host_base(smmu));
+ u64 l1_desc_host = host_ste_base[idx];
+ struct arm_smmu_strtab_l1 *l1_desc = &cfg->l2.l1tab[idx];
+
+ l2table = kvm_iommu_donate_pages(get_order(sizeof(*l2table)));
+ if (!l2table)
+ return -ENOMEM;
+ arm_smmu_write_strtab_l1_desc(l1_desc, hyp_virt_to_phys(l2table));
+ if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
+ kvm_flush_dcache_to_poc(l1_desc, sizeof(*l1_desc));
+
+ /*
+ * Now set the hyp l1 to a shared state.
+ * As mentioned in smmu_reshadow_ste() Linux never clears L1 ptrs,
+ * so no need to handle that case. Otherwise, we need to unshare
+ * the tables and emulate STE clear.
+ */
+ smmu_share_pages(l1_desc_host & STRTAB_L1_DESC_L2PTR_MASK, sizeof(*l2table));
+ return 0;
+}
+
+static void smmu_reshadow_ste(struct hyp_arm_smmu_v3_device *smmu, u32 sid, bool leaf)
+{
+ u64 *host_ste_base = hyp_phys_to_virt(strtab_host_base(smmu));
+ u64 *hyp_ste_base = strtab_hyp_base(smmu);
+ struct arm_smmu_ste *host_ste_ptr = smmu_get_ste_ptr(smmu, sid, host_ste_base);
+ struct arm_smmu_ste *hyp_ste_ptr = smmu_get_ste_ptr(smmu, sid, hyp_ste_base);
+ int i;
+
+ /*
+ * Linux only uses leaf = 1, when leaf is 0, we need to verify that this
+ * is a 2 level table and reshadow of l2.
+ * Also Linux never clears l1 ptr, that needs to free the old shadow.
+ */
+ if (WARN_ON(!leaf || !host_ste_ptr))
+ return;
+
+ /* If host is valid and hyp is not, means a new L1 installed. */
+ if (!hyp_ste_ptr) {
+ WARN_ON(smmu_shadow_l2_strtab(smmu, sid));
+ hyp_ste_ptr = smmu_get_ste_ptr(smmu, sid, hyp_ste_base);
+ }
+
+ for (i = 0; i < STRTAB_STE_DWORDS; i++)
+ WRITE_ONCE(hyp_ste_ptr->data[i], host_ste_ptr->data[i]);
+}
+
static int smmu_init_strtab(struct hyp_arm_smmu_v3_device *smmu)
{
int ret;
@@ -390,8 +470,13 @@ static bool smmu_filter_command(struct hyp_arm_smmu_v3_device *smmu, u64 *comman
switch (type) {
case CMDQ_OP_CFGI_STE:
- /* TBD: SHADOW_STE*/
+ {
+ u32 sid = FIELD_GET(CMDQ_CFGI_0_SID, command[0]);
+ u32 leaf = FIELD_GET(CMDQ_CFGI_1_LEAF, command[1]);
+
+ smmu_reshadow_ste(smmu, sid, leaf);
break;
+ }
case CMDQ_OP_CFGI_ALL:
{
/*
@@ -564,23 +649,21 @@ static bool smmu_dabt_device(struct hyp_arm_smmu_v3_device *smmu,
regs->regs[rd] = smmu->cmdq_host.llq.cons | err;
}
goto out_ret;
- /* Passthrough the register access for bisectiblity, handled later */
case ARM_SMMU_STRTAB_BASE:
if (is_write) {
/* Must only be written when SMMU_CR0.SMMUEN == 0.*/
WARN_ON(is_smmu_enabled(smmu));
smmu->host_ste_base = val;
}
- mask = read_write;
- break;
+ goto out_ret;
case ARM_SMMU_STRTAB_BASE_CFG:
if (is_write) {
/* Must only be written when SMMU_CR0.SMMUEN == 0.*/
WARN_ON(is_smmu_enabled(smmu));
smmu->host_ste_cfg = val;
}
- mask = read_write;
- break;
+ goto out_ret;
+ /* Passthrough the register access for bisectiblity, handled later */
case ARM_SMMU_GBPA:
mask = read_write;
break;
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 25/28] iommu/arm-smmu-v3-kvm: Emulate GBPA
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (23 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 24/28] iommu/arm-smmu-v3-kvm: Shadow STEs Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 26/28] iommu/arm-smmu-v3-kvm: Support io-pgtable Mostafa Saleh
` (2 subsequent siblings)
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
The last bit of emulation is GBPA. it must be always set to ABORT,
as when the SMMU is disabled it’s not allowed for the host to bypass
the SMMU.
That ‘s is done by setting the GBPA to ABORT at init time, when the host:
- Writes, we ignore the write and save the value without the UPDATE bit.
- Reads, return the saved value.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 21 ++++++++++++++++---
.../iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h | 2 ++
2 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index 0f890a7d8db3..db9d9caaca2c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -100,6 +100,13 @@ static int smmu_unshare_pages(phys_addr_t addr, size_t size)
return 0;
}
+static int smmu_abort_gbpa(struct hyp_arm_smmu_v3_device *smmu)
+{
+ writel_relaxed(GBPA_UPDATE | GBPA_ABORT, smmu->base + ARM_SMMU_GBPA);
+ /* Wait till UPDATE is cleared. */
+ return smmu_wait(readl_relaxed(smmu->base + ARM_SMMU_GBPA) == GBPA_ABORT);
+}
+
static bool smmu_cmdq_full(struct arm_smmu_queue *cmdq)
{
struct arm_smmu_ll_queue *llq = &cmdq->llq;
@@ -416,6 +423,10 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
if (ret)
goto out_ret;
+ ret = smmu_abort_gbpa(smmu);
+ if (ret)
+ goto out_ret;
+
return 0;
out_ret:
@@ -663,10 +674,14 @@ static bool smmu_dabt_device(struct hyp_arm_smmu_v3_device *smmu,
smmu->host_ste_cfg = val;
}
goto out_ret;
- /* Passthrough the register access for bisectiblity, handled later */
case ARM_SMMU_GBPA:
- mask = read_write;
- break;
+ if (is_write)
+ smmu->gbpa = val & ~GBPA_UPDATE;
+ else
+ regs->regs[rd] = smmu->gbpa;
+
+ WARN_ON(len != sizeof(u32));
+ goto out_ret;
case ARM_SMMU_CR0:
if (is_write) {
bool last_cmdq_en = is_cmdq_enabled(smmu);
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
index cf85e5efdd9e..aab585dd9fd8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm_smmu_v3.h
@@ -31,6 +31,7 @@
* @host_ste_cfg Host stream table config
* @host_ste_base Host stream table base
* @strtab_cfg Stream table as seen by HW
+ * @gbpa Last value of GBPA from the host
*/
struct hyp_arm_smmu_v3_device {
phys_addr_t mmio_addr;
@@ -54,6 +55,7 @@ struct hyp_arm_smmu_v3_device {
u64 host_ste_cfg;
u64 host_ste_base;
struct arm_smmu_strtab_cfg strtab_cfg;
+ u32 gbpa;
};
extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 26/28] iommu/arm-smmu-v3-kvm: Support io-pgtable
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (24 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 25/28] iommu/arm-smmu-v3-kvm: Emulate GBPA Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 27/28] iommu/arm-smmu-v3-kvm: Shadow the CPU stage-2 page table Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 28/28] iommu/arm-smmu-v3-kvm: Enable nesting Mostafa Saleh
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Add hooks needed to support io-pgtable-arm, mostly about
memory allocation.
Also add a function to allocate s2 64 bit page table.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
arch/arm64/kvm/hyp/nvhe/Makefile | 4 +-
.../arm/arm-smmu-v3/pkvm/io-pgtable-arm-hyp.c | 64 +++++++++++++++++++
drivers/iommu/io-pgtable-arm.c | 2 +-
drivers/iommu/io-pgtable-arm.h | 11 ++++
4 files changed, 79 insertions(+), 2 deletions(-)
create mode 100644 drivers/iommu/arm/arm-smmu-v3/pkvm/io-pgtable-arm-hyp.c
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index c71c96262378..10090be6b067 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -34,7 +34,9 @@ hyp-obj-y += $(lib-objs)
HYP_SMMU_V3_DRV_PATH = ../../../../../drivers/iommu/arm/arm-smmu-v3
hyp-obj-$(CONFIG_ARM_SMMU_V3_PKVM) += $(HYP_SMMU_V3_DRV_PATH)/pkvm/arm-smmu-v3.o \
- $(HYP_SMMU_V3_DRV_PATH)/arm-smmu-v3-common-hyp.o
+ $(HYP_SMMU_V3_DRV_PATH)/arm-smmu-v3-common-hyp.o \
+ $(HYP_SMMU_V3_DRV_PATH)/pkvm/io-pgtable-arm-hyp.o \
+ $(HYP_SMMU_V3_DRV_PATH)/../../io-pgtable-arm.o
##
## Build rules for compiling nVHE hyp code
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/io-pgtable-arm-hyp.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/io-pgtable-arm-hyp.c
new file mode 100644
index 000000000000..6cf9e5bb76e7
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/io-pgtable-arm-hyp.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022 Arm Ltd.
+ */
+#include <nvhe/iommu.h>
+
+#include "../../../io-pgtable-arm.h"
+
+struct io_pgtable_ops *kvm_alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
+ struct io_pgtable_cfg *cfg,
+ void *cookie)
+{
+ struct io_pgtable *iop;
+
+ if (fmt != ARM_64_LPAE_S2)
+ return NULL;
+
+ iop = arm_64_lpae_alloc_pgtable_s2(cfg, cookie);
+ iop->fmt = fmt;
+ iop->cookie = cookie;
+ iop->cfg = *cfg;
+
+ return &iop->ops;
+}
+
+void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
+ struct io_pgtable_cfg *cfg, void *cookie)
+{
+ void *addr;
+
+ addr = kvm_iommu_donate_pages(get_order(size));
+
+ if (addr && !cfg->coherent_walk)
+ kvm_flush_dcache_to_poc(addr, size);
+
+ return addr;
+}
+
+void __arm_lpae_free_pages(void *addr, size_t size, struct io_pgtable_cfg *cfg,
+ void *cookie)
+{
+ if (!cfg->coherent_walk)
+ kvm_flush_dcache_to_poc(addr, size);
+
+ kvm_iommu_reclaim_pages(addr);
+}
+
+void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
+ struct io_pgtable_cfg *cfg)
+{
+ if (!cfg->coherent_walk)
+ kvm_flush_dcache_to_poc(ptep, sizeof(*ptep) * num_entries);
+}
+
+/* At the moment this is only used once, so rounding up to a page is not really a problem. */
+void *__arm_lpae_alloc_data(size_t size, gfp_t gfp)
+{
+ return kvm_iommu_donate_pages(get_order(size));
+}
+
+void __arm_lpae_free_data(void *p)
+{
+ return kvm_iommu_reclaim_pages(p);
+}
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 2ca09081c3b0..211f6d54b902 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -947,7 +947,7 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
return NULL;
}
-static struct io_pgtable *
+struct io_pgtable *
arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
{
u64 sl;
diff --git a/drivers/iommu/io-pgtable-arm.h b/drivers/iommu/io-pgtable-arm.h
index 7d9f0b759275..194c3e975288 100644
--- a/drivers/iommu/io-pgtable-arm.h
+++ b/drivers/iommu/io-pgtable-arm.h
@@ -78,8 +78,19 @@ void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
void *cookie);
void *__arm_lpae_alloc_data(size_t size, gfp_t gfp);
void __arm_lpae_free_data(void *p);
+struct io_pgtable *
+arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie);
#ifndef __KVM_NVHE_HYPERVISOR__
#define __arm_lpae_virt_to_phys __pa
#define __arm_lpae_phys_to_virt __va
+#else
+#include <nvhe/memory.h>
+#define __arm_lpae_virt_to_phys hyp_virt_to_phys
+#define __arm_lpae_phys_to_virt hyp_phys_to_virt
+#undef WARN_ONCE
+#define WARN_ONCE(condition, format...) WARN_ON(1)
+struct io_pgtable_ops *kvm_alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
+ struct io_pgtable_cfg *cfg,
+ void *cookie);
#endif /* !__KVM_NVHE_HYPERVISOR__ */
#endif /* IO_PGTABLE_ARM_H_ */
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 27/28] iommu/arm-smmu-v3-kvm: Shadow the CPU stage-2 page table
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (25 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 26/28] iommu/arm-smmu-v3-kvm: Support io-pgtable Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 28/28] iommu/arm-smmu-v3-kvm: Enable nesting Mostafa Saleh
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Based on the callbacks from the hypervisor, update the SMMUv3
Identity mapped page table.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 171 +++++++++++++++++-
1 file changed, 169 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index db9d9caaca2c..2d4ff21f83f9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -11,6 +11,7 @@
#include <nvhe/trap_handler.h>
#include "arm_smmu_v3.h"
+#include "../../../io-pgtable-arm.h"
size_t __ro_after_init kvm_hyp_arm_smmu_v3_count;
struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
@@ -58,6 +59,9 @@ struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
smmu_wait(_cond); \
})
+/* Protected by host_mmu.lock from core code. */
+static struct io_pgtable *idmap_pgtable;
+
/* Transfer ownership of memory */
static int smmu_take_pages(u64 phys, size_t size)
{
@@ -166,7 +170,6 @@ static int smmu_sync_cmd(struct hyp_arm_smmu_v3_device *smmu)
return smmu_wait_event(smmu, smmu_cmdq_empty(&smmu->cmdq));
}
-__maybe_unused
static int smmu_send_cmd(struct hyp_arm_smmu_v3_device *smmu,
struct arm_smmu_cmdq_ent *cmd)
{
@@ -178,6 +181,66 @@ static int smmu_send_cmd(struct hyp_arm_smmu_v3_device *smmu,
return smmu_sync_cmd(smmu);
}
+static void __smmu_add_cmd(struct hyp_arm_smmu_v3_device *smmu, void *unused,
+ struct arm_smmu_cmdq_ent *cmd)
+{
+ WARN_ON(smmu_add_cmd(smmu, cmd));
+}
+
+static int smmu_tlb_inv_range_smmu(struct hyp_arm_smmu_v3_device *smmu,
+ struct arm_smmu_cmdq_ent *cmd,
+ unsigned long iova, size_t size, size_t granule)
+{
+ arm_smmu_tlb_inv_build(cmd, iova, size, granule,
+ idmap_pgtable->cfg.pgsize_bitmap, smmu,
+ __smmu_add_cmd, NULL);
+ return smmu_sync_cmd(smmu);
+}
+
+static void smmu_tlb_inv_range(unsigned long iova, size_t size, size_t granule,
+ bool leaf)
+{
+ struct arm_smmu_cmdq_ent cmd = {
+ .opcode = CMDQ_OP_TLBI_S2_IPA,
+ .tlbi = {
+ .leaf = leaf,
+ .vmid = 0,
+ },
+ };
+ struct arm_smmu_cmdq_ent cmd_s1 = {
+ .opcode = CMDQ_OP_TLBI_NH_ALL,
+ .tlbi = {
+ .vmid = 0,
+ },
+ };
+ struct hyp_arm_smmu_v3_device *smmu;
+
+ for_each_smmu(smmu) {
+ hyp_spin_lock(&smmu->lock);
+ WARN_ON(smmu_tlb_inv_range_smmu(smmu, &cmd, iova, size, granule));
+ WARN_ON(smmu_send_cmd(smmu, &cmd_s1));
+ hyp_spin_unlock(&smmu->lock);
+ }
+}
+
+static void smmu_tlb_flush_walk(unsigned long iova, size_t size,
+ size_t granule, void *cookie)
+{
+ smmu_tlb_inv_range(iova, size, granule, false);
+}
+
+static void smmu_tlb_add_page(struct iommu_iotlb_gather *gather,
+ unsigned long iova, size_t granule,
+ void *cookie)
+{
+ smmu_tlb_inv_range(iova, granule, granule, true);
+}
+
+static const struct iommu_flush_ops smmu_tlb_ops = {
+ .tlb_flush_walk = smmu_tlb_flush_walk,
+ .tlb_add_page = smmu_tlb_add_page,
+};
+
/* Put the device in a state that can be probed by the host driver. */
static void smmu_deinit_device(struct hyp_arm_smmu_v3_device *smmu)
{
@@ -434,6 +497,37 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
return ret;
}
+static int smmu_init_pgt(void)
+{
+ /* Default values overridden based on SMMUs common features. */
+ struct io_pgtable_cfg cfg = (struct io_pgtable_cfg) {
+ .tlb = &smmu_tlb_ops,
+ .pgsize_bitmap = -1,
+ .ias = 48,
+ .oas = 48,
+ .coherent_walk = true,
+ };
+ struct hyp_arm_smmu_v3_device *smmu;
+ struct io_pgtable_ops *ops;
+
+ for_each_smmu(smmu) {
+ cfg.ias = min(cfg.ias, smmu->ias);
+ cfg.oas = min(cfg.oas, smmu->oas);
+ cfg.pgsize_bitmap &= smmu->pgsize_bitmap;
+ cfg.coherent_walk &= !!(smmu->features & ARM_SMMU_FEAT_COHERENCY);
+ }
+
+ /* At least PAGE_SIZE must be supported by all SMMUs*/
+ if ((cfg.pgsize_bitmap & PAGE_SIZE) == 0)
+ return -EINVAL;
+
+ ops = kvm_alloc_io_pgtable_ops(ARM_64_LPAE_S2, &cfg, NULL);
+ if (!ops)
+ return -ENOMEM;
+ idmap_pgtable = io_pgtable_ops_to_pgtable(ops);
+ return 0;
+}
+
static int smmu_init(void)
{
int ret;
@@ -455,7 +549,7 @@ static int smmu_init(void)
BUILD_BUG_ON(sizeof(hyp_spinlock_t) != sizeof(u32));
- return 0;
+ return smmu_init_pgt();
out_reclaim_smmu:
while (smmu != kvm_hyp_arm_smmu_v3_smmus)
@@ -789,8 +883,81 @@ static bool smmu_dabt_handler(struct user_pt_regs *regs, u64 esr, u64 addr)
return false;
}
+static size_t smmu_pgsize_idmap(size_t size, u64 paddr, size_t pgsize_bitmap)
+{
+ size_t pgsizes;
+
+ /* Remove page sizes that are larger than the current size */
+ pgsizes = pgsize_bitmap & GENMASK_ULL(__fls(size), 0);
+
+ /* Remove page sizes that the address is not aligned to. */
+ if (likely(paddr))
+ pgsizes &= GENMASK_ULL(__ffs(paddr), 0);
+
+ WARN_ON(!pgsizes);
+
+ /* Return the larget page size that fits. */
+ return BIT(__fls(pgsizes));
+}
+
static void smmu_host_stage2_idmap(phys_addr_t start, phys_addr_t end, int prot)
{
+ size_t size = end - start;
+ size_t pgsize = PAGE_SIZE, pgcount;
+ size_t mapped, unmapped;
+ int ret;
+ struct io_pgtable *pgtable = idmap_pgtable;
+
+ end = min(end, BIT(pgtable->cfg.oas));
+ if (start >= end)
+ return;
+
+ if (prot) {
+ if (!(prot & IOMMU_MMIO))
+ prot |= IOMMU_CACHE;
+
+ while (size) {
+ mapped = 0;
+ /*
+ * We handle pages size for memory and MMIO differently:
+ * - memory: Map everything with PAGE_SIZE, that is guaranteed to
+ * find memory as we allocated enough pages to cover the entire
+ * memory, we do that as io-pgtable-arm doesn't support
+ * split_blk_unmap logic any more, so we can't break blocks once
+ * mapped to tables.
+ * - MMIO: Unlike memory, pKVM allocate 1G to for all MMIO, while
+ * the MMIO space can be large, as it is assumed to cover the
+ * whole IAS that is not memory, we have to use block mappings,
+ * that is fine for MMIO as it is never donated at the moment,
+ * so we never need to unmap MMIO at the run time triggereing
+ * split block logic.
+ */
+ if (prot & IOMMU_MMIO)
+ pgsize = smmu_pgsize_idmap(size, start, pgtable->cfg.pgsize_bitmap);
+
+ pgcount = size / pgsize;
+ ret = pgtable->ops.map_pages(&pgtable->ops, start, start,
+ pgsize, pgcount, prot, 0, &mapped);
+ size -= mapped;
+ start += mapped;
+ if (!mapped || ret)
+ return;
+ }
+ } else {
+ /* Shouldn't happen. */
+ WARN_ON(prot & IOMMU_MMIO);
+ while (size) {
+ pgcount = size / pgsize;
+ unmapped = pgtable->ops.unmap_pages(&pgtable->ops, start,
+ pgsize, pgcount, NULL);
+ size -= unmapped;
+ start += unmapped;
+ if (!unmapped)
+ return;
+ }
+ /* Some memory were not unmapped. */
+ WARN_ON(size);
+ }
}
/* Shared with the kernel driver in EL1 */
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH v4 28/28] iommu/arm-smmu-v3-kvm: Enable nesting
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
` (26 preceding siblings ...)
2025-08-19 21:51 ` [PATCH v4 27/28] iommu/arm-smmu-v3-kvm: Shadow the CPU stage-2 page table Mostafa Saleh
@ 2025-08-19 21:51 ` Mostafa Saleh
27 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-08-19 21:51 UTC (permalink / raw)
To: linux-kernel, kvmarm, linux-arm-kernel, iommu
Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, will, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland, praan, Mostafa Saleh
Now, as the hypervisor controls the command queue, stream table,
and shadows the stage-2 page table.
Enable stage-2 in case the host puts an STE in bypass or stage-1.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
.../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 72 ++++++++++++++++++-
1 file changed, 70 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
index 2d4ff21f83f9..5be44a37d581 100644
--- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
@@ -336,6 +336,46 @@ static int smmu_init_cmdq(struct hyp_arm_smmu_v3_device *smmu)
return 0;
}
+static void smmu_attach_stage_2(struct hyp_arm_smmu_v3_device *smmu, struct arm_smmu_ste *ste)
+{
+ unsigned long vttbr;
+ unsigned long ts, sl, ic, oc, sh, tg, ps;
+ unsigned long cfg;
+ struct io_pgtable_cfg *pgt_cfg = &idmap_pgtable->cfg;
+
+ cfg = FIELD_GET(STRTAB_STE_0_CFG, ste->data[0]);
+ if (!FIELD_GET(STRTAB_STE_0_V, ste->data[0]) ||
+ (cfg == STRTAB_STE_0_CFG_ABORT))
+ return;
+ /* S2 is not advertised, that should never be attempted. */
+ if (WARN_ON(cfg == STRTAB_STE_0_CFG_NESTED))
+ return;
+ vttbr = pgt_cfg->arm_lpae_s2_cfg.vttbr;
+ ps = pgt_cfg->arm_lpae_s2_cfg.vtcr.ps;
+ tg = pgt_cfg->arm_lpae_s2_cfg.vtcr.tg;
+ sh = pgt_cfg->arm_lpae_s2_cfg.vtcr.sh;
+ oc = pgt_cfg->arm_lpae_s2_cfg.vtcr.orgn;
+ ic = pgt_cfg->arm_lpae_s2_cfg.vtcr.irgn;
+ sl = pgt_cfg->arm_lpae_s2_cfg.vtcr.sl;
+ ts = pgt_cfg->arm_lpae_s2_cfg.vtcr.tsz;
+
+ ste->data[1] |= FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING);
+ /* The host shouldn't write dwords 2 and 3, overwrite them. */
+ ste->data[2] = FIELD_PREP(STRTAB_STE_2_VTCR,
+ FIELD_PREP(STRTAB_STE_2_VTCR_S2PS, ps) |
+ FIELD_PREP(STRTAB_STE_2_VTCR_S2TG, tg) |
+ FIELD_PREP(STRTAB_STE_2_VTCR_S2SH0, sh) |
+ FIELD_PREP(STRTAB_STE_2_VTCR_S2OR0, oc) |
+ FIELD_PREP(STRTAB_STE_2_VTCR_S2IR0, ic) |
+ FIELD_PREP(STRTAB_STE_2_VTCR_S2SL0, sl) |
+ FIELD_PREP(STRTAB_STE_2_VTCR_S2T0SZ, ts)) |
+ FIELD_PREP(STRTAB_STE_2_S2VMID, 0) |
+ STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2R;
+ ste->data[3] = vttbr & STRTAB_STE_3_S2TTB_MASK;
+ /* Convert S1 => nested and bypass => S2 */
+ ste->data[0] |= FIELD_PREP(STRTAB_STE_0_CFG, cfg | BIT(1));
+}
+
/* Get an STE for a stream table base. */
static struct arm_smmu_ste *smmu_get_ste_ptr(struct hyp_arm_smmu_v3_device *smmu,
u32 sid, u64 *strtab)
@@ -394,6 +434,10 @@ static void smmu_reshadow_ste(struct hyp_arm_smmu_v3_device *smmu, u32 sid, bool
struct arm_smmu_ste *host_ste_ptr = smmu_get_ste_ptr(smmu, sid, host_ste_base);
struct arm_smmu_ste *hyp_ste_ptr = smmu_get_ste_ptr(smmu, sid, hyp_ste_base);
int i;
+ struct arm_smmu_ste target = {};
+ struct arm_smmu_cmdq_ent cfgi_cmd = {
+ .opcode = CMDQ_OP_CFGI_ALL,
+ };
/*
* Linux only uses leaf = 1, when leaf is 0, we need to verify that this
@@ -409,8 +453,32 @@ static void smmu_reshadow_ste(struct hyp_arm_smmu_v3_device *smmu, u32 sid, bool
hyp_ste_ptr = smmu_get_ste_ptr(smmu, sid, hyp_ste_base);
}
- for (i = 0; i < STRTAB_STE_DWORDS; i++)
- WRITE_ONCE(hyp_ste_ptr->data[i], host_ste_ptr->data[i]);
+ memcpy(target.data, host_ste_ptr->data, STRTAB_STE_DWORDS << 3);
+
+ /*
+ * Typically, STE update is done as the following
+ * 1- Write last 7 dwords, while STE is invalid
+ * 2- CFGI
+ * 3- Write first dword, making STE valid
+ * 4- CFGI
+ * As the SMMU has to least to load 64 bits atomically
+ * that gurantees that there is no race between writing
+ * the STE and the CFGI where the SMMU observes parts
+ * of the STE.
+ * In the shadow we update the STE to enable nested translation,
+ * which requires updating first 3 dwords.
+ * That is only done if the STE is valid and not in abort.
+ * Which means it happens at step 4)
+ * So we need to also write the last 7 dwords and send CFGI
+ * before writing the first dword.
+ * There is no need for last CFGI as the host will do that.
+ */
+ smmu_attach_stage_2(smmu, &target);
+ for (i = 1; i < STRTAB_STE_DWORDS; i++)
+ WRITE_ONCE(hyp_ste_ptr->data[i], target.data[i]);
+
+ WARN_ON(smmu_send_cmd(smmu, &cfgi_cmd));
+ WRITE_ONCE(hyp_ste_ptr->data[0], target.data[0]);
}
static int smmu_init_strtab(struct hyp_arm_smmu_v3_device *smmu)
--
2.51.0.rc1.167.g924127e9c0-goog
^ permalink raw reply related [flat|nested] 82+ messages in thread
* Re: [PATCH v4 01/28] KVM: arm64: Add a new function to donate memory with prot
2025-08-19 21:51 ` [PATCH v4 01/28] KVM: arm64: Add a new function to donate memory with prot Mostafa Saleh
@ 2025-09-09 13:46 ` Will Deacon
2025-09-14 19:23 ` Pranjal Shrivastava
2025-09-16 11:56 ` Mostafa Saleh
0 siblings, 2 replies; 82+ messages in thread
From: Will Deacon @ 2025-09-09 13:46 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Aug 19, 2025 at 09:51:29PM +0000, Mostafa Saleh wrote:
> Soon, IOMMU drivers running in the hypervisor might interact with
> non-coherent devices, so it needs a mechanism to map memory as
> non cacheable.
> Add ___pkvm_host_donate_hyp() which accepts a new argument for prot,
> so the driver can add KVM_PGTABLE_PROT_NORMAL_NC.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
> arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 11 +++++++++--
> 2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> index 5f9d56754e39..52d7ee91e18c 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> @@ -36,6 +36,7 @@ int __pkvm_prot_finalize(void);
> int __pkvm_host_share_hyp(u64 pfn);
> int __pkvm_host_unshare_hyp(u64 pfn);
> int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
> +int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
> int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
> int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
> int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 8957734d6183..861e448183fd 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -769,13 +769,15 @@ int __pkvm_host_unshare_hyp(u64 pfn)
> return ret;
> }
>
> -int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> +int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> {
> u64 phys = hyp_pfn_to_phys(pfn);
> u64 size = PAGE_SIZE * nr_pages;
> void *virt = __hyp_va(phys);
> int ret;
>
> + WARN_ON(prot & KVM_PGTABLE_PROT_X);
Should this actually just enforce that the permissions are
KVM_PGTABLE_PROT_RW:
WARN_ON((prot & KVM_PGTABLE_PROT_RWX) != KVM_PGTABLE_PROT_RW);
?
Since the motivation is about the memory type rather than the
permissions, it would be best to preserve the current behaviour.
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor
2025-08-19 21:51 ` [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor Mostafa Saleh
@ 2025-09-09 14:12 ` Will Deacon
2025-09-16 13:27 ` Mostafa Saleh
2025-09-14 20:41 ` Pranjal Shrivastava
1 sibling, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-09 14:12 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Aug 19, 2025 at 09:51:30PM +0000, Mostafa Saleh wrote:
> Add a function to donate MMIO to the hypervisor so IOMMU hypervisor
> drivers can use that to protect the MMIO of IOMMU.
> The initial attempt to implement this was to have a new flag to
> "___pkvm_host_donate_hyp" to accept MMIO. However that had many problems,
> it was quite intrusive for host/hyp to check/set page state to make it
> aware of MMIO and to encode the state in the page table in that case.
> Which is called in paths that can be sensitive to performance (FFA, VMs..)
>
> As donating MMIO is very rare, and we don’t need to encode the full state,
> it’s reasonable to have a separate function to do this.
> It will init the host s2 page table with an invalid leaf with the owner ID
> to prevent the host from mapping the page on faults.
>
> Also, prevent kvm_pgtable_stage2_unmap() from removing owner ID from
> stage-2 PTEs, as this can be triggered from recycle logic under memory
> pressure. There is no code relying on this, as all ownership changes is
> done via kvm_pgtable_stage2_set_owner()
>
> For error path in IOMMU drivers, add a function to donate MMIO back
> from hyp to host.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
> arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 +
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 64 +++++++++++++++++++
> arch/arm64/kvm/hyp/pgtable.c | 9 +--
> 3 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> index 52d7ee91e18c..98e173da0f9b 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> @@ -37,6 +37,8 @@ int __pkvm_host_share_hyp(u64 pfn);
> int __pkvm_host_unshare_hyp(u64 pfn);
> int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
> int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
> +int __pkvm_host_donate_hyp_mmio(u64 pfn);
> +int __pkvm_hyp_donate_host_mmio(u64 pfn);
> int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
> int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
> int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 861e448183fd..c9a15ef6b18d 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -799,6 +799,70 @@ int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> return ret;
> }
>
> +int __pkvm_host_donate_hyp_mmio(u64 pfn)
> +{
> + u64 phys = hyp_pfn_to_phys(pfn);
> + void *virt = __hyp_va(phys);
> + int ret;
> + kvm_pte_t pte;
> +
> + host_lock_component();
> + hyp_lock_component();
> +
> + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> + if (ret)
> + goto unlock;
> +
> + if (pte && !kvm_pte_valid(pte)) {
> + ret = -EPERM;
> + goto unlock;
> + }
Shouldn't we first check that the pfn is indeed MMIO? Otherwise, testing
the pte for the ownership information isn't right.
> + ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
> + if (ret)
> + goto unlock;
> + if (pte) {
> + ret = -EBUSY;
> + goto unlock;
> + }
> +
> + ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, PAGE_HYP_DEVICE);
> + if (ret)
> + goto unlock;
> + /*
> + * We set HYP as the owner of the MMIO pages in the host stage-2, for:
> + * - host aborts: host_stage2_adjust_range() would fail for invalid non zero PTEs.
> + * - recycle under memory pressure: host_stage2_unmap_dev_all() would call
> + * kvm_pgtable_stage2_unmap() which will not clear non zero invalid ptes (counted).
> + * - other MMIO donation: Would fail as we check that the PTE is valid or empty.
> + */
> + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> + PAGE_SIZE, &host_s2_pool, PKVM_ID_HYP));
> +unlock:
> + hyp_unlock_component();
> + host_unlock_component();
> +
> + return ret;
> +}
> +
> +int __pkvm_hyp_donate_host_mmio(u64 pfn)
> +{
> + u64 phys = hyp_pfn_to_phys(pfn);
> + u64 virt = (u64)__hyp_va(phys);
> + size_t size = PAGE_SIZE;
> +
> + host_lock_component();
> + hyp_lock_component();
Shouldn't we check that:
1. pfn is mmio
2. pfn is owned by hyp
3. The host doesn't have something mapped at pfn already
?
> + WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
> + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> + PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
> + hyp_unlock_component();
> + host_unlock_component();
> +
> + return 0;
> +}
> +
> int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> {
> return ___pkvm_host_donate_hyp(pfn, nr_pages, PAGE_HYP);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index c351b4abd5db..ba06b0c21d5a 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1095,13 +1095,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> kvm_pte_t *childp = NULL;
> bool need_flush = false;
>
> - if (!kvm_pte_valid(ctx->old)) {
> - if (stage2_pte_is_counted(ctx->old)) {
> - kvm_clear_pte(ctx->ptep);
> - mm_ops->put_page(ctx->ptep);
> - }
> - return 0;
> - }
> + if (!kvm_pte_valid(ctx->old))
> + return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
Can this code be reached for the guest? For example, if
pkvm_pgtable_stage2_destroy() runs into an MMIO-guarded pte on teardown?
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 03/28] KVM: arm64: pkvm: Add pkvm_time_get()
2025-08-19 21:51 ` [PATCH v4 03/28] KVM: arm64: pkvm: Add pkvm_time_get() Mostafa Saleh
@ 2025-09-09 14:16 ` Will Deacon
2025-09-09 15:56 ` Marc Zyngier
0 siblings, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-09 14:16 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Aug 19, 2025 at 09:51:31PM +0000, Mostafa Saleh wrote:
> Add a function to return time in us.
>
> This can be used from IOMMU drivers while waiting for conditions as
> for SMMUv3 TLB invalidation waiting for sync.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 2 ++
> arch/arm64/kvm/hyp/nvhe/setup.c | 4 ++++
> arch/arm64/kvm/hyp/nvhe/timer-sr.c | 33 ++++++++++++++++++++++++++
> 3 files changed, 39 insertions(+)
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
> index ce31d3b73603..6c19691720cd 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
> @@ -87,4 +87,6 @@ bool kvm_handle_pvm_restricted(struct kvm_vcpu *vcpu, u64 *exit_code);
> void kvm_init_pvm_id_regs(struct kvm_vcpu *vcpu);
> int kvm_check_pvm_sysreg_table(void);
>
> +int pkvm_timer_init(void);
> +u64 pkvm_time_get(void);
> #endif /* __ARM64_KVM_NVHE_PKVM_H__ */
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index a48d3f5a5afb..ee6435473204 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -304,6 +304,10 @@ void __noreturn __pkvm_init_finalise(void)
> };
> pkvm_pgtable.mm_ops = &pkvm_pgtable_mm_ops;
>
> + ret = pkvm_timer_init();
> + if (ret)
> + goto out;
> +
> ret = fix_host_ownership();
> if (ret)
> goto out;
> diff --git a/arch/arm64/kvm/hyp/nvhe/timer-sr.c b/arch/arm64/kvm/hyp/nvhe/timer-sr.c
> index ff176f4ce7de..e166cd5a56b8 100644
> --- a/arch/arm64/kvm/hyp/nvhe/timer-sr.c
> +++ b/arch/arm64/kvm/hyp/nvhe/timer-sr.c
> @@ -11,6 +11,10 @@
> #include <asm/kvm_hyp.h>
> #include <asm/kvm_mmu.h>
>
> +#include <nvhe/pkvm.h>
> +
> +static u32 timer_freq;
> +
> void __kvm_timer_set_cntvoff(u64 cntvoff)
> {
> write_sysreg(cntvoff, cntvoff_el2);
> @@ -68,3 +72,32 @@ void __timer_enable_traps(struct kvm_vcpu *vcpu)
>
> sysreg_clear_set(cnthctl_el2, clr, set);
> }
> +
> +static u64 pkvm_ticks_get(void)
> +{
> + return __arch_counter_get_cntvct();
> +}
> +
> +#define SEC_TO_US 1000000
> +
> +int pkvm_timer_init(void)
> +{
> + timer_freq = read_sysreg(cntfrq_el0);
> + /*
> + * TODO: The highest privileged level is supposed to initialize this
> + * register. But on some systems (which?), this information is only
> + * contained in the device-tree, so we'll need to find it out some other
> + * way.
> + */
> + if (!timer_freq || timer_freq < SEC_TO_US)
> + return -ENODEV;
> + return 0;
> +}
Right, I think the frequency should be provided by the host once the arch
timer driver has probed successfully. Relying on CNTFRQ isn't viable imo.
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 06/28] iommu/arm-smmu-v3: Split code with hyp
2025-08-19 21:51 ` [PATCH v4 06/28] iommu/arm-smmu-v3: Split code with hyp Mostafa Saleh
@ 2025-09-09 14:23 ` Will Deacon
2025-09-16 14:10 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-09 14:23 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Aug 19, 2025 at 09:51:34PM +0000, Mostafa Saleh wrote:
> The KVM SMMUv3 driver would re-use some of the cmdq code inside
> the hypervisor, move these functions to a new common c file that
> is shared between the host kernel and the hypervisor.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
> drivers/iommu/arm/arm-smmu-v3/Makefile | 2 +-
> .../arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c | 114 ++++++++++++++
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 146 ------------------
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 44 ++++++
> 4 files changed, 159 insertions(+), 147 deletions(-)
> create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
> index 493a659cc66b..1918b4a64cb0 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/Makefile
> +++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
> @@ -1,6 +1,6 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_ARM_SMMU_V3) += arm_smmu_v3.o
> -arm_smmu_v3-y := arm-smmu-v3.o
> +arm_smmu_v3-y := arm-smmu-v3.o arm-smmu-v3-common-hyp.o
> arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_IOMMUFD) += arm-smmu-v3-iommufd.o
> arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
> arm_smmu_v3-$(CONFIG_TEGRA241_CMDQV) += tegra241-cmdqv.o
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c
> new file mode 100644
> index 000000000000..62744c8548a8
> --- /dev/null
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c
> @@ -0,0 +1,114 @@
Given that this thie is linked into both the kernel and the hypervisor
objects, I think I'd drop the '-hyp' part from the filename. Maybe
something like 'arm-smmu-v3-lib.c' instead?
Let the bike-shedding begin!
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 07/28] iommu/arm-smmu-v3: Move TLB range invalidation into a macro
2025-08-19 21:51 ` [PATCH v4 07/28] iommu/arm-smmu-v3: Move TLB range invalidation into a macro Mostafa Saleh
@ 2025-09-09 14:25 ` Will Deacon
0 siblings, 0 replies; 82+ messages in thread
From: Will Deacon @ 2025-09-09 14:25 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Aug 19, 2025 at 09:51:35PM +0000, Mostafa Saleh wrote:
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 2698438cd35c..a222fb7ef2ec 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -1042,6 +1042,70 @@ static inline void arm_smmu_write_strtab_l1_desc(struct arm_smmu_strtab_l1 *dst,
> WRITE_ONCE(dst->l2ptr, cpu_to_le64(val));
> }
>
> +/**
> + * arm_smmu_tlb_inv_build - Create a range invalidation command
> + * @cmd: Base command initialized with OPCODE (S1, S2..), vmid and asid.
> + * @iova: Start IOVA to invalidate
> + * @size: Size of range
> + * @granule: Granule of invalidation
> + * @pgsize_bitmap: Page size bit map of the page table.
> + * @smmu: Struct for the smmu, must have ::features
> + * @add_cmd: Function to send/batch the invalidation command
> + * @cmds: Incase of batching, it includes the pointer to the batch
> + */
> +#define arm_smmu_tlb_inv_build(cmd, iova, size, granule, pgsize_bitmap, smmu, add_cmd, cmds) \
> +{ \
> + unsigned long _iova = (iova); \
> + size_t _size = (size); \
> + size_t _granule = (granule); \
> + unsigned long end = _iova + _size, num_pages = 0, tg = 0; \
> + size_t inv_range = _granule; \
This is pretty gross and I've been (very sporadically) trying to replace
the similar macro we have on the CPU side with static inline functions
instead.
Can you use an inline function here too?
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table
2025-08-19 21:51 ` [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table Mostafa Saleh
@ 2025-09-09 14:42 ` Will Deacon
2025-09-16 14:24 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-09 14:42 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Aug 19, 2025 at 09:51:38PM +0000, Mostafa Saleh wrote:
> Create a shadow page table for the IOMMU that shadows the
> host CPU stage-2 into the IOMMUs to establish DMA isolation.
>
> An initial snapshot is created after the driver init, then
> on every permission change a callback would be called for
> the IOMMU driver to update the page table.
>
> For some cases, an SMMUv3 may be able to share the same page
> table used with the host CPU stage-2 directly.
> However, this is too strict and requires changes to the core hypervisor
> page table code, plus it would require the hypervisor to handle IOMMU
> page faults. This can be added later as an optimization for SMMUV3.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
> arch/arm64/kvm/hyp/include/nvhe/iommu.h | 4 ++
> arch/arm64/kvm/hyp/nvhe/iommu/iommu.c | 83 ++++++++++++++++++++++++-
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 5 ++
> 3 files changed, 90 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
> index 1ac70cc28a9e..219363045b1c 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
> @@ -3,11 +3,15 @@
> #define __ARM64_KVM_NVHE_IOMMU_H__
>
> #include <asm/kvm_host.h>
> +#include <asm/kvm_pgtable.h>
>
> struct kvm_iommu_ops {
> int (*init)(void);
> + void (*host_stage2_idmap)(phys_addr_t start, phys_addr_t end, int prot);
> };
>
> int kvm_iommu_init(void);
>
> +void kvm_iommu_host_stage2_idmap(phys_addr_t start, phys_addr_t end,
> + enum kvm_pgtable_prot prot);
> #endif /* __ARM64_KVM_NVHE_IOMMU_H__ */
> diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> index a01c036c55be..f7d1c8feb358 100644
> --- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> +++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> @@ -4,15 +4,94 @@
> *
> * Copyright (C) 2022 Linaro Ltd.
> */
> +#include <linux/iommu.h>
> +
> #include <nvhe/iommu.h>
> +#include <nvhe/mem_protect.h>
> +#include <nvhe/spinlock.h>
>
> /* Only one set of ops supported */
> struct kvm_iommu_ops *kvm_iommu_ops;
>
> +/* Protected by host_mmu.lock */
> +static bool kvm_idmap_initialized;
> +
> +static inline int pkvm_to_iommu_prot(enum kvm_pgtable_prot prot)
> +{
> + int iommu_prot = 0;
> +
> + if (prot & KVM_PGTABLE_PROT_R)
> + iommu_prot |= IOMMU_READ;
> + if (prot & KVM_PGTABLE_PROT_W)
> + iommu_prot |= IOMMU_WRITE;
> + if (prot == PKVM_HOST_MMIO_PROT)
> + iommu_prot |= IOMMU_MMIO;
This looks a little odd to me.
On the CPU side, the only different between PKVM_HOST_MEM_PROT and
PKVM_HOST_MMIO_PROT is that the former has execute permission. Both are
mapped as cacheable at stage-2 because it's the job of the host to set
the more restrictive memory type at stage-1.
Carrying that over to the SMMU would suggest that we don't care about
IOMMU_MMIO at stage-2 at all, so why do we need to set it here?
> + /* We don't understand that, might be dangerous. */
> + WARN_ON(prot & ~PKVM_HOST_MEM_PROT);
> + return iommu_prot;
> +}
> +
> +static int __snapshot_host_stage2(const struct kvm_pgtable_visit_ctx *ctx,
> + enum kvm_pgtable_walk_flags visit)
> +{
> + u64 start = ctx->addr;
> + kvm_pte_t pte = *ctx->ptep;
> + u32 level = ctx->level;
> + u64 end = start + kvm_granule_size(level);
> + int prot = IOMMU_READ | IOMMU_WRITE;
> +
> + /* Keep unmapped. */
> + if (pte && !kvm_pte_valid(pte))
> + return 0;
> +
> + if (kvm_pte_valid(pte))
> + prot = pkvm_to_iommu_prot(kvm_pgtable_stage2_pte_prot(pte));
> + else if (!addr_is_memory(start))
> + prot |= IOMMU_MMIO;
Why do we need to map MMIO regions pro-actively here? I'd have thought
we could just do:
if (!kvm_pte_valid(pte))
return 0;
prot = pkvm_to_iommu_prot(kvm_pgtable_stage2_pte_prot(pte);
kvm_iommu_ops->host_stage2_idmap(start, end, prot);
return 0;
but I think that IOMMU_MMIO is throwing me again...
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 03/28] KVM: arm64: pkvm: Add pkvm_time_get()
2025-09-09 14:16 ` Will Deacon
@ 2025-09-09 15:56 ` Marc Zyngier
2025-09-15 11:10 ` Pranjal Shrivastava
2025-09-16 14:04 ` Mostafa Saleh
0 siblings, 2 replies; 82+ messages in thread
From: Marc Zyngier @ 2025-09-09 15:56 UTC (permalink / raw)
To: Will Deacon
Cc: Mostafa Saleh, linux-kernel, kvmarm, linux-arm-kernel, iommu,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba, jgg,
mark.rutland, praan
On Tue, 09 Sep 2025 15:16:26 +0100,
Will Deacon <will@kernel.org> wrote:
>
> On Tue, Aug 19, 2025 at 09:51:31PM +0000, Mostafa Saleh wrote:
> > Add a function to return time in us.
> >
> > This can be used from IOMMU drivers while waiting for conditions as
> > for SMMUv3 TLB invalidation waiting for sync.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > ---
> > arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 2 ++
> > arch/arm64/kvm/hyp/nvhe/setup.c | 4 ++++
> > arch/arm64/kvm/hyp/nvhe/timer-sr.c | 33 ++++++++++++++++++++++++++
> > 3 files changed, 39 insertions(+)
[...]
> > +#define SEC_TO_US 1000000
> > +
> > +int pkvm_timer_init(void)
> > +{
> > + timer_freq = read_sysreg(cntfrq_el0);
> > + /*
> > + * TODO: The highest privileged level is supposed to initialize this
> > + * register. But on some systems (which?), this information is only
> > + * contained in the device-tree, so we'll need to find it out some other
> > + * way.
> > + */
> > + if (!timer_freq || timer_freq < SEC_TO_US)
> > + return -ENODEV;
> > + return 0;
> > +}
>
> Right, I think the frequency should be provided by the host once the arch
> timer driver has probed successfully. Relying on CNTFRQ isn't viable imo.
We can always patch the value in, à la kimage_voffset. But it really
begs the question: who is their right mind doesn't set CNTFRQ_EL0 to
something sensible? Why should we care about supporting such
contraption?
I'd be happy to simply disable KVM when CNTFRQ_EL0 is misprogrammed,
or that the device tree provides a clock frequency. Because there is
no good way to support a guest in that case.
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 16/28] iommu/arm-smmu-v3-kvm: Create array for hyp SMMUv3
2025-08-19 21:51 ` [PATCH v4 16/28] iommu/arm-smmu-v3-kvm: Create array for hyp SMMUv3 Mostafa Saleh
@ 2025-09-09 18:30 ` Daniel Mentz
2025-09-16 14:35 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Daniel Mentz @ 2025-09-09 18:30 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Aug 19, 2025 at 2:55 PM Mostafa Saleh <smostafa@google.com> wrote:
>
> + if (kvm_arm_smmu_array[i].mmio_size < SZ_128K) {
> + pr_err("SMMUv3(%s) has unsupported size(0x%lx)\n", np->name,
> + kvm_arm_smmu_array[i].mmio_size);
Use format specifier %pOF to print device tree node.
If mmio_size is a size_t type, use format specifier %zx.
Align language of error message with kernel driver which prints "MMIO
region too small (%pr)\n".
I'm wondering if we should use kvm_err instead of pr_err.
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 14/28] iommu/arm-smmu-v3: Add KVM mode in the driver
2025-08-19 21:51 ` [PATCH v4 14/28] iommu/arm-smmu-v3: Add KVM mode in the driver Mostafa Saleh
@ 2025-09-12 13:52 ` Will Deacon
2025-09-16 14:30 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-12 13:52 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Aug 19, 2025 at 09:51:42PM +0000, Mostafa Saleh wrote:
> Add a file only compiled for KVM mode.
>
> At the moment it registers the driver with KVM, and add the hook
> needed for memory allocation.
>
> Next, it will create the array with available SMMUs and their
> description.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
> arch/arm64/include/asm/kvm_host.h | 4 +++
> arch/arm64/kvm/iommu.c | 10 ++++--
> drivers/iommu/arm/arm-smmu-v3/Makefile | 1 +
> .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c | 36 +++++++++++++++++++
> 4 files changed, 49 insertions(+), 2 deletions(-)
> create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index fcb4b26072f7..52212c0f2e9c 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -1678,4 +1678,8 @@ struct kvm_iommu_ops;
> int kvm_iommu_register_driver(struct kvm_iommu_ops *hyp_ops);
> size_t kvm_iommu_pages(void);
>
> +#ifdef CONFIG_ARM_SMMU_V3_PKVM
> +size_t smmu_hyp_pgt_pages(void);
> +#endif
> +
> #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/arch/arm64/kvm/iommu.c b/arch/arm64/kvm/iommu.c
> index 5460b1bd44a6..0475f7c95c6c 100644
> --- a/arch/arm64/kvm/iommu.c
> +++ b/arch/arm64/kvm/iommu.c
> @@ -17,10 +17,16 @@ int kvm_iommu_register_driver(struct kvm_iommu_ops *hyp_ops)
>
> size_t kvm_iommu_pages(void)
> {
> + size_t nr_pages = 0;
> +
> /*
> * This is called very early during setup_arch() where no initcalls,
> * so this has to call specific functions per each KVM driver.
> */
> - kvm_nvhe_sym(hyp_kvm_iommu_pages) = 0;
> - return 0;
> +#ifdef CONFIG_ARM_SMMU_V3_PKVM
> + nr_pages = smmu_hyp_pgt_pages();
> +#endif
Rather than hard-code this here, I wonder whether it would be better to
have a default size for the IOMMU carveout and have the driver tells us
how much it needs later on when it probes. Then we could either free
any unused portion back to the host or return an error to the driver if
it wants more than we have.
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-08-19 21:51 ` [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode Mostafa Saleh
@ 2025-09-12 13:54 ` Will Deacon
2025-09-23 14:35 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-12 13:54 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Aug 19, 2025 at 09:51:43PM +0000, Mostafa Saleh wrote:
> While in KVM mode, the driver must be loaded after the hypervisor
> initializes.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 ++++++++++++++++-----
> 1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 10ca07c6dbe9..a04730b5fe41 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -4576,12 +4576,6 @@ static const struct of_device_id arm_smmu_of_match[] = {
> };
> MODULE_DEVICE_TABLE(of, arm_smmu_of_match);
>
> -static void arm_smmu_driver_unregister(struct platform_driver *drv)
> -{
> - arm_smmu_sva_notifier_synchronize();
> - platform_driver_unregister(drv);
> -}
> -
> static struct platform_driver arm_smmu_driver = {
> .driver = {
> .name = "arm-smmu-v3",
> @@ -4592,8 +4586,27 @@ static struct platform_driver arm_smmu_driver = {
> .remove = arm_smmu_device_remove,
> .shutdown = arm_smmu_device_shutdown,
> };
> +
> +#ifndef CONFIG_ARM_SMMU_V3_PKVM
> +static void arm_smmu_driver_unregister(struct platform_driver *drv)
> +{
> + arm_smmu_sva_notifier_synchronize();
> + platform_driver_unregister(drv);
> +}
> +
> module_driver(arm_smmu_driver, platform_driver_register,
> arm_smmu_driver_unregister);
> +#else
> +/*
> + * Must be done after the hypervisor initializes at module_init()
> + * No need for unregister as this is a built in driver.
> + */
> +static int arm_smmu_driver_register(void)
> +{
> + return platform_driver_register(&arm_smmu_driver);
> +}
> +device_initcall_sync(arm_smmu_driver_register);
> +#endif /* !CONFIG_ARM_SMMU_V3_PKVM */
I think this is a bit grotty as we now have to reason about different
initialisation ordering based on CONFIG_ARM_SMMU_V3_PKVM. Could we
instead return -EPROBE_DEFER if the driver tries to probe before the
hypervisor is up?
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-08-19 21:51 ` [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host Mostafa Saleh
@ 2025-09-12 14:18 ` Will Deacon
2025-09-15 16:38 ` Jason Gunthorpe
2025-09-16 14:50 ` Mostafa Saleh
0 siblings, 2 replies; 82+ messages in thread
From: Will Deacon @ 2025-09-12 14:18 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Aug 19, 2025 at 09:51:50PM +0000, Mostafa Saleh wrote:
> Don’t allow access to the command queue from the host:
> - ARM_SMMU_CMDQ_BASE: Only allowed to be written when CMDQ is disabled, we
> use it to keep track of the host command queue base.
> Reads return the saved value.
> - ARM_SMMU_CMDQ_PROD: Writes trigger command queue emulation which sanitises
> and filters the whole range. Reads returns the host copy.
> - ARM_SMMU_CMDQ_CONS: Writes move the sw copy of the cons, but the host can’t
> skip commands once submitted. Reads return the emulated value and the error
> bits in the actual cons.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
> .../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 108 +++++++++++++++++-
> 1 file changed, 105 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
> index 554229e466f3..10c6461bbf12 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
> @@ -325,6 +325,88 @@ static bool is_cmdq_enabled(struct hyp_arm_smmu_v3_device *smmu)
> return FIELD_GET(CR0_CMDQEN, smmu->cr0);
> }
>
> +static bool smmu_filter_command(struct hyp_arm_smmu_v3_device *smmu, u64 *command)
> +{
> + u64 type = FIELD_GET(CMDQ_0_OP, command[0]);
> +
> + switch (type) {
> + case CMDQ_OP_CFGI_STE:
> + /* TBD: SHADOW_STE*/
> + break;
> + case CMDQ_OP_CFGI_ALL:
> + {
> + /*
> + * Linux doesn't use range STE invalidation, and only use this
> + * for CFGI_ALL, which is done on reset and not on an new STE
> + * being used.
> + * Although, this is not architectural we rely on the current Linux
> + * implementation.
> + */
> + WARN_ON((FIELD_GET(CMDQ_CFGI_1_RANGE, command[1]) != 31));
> + break;
> + }
> + case CMDQ_OP_TLBI_NH_ASID:
> + case CMDQ_OP_TLBI_NH_VA:
> + case 0x13: /* CMD_TLBI_NH_VAA: Not used by Linux */
> + {
> + /* Only allow VMID = 0*/
> + if (FIELD_GET(CMDQ_TLBI_0_VMID, command[0]) == 0)
> + break;
> + break;
> + }
> + case 0x10: /* CMD_TLBI_NH_ALL: Not used by Linux */
> + case CMDQ_OP_TLBI_EL2_ALL:
> + case CMDQ_OP_TLBI_EL2_VA:
> + case CMDQ_OP_TLBI_EL2_ASID:
> + case CMDQ_OP_TLBI_S12_VMALL:
> + case 0x23: /* CMD_TLBI_EL2_VAA: Not used by Linux */
> + /* Malicous host */
> + return WARN_ON(true);
> + case CMDQ_OP_CMD_SYNC:
> + if (FIELD_GET(CMDQ_SYNC_0_CS, command[0]) == CMDQ_SYNC_0_CS_IRQ) {
> + /* Allow it, but let the host timeout, as this should never happen. */
> + command[0] &= ~CMDQ_SYNC_0_CS;
> + command[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV);
> + command[1] &= ~CMDQ_SYNC_1_MSIADDR_MASK;
> + }
> + break;
> + }
> +
> + return false;
> +}
> +
> +static void smmu_emulate_cmdq_insert(struct hyp_arm_smmu_v3_device *smmu)
> +{
> + u64 *host_cmdq = hyp_phys_to_virt(smmu->cmdq_host.q_base & Q_BASE_ADDR_MASK);
> + int idx;
> + u64 cmd[CMDQ_ENT_DWORDS];
> + bool skip;
> +
> + if (!is_cmdq_enabled(smmu))
> + return;
> +
> + while (!queue_empty(&smmu->cmdq_host.llq)) {
> + /* Wait for the command queue to have some space. */
> + WARN_ON(smmu_wait_event(smmu, !smmu_cmdq_full(&smmu->cmdq)));
> +
> + idx = Q_IDX(&smmu->cmdq_host.llq, smmu->cmdq_host.llq.cons);
> + /* Avoid TOCTOU */
> + memcpy(cmd, &host_cmdq[idx * CMDQ_ENT_DWORDS], CMDQ_ENT_DWORDS << 3);
> + skip = smmu_filter_command(smmu, cmd);
> + if (!skip)
> + smmu_add_cmd_raw(smmu, cmd);
> + queue_inc_cons(&smmu->cmdq_host.llq);
> + }
Hmmm. There's something I'd not considered before here.
Ideally, the data structures that are shadowed by the hypervisor would
be mapped as normal-WB cacheable in both the host and the hypervisor so
we don't have to worry about coherency and we get the performance
benefits from the caches. Indeed, I think that's how you've mapped
'host_cmdq' above _however_ I sadly don't think we can do that if the
actual SMMU hardware isn't coherent.
We don't have a way to say things like "The STEs and CMDQ are coherent
but the CDs and Stage-1 page-tables aren't" so that means we have to
treat the shadowed structures populated by the host in the same way as
the host-owned structures that are consumed directly by the hardware.
Consequently, we should either be using non-cacheable mappings at EL2
for these structures or doing CMOs around the accesses.
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 01/28] KVM: arm64: Add a new function to donate memory with prot
2025-09-09 13:46 ` Will Deacon
@ 2025-09-14 19:23 ` Pranjal Shrivastava
2025-09-16 11:58 ` Mostafa Saleh
2025-09-16 11:56 ` Mostafa Saleh
1 sibling, 1 reply; 82+ messages in thread
From: Pranjal Shrivastava @ 2025-09-14 19:23 UTC (permalink / raw)
To: Will Deacon
Cc: Mostafa Saleh, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba, jgg,
mark.rutland
On Tue, Sep 09, 2025 at 02:46:42PM +0100, Will Deacon wrote:
> On Tue, Aug 19, 2025 at 09:51:29PM +0000, Mostafa Saleh wrote:
> > Soon, IOMMU drivers running in the hypervisor might interact with
> > non-coherent devices, so it needs a mechanism to map memory as
> > non cacheable.
> > Add ___pkvm_host_donate_hyp() which accepts a new argument for prot,
> > so the driver can add KVM_PGTABLE_PROT_NORMAL_NC.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
> > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 11 +++++++++--
> > 2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > index 5f9d56754e39..52d7ee91e18c 100644
> > --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > @@ -36,6 +36,7 @@ int __pkvm_prot_finalize(void);
> > int __pkvm_host_share_hyp(u64 pfn);
> > int __pkvm_host_unshare_hyp(u64 pfn);
> > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
> > +int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
> > int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
> > int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
> > int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
> > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > index 8957734d6183..861e448183fd 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > @@ -769,13 +769,15 @@ int __pkvm_host_unshare_hyp(u64 pfn)
> > return ret;
> > }
> >
> > -int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> > +int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> > {
> > u64 phys = hyp_pfn_to_phys(pfn);
> > u64 size = PAGE_SIZE * nr_pages;
> > void *virt = __hyp_va(phys);
> > int ret;
> >
> > + WARN_ON(prot & KVM_PGTABLE_PROT_X);
>
> Should this actually just enforce that the permissions are
> KVM_PGTABLE_PROT_RW:
>
> WARN_ON((prot & KVM_PGTABLE_PROT_RWX) != KVM_PGTABLE_PROT_RW);
>
> ?
>
> Since the motivation is about the memory type rather than the
> permissions, it would be best to preserve the current behaviour.
+1. I believe the current `WARN_ON(prot & KVM_PGTABLE_PROT_X);` check
would potentially allow "Read-only" or "Write-only" donations to slide
through silently.
>
> Will
Thanks,
Praan
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor
2025-08-19 21:51 ` [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor Mostafa Saleh
2025-09-09 14:12 ` Will Deacon
@ 2025-09-14 20:41 ` Pranjal Shrivastava
2025-09-16 13:43 ` Mostafa Saleh
1 sibling, 1 reply; 82+ messages in thread
From: Pranjal Shrivastava @ 2025-09-14 20:41 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland
On Tue, Aug 19, 2025 at 09:51:30PM +0000, Mostafa Saleh wrote:
> Add a function to donate MMIO to the hypervisor so IOMMU hypervisor
> drivers can use that to protect the MMIO of IOMMU.
> The initial attempt to implement this was to have a new flag to
> "___pkvm_host_donate_hyp" to accept MMIO. However that had many problems,
> it was quite intrusive for host/hyp to check/set page state to make it
> aware of MMIO and to encode the state in the page table in that case.
> Which is called in paths that can be sensitive to performance (FFA, VMs..)
>
> As donating MMIO is very rare, and we don’t need to encode the full state,
> it’s reasonable to have a separate function to do this.
> It will init the host s2 page table with an invalid leaf with the owner ID
> to prevent the host from mapping the page on faults.
>
> Also, prevent kvm_pgtable_stage2_unmap() from removing owner ID from
> stage-2 PTEs, as this can be triggered from recycle logic under memory
> pressure. There is no code relying on this, as all ownership changes is
> done via kvm_pgtable_stage2_set_owner()
>
> For error path in IOMMU drivers, add a function to donate MMIO back
> from hyp to host.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
> arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 +
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 64 +++++++++++++++++++
> arch/arm64/kvm/hyp/pgtable.c | 9 +--
> 3 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> index 52d7ee91e18c..98e173da0f9b 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> @@ -37,6 +37,8 @@ int __pkvm_host_share_hyp(u64 pfn);
> int __pkvm_host_unshare_hyp(u64 pfn);
> int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
> int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
> +int __pkvm_host_donate_hyp_mmio(u64 pfn);
> +int __pkvm_hyp_donate_host_mmio(u64 pfn);
> int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
> int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
> int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 861e448183fd..c9a15ef6b18d 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -799,6 +799,70 @@ int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> return ret;
> }
>
> +int __pkvm_host_donate_hyp_mmio(u64 pfn)
> +{
> + u64 phys = hyp_pfn_to_phys(pfn);
> + void *virt = __hyp_va(phys);
> + int ret;
> + kvm_pte_t pte;
> +
> + host_lock_component();
> + hyp_lock_component();
> +
> + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> + if (ret)
> + goto unlock;
> +
> + if (pte && !kvm_pte_valid(pte)) {
> + ret = -EPERM;
> + goto unlock;
> + }
> +
> + ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
> + if (ret)
> + goto unlock;
> + if (pte) {
> + ret = -EBUSY;
> + goto unlock;
> + }
I'm thinking of a situation where both of these checks might be
necessary.. The first check seems to confirm if the page being donated
isn't set up to trap in the hyp (i.e. the donor/host doesn't own the
page anymore).
However, the second check seems to check if the pfn is already mapped
in the hyp's space. Is this check only to catch errorneous donations of
a shared page or is there something else?
> +
> + ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, PAGE_HYP_DEVICE);
> + if (ret)
> + goto unlock;
> + /*
> + * We set HYP as the owner of the MMIO pages in the host stage-2, for:
> + * - host aborts: host_stage2_adjust_range() would fail for invalid non zero PTEs.
> + * - recycle under memory pressure: host_stage2_unmap_dev_all() would call
> + * kvm_pgtable_stage2_unmap() which will not clear non zero invalid ptes (counted).
> + * - other MMIO donation: Would fail as we check that the PTE is valid or empty.
> + */
> + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> + PAGE_SIZE, &host_s2_pool, PKVM_ID_HYP));
> +unlock:
> + hyp_unlock_component();
> + host_unlock_component();
> +
> + return ret;
> +}
> +
> +int __pkvm_hyp_donate_host_mmio(u64 pfn)
> +{
> + u64 phys = hyp_pfn_to_phys(pfn);
> + u64 virt = (u64)__hyp_va(phys);
> + size_t size = PAGE_SIZE;
> +
> + host_lock_component();
> + hyp_lock_component();
> +
> + WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
> + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> + PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
> + hyp_unlock_component();
> + host_unlock_component();
> +
> + return 0;
> +}
> +
> int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> {
> return ___pkvm_host_donate_hyp(pfn, nr_pages, PAGE_HYP);
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index c351b4abd5db..ba06b0c21d5a 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1095,13 +1095,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> kvm_pte_t *childp = NULL;
> bool need_flush = false;
>
> - if (!kvm_pte_valid(ctx->old)) {
> - if (stage2_pte_is_counted(ctx->old)) {
> - kvm_clear_pte(ctx->ptep);
> - mm_ops->put_page(ctx->ptep);
> - }
> - return 0;
> - }
> + if (!kvm_pte_valid(ctx->old))
> + return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
>
> if (kvm_pte_table(ctx->old, ctx->level)) {
> childp = kvm_pte_follow(ctx->old, mm_ops);
> --
Thanks
Praan
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 03/28] KVM: arm64: pkvm: Add pkvm_time_get()
2025-09-09 15:56 ` Marc Zyngier
@ 2025-09-15 11:10 ` Pranjal Shrivastava
2025-09-16 14:04 ` Mostafa Saleh
1 sibling, 0 replies; 82+ messages in thread
From: Pranjal Shrivastava @ 2025-09-15 11:10 UTC (permalink / raw)
To: Marc Zyngier
Cc: Will Deacon, Mostafa Saleh, linux-kernel, kvmarm,
linux-arm-kernel, iommu, oliver.upton, joey.gouly, suzuki.poulose,
yuzenghui, catalin.marinas, robin.murphy, jean-philippe, qperret,
tabba, jgg, mark.rutland
On Tue, Sep 09, 2025 at 04:56:16PM +0100, Marc Zyngier wrote:
> On Tue, 09 Sep 2025 15:16:26 +0100,
> Will Deacon <will@kernel.org> wrote:
> >
> > On Tue, Aug 19, 2025 at 09:51:31PM +0000, Mostafa Saleh wrote:
> > > Add a function to return time in us.
> > >
> > > This can be used from IOMMU drivers while waiting for conditions as
> > > for SMMUv3 TLB invalidation waiting for sync.
> > >
> > > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > ---
> > > arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 2 ++
> > > arch/arm64/kvm/hyp/nvhe/setup.c | 4 ++++
> > > arch/arm64/kvm/hyp/nvhe/timer-sr.c | 33 ++++++++++++++++++++++++++
> > > 3 files changed, 39 insertions(+)
>
> [...]
>
> > > +#define SEC_TO_US 1000000
> > > +
> > > +int pkvm_timer_init(void)
> > > +{
> > > + timer_freq = read_sysreg(cntfrq_el0);
> > > + /*
> > > + * TODO: The highest privileged level is supposed to initialize this
> > > + * register. But on some systems (which?), this information is only
> > > + * contained in the device-tree, so we'll need to find it out some other
> > > + * way.
> > > + */
> > > + if (!timer_freq || timer_freq < SEC_TO_US)
> > > + return -ENODEV;
> > > + return 0;
> > > +}
> >
> > Right, I think the frequency should be provided by the host once the arch
> > timer driver has probed successfully. Relying on CNTFRQ isn't viable imo.
>
Are platforms are using DT to change this value?
Because I think TF-A mandates [1] this value to be set, from the docs:
BL31 programs the CNTFRQ_EL0 register with the clock frequency of the
system counter, which is provided by the platform.
Even the TF-A porting guide [2] mandates it to be set.
> We can always patch the value in, à la kimage_voffset. But it really
> begs the question: who is their right mind doesn't set CNTFRQ_EL0 to
> something sensible? Why should we care about supporting such
> contraption?
>
> I'd be happy to simply disable KVM when CNTFRQ_EL0 is misprogrammed,
> or that the device tree provides a clock frequency. Because there is
> no good way to support a guest in that case.
>
And even if someone choses not to use TF-A, IIUC, doesn't the aarch64
linux boot protocol[3] mandate CNTFRQ must be programmed with the timer
frequency? As per the doc:
Before jumping into the kernel, the following conditions must be met:
[...]
- Architected timers
CNTFRQ must be programmed with the timer frequency and CNTVOFF must
be programmed with a consistent value on all CPUs.
Thanks,
Praan
[1] https://trustedfirmware-a.readthedocs.io/en/latest/design/firmware-design.html
[2] https://trustedfirmware-a.readthedocs.io/en/latest/porting-guide.html#function-plat-get-syscnt-freq2-mandatory
[3] https://www.kernel.org/doc/Documentation/arm64/booting.txt
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 04/28] iommu/io-pgtable-arm: Move selftests to a separate file
2025-08-19 21:51 ` [PATCH v4 04/28] iommu/io-pgtable-arm: Move selftests to a separate file Mostafa Saleh
@ 2025-09-15 14:37 ` Pranjal Shrivastava
2025-09-16 14:07 ` Mostafa Saleh
2025-09-15 16:45 ` Jason Gunthorpe
1 sibling, 1 reply; 82+ messages in thread
From: Pranjal Shrivastava @ 2025-09-15 14:37 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland
On Tue, Aug 19, 2025 at 09:51:32PM +0000, Mostafa Saleh wrote:
> Soon, io-pgtable-arm.c will be compiled as part of the KVM/arm64
> in the hypervisor object, which doesn't have many of the kernel APIs,
> as faux devices, printk...
>
> We would need to factor this things outside of this file, this patch
> moves the selftests outside, which remove many of the kernel
> dependencies, which also is not needed by the hypervisor.
> Create io-pgtable-arm-kernel.c for that, and in the next patch
> the rest of the code is factored out.
>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
> drivers/iommu/Makefile | 2 +-
> drivers/iommu/io-pgtable-arm-kernel.c | 216 +++++++++++++++++++++++
> drivers/iommu/io-pgtable-arm.c | 245 --------------------------
> drivers/iommu/io-pgtable-arm.h | 41 +++++
> 4 files changed, 258 insertions(+), 246 deletions(-)
> create mode 100644 drivers/iommu/io-pgtable-arm-kernel.c
>
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 355294fa9033..d601b0e25ef5 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -11,7 +11,7 @@ obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
> obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
> obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
> obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
> -obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
> +obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o io-pgtable-arm-kernel.o
> obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
> obj-$(CONFIG_IOMMU_IOVA) += iova.o
> obj-$(CONFIG_OF_IOMMU) += of_iommu.o
> diff --git a/drivers/iommu/io-pgtable-arm-kernel.c b/drivers/iommu/io-pgtable-arm-kernel.c
> new file mode 100644
> index 000000000000..f3b869310964
> --- /dev/null
> +++ b/drivers/iommu/io-pgtable-arm-kernel.c
If this file just contains the selftests, how about naming it
"io-pgtable-arm-selftests.c" ?
> @@ -0,0 +1,216 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * CPU-agnostic ARM page table allocator.
> + *
> + * Copyright (C) 2014 ARM Limited
> + *
> + * Author: Will Deacon <will.deacon@arm.com>
> + */
> +#define pr_fmt(fmt) "arm-lpae io-pgtable: " fmt
> +
> +#include <linux/device/faux.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +
> +#include "io-pgtable-arm.h"
> +
> +#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE_SELFTEST
> +
> +static struct io_pgtable_cfg *cfg_cookie __initdata;
> +
> +static void __init dummy_tlb_flush_all(void *cookie)
> +{
> + WARN_ON(cookie != cfg_cookie);
> +}
> +
> +static void __init dummy_tlb_flush(unsigned long iova, size_t size,
> + size_t granule, void *cookie)
> +{
> + WARN_ON(cookie != cfg_cookie);
> + WARN_ON(!(size & cfg_cookie->pgsize_bitmap));
> +}
> +
> +static void __init dummy_tlb_add_page(struct iommu_iotlb_gather *gather,
> + unsigned long iova, size_t granule,
> + void *cookie)
> +{
> + dummy_tlb_flush(iova, granule, granule, cookie);
> +}
> +
> +static const struct iommu_flush_ops dummy_tlb_ops __initconst = {
> + .tlb_flush_all = dummy_tlb_flush_all,
> + .tlb_flush_walk = dummy_tlb_flush,
> + .tlb_add_page = dummy_tlb_add_page,
> +};
> +
> +static void __init arm_lpae_dump_ops(struct io_pgtable_ops *ops)
> +{
> + struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> + struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +
> + pr_err("cfg: pgsize_bitmap 0x%lx, ias %u-bit\n",
> + cfg->pgsize_bitmap, cfg->ias);
> + pr_err("data: %d levels, 0x%zx pgd_size, %u pg_shift, %u bits_per_level, pgd @ %p\n",
> + ARM_LPAE_MAX_LEVELS - data->start_level, ARM_LPAE_PGD_SIZE(data),
> + ilog2(ARM_LPAE_GRANULE(data)), data->bits_per_level, data->pgd);
> +}
> +
> +#define __FAIL(ops, i) ({ \
> + WARN(1, "selftest: test failed for fmt idx %d\n", (i)); \
> + arm_lpae_dump_ops(ops); \
> + -EFAULT; \
> +})
> +
> +static int __init arm_lpae_run_tests(struct io_pgtable_cfg *cfg)
> +{
> + static const enum io_pgtable_fmt fmts[] __initconst = {
> + ARM_64_LPAE_S1,
> + ARM_64_LPAE_S2,
> + };
> +
> + int i, j;
> + unsigned long iova;
> + size_t size, mapped;
> + struct io_pgtable_ops *ops;
> +
> + for (i = 0; i < ARRAY_SIZE(fmts); ++i) {
> + cfg_cookie = cfg;
> + ops = alloc_io_pgtable_ops(fmts[i], cfg, cfg);
> + if (!ops) {
> + pr_err("selftest: failed to allocate io pgtable ops\n");
> + return -ENOMEM;
> + }
> +
> + /*
> + * Initial sanity checks.
> + * Empty page tables shouldn't provide any translations.
> + */
> + if (ops->iova_to_phys(ops, 42))
> + return __FAIL(ops, i);
> +
> + if (ops->iova_to_phys(ops, SZ_1G + 42))
> + return __FAIL(ops, i);
> +
> + if (ops->iova_to_phys(ops, SZ_2G + 42))
> + return __FAIL(ops, i);
> +
> + /*
> + * Distinct mappings of different granule sizes.
> + */
> + iova = 0;
> + for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
> + size = 1UL << j;
> +
> + if (ops->map_pages(ops, iova, iova, size, 1,
> + IOMMU_READ | IOMMU_WRITE |
> + IOMMU_NOEXEC | IOMMU_CACHE,
> + GFP_KERNEL, &mapped))
> + return __FAIL(ops, i);
> +
> + /* Overlapping mappings */
> + if (!ops->map_pages(ops, iova, iova + size, size, 1,
> + IOMMU_READ | IOMMU_NOEXEC,
> + GFP_KERNEL, &mapped))
> + return __FAIL(ops, i);
> +
> + if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
> + return __FAIL(ops, i);
> +
> + iova += SZ_1G;
> + }
> +
> + /* Full unmap */
> + iova = 0;
> + for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
> + size = 1UL << j;
> +
> + if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
> + return __FAIL(ops, i);
> +
> + if (ops->iova_to_phys(ops, iova + 42))
> + return __FAIL(ops, i);
> +
> + /* Remap full block */
> + if (ops->map_pages(ops, iova, iova, size, 1,
> + IOMMU_WRITE, GFP_KERNEL, &mapped))
> + return __FAIL(ops, i);
> +
> + if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
> + return __FAIL(ops, i);
> +
> + iova += SZ_1G;
> + }
> +
> + /*
> + * Map/unmap the last largest supported page of the IAS, this can
> + * trigger corner cases in the concatednated page tables.
> + */
> + mapped = 0;
> + size = 1UL << __fls(cfg->pgsize_bitmap);
> + iova = (1UL << cfg->ias) - size;
> + if (ops->map_pages(ops, iova, iova, size, 1,
> + IOMMU_READ | IOMMU_WRITE |
> + IOMMU_NOEXEC | IOMMU_CACHE,
> + GFP_KERNEL, &mapped))
> + return __FAIL(ops, i);
> + if (mapped != size)
> + return __FAIL(ops, i);
> + if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
> + return __FAIL(ops, i);
> +
> + free_io_pgtable_ops(ops);
> + }
> +
> + return 0;
> +}
> +
> +static int __init arm_lpae_do_selftests(void)
> +{
> + static const unsigned long pgsize[] __initconst = {
> + SZ_4K | SZ_2M | SZ_1G,
> + SZ_16K | SZ_32M,
> + SZ_64K | SZ_512M,
> + };
> +
> + static const unsigned int address_size[] __initconst = {
> + 32, 36, 40, 42, 44, 48,
> + };
> +
> + int i, j, k, pass = 0, fail = 0;
> + struct faux_device *dev;
> + struct io_pgtable_cfg cfg = {
> + .tlb = &dummy_tlb_ops,
> + .coherent_walk = true,
> + .quirks = IO_PGTABLE_QUIRK_NO_WARN,
> + };
> +
> + dev = faux_device_create("io-pgtable-test", NULL, 0);
> + if (!dev)
> + return -ENOMEM;
> +
> + cfg.iommu_dev = &dev->dev;
> +
> + for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
> + for (j = 0; j < ARRAY_SIZE(address_size); ++j) {
> + /* Don't use ias > oas as it is not valid for stage-2. */
> + for (k = 0; k <= j; ++k) {
> + cfg.pgsize_bitmap = pgsize[i];
> + cfg.ias = address_size[k];
> + cfg.oas = address_size[j];
> + pr_info("selftest: pgsize_bitmap 0x%08lx, IAS %u OAS %u\n",
> + pgsize[i], cfg.ias, cfg.oas);
> + if (arm_lpae_run_tests(&cfg))
> + fail++;
> + else
> + pass++;
> + }
> + }
> + }
> +
> + pr_info("selftest: completed with %d PASS %d FAIL\n", pass, fail);
> + faux_device_destroy(dev);
> +
> + return fail ? -EFAULT : 0;
> +}
> +subsys_initcall(arm_lpae_do_selftests);
> +#endif
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 96425e92f313..791a2c4ecb83 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -7,15 +7,10 @@
> * Author: Will Deacon <will.deacon@arm.com>
> */
>
> -#define pr_fmt(fmt) "arm-lpae io-pgtable: " fmt
> -
> #include <linux/atomic.h>
> #include <linux/bitops.h>
> #include <linux/io-pgtable.h>
> -#include <linux/kernel.h>
> -#include <linux/device/faux.h>
> #include <linux/sizes.h>
> -#include <linux/slab.h>
> #include <linux/types.h>
> #include <linux/dma-mapping.h>
>
> @@ -24,33 +19,6 @@
> #include "io-pgtable-arm.h"
> #include "iommu-pages.h"
>
> -#define ARM_LPAE_MAX_ADDR_BITS 52
> -#define ARM_LPAE_S2_MAX_CONCAT_PAGES 16
> -#define ARM_LPAE_MAX_LEVELS 4
> -
> -/* Struct accessors */
> -#define io_pgtable_to_data(x) \
> - container_of((x), struct arm_lpae_io_pgtable, iop)
> -
> -#define io_pgtable_ops_to_data(x) \
> - io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
> -
> -/*
> - * Calculate the right shift amount to get to the portion describing level l
> - * in a virtual address mapped by the pagetable in d.
> - */
> -#define ARM_LPAE_LVL_SHIFT(l,d) \
> - (((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level) + \
> - ilog2(sizeof(arm_lpae_iopte)))
> -
> -#define ARM_LPAE_GRANULE(d) \
> - (sizeof(arm_lpae_iopte) << (d)->bits_per_level)
> -#define ARM_LPAE_PGD_SIZE(d) \
> - (sizeof(arm_lpae_iopte) << (d)->pgd_bits)
> -
> -#define ARM_LPAE_PTES_PER_TABLE(d) \
> - (ARM_LPAE_GRANULE(d) >> ilog2(sizeof(arm_lpae_iopte)))
> -
> /*
> * Calculate the index at level l used to map virtual address a using the
> * pagetable in d.
> @@ -163,18 +131,6 @@
> #define iopte_set_writeable_clean(ptep) \
> set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long *)(ptep))
>
> -struct arm_lpae_io_pgtable {
> - struct io_pgtable iop;
> -
> - int pgd_bits;
> - int start_level;
> - int bits_per_level;
> -
> - void *pgd;
> -};
> -
> -typedef u64 arm_lpae_iopte;
> -
> static inline bool iopte_leaf(arm_lpae_iopte pte, int lvl,
> enum io_pgtable_fmt fmt)
> {
> @@ -1274,204 +1230,3 @@ struct io_pgtable_init_fns io_pgtable_arm_mali_lpae_init_fns = {
> .alloc = arm_mali_lpae_alloc_pgtable,
> .free = arm_lpae_free_pgtable,
> };
> -
> -#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE_SELFTEST
> -
> -static struct io_pgtable_cfg *cfg_cookie __initdata;
> -
> -static void __init dummy_tlb_flush_all(void *cookie)
> -{
> - WARN_ON(cookie != cfg_cookie);
> -}
> -
> -static void __init dummy_tlb_flush(unsigned long iova, size_t size,
> - size_t granule, void *cookie)
> -{
> - WARN_ON(cookie != cfg_cookie);
> - WARN_ON(!(size & cfg_cookie->pgsize_bitmap));
> -}
> -
> -static void __init dummy_tlb_add_page(struct iommu_iotlb_gather *gather,
> - unsigned long iova, size_t granule,
> - void *cookie)
> -{
> - dummy_tlb_flush(iova, granule, granule, cookie);
> -}
> -
> -static const struct iommu_flush_ops dummy_tlb_ops __initconst = {
> - .tlb_flush_all = dummy_tlb_flush_all,
> - .tlb_flush_walk = dummy_tlb_flush,
> - .tlb_add_page = dummy_tlb_add_page,
> -};
> -
> -static void __init arm_lpae_dump_ops(struct io_pgtable_ops *ops)
> -{
> - struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> - struct io_pgtable_cfg *cfg = &data->iop.cfg;
> -
> - pr_err("cfg: pgsize_bitmap 0x%lx, ias %u-bit\n",
> - cfg->pgsize_bitmap, cfg->ias);
> - pr_err("data: %d levels, 0x%zx pgd_size, %u pg_shift, %u bits_per_level, pgd @ %p\n",
> - ARM_LPAE_MAX_LEVELS - data->start_level, ARM_LPAE_PGD_SIZE(data),
> - ilog2(ARM_LPAE_GRANULE(data)), data->bits_per_level, data->pgd);
> -}
> -
> -#define __FAIL(ops, i) ({ \
> - WARN(1, "selftest: test failed for fmt idx %d\n", (i)); \
> - arm_lpae_dump_ops(ops); \
> - -EFAULT; \
> -})
> -
> -static int __init arm_lpae_run_tests(struct io_pgtable_cfg *cfg)
> -{
> - static const enum io_pgtable_fmt fmts[] __initconst = {
> - ARM_64_LPAE_S1,
> - ARM_64_LPAE_S2,
> - };
> -
> - int i, j;
> - unsigned long iova;
> - size_t size, mapped;
> - struct io_pgtable_ops *ops;
> -
> - for (i = 0; i < ARRAY_SIZE(fmts); ++i) {
> - cfg_cookie = cfg;
> - ops = alloc_io_pgtable_ops(fmts[i], cfg, cfg);
> - if (!ops) {
> - pr_err("selftest: failed to allocate io pgtable ops\n");
> - return -ENOMEM;
> - }
> -
> - /*
> - * Initial sanity checks.
> - * Empty page tables shouldn't provide any translations.
> - */
> - if (ops->iova_to_phys(ops, 42))
> - return __FAIL(ops, i);
> -
> - if (ops->iova_to_phys(ops, SZ_1G + 42))
> - return __FAIL(ops, i);
> -
> - if (ops->iova_to_phys(ops, SZ_2G + 42))
> - return __FAIL(ops, i);
> -
> - /*
> - * Distinct mappings of different granule sizes.
> - */
> - iova = 0;
> - for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
> - size = 1UL << j;
> -
> - if (ops->map_pages(ops, iova, iova, size, 1,
> - IOMMU_READ | IOMMU_WRITE |
> - IOMMU_NOEXEC | IOMMU_CACHE,
> - GFP_KERNEL, &mapped))
> - return __FAIL(ops, i);
> -
> - /* Overlapping mappings */
> - if (!ops->map_pages(ops, iova, iova + size, size, 1,
> - IOMMU_READ | IOMMU_NOEXEC,
> - GFP_KERNEL, &mapped))
> - return __FAIL(ops, i);
> -
> - if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
> - return __FAIL(ops, i);
> -
> - iova += SZ_1G;
> - }
> -
> - /* Full unmap */
> - iova = 0;
> - for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
> - size = 1UL << j;
> -
> - if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
> - return __FAIL(ops, i);
> -
> - if (ops->iova_to_phys(ops, iova + 42))
> - return __FAIL(ops, i);
> -
> - /* Remap full block */
> - if (ops->map_pages(ops, iova, iova, size, 1,
> - IOMMU_WRITE, GFP_KERNEL, &mapped))
> - return __FAIL(ops, i);
> -
> - if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
> - return __FAIL(ops, i);
> -
> - iova += SZ_1G;
> - }
> -
> - /*
> - * Map/unmap the last largest supported page of the IAS, this can
> - * trigger corner cases in the concatednated page tables.
> - */
> - mapped = 0;
> - size = 1UL << __fls(cfg->pgsize_bitmap);
> - iova = (1UL << cfg->ias) - size;
> - if (ops->map_pages(ops, iova, iova, size, 1,
> - IOMMU_READ | IOMMU_WRITE |
> - IOMMU_NOEXEC | IOMMU_CACHE,
> - GFP_KERNEL, &mapped))
> - return __FAIL(ops, i);
> - if (mapped != size)
> - return __FAIL(ops, i);
> - if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
> - return __FAIL(ops, i);
> -
> - free_io_pgtable_ops(ops);
> - }
> -
> - return 0;
> -}
> -
> -static int __init arm_lpae_do_selftests(void)
> -{
> - static const unsigned long pgsize[] __initconst = {
> - SZ_4K | SZ_2M | SZ_1G,
> - SZ_16K | SZ_32M,
> - SZ_64K | SZ_512M,
> - };
> -
> - static const unsigned int address_size[] __initconst = {
> - 32, 36, 40, 42, 44, 48,
> - };
> -
> - int i, j, k, pass = 0, fail = 0;
> - struct faux_device *dev;
> - struct io_pgtable_cfg cfg = {
> - .tlb = &dummy_tlb_ops,
> - .coherent_walk = true,
> - .quirks = IO_PGTABLE_QUIRK_NO_WARN,
> - };
> -
> - dev = faux_device_create("io-pgtable-test", NULL, 0);
> - if (!dev)
> - return -ENOMEM;
> -
> - cfg.iommu_dev = &dev->dev;
> -
> - for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
> - for (j = 0; j < ARRAY_SIZE(address_size); ++j) {
> - /* Don't use ias > oas as it is not valid for stage-2. */
> - for (k = 0; k <= j; ++k) {
> - cfg.pgsize_bitmap = pgsize[i];
> - cfg.ias = address_size[k];
> - cfg.oas = address_size[j];
> - pr_info("selftest: pgsize_bitmap 0x%08lx, IAS %u OAS %u\n",
> - pgsize[i], cfg.ias, cfg.oas);
> - if (arm_lpae_run_tests(&cfg))
> - fail++;
> - else
> - pass++;
> - }
> - }
> - }
> -
> - pr_info("selftest: completed with %d PASS %d FAIL\n", pass, fail);
> - faux_device_destroy(dev);
> -
> - return fail ? -EFAULT : 0;
> -}
> -subsys_initcall(arm_lpae_do_selftests);
> -#endif
> diff --git a/drivers/iommu/io-pgtable-arm.h b/drivers/iommu/io-pgtable-arm.h
> index ba7cfdf7afa0..a06a23543cff 100644
> --- a/drivers/iommu/io-pgtable-arm.h
> +++ b/drivers/iommu/io-pgtable-arm.h
> @@ -2,6 +2,8 @@
> #ifndef IO_PGTABLE_ARM_H_
> #define IO_PGTABLE_ARM_H_
>
> +#include <linux/io-pgtable.h>
> +
> #define ARM_LPAE_TCR_TG0_4K 0
> #define ARM_LPAE_TCR_TG0_64K 1
> #define ARM_LPAE_TCR_TG0_16K 2
> @@ -27,4 +29,43 @@
> #define ARM_LPAE_TCR_PS_48_BIT 0x5ULL
> #define ARM_LPAE_TCR_PS_52_BIT 0x6ULL
>
> +/* Struct accessors */
> +#define io_pgtable_to_data(x) \
> + container_of((x), struct arm_lpae_io_pgtable, iop)
> +
> +#define io_pgtable_ops_to_data(x) \
> + io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
> +
> +struct arm_lpae_io_pgtable {
> + struct io_pgtable iop;
> +
> + int pgd_bits;
> + int start_level;
> + int bits_per_level;
> +
> + void *pgd;
> +};
> +
> +#define ARM_LPAE_MAX_ADDR_BITS 52
> +#define ARM_LPAE_S2_MAX_CONCAT_PAGES 16
> +#define ARM_LPAE_MAX_LEVELS 4
> +
> +/*
> + * Calculate the right shift amount to get to the portion describing level l
> + * in a virtual address mapped by the pagetable in d.
> + */
> +#define ARM_LPAE_LVL_SHIFT(l,d) \
> + (((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level) + \
> + ilog2(sizeof(arm_lpae_iopte)))
> +
> +#define ARM_LPAE_GRANULE(d) \
> + (sizeof(arm_lpae_iopte) << (d)->bits_per_level)
> +#define ARM_LPAE_PGD_SIZE(d) \
> + (sizeof(arm_lpae_iopte) << (d)->pgd_bits)
> +
> +#define ARM_LPAE_PTES_PER_TABLE(d) \
> + (ARM_LPAE_GRANULE(d) >> ilog2(sizeof(arm_lpae_iopte)))
> +
> +typedef u64 arm_lpae_iopte;
> +
> #endif /* IO_PGTABLE_ARM_H_ */
Apart from the renaming above, I was able to apply this patch alone, and
build succesfully while toggling IOMMU_IO_PGTABLE_LPAE_SELFTEST across
builds.
Reviewed-by: Pranjal Shrivastava <praan@google.com>
> --
> 2.51.0.rc1.167.g924127e9c0-goog
>
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-09-12 14:18 ` Will Deacon
@ 2025-09-15 16:38 ` Jason Gunthorpe
2025-09-16 15:19 ` Mostafa Saleh
2025-09-16 14:50 ` Mostafa Saleh
1 sibling, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2025-09-15 16:38 UTC (permalink / raw)
To: Will Deacon
Cc: Mostafa Saleh, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Fri, Sep 12, 2025 at 03:18:08PM +0100, Will Deacon wrote:
> Ideally, the data structures that are shadowed by the hypervisor would
> be mapped as normal-WB cacheable in both the host and the hypervisor so
> we don't have to worry about coherency and we get the performance
> benefits from the caches. Indeed, I think that's how you've mapped
> 'host_cmdq' above _however_ I sadly don't think we can do that if the
> actual SMMU hardware isn't coherent.
That seems like the right conclusion to me, pkvm should not be mapping
as cachable unless it knows the IORT/IDR is marked as coherent.
This is actually something I want to fix in the SMMU driver, it should
always be allocating cachable memory and using
dma_sync_single_for_device() instead of non-cachable DMA coherent
allocations. (Or perhaps better is to use
iommu_pages_flush_incoherent())
I'm hearing about an interesting use case where we'd want to tell the
SMMU to walk STEs non-cachable even if the HW is capable to do
cachable. Apparently in some SOCs it gives better isochronous
properties for realtime DMA.
IMHO for this series at this point pkvm should just require a coherent
SMMU until the above revisions happen.
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 04/28] iommu/io-pgtable-arm: Move selftests to a separate file
2025-08-19 21:51 ` [PATCH v4 04/28] iommu/io-pgtable-arm: Move selftests to a separate file Mostafa Saleh
2025-09-15 14:37 ` Pranjal Shrivastava
@ 2025-09-15 16:45 ` Jason Gunthorpe
2025-09-16 14:09 ` Mostafa Saleh
1 sibling, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2025-09-15 16:45 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
robin.murphy, jean-philippe, qperret, tabba, mark.rutland, praan
On Tue, Aug 19, 2025 at 09:51:32PM +0000, Mostafa Saleh wrote:
> Soon, io-pgtable-arm.c will be compiled as part of the KVM/arm64
> in the hypervisor object, which doesn't have many of the kernel APIs,
> as faux devices, printk...
>
> We would need to factor this things outside of this file, this patch
> moves the selftests outside, which remove many of the kernel
> dependencies, which also is not needed by the hypervisor.
> Create io-pgtable-arm-kernel.c for that, and in the next patch
> the rest of the code is factored out.
Please send this as a stand alone patch, it looks like a good idea.
Also please add the boiler plate to wrap the selftest into a kunit and
use the usual kunit machinery. We alredy have an kunit for
smmuv3, it can just add another file to that.
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 01/28] KVM: arm64: Add a new function to donate memory with prot
2025-09-09 13:46 ` Will Deacon
2025-09-14 19:23 ` Pranjal Shrivastava
@ 2025-09-16 11:56 ` Mostafa Saleh
1 sibling, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 11:56 UTC (permalink / raw)
To: Will Deacon
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Sep 09, 2025 at 02:46:42PM +0100, Will Deacon wrote:
> On Tue, Aug 19, 2025 at 09:51:29PM +0000, Mostafa Saleh wrote:
> > Soon, IOMMU drivers running in the hypervisor might interact with
> > non-coherent devices, so it needs a mechanism to map memory as
> > non cacheable.
> > Add ___pkvm_host_donate_hyp() which accepts a new argument for prot,
> > so the driver can add KVM_PGTABLE_PROT_NORMAL_NC.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
> > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 11 +++++++++--
> > 2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > index 5f9d56754e39..52d7ee91e18c 100644
> > --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > @@ -36,6 +36,7 @@ int __pkvm_prot_finalize(void);
> > int __pkvm_host_share_hyp(u64 pfn);
> > int __pkvm_host_unshare_hyp(u64 pfn);
> > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
> > +int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
> > int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
> > int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
> > int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
> > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > index 8957734d6183..861e448183fd 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > @@ -769,13 +769,15 @@ int __pkvm_host_unshare_hyp(u64 pfn)
> > return ret;
> > }
> >
> > -int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> > +int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> > {
> > u64 phys = hyp_pfn_to_phys(pfn);
> > u64 size = PAGE_SIZE * nr_pages;
> > void *virt = __hyp_va(phys);
> > int ret;
> >
> > + WARN_ON(prot & KVM_PGTABLE_PROT_X);
>
> Should this actually just enforce that the permissions are
> KVM_PGTABLE_PROT_RW:
>
> WARN_ON((prot & KVM_PGTABLE_PROT_RWX) != KVM_PGTABLE_PROT_RW);
>
> ?
>
> Since the motivation is about the memory type rather than the
> permissions, it would be best to preserve the current behaviour.
Yes, this series doesn't do any permisson changes, I was not sure if
that would be needed in the future or not (some RO mappings), but I will
update the check as suggested to be more strict.
Thanks,
Mostafa
>
> Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 01/28] KVM: arm64: Add a new function to donate memory with prot
2025-09-14 19:23 ` Pranjal Shrivastava
@ 2025-09-16 11:58 ` Mostafa Saleh
0 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 11:58 UTC (permalink / raw)
To: Pranjal Shrivastava
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba, jgg,
mark.rutland
On Sun, Sep 14, 2025 at 07:23:12PM +0000, Pranjal Shrivastava wrote:
> On Tue, Sep 09, 2025 at 02:46:42PM +0100, Will Deacon wrote:
> > On Tue, Aug 19, 2025 at 09:51:29PM +0000, Mostafa Saleh wrote:
> > > Soon, IOMMU drivers running in the hypervisor might interact with
> > > non-coherent devices, so it needs a mechanism to map memory as
> > > non cacheable.
> > > Add ___pkvm_host_donate_hyp() which accepts a new argument for prot,
> > > so the driver can add KVM_PGTABLE_PROT_NORMAL_NC.
> > >
> > > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > > ---
> > > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
> > > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 11 +++++++++--
> > > 2 files changed, 10 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > > index 5f9d56754e39..52d7ee91e18c 100644
> > > --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > > +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > > @@ -36,6 +36,7 @@ int __pkvm_prot_finalize(void);
> > > int __pkvm_host_share_hyp(u64 pfn);
> > > int __pkvm_host_unshare_hyp(u64 pfn);
> > > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
> > > +int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
> > > int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
> > > int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
> > > int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
> > > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > > index 8957734d6183..861e448183fd 100644
> > > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > > @@ -769,13 +769,15 @@ int __pkvm_host_unshare_hyp(u64 pfn)
> > > return ret;
> > > }
> > >
> > > -int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> > > +int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> > > {
> > > u64 phys = hyp_pfn_to_phys(pfn);
> > > u64 size = PAGE_SIZE * nr_pages;
> > > void *virt = __hyp_va(phys);
> > > int ret;
> > >
> > > + WARN_ON(prot & KVM_PGTABLE_PROT_X);
> >
> > Should this actually just enforce that the permissions are
> > KVM_PGTABLE_PROT_RW:
> >
> > WARN_ON((prot & KVM_PGTABLE_PROT_RWX) != KVM_PGTABLE_PROT_RW);
> >
> > ?
> >
> > Since the motivation is about the memory type rather than the
> > permissions, it would be best to preserve the current behaviour.
>
> +1. I believe the current `WARN_ON(prot & KVM_PGTABLE_PROT_X);` check
> would potentially allow "Read-only" or "Write-only" donations to slide
> through silently.
True, this can only be done from the hypervisor code though, I will
make the check stricter as Will suggested, and if needed we can relax
that later.
Thanks,
Mostafa
>
> >
> > Will
>
> Thanks,
> Praan
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor
2025-09-09 14:12 ` Will Deacon
@ 2025-09-16 13:27 ` Mostafa Saleh
2025-09-26 14:33 ` Will Deacon
0 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 13:27 UTC (permalink / raw)
To: Will Deacon
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Sep 09, 2025 at 03:12:45PM +0100, Will Deacon wrote:
> On Tue, Aug 19, 2025 at 09:51:30PM +0000, Mostafa Saleh wrote:
> > Add a function to donate MMIO to the hypervisor so IOMMU hypervisor
> > drivers can use that to protect the MMIO of IOMMU.
> > The initial attempt to implement this was to have a new flag to
> > "___pkvm_host_donate_hyp" to accept MMIO. However that had many problems,
> > it was quite intrusive for host/hyp to check/set page state to make it
> > aware of MMIO and to encode the state in the page table in that case.
> > Which is called in paths that can be sensitive to performance (FFA, VMs..)
> >
> > As donating MMIO is very rare, and we don’t need to encode the full state,
> > it’s reasonable to have a separate function to do this.
> > It will init the host s2 page table with an invalid leaf with the owner ID
> > to prevent the host from mapping the page on faults.
> >
> > Also, prevent kvm_pgtable_stage2_unmap() from removing owner ID from
> > stage-2 PTEs, as this can be triggered from recycle logic under memory
> > pressure. There is no code relying on this, as all ownership changes is
> > done via kvm_pgtable_stage2_set_owner()
> >
> > For error path in IOMMU drivers, add a function to donate MMIO back
> > from hyp to host.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 +
> > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 64 +++++++++++++++++++
> > arch/arm64/kvm/hyp/pgtable.c | 9 +--
> > 3 files changed, 68 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > index 52d7ee91e18c..98e173da0f9b 100644
> > --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > @@ -37,6 +37,8 @@ int __pkvm_host_share_hyp(u64 pfn);
> > int __pkvm_host_unshare_hyp(u64 pfn);
> > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
> > int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
> > +int __pkvm_host_donate_hyp_mmio(u64 pfn);
> > +int __pkvm_hyp_donate_host_mmio(u64 pfn);
> > int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
> > int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
> > int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
> > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > index 861e448183fd..c9a15ef6b18d 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > @@ -799,6 +799,70 @@ int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> > return ret;
> > }
> >
> > +int __pkvm_host_donate_hyp_mmio(u64 pfn)
> > +{
> > + u64 phys = hyp_pfn_to_phys(pfn);
> > + void *virt = __hyp_va(phys);
> > + int ret;
> > + kvm_pte_t pte;
> > +
> > + host_lock_component();
> > + hyp_lock_component();
> > +
> > + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> > + if (ret)
> > + goto unlock;
> > +
> > + if (pte && !kvm_pte_valid(pte)) {
> > + ret = -EPERM;
> > + goto unlock;
> > + }
>
> Shouldn't we first check that the pfn is indeed MMIO? Otherwise, testing
> the pte for the ownership information isn't right.
I will add it, although the input should be trusted as it comes from the
hypervisor SMMUv3 driver.
>
> > + ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
> > + if (ret)
> > + goto unlock;
> > + if (pte) {
> > + ret = -EBUSY;
> > + goto unlock;
> > + }
> > +
> > + ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, PAGE_HYP_DEVICE);
> > + if (ret)
> > + goto unlock;
> > + /*
> > + * We set HYP as the owner of the MMIO pages in the host stage-2, for:
> > + * - host aborts: host_stage2_adjust_range() would fail for invalid non zero PTEs.
> > + * - recycle under memory pressure: host_stage2_unmap_dev_all() would call
> > + * kvm_pgtable_stage2_unmap() which will not clear non zero invalid ptes (counted).
> > + * - other MMIO donation: Would fail as we check that the PTE is valid or empty.
> > + */
> > + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> > + PAGE_SIZE, &host_s2_pool, PKVM_ID_HYP));
> > +unlock:
> > + hyp_unlock_component();
> > + host_unlock_component();
> > +
> > + return ret;
> > +}
> > +
> > +int __pkvm_hyp_donate_host_mmio(u64 pfn)
> > +{
> > + u64 phys = hyp_pfn_to_phys(pfn);
> > + u64 virt = (u64)__hyp_va(phys);
> > + size_t size = PAGE_SIZE;
> > +
> > + host_lock_component();
> > + hyp_lock_component();
>
> Shouldn't we check that:
>
> 1. pfn is mmio
> 2. pfn is owned by hyp
> 3. The host doesn't have something mapped at pfn already
>
> ?
>
I thought about this initially, but as
- This code is only called from the hypervisor with trusted
inputs (only at boot)
- Only called on error path
So WARN_ON in case of failure to unmap MMIO pages seemed is good enough,
to avoid extra code.
But I can add the checks if you think they are necessary, we will need
to add new helpers for MMIO state though.
> > + WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
> > + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> > + PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
> > + hyp_unlock_component();
> > + host_unlock_component();
> > +
> > + return 0;
> > +}
> > +
> > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> > {
> > return ___pkvm_host_donate_hyp(pfn, nr_pages, PAGE_HYP);
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index c351b4abd5db..ba06b0c21d5a 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1095,13 +1095,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > kvm_pte_t *childp = NULL;
> > bool need_flush = false;
> >
> > - if (!kvm_pte_valid(ctx->old)) {
> > - if (stage2_pte_is_counted(ctx->old)) {
> > - kvm_clear_pte(ctx->ptep);
> > - mm_ops->put_page(ctx->ptep);
> > - }
> > - return 0;
> > - }
> > + if (!kvm_pte_valid(ctx->old))
> > + return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
>
> Can this code be reached for the guest? For example, if
> pkvm_pgtable_stage2_destroy() runs into an MMIO-guarded pte on teardown?
AFAICT, VMs page table is destroyed from reclaim_pgtable_pages() =>
kvm_pgtable_stage2_destroy() => kvm_pgtable_stage2_destroy_range() ... =>
stage2_free_walker()
Which doesn't interact with “stage2_unmap_walker”, so that should be
fine.
Thanks,
Mostafa
>
> Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor
2025-09-14 20:41 ` Pranjal Shrivastava
@ 2025-09-16 13:43 ` Mostafa Saleh
0 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 13:43 UTC (permalink / raw)
To: Pranjal Shrivastava
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland
On Sun, Sep 14, 2025 at 08:41:04PM +0000, Pranjal Shrivastava wrote:
> On Tue, Aug 19, 2025 at 09:51:30PM +0000, Mostafa Saleh wrote:
> > Add a function to donate MMIO to the hypervisor so IOMMU hypervisor
> > drivers can use that to protect the MMIO of IOMMU.
> > The initial attempt to implement this was to have a new flag to
> > "___pkvm_host_donate_hyp" to accept MMIO. However that had many problems,
> > it was quite intrusive for host/hyp to check/set page state to make it
> > aware of MMIO and to encode the state in the page table in that case.
> > Which is called in paths that can be sensitive to performance (FFA, VMs..)
> >
> > As donating MMIO is very rare, and we don’t need to encode the full state,
> > it’s reasonable to have a separate function to do this.
> > It will init the host s2 page table with an invalid leaf with the owner ID
> > to prevent the host from mapping the page on faults.
> >
> > Also, prevent kvm_pgtable_stage2_unmap() from removing owner ID from
> > stage-2 PTEs, as this can be triggered from recycle logic under memory
> > pressure. There is no code relying on this, as all ownership changes is
> > done via kvm_pgtable_stage2_set_owner()
> >
> > For error path in IOMMU drivers, add a function to donate MMIO back
> > from hyp to host.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 +
> > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 64 +++++++++++++++++++
> > arch/arm64/kvm/hyp/pgtable.c | 9 +--
> > 3 files changed, 68 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > index 52d7ee91e18c..98e173da0f9b 100644
> > --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > @@ -37,6 +37,8 @@ int __pkvm_host_share_hyp(u64 pfn);
> > int __pkvm_host_unshare_hyp(u64 pfn);
> > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
> > int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot);
> > +int __pkvm_host_donate_hyp_mmio(u64 pfn);
> > +int __pkvm_hyp_donate_host_mmio(u64 pfn);
> > int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
> > int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
> > int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
> > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > index 861e448183fd..c9a15ef6b18d 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > @@ -799,6 +799,70 @@ int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> > return ret;
> > }
> >
> > +int __pkvm_host_donate_hyp_mmio(u64 pfn)
> > +{
> > + u64 phys = hyp_pfn_to_phys(pfn);
> > + void *virt = __hyp_va(phys);
> > + int ret;
> > + kvm_pte_t pte;
> > +
> > + host_lock_component();
> > + hyp_lock_component();
> > +
> > + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> > + if (ret)
> > + goto unlock;
> > +
> > + if (pte && !kvm_pte_valid(pte)) {
> > + ret = -EPERM;
> > + goto unlock;
> > + }
> > +
> > + ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL);
> > + if (ret)
> > + goto unlock;
> > + if (pte) {
> > + ret = -EBUSY;
> > + goto unlock;
> > + }
>
> I'm thinking of a situation where both of these checks might be
> necessary.. The first check seems to confirm if the page being donated
> isn't set up to trap in the hyp (i.e. the donor/host doesn't own the
> page anymore).
>
> However, the second check seems to check if the pfn is already mapped
> in the hyp's space. Is this check only to catch errorneous donations of
> a shared page or is there something else?
The first check confirms that the host kernel owns the page, so it can
donate it.
The second check checks that the hypervisor doesn't already have something
mapped at this point.
I can't find a case where this happens, I believe the second is mainly a
debug check (similar to __pkvm_host_donate/share_hyp for normal memory.
Thanks,
Mostafa
>
> > +
> > + ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, PAGE_HYP_DEVICE);
> > + if (ret)
> > + goto unlock;
> > + /*
> > + * We set HYP as the owner of the MMIO pages in the host stage-2, for:
> > + * - host aborts: host_stage2_adjust_range() would fail for invalid non zero PTEs.
> > + * - recycle under memory pressure: host_stage2_unmap_dev_all() would call
> > + * kvm_pgtable_stage2_unmap() which will not clear non zero invalid ptes (counted).
> > + * - other MMIO donation: Would fail as we check that the PTE is valid or empty.
> > + */
> > + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> > + PAGE_SIZE, &host_s2_pool, PKVM_ID_HYP));
> > +unlock:
> > + hyp_unlock_component();
> > + host_unlock_component();
> > +
> > + return ret;
> > +}
> > +
> > +int __pkvm_hyp_donate_host_mmio(u64 pfn)
> > +{
> > + u64 phys = hyp_pfn_to_phys(pfn);
> > + u64 virt = (u64)__hyp_va(phys);
> > + size_t size = PAGE_SIZE;
> > +
> > + host_lock_component();
> > + hyp_lock_component();
> > +
> > + WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
> > + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> > + PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
> > + hyp_unlock_component();
> > + host_unlock_component();
> > +
> > + return 0;
> > +}
> > +
> > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> > {
> > return ___pkvm_host_donate_hyp(pfn, nr_pages, PAGE_HYP);
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index c351b4abd5db..ba06b0c21d5a 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1095,13 +1095,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > kvm_pte_t *childp = NULL;
> > bool need_flush = false;
> >
> > - if (!kvm_pte_valid(ctx->old)) {
> > - if (stage2_pte_is_counted(ctx->old)) {
> > - kvm_clear_pte(ctx->ptep);
> > - mm_ops->put_page(ctx->ptep);
> > - }
> > - return 0;
> > - }
> > + if (!kvm_pte_valid(ctx->old))
> > + return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
> >
> > if (kvm_pte_table(ctx->old, ctx->level)) {
> > childp = kvm_pte_follow(ctx->old, mm_ops);
> > --
>
> Thanks
> Praan
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 03/28] KVM: arm64: pkvm: Add pkvm_time_get()
2025-09-09 15:56 ` Marc Zyngier
2025-09-15 11:10 ` Pranjal Shrivastava
@ 2025-09-16 14:04 ` Mostafa Saleh
1 sibling, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 14:04 UTC (permalink / raw)
To: Marc Zyngier
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba, jgg,
mark.rutland, praan
On Tue, Sep 09, 2025 at 04:56:16PM +0100, Marc Zyngier wrote:
> On Tue, 09 Sep 2025 15:16:26 +0100,
> Will Deacon <will@kernel.org> wrote:
> >
> > On Tue, Aug 19, 2025 at 09:51:31PM +0000, Mostafa Saleh wrote:
> > > Add a function to return time in us.
> > >
> > > This can be used from IOMMU drivers while waiting for conditions as
> > > for SMMUv3 TLB invalidation waiting for sync.
> > >
> > > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > ---
> > > arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 2 ++
> > > arch/arm64/kvm/hyp/nvhe/setup.c | 4 ++++
> > > arch/arm64/kvm/hyp/nvhe/timer-sr.c | 33 ++++++++++++++++++++++++++
> > > 3 files changed, 39 insertions(+)
>
> [...]
>
> > > +#define SEC_TO_US 1000000
> > > +
> > > +int pkvm_timer_init(void)
> > > +{
> > > + timer_freq = read_sysreg(cntfrq_el0);
> > > + /*
> > > + * TODO: The highest privileged level is supposed to initialize this
> > > + * register. But on some systems (which?), this information is only
> > > + * contained in the device-tree, so we'll need to find it out some other
> > > + * way.
> > > + */
> > > + if (!timer_freq || timer_freq < SEC_TO_US)
> > > + return -ENODEV;
> > > + return 0;
> > > +}
> >
> > Right, I think the frequency should be provided by the host once the arch
> > timer driver has probed successfully. Relying on CNTFRQ isn't viable imo.
>
> We can always patch the value in, à la kimage_voffset. But it really
> begs the question: who is their right mind doesn't set CNTFRQ_EL0 to
> something sensible? Why should we care about supporting such
> contraption?
>
> I'd be happy to simply disable KVM when CNTFRQ_EL0 is misprogrammed,
> or that the device tree provides a clock frequency. Because there is
> no good way to support a guest in that case.
>
I can make "arch_timer_rate" available to the hypervisor, but I'd rather
just to fail in that case as Marc suggested to avoid complexity (and due
to the lack HW on my end to test this) even if we check this only for
protected mode.
Thanks,
Mostafa
> M.
>
> --
> Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 04/28] iommu/io-pgtable-arm: Move selftests to a separate file
2025-09-15 14:37 ` Pranjal Shrivastava
@ 2025-09-16 14:07 ` Mostafa Saleh
0 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 14:07 UTC (permalink / raw)
To: Pranjal Shrivastava
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland
On Mon, Sep 15, 2025 at 02:37:49PM +0000, Pranjal Shrivastava wrote:
> On Tue, Aug 19, 2025 at 09:51:32PM +0000, Mostafa Saleh wrote:
> > Soon, io-pgtable-arm.c will be compiled as part of the KVM/arm64
> > in the hypervisor object, which doesn't have many of the kernel APIs,
> > as faux devices, printk...
> >
> > We would need to factor this things outside of this file, this patch
> > moves the selftests outside, which remove many of the kernel
> > dependencies, which also is not needed by the hypervisor.
> > Create io-pgtable-arm-kernel.c for that, and in the next patch
> > the rest of the code is factored out.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> > drivers/iommu/Makefile | 2 +-
> > drivers/iommu/io-pgtable-arm-kernel.c | 216 +++++++++++++++++++++++
> > drivers/iommu/io-pgtable-arm.c | 245 --------------------------
> > drivers/iommu/io-pgtable-arm.h | 41 +++++
> > 4 files changed, 258 insertions(+), 246 deletions(-)
> > create mode 100644 drivers/iommu/io-pgtable-arm-kernel.c
> >
> > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> > index 355294fa9033..d601b0e25ef5 100644
> > --- a/drivers/iommu/Makefile
> > +++ b/drivers/iommu/Makefile
> > @@ -11,7 +11,7 @@ obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
> > obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
> > obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
> > obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
> > -obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
> > +obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o io-pgtable-arm-kernel.o
> > obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
> > obj-$(CONFIG_IOMMU_IOVA) += iova.o
> > obj-$(CONFIG_OF_IOMMU) += of_iommu.o
> > diff --git a/drivers/iommu/io-pgtable-arm-kernel.c b/drivers/iommu/io-pgtable-arm-kernel.c
> > new file mode 100644
> > index 000000000000..f3b869310964
> > --- /dev/null
> > +++ b/drivers/iommu/io-pgtable-arm-kernel.c
>
> If this file just contains the selftests, how about naming it
> "io-pgtable-arm-selftests.c" ?
In the next patch I am adding more kernel code to it, so it’s not just
selftests, but as Jason suggested, we can just completely move the self
tests out, in that case "io-pgtable-arm-selftests.c" makes sense.
Thanks,
Mostafa
>
> > @@ -0,0 +1,216 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * CPU-agnostic ARM page table allocator.
> > + *
> > + * Copyright (C) 2014 ARM Limited
> > + *
> > + * Author: Will Deacon <will.deacon@arm.com>
> > + */
> > +#define pr_fmt(fmt) "arm-lpae io-pgtable: " fmt
> > +
> > +#include <linux/device/faux.h>
> > +#include <linux/kernel.h>
> > +#include <linux/slab.h>
> > +
> > +#include "io-pgtable-arm.h"
> > +
> > +#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE_SELFTEST
> > +
> > +static struct io_pgtable_cfg *cfg_cookie __initdata;
> > +
> > +static void __init dummy_tlb_flush_all(void *cookie)
> > +{
> > + WARN_ON(cookie != cfg_cookie);
> > +}
> > +
> > +static void __init dummy_tlb_flush(unsigned long iova, size_t size,
> > + size_t granule, void *cookie)
> > +{
> > + WARN_ON(cookie != cfg_cookie);
> > + WARN_ON(!(size & cfg_cookie->pgsize_bitmap));
> > +}
> > +
> > +static void __init dummy_tlb_add_page(struct iommu_iotlb_gather *gather,
> > + unsigned long iova, size_t granule,
> > + void *cookie)
> > +{
> > + dummy_tlb_flush(iova, granule, granule, cookie);
> > +}
> > +
> > +static const struct iommu_flush_ops dummy_tlb_ops __initconst = {
> > + .tlb_flush_all = dummy_tlb_flush_all,
> > + .tlb_flush_walk = dummy_tlb_flush,
> > + .tlb_add_page = dummy_tlb_add_page,
> > +};
> > +
> > +static void __init arm_lpae_dump_ops(struct io_pgtable_ops *ops)
> > +{
> > + struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> > + struct io_pgtable_cfg *cfg = &data->iop.cfg;
> > +
> > + pr_err("cfg: pgsize_bitmap 0x%lx, ias %u-bit\n",
> > + cfg->pgsize_bitmap, cfg->ias);
> > + pr_err("data: %d levels, 0x%zx pgd_size, %u pg_shift, %u bits_per_level, pgd @ %p\n",
> > + ARM_LPAE_MAX_LEVELS - data->start_level, ARM_LPAE_PGD_SIZE(data),
> > + ilog2(ARM_LPAE_GRANULE(data)), data->bits_per_level, data->pgd);
> > +}
> > +
> > +#define __FAIL(ops, i) ({ \
> > + WARN(1, "selftest: test failed for fmt idx %d\n", (i)); \
> > + arm_lpae_dump_ops(ops); \
> > + -EFAULT; \
> > +})
> > +
> > +static int __init arm_lpae_run_tests(struct io_pgtable_cfg *cfg)
> > +{
> > + static const enum io_pgtable_fmt fmts[] __initconst = {
> > + ARM_64_LPAE_S1,
> > + ARM_64_LPAE_S2,
> > + };
> > +
> > + int i, j;
> > + unsigned long iova;
> > + size_t size, mapped;
> > + struct io_pgtable_ops *ops;
> > +
> > + for (i = 0; i < ARRAY_SIZE(fmts); ++i) {
> > + cfg_cookie = cfg;
> > + ops = alloc_io_pgtable_ops(fmts[i], cfg, cfg);
> > + if (!ops) {
> > + pr_err("selftest: failed to allocate io pgtable ops\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /*
> > + * Initial sanity checks.
> > + * Empty page tables shouldn't provide any translations.
> > + */
> > + if (ops->iova_to_phys(ops, 42))
> > + return __FAIL(ops, i);
> > +
> > + if (ops->iova_to_phys(ops, SZ_1G + 42))
> > + return __FAIL(ops, i);
> > +
> > + if (ops->iova_to_phys(ops, SZ_2G + 42))
> > + return __FAIL(ops, i);
> > +
> > + /*
> > + * Distinct mappings of different granule sizes.
> > + */
> > + iova = 0;
> > + for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
> > + size = 1UL << j;
> > +
> > + if (ops->map_pages(ops, iova, iova, size, 1,
> > + IOMMU_READ | IOMMU_WRITE |
> > + IOMMU_NOEXEC | IOMMU_CACHE,
> > + GFP_KERNEL, &mapped))
> > + return __FAIL(ops, i);
> > +
> > + /* Overlapping mappings */
> > + if (!ops->map_pages(ops, iova, iova + size, size, 1,
> > + IOMMU_READ | IOMMU_NOEXEC,
> > + GFP_KERNEL, &mapped))
> > + return __FAIL(ops, i);
> > +
> > + if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
> > + return __FAIL(ops, i);
> > +
> > + iova += SZ_1G;
> > + }
> > +
> > + /* Full unmap */
> > + iova = 0;
> > + for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
> > + size = 1UL << j;
> > +
> > + if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
> > + return __FAIL(ops, i);
> > +
> > + if (ops->iova_to_phys(ops, iova + 42))
> > + return __FAIL(ops, i);
> > +
> > + /* Remap full block */
> > + if (ops->map_pages(ops, iova, iova, size, 1,
> > + IOMMU_WRITE, GFP_KERNEL, &mapped))
> > + return __FAIL(ops, i);
> > +
> > + if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
> > + return __FAIL(ops, i);
> > +
> > + iova += SZ_1G;
> > + }
> > +
> > + /*
> > + * Map/unmap the last largest supported page of the IAS, this can
> > + * trigger corner cases in the concatednated page tables.
> > + */
> > + mapped = 0;
> > + size = 1UL << __fls(cfg->pgsize_bitmap);
> > + iova = (1UL << cfg->ias) - size;
> > + if (ops->map_pages(ops, iova, iova, size, 1,
> > + IOMMU_READ | IOMMU_WRITE |
> > + IOMMU_NOEXEC | IOMMU_CACHE,
> > + GFP_KERNEL, &mapped))
> > + return __FAIL(ops, i);
> > + if (mapped != size)
> > + return __FAIL(ops, i);
> > + if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
> > + return __FAIL(ops, i);
> > +
> > + free_io_pgtable_ops(ops);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int __init arm_lpae_do_selftests(void)
> > +{
> > + static const unsigned long pgsize[] __initconst = {
> > + SZ_4K | SZ_2M | SZ_1G,
> > + SZ_16K | SZ_32M,
> > + SZ_64K | SZ_512M,
> > + };
> > +
> > + static const unsigned int address_size[] __initconst = {
> > + 32, 36, 40, 42, 44, 48,
> > + };
> > +
> > + int i, j, k, pass = 0, fail = 0;
> > + struct faux_device *dev;
> > + struct io_pgtable_cfg cfg = {
> > + .tlb = &dummy_tlb_ops,
> > + .coherent_walk = true,
> > + .quirks = IO_PGTABLE_QUIRK_NO_WARN,
> > + };
> > +
> > + dev = faux_device_create("io-pgtable-test", NULL, 0);
> > + if (!dev)
> > + return -ENOMEM;
> > +
> > + cfg.iommu_dev = &dev->dev;
> > +
> > + for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
> > + for (j = 0; j < ARRAY_SIZE(address_size); ++j) {
> > + /* Don't use ias > oas as it is not valid for stage-2. */
> > + for (k = 0; k <= j; ++k) {
> > + cfg.pgsize_bitmap = pgsize[i];
> > + cfg.ias = address_size[k];
> > + cfg.oas = address_size[j];
> > + pr_info("selftest: pgsize_bitmap 0x%08lx, IAS %u OAS %u\n",
> > + pgsize[i], cfg.ias, cfg.oas);
> > + if (arm_lpae_run_tests(&cfg))
> > + fail++;
> > + else
> > + pass++;
> > + }
> > + }
> > + }
> > +
> > + pr_info("selftest: completed with %d PASS %d FAIL\n", pass, fail);
> > + faux_device_destroy(dev);
> > +
> > + return fail ? -EFAULT : 0;
> > +}
> > +subsys_initcall(arm_lpae_do_selftests);
> > +#endif
> > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > index 96425e92f313..791a2c4ecb83 100644
> > --- a/drivers/iommu/io-pgtable-arm.c
> > +++ b/drivers/iommu/io-pgtable-arm.c
> > @@ -7,15 +7,10 @@
> > * Author: Will Deacon <will.deacon@arm.com>
> > */
> >
> > -#define pr_fmt(fmt) "arm-lpae io-pgtable: " fmt
> > -
> > #include <linux/atomic.h>
> > #include <linux/bitops.h>
> > #include <linux/io-pgtable.h>
> > -#include <linux/kernel.h>
> > -#include <linux/device/faux.h>
> > #include <linux/sizes.h>
> > -#include <linux/slab.h>
> > #include <linux/types.h>
> > #include <linux/dma-mapping.h>
> >
> > @@ -24,33 +19,6 @@
> > #include "io-pgtable-arm.h"
> > #include "iommu-pages.h"
> >
> > -#define ARM_LPAE_MAX_ADDR_BITS 52
> > -#define ARM_LPAE_S2_MAX_CONCAT_PAGES 16
> > -#define ARM_LPAE_MAX_LEVELS 4
> > -
> > -/* Struct accessors */
> > -#define io_pgtable_to_data(x) \
> > - container_of((x), struct arm_lpae_io_pgtable, iop)
> > -
> > -#define io_pgtable_ops_to_data(x) \
> > - io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
> > -
> > -/*
> > - * Calculate the right shift amount to get to the portion describing level l
> > - * in a virtual address mapped by the pagetable in d.
> > - */
> > -#define ARM_LPAE_LVL_SHIFT(l,d) \
> > - (((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level) + \
> > - ilog2(sizeof(arm_lpae_iopte)))
> > -
> > -#define ARM_LPAE_GRANULE(d) \
> > - (sizeof(arm_lpae_iopte) << (d)->bits_per_level)
> > -#define ARM_LPAE_PGD_SIZE(d) \
> > - (sizeof(arm_lpae_iopte) << (d)->pgd_bits)
> > -
> > -#define ARM_LPAE_PTES_PER_TABLE(d) \
> > - (ARM_LPAE_GRANULE(d) >> ilog2(sizeof(arm_lpae_iopte)))
> > -
> > /*
> > * Calculate the index at level l used to map virtual address a using the
> > * pagetable in d.
> > @@ -163,18 +131,6 @@
> > #define iopte_set_writeable_clean(ptep) \
> > set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long *)(ptep))
> >
> > -struct arm_lpae_io_pgtable {
> > - struct io_pgtable iop;
> > -
> > - int pgd_bits;
> > - int start_level;
> > - int bits_per_level;
> > -
> > - void *pgd;
> > -};
> > -
> > -typedef u64 arm_lpae_iopte;
> > -
> > static inline bool iopte_leaf(arm_lpae_iopte pte, int lvl,
> > enum io_pgtable_fmt fmt)
> > {
> > @@ -1274,204 +1230,3 @@ struct io_pgtable_init_fns io_pgtable_arm_mali_lpae_init_fns = {
> > .alloc = arm_mali_lpae_alloc_pgtable,
> > .free = arm_lpae_free_pgtable,
> > };
> > -
> > -#ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE_SELFTEST
> > -
> > -static struct io_pgtable_cfg *cfg_cookie __initdata;
> > -
> > -static void __init dummy_tlb_flush_all(void *cookie)
> > -{
> > - WARN_ON(cookie != cfg_cookie);
> > -}
> > -
> > -static void __init dummy_tlb_flush(unsigned long iova, size_t size,
> > - size_t granule, void *cookie)
> > -{
> > - WARN_ON(cookie != cfg_cookie);
> > - WARN_ON(!(size & cfg_cookie->pgsize_bitmap));
> > -}
> > -
> > -static void __init dummy_tlb_add_page(struct iommu_iotlb_gather *gather,
> > - unsigned long iova, size_t granule,
> > - void *cookie)
> > -{
> > - dummy_tlb_flush(iova, granule, granule, cookie);
> > -}
> > -
> > -static const struct iommu_flush_ops dummy_tlb_ops __initconst = {
> > - .tlb_flush_all = dummy_tlb_flush_all,
> > - .tlb_flush_walk = dummy_tlb_flush,
> > - .tlb_add_page = dummy_tlb_add_page,
> > -};
> > -
> > -static void __init arm_lpae_dump_ops(struct io_pgtable_ops *ops)
> > -{
> > - struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> > - struct io_pgtable_cfg *cfg = &data->iop.cfg;
> > -
> > - pr_err("cfg: pgsize_bitmap 0x%lx, ias %u-bit\n",
> > - cfg->pgsize_bitmap, cfg->ias);
> > - pr_err("data: %d levels, 0x%zx pgd_size, %u pg_shift, %u bits_per_level, pgd @ %p\n",
> > - ARM_LPAE_MAX_LEVELS - data->start_level, ARM_LPAE_PGD_SIZE(data),
> > - ilog2(ARM_LPAE_GRANULE(data)), data->bits_per_level, data->pgd);
> > -}
> > -
> > -#define __FAIL(ops, i) ({ \
> > - WARN(1, "selftest: test failed for fmt idx %d\n", (i)); \
> > - arm_lpae_dump_ops(ops); \
> > - -EFAULT; \
> > -})
> > -
> > -static int __init arm_lpae_run_tests(struct io_pgtable_cfg *cfg)
> > -{
> > - static const enum io_pgtable_fmt fmts[] __initconst = {
> > - ARM_64_LPAE_S1,
> > - ARM_64_LPAE_S2,
> > - };
> > -
> > - int i, j;
> > - unsigned long iova;
> > - size_t size, mapped;
> > - struct io_pgtable_ops *ops;
> > -
> > - for (i = 0; i < ARRAY_SIZE(fmts); ++i) {
> > - cfg_cookie = cfg;
> > - ops = alloc_io_pgtable_ops(fmts[i], cfg, cfg);
> > - if (!ops) {
> > - pr_err("selftest: failed to allocate io pgtable ops\n");
> > - return -ENOMEM;
> > - }
> > -
> > - /*
> > - * Initial sanity checks.
> > - * Empty page tables shouldn't provide any translations.
> > - */
> > - if (ops->iova_to_phys(ops, 42))
> > - return __FAIL(ops, i);
> > -
> > - if (ops->iova_to_phys(ops, SZ_1G + 42))
> > - return __FAIL(ops, i);
> > -
> > - if (ops->iova_to_phys(ops, SZ_2G + 42))
> > - return __FAIL(ops, i);
> > -
> > - /*
> > - * Distinct mappings of different granule sizes.
> > - */
> > - iova = 0;
> > - for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
> > - size = 1UL << j;
> > -
> > - if (ops->map_pages(ops, iova, iova, size, 1,
> > - IOMMU_READ | IOMMU_WRITE |
> > - IOMMU_NOEXEC | IOMMU_CACHE,
> > - GFP_KERNEL, &mapped))
> > - return __FAIL(ops, i);
> > -
> > - /* Overlapping mappings */
> > - if (!ops->map_pages(ops, iova, iova + size, size, 1,
> > - IOMMU_READ | IOMMU_NOEXEC,
> > - GFP_KERNEL, &mapped))
> > - return __FAIL(ops, i);
> > -
> > - if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
> > - return __FAIL(ops, i);
> > -
> > - iova += SZ_1G;
> > - }
> > -
> > - /* Full unmap */
> > - iova = 0;
> > - for_each_set_bit(j, &cfg->pgsize_bitmap, BITS_PER_LONG) {
> > - size = 1UL << j;
> > -
> > - if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
> > - return __FAIL(ops, i);
> > -
> > - if (ops->iova_to_phys(ops, iova + 42))
> > - return __FAIL(ops, i);
> > -
> > - /* Remap full block */
> > - if (ops->map_pages(ops, iova, iova, size, 1,
> > - IOMMU_WRITE, GFP_KERNEL, &mapped))
> > - return __FAIL(ops, i);
> > -
> > - if (ops->iova_to_phys(ops, iova + 42) != (iova + 42))
> > - return __FAIL(ops, i);
> > -
> > - iova += SZ_1G;
> > - }
> > -
> > - /*
> > - * Map/unmap the last largest supported page of the IAS, this can
> > - * trigger corner cases in the concatednated page tables.
> > - */
> > - mapped = 0;
> > - size = 1UL << __fls(cfg->pgsize_bitmap);
> > - iova = (1UL << cfg->ias) - size;
> > - if (ops->map_pages(ops, iova, iova, size, 1,
> > - IOMMU_READ | IOMMU_WRITE |
> > - IOMMU_NOEXEC | IOMMU_CACHE,
> > - GFP_KERNEL, &mapped))
> > - return __FAIL(ops, i);
> > - if (mapped != size)
> > - return __FAIL(ops, i);
> > - if (ops->unmap_pages(ops, iova, size, 1, NULL) != size)
> > - return __FAIL(ops, i);
> > -
> > - free_io_pgtable_ops(ops);
> > - }
> > -
> > - return 0;
> > -}
> > -
> > -static int __init arm_lpae_do_selftests(void)
> > -{
> > - static const unsigned long pgsize[] __initconst = {
> > - SZ_4K | SZ_2M | SZ_1G,
> > - SZ_16K | SZ_32M,
> > - SZ_64K | SZ_512M,
> > - };
> > -
> > - static const unsigned int address_size[] __initconst = {
> > - 32, 36, 40, 42, 44, 48,
> > - };
> > -
> > - int i, j, k, pass = 0, fail = 0;
> > - struct faux_device *dev;
> > - struct io_pgtable_cfg cfg = {
> > - .tlb = &dummy_tlb_ops,
> > - .coherent_walk = true,
> > - .quirks = IO_PGTABLE_QUIRK_NO_WARN,
> > - };
> > -
> > - dev = faux_device_create("io-pgtable-test", NULL, 0);
> > - if (!dev)
> > - return -ENOMEM;
> > -
> > - cfg.iommu_dev = &dev->dev;
> > -
> > - for (i = 0; i < ARRAY_SIZE(pgsize); ++i) {
> > - for (j = 0; j < ARRAY_SIZE(address_size); ++j) {
> > - /* Don't use ias > oas as it is not valid for stage-2. */
> > - for (k = 0; k <= j; ++k) {
> > - cfg.pgsize_bitmap = pgsize[i];
> > - cfg.ias = address_size[k];
> > - cfg.oas = address_size[j];
> > - pr_info("selftest: pgsize_bitmap 0x%08lx, IAS %u OAS %u\n",
> > - pgsize[i], cfg.ias, cfg.oas);
> > - if (arm_lpae_run_tests(&cfg))
> > - fail++;
> > - else
> > - pass++;
> > - }
> > - }
> > - }
> > -
> > - pr_info("selftest: completed with %d PASS %d FAIL\n", pass, fail);
> > - faux_device_destroy(dev);
> > -
> > - return fail ? -EFAULT : 0;
> > -}
> > -subsys_initcall(arm_lpae_do_selftests);
> > -#endif
> > diff --git a/drivers/iommu/io-pgtable-arm.h b/drivers/iommu/io-pgtable-arm.h
> > index ba7cfdf7afa0..a06a23543cff 100644
> > --- a/drivers/iommu/io-pgtable-arm.h
> > +++ b/drivers/iommu/io-pgtable-arm.h
> > @@ -2,6 +2,8 @@
> > #ifndef IO_PGTABLE_ARM_H_
> > #define IO_PGTABLE_ARM_H_
> >
> > +#include <linux/io-pgtable.h>
> > +
> > #define ARM_LPAE_TCR_TG0_4K 0
> > #define ARM_LPAE_TCR_TG0_64K 1
> > #define ARM_LPAE_TCR_TG0_16K 2
> > @@ -27,4 +29,43 @@
> > #define ARM_LPAE_TCR_PS_48_BIT 0x5ULL
> > #define ARM_LPAE_TCR_PS_52_BIT 0x6ULL
> >
> > +/* Struct accessors */
> > +#define io_pgtable_to_data(x) \
> > + container_of((x), struct arm_lpae_io_pgtable, iop)
> > +
> > +#define io_pgtable_ops_to_data(x) \
> > + io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
> > +
> > +struct arm_lpae_io_pgtable {
> > + struct io_pgtable iop;
> > +
> > + int pgd_bits;
> > + int start_level;
> > + int bits_per_level;
> > +
> > + void *pgd;
> > +};
> > +
> > +#define ARM_LPAE_MAX_ADDR_BITS 52
> > +#define ARM_LPAE_S2_MAX_CONCAT_PAGES 16
> > +#define ARM_LPAE_MAX_LEVELS 4
> > +
> > +/*
> > + * Calculate the right shift amount to get to the portion describing level l
> > + * in a virtual address mapped by the pagetable in d.
> > + */
> > +#define ARM_LPAE_LVL_SHIFT(l,d) \
> > + (((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level) + \
> > + ilog2(sizeof(arm_lpae_iopte)))
> > +
> > +#define ARM_LPAE_GRANULE(d) \
> > + (sizeof(arm_lpae_iopte) << (d)->bits_per_level)
> > +#define ARM_LPAE_PGD_SIZE(d) \
> > + (sizeof(arm_lpae_iopte) << (d)->pgd_bits)
> > +
> > +#define ARM_LPAE_PTES_PER_TABLE(d) \
> > + (ARM_LPAE_GRANULE(d) >> ilog2(sizeof(arm_lpae_iopte)))
> > +
> > +typedef u64 arm_lpae_iopte;
> > +
> > #endif /* IO_PGTABLE_ARM_H_ */
>
> Apart from the renaming above, I was able to apply this patch alone, and
> build succesfully while toggling IOMMU_IO_PGTABLE_LPAE_SELFTEST across
> builds.
>
> Reviewed-by: Pranjal Shrivastava <praan@google.com>
>
> > --
> > 2.51.0.rc1.167.g924127e9c0-goog
> >
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 04/28] iommu/io-pgtable-arm: Move selftests to a separate file
2025-09-15 16:45 ` Jason Gunthorpe
@ 2025-09-16 14:09 ` Mostafa Saleh
0 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 14:09 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
robin.murphy, jean-philippe, qperret, tabba, mark.rutland, praan
On Mon, Sep 15, 2025 at 01:45:17PM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 19, 2025 at 09:51:32PM +0000, Mostafa Saleh wrote:
> > Soon, io-pgtable-arm.c will be compiled as part of the KVM/arm64
> > in the hypervisor object, which doesn't have many of the kernel APIs,
> > as faux devices, printk...
> >
> > We would need to factor this things outside of this file, this patch
> > moves the selftests outside, which remove many of the kernel
> > dependencies, which also is not needed by the hypervisor.
> > Create io-pgtable-arm-kernel.c for that, and in the next patch
> > the rest of the code is factored out.
>
> Please send this as a stand alone patch, it looks like a good idea.
>
> Also please add the boiler plate to wrap the selftest into a kunit and
> use the usual kunit machinery. We alredy have an kunit for
> smmuv3, it can just add another file to that.
Makes sense, I will do that (hopefully before the weekend)
Thanks,
Mostafa
>
> Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 06/28] iommu/arm-smmu-v3: Split code with hyp
2025-09-09 14:23 ` Will Deacon
@ 2025-09-16 14:10 ` Mostafa Saleh
0 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 14:10 UTC (permalink / raw)
To: Will Deacon
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Sep 09, 2025 at 03:23:14PM +0100, Will Deacon wrote:
> On Tue, Aug 19, 2025 at 09:51:34PM +0000, Mostafa Saleh wrote:
> > The KVM SMMUv3 driver would re-use some of the cmdq code inside
> > the hypervisor, move these functions to a new common c file that
> > is shared between the host kernel and the hypervisor.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> > drivers/iommu/arm/arm-smmu-v3/Makefile | 2 +-
> > .../arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c | 114 ++++++++++++++
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 146 ------------------
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 44 ++++++
> > 4 files changed, 159 insertions(+), 147 deletions(-)
> > create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
> > index 493a659cc66b..1918b4a64cb0 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/Makefile
> > +++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
> > @@ -1,6 +1,6 @@
> > # SPDX-License-Identifier: GPL-2.0
> > obj-$(CONFIG_ARM_SMMU_V3) += arm_smmu_v3.o
> > -arm_smmu_v3-y := arm-smmu-v3.o
> > +arm_smmu_v3-y := arm-smmu-v3.o arm-smmu-v3-common-hyp.o
> > arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_IOMMUFD) += arm-smmu-v3-iommufd.o
> > arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
> > arm_smmu_v3-$(CONFIG_TEGRA241_CMDQV) += tegra241-cmdqv.o
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c
> > new file mode 100644
> > index 000000000000..62744c8548a8
> > --- /dev/null
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common-hyp.c
> > @@ -0,0 +1,114 @@
>
> Given that this thie is linked into both the kernel and the hypervisor
> objects, I think I'd drop the '-hyp' part from the filename. Maybe
> something like 'arm-smmu-v3-lib.c' instead?
>
Yes, that makes more sense, will do that.
Thanks,
Mostafa
> Let the bike-shedding begin!
>
> Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table
2025-09-09 14:42 ` Will Deacon
@ 2025-09-16 14:24 ` Mostafa Saleh
2025-09-26 14:42 ` Will Deacon
0 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 14:24 UTC (permalink / raw)
To: Will Deacon
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Sep 09, 2025 at 03:42:07PM +0100, Will Deacon wrote:
> On Tue, Aug 19, 2025 at 09:51:38PM +0000, Mostafa Saleh wrote:
> > Create a shadow page table for the IOMMU that shadows the
> > host CPU stage-2 into the IOMMUs to establish DMA isolation.
> >
> > An initial snapshot is created after the driver init, then
> > on every permission change a callback would be called for
> > the IOMMU driver to update the page table.
> >
> > For some cases, an SMMUv3 may be able to share the same page
> > table used with the host CPU stage-2 directly.
> > However, this is too strict and requires changes to the core hypervisor
> > page table code, plus it would require the hypervisor to handle IOMMU
> > page faults. This can be added later as an optimization for SMMUV3.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> > arch/arm64/kvm/hyp/include/nvhe/iommu.h | 4 ++
> > arch/arm64/kvm/hyp/nvhe/iommu/iommu.c | 83 ++++++++++++++++++++++++-
> > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 5 ++
> > 3 files changed, 90 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
> > index 1ac70cc28a9e..219363045b1c 100644
> > --- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
> > @@ -3,11 +3,15 @@
> > #define __ARM64_KVM_NVHE_IOMMU_H__
> >
> > #include <asm/kvm_host.h>
> > +#include <asm/kvm_pgtable.h>
> >
> > struct kvm_iommu_ops {
> > int (*init)(void);
> > + void (*host_stage2_idmap)(phys_addr_t start, phys_addr_t end, int prot);
> > };
> >
> > int kvm_iommu_init(void);
> >
> > +void kvm_iommu_host_stage2_idmap(phys_addr_t start, phys_addr_t end,
> > + enum kvm_pgtable_prot prot);
> > #endif /* __ARM64_KVM_NVHE_IOMMU_H__ */
> > diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > index a01c036c55be..f7d1c8feb358 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > @@ -4,15 +4,94 @@
> > *
> > * Copyright (C) 2022 Linaro Ltd.
> > */
> > +#include <linux/iommu.h>
> > +
> > #include <nvhe/iommu.h>
> > +#include <nvhe/mem_protect.h>
> > +#include <nvhe/spinlock.h>
> >
> > /* Only one set of ops supported */
> > struct kvm_iommu_ops *kvm_iommu_ops;
> >
> > +/* Protected by host_mmu.lock */
> > +static bool kvm_idmap_initialized;
> > +
> > +static inline int pkvm_to_iommu_prot(enum kvm_pgtable_prot prot)
> > +{
> > + int iommu_prot = 0;
> > +
> > + if (prot & KVM_PGTABLE_PROT_R)
> > + iommu_prot |= IOMMU_READ;
> > + if (prot & KVM_PGTABLE_PROT_W)
> > + iommu_prot |= IOMMU_WRITE;
> > + if (prot == PKVM_HOST_MMIO_PROT)
> > + iommu_prot |= IOMMU_MMIO;
>
> This looks a little odd to me.
>
> On the CPU side, the only different between PKVM_HOST_MEM_PROT and
> PKVM_HOST_MMIO_PROT is that the former has execute permission. Both are
> mapped as cacheable at stage-2 because it's the job of the host to set
> the more restrictive memory type at stage-1.
>
> Carrying that over to the SMMU would suggest that we don't care about
> IOMMU_MMIO at stage-2 at all, so why do we need to set it here?
Unlike the CPU, the host can set the SMMU to bypass, in that case the
hypervisor will attach its stage-2 with no stage-1 configured. So,
stage-2 must have the correct attrs for MMIO.
>
> > + /* We don't understand that, might be dangerous. */
> > + WARN_ON(prot & ~PKVM_HOST_MEM_PROT);
> > + return iommu_prot;
> > +}
> > +
> > +static int __snapshot_host_stage2(const struct kvm_pgtable_visit_ctx *ctx,
> > + enum kvm_pgtable_walk_flags visit)
> > +{
> > + u64 start = ctx->addr;
> > + kvm_pte_t pte = *ctx->ptep;
> > + u32 level = ctx->level;
> > + u64 end = start + kvm_granule_size(level);
> > + int prot = IOMMU_READ | IOMMU_WRITE;
> > +
> > + /* Keep unmapped. */
> > + if (pte && !kvm_pte_valid(pte))
> > + return 0;
> > +
> > + if (kvm_pte_valid(pte))
> > + prot = pkvm_to_iommu_prot(kvm_pgtable_stage2_pte_prot(pte));
> > + else if (!addr_is_memory(start))
> > + prot |= IOMMU_MMIO;
>
> Why do we need to map MMIO regions pro-actively here? I'd have thought
> we could just do:
>
> if (!kvm_pte_valid(pte))
> return 0;
>
> prot = pkvm_to_iommu_prot(kvm_pgtable_stage2_pte_prot(pte);
> kvm_iommu_ops->host_stage2_idmap(start, end, prot);
> return 0;
>
> but I think that IOMMU_MMIO is throwing me again...
We have to map everything pro-actively as we don’t handle page faults
in the SMMUv3 driver.
This would be a future work where the CPU stage-2 page table is shared with
the SMMUv3.
Thanks,
Mostafa
>
> Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 14/28] iommu/arm-smmu-v3: Add KVM mode in the driver
2025-09-12 13:52 ` Will Deacon
@ 2025-09-16 14:30 ` Mostafa Saleh
0 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 14:30 UTC (permalink / raw)
To: Will Deacon
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Fri, Sep 12, 2025 at 02:52:27PM +0100, Will Deacon wrote:
> On Tue, Aug 19, 2025 at 09:51:42PM +0000, Mostafa Saleh wrote:
> > Add a file only compiled for KVM mode.
> >
> > At the moment it registers the driver with KVM, and add the hook
> > needed for memory allocation.
> >
> > Next, it will create the array with available SMMUs and their
> > description.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> > arch/arm64/include/asm/kvm_host.h | 4 +++
> > arch/arm64/kvm/iommu.c | 10 ++++--
> > drivers/iommu/arm/arm-smmu-v3/Makefile | 1 +
> > .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c | 36 +++++++++++++++++++
> > 4 files changed, 49 insertions(+), 2 deletions(-)
> > create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index fcb4b26072f7..52212c0f2e9c 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -1678,4 +1678,8 @@ struct kvm_iommu_ops;
> > int kvm_iommu_register_driver(struct kvm_iommu_ops *hyp_ops);
> > size_t kvm_iommu_pages(void);
> >
> > +#ifdef CONFIG_ARM_SMMU_V3_PKVM
> > +size_t smmu_hyp_pgt_pages(void);
> > +#endif
> > +
> > #endif /* __ARM64_KVM_HOST_H__ */
> > diff --git a/arch/arm64/kvm/iommu.c b/arch/arm64/kvm/iommu.c
> > index 5460b1bd44a6..0475f7c95c6c 100644
> > --- a/arch/arm64/kvm/iommu.c
> > +++ b/arch/arm64/kvm/iommu.c
> > @@ -17,10 +17,16 @@ int kvm_iommu_register_driver(struct kvm_iommu_ops *hyp_ops)
> >
> > size_t kvm_iommu_pages(void)
> > {
> > + size_t nr_pages = 0;
> > +
> > /*
> > * This is called very early during setup_arch() where no initcalls,
> > * so this has to call specific functions per each KVM driver.
> > */
> > - kvm_nvhe_sym(hyp_kvm_iommu_pages) = 0;
> > - return 0;
> > +#ifdef CONFIG_ARM_SMMU_V3_PKVM
> > + nr_pages = smmu_hyp_pgt_pages();
> > +#endif
>
> Rather than hard-code this here, I wonder whether it would be better to
> have a default size for the IOMMU carveout and have the driver tells us
> how much it needs later on when it probes. Then we could either free
> any unused portion back to the host or return an error to the driver if
> it wants more than we have.
I can do that, we can set the default from a config or cmdline (or
both).
Thanks,
Mostafa
>
> Will
>
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 16/28] iommu/arm-smmu-v3-kvm: Create array for hyp SMMUv3
2025-09-09 18:30 ` Daniel Mentz
@ 2025-09-16 14:35 ` Mostafa Saleh
0 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 14:35 UTC (permalink / raw)
To: Daniel Mentz
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas, will,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Sep 09, 2025 at 11:30:48AM -0700, Daniel Mentz wrote:
> On Tue, Aug 19, 2025 at 2:55 PM Mostafa Saleh <smostafa@google.com> wrote:
> >
> > + if (kvm_arm_smmu_array[i].mmio_size < SZ_128K) {
> > + pr_err("SMMUv3(%s) has unsupported size(0x%lx)\n", np->name,
> > + kvm_arm_smmu_array[i].mmio_size);
>
> Use format specifier %pOF to print device tree node.
> If mmio_size is a size_t type, use format specifier %zx.
> Align language of error message with kernel driver which prints "MMIO
> region too small (%pr)\n".
Thanks for catching that, I will fix it in v5.
> I'm wondering if we should use kvm_err instead of pr_err.
I am not sure, kvm_err seems to be used from core arch code only, but
I don't see why not.
Thanks,
Mostafa
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-09-12 14:18 ` Will Deacon
2025-09-15 16:38 ` Jason Gunthorpe
@ 2025-09-16 14:50 ` Mostafa Saleh
1 sibling, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 14:50 UTC (permalink / raw)
To: Will Deacon
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Fri, Sep 12, 2025 at 03:18:08PM +0100, Will Deacon wrote:
> On Tue, Aug 19, 2025 at 09:51:50PM +0000, Mostafa Saleh wrote:
> > Don’t allow access to the command queue from the host:
> > - ARM_SMMU_CMDQ_BASE: Only allowed to be written when CMDQ is disabled, we
> > use it to keep track of the host command queue base.
> > Reads return the saved value.
> > - ARM_SMMU_CMDQ_PROD: Writes trigger command queue emulation which sanitises
> > and filters the whole range. Reads returns the host copy.
> > - ARM_SMMU_CMDQ_CONS: Writes move the sw copy of the cons, but the host can’t
> > skip commands once submitted. Reads return the emulated value and the error
> > bits in the actual cons.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> > .../iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c | 108 +++++++++++++++++-
> > 1 file changed, 105 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
> > index 554229e466f3..10c6461bbf12 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c
> > @@ -325,6 +325,88 @@ static bool is_cmdq_enabled(struct hyp_arm_smmu_v3_device *smmu)
> > return FIELD_GET(CR0_CMDQEN, smmu->cr0);
> > }
> >
> > +static bool smmu_filter_command(struct hyp_arm_smmu_v3_device *smmu, u64 *command)
> > +{
> > + u64 type = FIELD_GET(CMDQ_0_OP, command[0]);
> > +
> > + switch (type) {
> > + case CMDQ_OP_CFGI_STE:
> > + /* TBD: SHADOW_STE*/
> > + break;
> > + case CMDQ_OP_CFGI_ALL:
> > + {
> > + /*
> > + * Linux doesn't use range STE invalidation, and only use this
> > + * for CFGI_ALL, which is done on reset and not on an new STE
> > + * being used.
> > + * Although, this is not architectural we rely on the current Linux
> > + * implementation.
> > + */
> > + WARN_ON((FIELD_GET(CMDQ_CFGI_1_RANGE, command[1]) != 31));
> > + break;
> > + }
> > + case CMDQ_OP_TLBI_NH_ASID:
> > + case CMDQ_OP_TLBI_NH_VA:
> > + case 0x13: /* CMD_TLBI_NH_VAA: Not used by Linux */
> > + {
> > + /* Only allow VMID = 0*/
> > + if (FIELD_GET(CMDQ_TLBI_0_VMID, command[0]) == 0)
> > + break;
> > + break;
> > + }
> > + case 0x10: /* CMD_TLBI_NH_ALL: Not used by Linux */
> > + case CMDQ_OP_TLBI_EL2_ALL:
> > + case CMDQ_OP_TLBI_EL2_VA:
> > + case CMDQ_OP_TLBI_EL2_ASID:
> > + case CMDQ_OP_TLBI_S12_VMALL:
> > + case 0x23: /* CMD_TLBI_EL2_VAA: Not used by Linux */
> > + /* Malicous host */
> > + return WARN_ON(true);
> > + case CMDQ_OP_CMD_SYNC:
> > + if (FIELD_GET(CMDQ_SYNC_0_CS, command[0]) == CMDQ_SYNC_0_CS_IRQ) {
> > + /* Allow it, but let the host timeout, as this should never happen. */
> > + command[0] &= ~CMDQ_SYNC_0_CS;
> > + command[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV);
> > + command[1] &= ~CMDQ_SYNC_1_MSIADDR_MASK;
> > + }
> > + break;
> > + }
> > +
> > + return false;
> > +}
> > +
> > +static void smmu_emulate_cmdq_insert(struct hyp_arm_smmu_v3_device *smmu)
> > +{
> > + u64 *host_cmdq = hyp_phys_to_virt(smmu->cmdq_host.q_base & Q_BASE_ADDR_MASK);
> > + int idx;
> > + u64 cmd[CMDQ_ENT_DWORDS];
> > + bool skip;
> > +
> > + if (!is_cmdq_enabled(smmu))
> > + return;
> > +
> > + while (!queue_empty(&smmu->cmdq_host.llq)) {
> > + /* Wait for the command queue to have some space. */
> > + WARN_ON(smmu_wait_event(smmu, !smmu_cmdq_full(&smmu->cmdq)));
> > +
> > + idx = Q_IDX(&smmu->cmdq_host.llq, smmu->cmdq_host.llq.cons);
> > + /* Avoid TOCTOU */
> > + memcpy(cmd, &host_cmdq[idx * CMDQ_ENT_DWORDS], CMDQ_ENT_DWORDS << 3);
> > + skip = smmu_filter_command(smmu, cmd);
> > + if (!skip)
> > + smmu_add_cmd_raw(smmu, cmd);
> > + queue_inc_cons(&smmu->cmdq_host.llq);
> > + }
>
> Hmmm. There's something I'd not considered before here.
>
> Ideally, the data structures that are shadowed by the hypervisor would
> be mapped as normal-WB cacheable in both the host and the hypervisor so
> we don't have to worry about coherency and we get the performance
> benefits from the caches. Indeed, I think that's how you've mapped
> 'host_cmdq' above _however_ I sadly don't think we can do that if the
> actual SMMU hardware isn't coherent.
>
> We don't have a way to say things like "The STEs and CMDQ are coherent
> but the CDs and Stage-1 page-tables aren't" so that means we have to
> treat the shadowed structures populated by the host in the same way as
> the host-owned structures that are consumed directly by the hardware.
> Consequently, we should either be using non-cacheable mappings at EL2
> for these structures or doing CMOs around the accesses.
Thanks for catching that, I missed it, I think we can keep the host shared
as cacheable, and use CMOs when accessing it, I will have a closer look.
Thanks,
Mostafa
>
> Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-09-15 16:38 ` Jason Gunthorpe
@ 2025-09-16 15:19 ` Mostafa Saleh
2025-09-17 12:36 ` Jason Gunthorpe
0 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-16 15:19 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Mon, Sep 15, 2025 at 01:38:58PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 12, 2025 at 03:18:08PM +0100, Will Deacon wrote:
> > Ideally, the data structures that are shadowed by the hypervisor would
> > be mapped as normal-WB cacheable in both the host and the hypervisor so
> > we don't have to worry about coherency and we get the performance
> > benefits from the caches. Indeed, I think that's how you've mapped
> > 'host_cmdq' above _however_ I sadly don't think we can do that if the
> > actual SMMU hardware isn't coherent.
>
> That seems like the right conclusion to me, pkvm should not be mapping
> as cachable unless it knows the IORT/IDR is marked as coherent.
>
> This is actually something I want to fix in the SMMU driver, it should
> always be allocating cachable memory and using
> dma_sync_single_for_device() instead of non-cachable DMA coherent
> allocations. (Or perhaps better is to use
> iommu_pages_flush_incoherent())
>
> I'm hearing about an interesting use case where we'd want to tell the
> SMMU to walk STEs non-cachable even if the HW is capable to do
> cachable. Apparently in some SOCs it gives better isochronous
> properties for realtime DMA.
Interesting, I guess that would be more noticable for the page table
walks rather than the STE, as Linux doesn't invalidate STEs that much.
>
>
> IMHO for this series at this point pkvm should just require a coherent
> SMMU until the above revisions happen.
I think the fix for the problem Will mentioned is to just use CMOs
before accessing the host structures, so that should be simple.
If it turns to be more complicated, I am happy to drop the support
for non-coherent devices from this series and we can add it later.
Thanks,
Mostafa
>
> Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-09-16 15:19 ` Mostafa Saleh
@ 2025-09-17 12:36 ` Jason Gunthorpe
2025-09-17 15:01 ` Will Deacon
0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2025-09-17 12:36 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Tue, Sep 16, 2025 at 03:19:02PM +0000, Mostafa Saleh wrote:
> I think the fix for the problem Will mentioned is to just use CMOs
> before accessing the host structures, so that should be simple.
> If it turns to be more complicated, I am happy to drop the support
> for non-coherent devices from this series and we can add it later.
I feel like it is easier/better to fix the driver to use cachable
memory than to add CMOs to the pkvm side..
This way it will help qemu/etc as well.
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-09-17 12:36 ` Jason Gunthorpe
@ 2025-09-17 15:01 ` Will Deacon
2025-09-17 15:16 ` Jason Gunthorpe
0 siblings, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-17 15:01 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mostafa Saleh, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Wed, Sep 17, 2025 at 09:36:01AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 16, 2025 at 03:19:02PM +0000, Mostafa Saleh wrote:
>
> > I think the fix for the problem Will mentioned is to just use CMOs
> > before accessing the host structures, so that should be simple.
> > If it turns to be more complicated, I am happy to drop the support
> > for non-coherent devices from this series and we can add it later.
>
> I feel like it is easier/better to fix the driver to use cachable
> memory than to add CMOs to the pkvm side..
Hmm, but for non-coherent SMMU hardware (which sadly exists in
production), I don't think there's a way for firmware to tell the driver
that it needs to issue CMOs for the page-tables and the CDs but not the
other in-memory data structures (e.g. STEs). I suppose we could do it in
some pKVM-specific way, but then that's not really helping anybody else.
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-09-17 15:01 ` Will Deacon
@ 2025-09-17 15:16 ` Jason Gunthorpe
2025-09-17 15:25 ` Will Deacon
0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2025-09-17 15:16 UTC (permalink / raw)
To: Will Deacon
Cc: Mostafa Saleh, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Wed, Sep 17, 2025 at 04:01:34PM +0100, Will Deacon wrote:
> On Wed, Sep 17, 2025 at 09:36:01AM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 16, 2025 at 03:19:02PM +0000, Mostafa Saleh wrote:
> >
> > > I think the fix for the problem Will mentioned is to just use CMOs
> > > before accessing the host structures, so that should be simple.
> > > If it turns to be more complicated, I am happy to drop the support
> > > for non-coherent devices from this series and we can add it later.
> >
> > I feel like it is easier/better to fix the driver to use cachable
> > memory than to add CMOs to the pkvm side..
>
> Hmm, but for non-coherent SMMU hardware (which sadly exists in
> production), I don't think there's a way for firmware to tell the driver
> that it needs to issue CMOs for the page-tables and the CDs but not the
> other in-memory data structures (e.g. STEs). I suppose we could do it in
> some pKVM-specific way, but then that's not really helping anybody else.
Not sure I understand?
I mean to issue CMOs in the smmu driver consistently for everthing,
page table, CD entry, STE, etc. Today it only does it for page table.
Make the driver consistently use cachable memory for everything
instead of having two different ways to deal with incoherent HW.
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-09-17 15:16 ` Jason Gunthorpe
@ 2025-09-17 15:25 ` Will Deacon
2025-09-17 15:59 ` Jason Gunthorpe
0 siblings, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-17 15:25 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mostafa Saleh, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Wed, Sep 17, 2025 at 12:16:12PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 17, 2025 at 04:01:34PM +0100, Will Deacon wrote:
> > On Wed, Sep 17, 2025 at 09:36:01AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 16, 2025 at 03:19:02PM +0000, Mostafa Saleh wrote:
> > >
> > > > I think the fix for the problem Will mentioned is to just use CMOs
> > > > before accessing the host structures, so that should be simple.
> > > > If it turns to be more complicated, I am happy to drop the support
> > > > for non-coherent devices from this series and we can add it later.
> > >
> > > I feel like it is easier/better to fix the driver to use cachable
> > > memory than to add CMOs to the pkvm side..
> >
> > Hmm, but for non-coherent SMMU hardware (which sadly exists in
> > production), I don't think there's a way for firmware to tell the driver
> > that it needs to issue CMOs for the page-tables and the CDs but not the
> > other in-memory data structures (e.g. STEs). I suppose we could do it in
> > some pKVM-specific way, but then that's not really helping anybody else.
>
> Not sure I understand?
>
> I mean to issue CMOs in the smmu driver consistently for everthing,
> page table, CD entry, STE, etc. Today it only does it for page table.
>
> Make the driver consistently use cachable memory for everything
> instead of having two different ways to deal with incoherent HW.
Ah right, so the driver would unnecessarily issue CMOs for the structures
that are just shared with the hypervisor. At least it's _functional_ that
way, but I'm sure people will complain!
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-09-17 15:25 ` Will Deacon
@ 2025-09-17 15:59 ` Jason Gunthorpe
2025-09-18 10:26 ` Will Deacon
0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2025-09-17 15:59 UTC (permalink / raw)
To: Will Deacon
Cc: Mostafa Saleh, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Wed, Sep 17, 2025 at 04:25:35PM +0100, Will Deacon wrote:
> Ah right, so the driver would unnecessarily issue CMOs for the structures
> that are just shared with the hypervisor. At least it's _functional_ that
> way, but I'm sure people will complain!
Yes, functional, why would anyone complain? STE and CD manipulation is
not fast path for anything?
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-09-17 15:59 ` Jason Gunthorpe
@ 2025-09-18 10:26 ` Will Deacon
2025-09-18 14:36 ` Jason Gunthorpe
0 siblings, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-18 10:26 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mostafa Saleh, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Wed, Sep 17, 2025 at 12:59:31PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 17, 2025 at 04:25:35PM +0100, Will Deacon wrote:
>
> > Ah right, so the driver would unnecessarily issue CMOs for the structures
> > that are just shared with the hypervisor. At least it's _functional_ that
> > way, but I'm sure people will complain!
>
> Yes, functional, why would anyone complain? STE and CD manipulation is
> not fast path for anything?
Won't it also apply to cmdq insertion?
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host
2025-09-18 10:26 ` Will Deacon
@ 2025-09-18 14:36 ` Jason Gunthorpe
0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2025-09-18 14:36 UTC (permalink / raw)
To: Will Deacon
Cc: Mostafa Saleh, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Thu, Sep 18, 2025 at 11:26:50AM +0100, Will Deacon wrote:
> On Wed, Sep 17, 2025 at 12:59:31PM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 17, 2025 at 04:25:35PM +0100, Will Deacon wrote:
> >
> > > Ah right, so the driver would unnecessarily issue CMOs for the structures
> > > that are just shared with the hypervisor. At least it's _functional_ that
> > > way, but I'm sure people will complain!
> >
> > Yes, functional, why would anyone complain? STE and CD manipulation is
> > not fast path for anything?
>
> Won't it also apply to cmdq insertion?
Oh, changing CMDQ wasn't on my mind..
Yeah, OK I don't know what the performance delta would be like there.
However, to get peak performance out of pkvm we really do want the
SMMU driver to write CMDQ as cachable, pkvm to read it as cachable and
then copy it to a non-cachable HW queue.
Otherwise pkvm will be issuing CMOs on fast paths :\
If we convert the slow speed stuff, STE, CD, Fault to do CMOs then we
could make a fairly small change for pkvm mode to force the guest CMDQ
to be cachable without CMO. Some special feature triggered by pkvm
detection during probe.
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-09-12 13:54 ` Will Deacon
@ 2025-09-23 14:35 ` Mostafa Saleh
2025-09-23 17:38 ` Jason Gunthorpe
0 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-23 14:35 UTC (permalink / raw)
To: Will Deacon
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Fri, Sep 12, 2025 at 02:54:11PM +0100, Will Deacon wrote:
> On Tue, Aug 19, 2025 at 09:51:43PM +0000, Mostafa Saleh wrote:
> > While in KVM mode, the driver must be loaded after the hypervisor
> > initializes.
> >
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 25 ++++++++++++++++-----
> > 1 file changed, 19 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 10ca07c6dbe9..a04730b5fe41 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -4576,12 +4576,6 @@ static const struct of_device_id arm_smmu_of_match[] = {
> > };
> > MODULE_DEVICE_TABLE(of, arm_smmu_of_match);
> >
> > -static void arm_smmu_driver_unregister(struct platform_driver *drv)
> > -{
> > - arm_smmu_sva_notifier_synchronize();
> > - platform_driver_unregister(drv);
> > -}
> > -
> > static struct platform_driver arm_smmu_driver = {
> > .driver = {
> > .name = "arm-smmu-v3",
> > @@ -4592,8 +4586,27 @@ static struct platform_driver arm_smmu_driver = {
> > .remove = arm_smmu_device_remove,
> > .shutdown = arm_smmu_device_shutdown,
> > };
> > +
> > +#ifndef CONFIG_ARM_SMMU_V3_PKVM
> > +static void arm_smmu_driver_unregister(struct platform_driver *drv)
> > +{
> > + arm_smmu_sva_notifier_synchronize();
> > + platform_driver_unregister(drv);
> > +}
> > +
> > module_driver(arm_smmu_driver, platform_driver_register,
> > arm_smmu_driver_unregister);
> > +#else
> > +/*
> > + * Must be done after the hypervisor initializes at module_init()
> > + * No need for unregister as this is a built in driver.
> > + */
> > +static int arm_smmu_driver_register(void)
> > +{
> > + return platform_driver_register(&arm_smmu_driver);
> > +}
> > +device_initcall_sync(arm_smmu_driver_register);
> > +#endif /* !CONFIG_ARM_SMMU_V3_PKVM */
>
> I think this is a bit grotty as we now have to reason about different
> initialisation ordering based on CONFIG_ARM_SMMU_V3_PKVM. Could we
> instead return -EPROBE_DEFER if the driver tries to probe before the
> hypervisor is up?
I looked a bit into this and I think the current approach would be
better because:
1- In case KVM fails to initialise or was disabled from command line,
waiting for the hypervisor means SMMUs may never probe.
One of the things I was cautious to get right is the error path,
so if KVM or if the nested driver fails at any point at initialization,
the SMMUs should still be probed and the systems should still be running
even without KVM.
2- That's not as bad, but it leaks some KVM internals as we need to either
check (is_kvm_arm_initialised()\or kvm_protected_mode_initialized) from
driver code, as opposed to registering the driver late based on a kernel
config for the nested SMMUv3.
If we really want to avoid the current approach, we can keep deferring probe,
until a check for a new flag set from “finalize_pkvm” which is called
unconditionally of KVM state.
Thanks,
Mostafa
>
> Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-09-23 14:35 ` Mostafa Saleh
@ 2025-09-23 17:38 ` Jason Gunthorpe
2025-09-29 11:10 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2025-09-23 17:38 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Tue, Sep 23, 2025 at 02:35:48PM +0000, Mostafa Saleh wrote:
> If we really want to avoid the current approach, we can keep deferring probe,
> until a check for a new flag set from “finalize_pkvm” which is called
> unconditionally of KVM state.
I still think the pkvm drivers should be bound to some special pkvm
device_driver and the driver core should handle all this special
dancing:
- Wait for pkvm to decide if it will start or not
- Claim a device for pkvm and make it visible in some generic way,eg
in sysfs
- Fall back to using the normal driver once we conclude pkvm won't
run.
It sounds like a pain to open code all this logic in every pkvm
driver? How many do you have?
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor
2025-09-16 13:27 ` Mostafa Saleh
@ 2025-09-26 14:33 ` Will Deacon
2025-09-29 10:57 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-26 14:33 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Sep 16, 2025 at 01:27:39PM +0000, Mostafa Saleh wrote:
> On Tue, Sep 09, 2025 at 03:12:45PM +0100, Will Deacon wrote:
> > On Tue, Aug 19, 2025 at 09:51:30PM +0000, Mostafa Saleh wrote:
> > > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > > index 861e448183fd..c9a15ef6b18d 100644
> > > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > > @@ -799,6 +799,70 @@ int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> > > return ret;
> > > }
> > >
> > > +int __pkvm_host_donate_hyp_mmio(u64 pfn)
> > > +{
> > > + u64 phys = hyp_pfn_to_phys(pfn);
> > > + void *virt = __hyp_va(phys);
> > > + int ret;
> > > + kvm_pte_t pte;
> > > +
> > > + host_lock_component();
> > > + hyp_lock_component();
> > > +
> > > + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> > > + if (ret)
> > > + goto unlock;
> > > +
> > > + if (pte && !kvm_pte_valid(pte)) {
> > > + ret = -EPERM;
> > > + goto unlock;
> > > + }
> >
> > Shouldn't we first check that the pfn is indeed MMIO? Otherwise, testing
> > the pte for the ownership information isn't right.
>
> I will add it, although the input should be trusted as it comes from the
> hypervisor SMMUv3 driver.
(more on this below)
> > > +int __pkvm_hyp_donate_host_mmio(u64 pfn)
> > > +{
> > > + u64 phys = hyp_pfn_to_phys(pfn);
> > > + u64 virt = (u64)__hyp_va(phys);
> > > + size_t size = PAGE_SIZE;
> > > +
> > > + host_lock_component();
> > > + hyp_lock_component();
> >
> > Shouldn't we check that:
> >
> > 1. pfn is mmio
> > 2. pfn is owned by hyp
> > 3. The host doesn't have something mapped at pfn already
> >
> > ?
> >
>
> I thought about this initially, but as
> - This code is only called from the hypervisor with trusted
> inputs (only at boot)
> - Only called on error path
>
> So WARN_ON in case of failure to unmap MMIO pages seemed is good enough,
> to avoid extra code.
>
> But I can add the checks if you think they are necessary, we will need
> to add new helpers for MMIO state though.
I'd personally prefer to put the checks here so that callers don't have
to worry (or forget!) about them. That also means that the donation
function can be readily reused in the same way as the existing functions
which operate on memory pages.
How much work is it to add the MMIO helpers?
> > > + WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
> > > + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> > > + PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
> > > + hyp_unlock_component();
> > > + host_unlock_component();
> > > +
> > > + return 0;
> > > +}
> > > +
> > > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> > > {
> > > return ___pkvm_host_donate_hyp(pfn, nr_pages, PAGE_HYP);
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index c351b4abd5db..ba06b0c21d5a 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1095,13 +1095,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > kvm_pte_t *childp = NULL;
> > > bool need_flush = false;
> > >
> > > - if (!kvm_pte_valid(ctx->old)) {
> > > - if (stage2_pte_is_counted(ctx->old)) {
> > > - kvm_clear_pte(ctx->ptep);
> > > - mm_ops->put_page(ctx->ptep);
> > > - }
> > > - return 0;
> > > - }
> > > + if (!kvm_pte_valid(ctx->old))
> > > + return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
> >
> > Can this code be reached for the guest? For example, if
> > pkvm_pgtable_stage2_destroy() runs into an MMIO-guarded pte on teardown?
>
> AFAICT, VMs page table is destroyed from reclaim_pgtable_pages() =>
> kvm_pgtable_stage2_destroy() => kvm_pgtable_stage2_destroy_range() ... =>
> stage2_free_walker()
>
> Which doesn't interact with “stage2_unmap_walker”, so that should be
> fine.
Fair enough. I feel like this might bite us later on but, with what you
have, we'll see the -EPERM and then we can figure out what to do then.
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table
2025-09-16 14:24 ` Mostafa Saleh
@ 2025-09-26 14:42 ` Will Deacon
2025-09-29 11:01 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Will Deacon @ 2025-09-26 14:42 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Tue, Sep 16, 2025 at 02:24:46PM +0000, Mostafa Saleh wrote:
> On Tue, Sep 09, 2025 at 03:42:07PM +0100, Will Deacon wrote:
> > On Tue, Aug 19, 2025 at 09:51:38PM +0000, Mostafa Saleh wrote:
> > > diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > > index a01c036c55be..f7d1c8feb358 100644
> > > --- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > > +++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > > @@ -4,15 +4,94 @@
> > > *
> > > * Copyright (C) 2022 Linaro Ltd.
> > > */
> > > +#include <linux/iommu.h>
> > > +
> > > #include <nvhe/iommu.h>
> > > +#include <nvhe/mem_protect.h>
> > > +#include <nvhe/spinlock.h>
> > >
> > > /* Only one set of ops supported */
> > > struct kvm_iommu_ops *kvm_iommu_ops;
> > >
> > > +/* Protected by host_mmu.lock */
> > > +static bool kvm_idmap_initialized;
> > > +
> > > +static inline int pkvm_to_iommu_prot(enum kvm_pgtable_prot prot)
> > > +{
> > > + int iommu_prot = 0;
> > > +
> > > + if (prot & KVM_PGTABLE_PROT_R)
> > > + iommu_prot |= IOMMU_READ;
> > > + if (prot & KVM_PGTABLE_PROT_W)
> > > + iommu_prot |= IOMMU_WRITE;
> > > + if (prot == PKVM_HOST_MMIO_PROT)
> > > + iommu_prot |= IOMMU_MMIO;
> >
> > This looks a little odd to me.
> >
> > On the CPU side, the only different between PKVM_HOST_MEM_PROT and
> > PKVM_HOST_MMIO_PROT is that the former has execute permission. Both are
> > mapped as cacheable at stage-2 because it's the job of the host to set
> > the more restrictive memory type at stage-1.
> >
> > Carrying that over to the SMMU would suggest that we don't care about
> > IOMMU_MMIO at stage-2 at all, so why do we need to set it here?
>
> Unlike the CPU, the host can set the SMMU to bypass, in that case the
> hypervisor will attach its stage-2 with no stage-1 configured. So,
> stage-2 must have the correct attrs for MMIO.
I'm not sure about that.
If the SMMU is in stage-1 bypass, we still have the incoming memory
attributes from the transaction (modulo MTCFG which we shouldn't be
setting) and they should combine with the stage-2 attributes in roughly
the same way as the CPU, no?
> > > +static int __snapshot_host_stage2(const struct kvm_pgtable_visit_ctx *ctx,
> > > + enum kvm_pgtable_walk_flags visit)
> > > +{
> > > + u64 start = ctx->addr;
> > > + kvm_pte_t pte = *ctx->ptep;
> > > + u32 level = ctx->level;
> > > + u64 end = start + kvm_granule_size(level);
> > > + int prot = IOMMU_READ | IOMMU_WRITE;
> > > +
> > > + /* Keep unmapped. */
> > > + if (pte && !kvm_pte_valid(pte))
> > > + return 0;
> > > +
> > > + if (kvm_pte_valid(pte))
> > > + prot = pkvm_to_iommu_prot(kvm_pgtable_stage2_pte_prot(pte));
> > > + else if (!addr_is_memory(start))
> > > + prot |= IOMMU_MMIO;
> >
> > Why do we need to map MMIO regions pro-actively here? I'd have thought
> > we could just do:
> >
> > if (!kvm_pte_valid(pte))
> > return 0;
> >
> > prot = pkvm_to_iommu_prot(kvm_pgtable_stage2_pte_prot(pte);
> > kvm_iommu_ops->host_stage2_idmap(start, end, prot);
> > return 0;
> >
> > but I think that IOMMU_MMIO is throwing me again...
>
> We have to map everything pro-actively as we don’t handle page faults
> in the SMMUv3 driver.
> This would be a future work where the CPU stage-2 page table is shared with
> the SMMUv3.
Ah yes, I'd forgotten about that.
Thanks,
Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor
2025-09-26 14:33 ` Will Deacon
@ 2025-09-29 10:57 ` Mostafa Saleh
0 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-29 10:57 UTC (permalink / raw)
To: Will Deacon
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Fri, Sep 26, 2025 at 03:33:06PM +0100, Will Deacon wrote:
> On Tue, Sep 16, 2025 at 01:27:39PM +0000, Mostafa Saleh wrote:
> > On Tue, Sep 09, 2025 at 03:12:45PM +0100, Will Deacon wrote:
> > > On Tue, Aug 19, 2025 at 09:51:30PM +0000, Mostafa Saleh wrote:
> > > > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > > > index 861e448183fd..c9a15ef6b18d 100644
> > > > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > > > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > > > @@ -799,6 +799,70 @@ int ___pkvm_host_donate_hyp(u64 pfn, u64 nr_pages, enum kvm_pgtable_prot prot)
> > > > return ret;
> > > > }
> > > >
> > > > +int __pkvm_host_donate_hyp_mmio(u64 pfn)
> > > > +{
> > > > + u64 phys = hyp_pfn_to_phys(pfn);
> > > > + void *virt = __hyp_va(phys);
> > > > + int ret;
> > > > + kvm_pte_t pte;
> > > > +
> > > > + host_lock_component();
> > > > + hyp_lock_component();
> > > > +
> > > > + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, NULL);
> > > > + if (ret)
> > > > + goto unlock;
> > > > +
> > > > + if (pte && !kvm_pte_valid(pte)) {
> > > > + ret = -EPERM;
> > > > + goto unlock;
> > > > + }
> > >
> > > Shouldn't we first check that the pfn is indeed MMIO? Otherwise, testing
> > > the pte for the ownership information isn't right.
> >
> > I will add it, although the input should be trusted as it comes from the
> > hypervisor SMMUv3 driver.
>
> (more on this below)
>
> > > > +int __pkvm_hyp_donate_host_mmio(u64 pfn)
> > > > +{
> > > > + u64 phys = hyp_pfn_to_phys(pfn);
> > > > + u64 virt = (u64)__hyp_va(phys);
> > > > + size_t size = PAGE_SIZE;
> > > > +
> > > > + host_lock_component();
> > > > + hyp_lock_component();
> > >
> > > Shouldn't we check that:
> > >
> > > 1. pfn is mmio
> > > 2. pfn is owned by hyp
> > > 3. The host doesn't have something mapped at pfn already
> > >
> > > ?
> > >
> >
> > I thought about this initially, but as
> > - This code is only called from the hypervisor with trusted
> > inputs (only at boot)
> > - Only called on error path
> >
> > So WARN_ON in case of failure to unmap MMIO pages seemed is good enough,
> > to avoid extra code.
> >
> > But I can add the checks if you think they are necessary, we will need
> > to add new helpers for MMIO state though.
>
> I'd personally prefer to put the checks here so that callers don't have
> to worry (or forget!) about them. That also means that the donation
> function can be readily reused in the same way as the existing functions
> which operate on memory pages.
>
> How much work is it to add the MMIO helpers?
It's not much work I guess, I was just worried about adding new helpers
just to use in a rare error path.
I will add them for v5.
Thanks,
Mostafa
>
> > > > + WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, virt, size) != size);
> > > > + WARN_ON(host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, phys,
> > > > + PAGE_SIZE, &host_s2_pool, PKVM_ID_HOST));
> > > > + hyp_unlock_component();
> > > > + host_unlock_component();
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages)
> > > > {
> > > > return ___pkvm_host_donate_hyp(pfn, nr_pages, PAGE_HYP);
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index c351b4abd5db..ba06b0c21d5a 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1095,13 +1095,8 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > kvm_pte_t *childp = NULL;
> > > > bool need_flush = false;
> > > >
> > > > - if (!kvm_pte_valid(ctx->old)) {
> > > > - if (stage2_pte_is_counted(ctx->old)) {
> > > > - kvm_clear_pte(ctx->ptep);
> > > > - mm_ops->put_page(ctx->ptep);
> > > > - }
> > > > - return 0;
> > > > - }
> > > > + if (!kvm_pte_valid(ctx->old))
> > > > + return stage2_pte_is_counted(ctx->old) ? -EPERM : 0;
> > >
> > > Can this code be reached for the guest? For example, if
> > > pkvm_pgtable_stage2_destroy() runs into an MMIO-guarded pte on teardown?
> >
> > AFAICT, VMs page table is destroyed from reclaim_pgtable_pages() =>
> > kvm_pgtable_stage2_destroy() => kvm_pgtable_stage2_destroy_range() ... =>
> > stage2_free_walker()
> >
> > Which doesn't interact with “stage2_unmap_walker”, so that should be
> > fine.
>
> Fair enough. I feel like this might bite us later on but, with what you
> have, we'll see the -EPERM and then we can figure out what to do then.
>
> Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table
2025-09-26 14:42 ` Will Deacon
@ 2025-09-29 11:01 ` Mostafa Saleh
2025-09-30 12:38 ` Jason Gunthorpe
0 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-29 11:01 UTC (permalink / raw)
To: Will Deacon
Cc: linux-kernel, kvmarm, linux-arm-kernel, iommu, maz, oliver.upton,
joey.gouly, suzuki.poulose, yuzenghui, catalin.marinas,
robin.murphy, jean-philippe, qperret, tabba, jgg, mark.rutland,
praan
On Fri, Sep 26, 2025 at 03:42:38PM +0100, Will Deacon wrote:
> On Tue, Sep 16, 2025 at 02:24:46PM +0000, Mostafa Saleh wrote:
> > On Tue, Sep 09, 2025 at 03:42:07PM +0100, Will Deacon wrote:
> > > On Tue, Aug 19, 2025 at 09:51:38PM +0000, Mostafa Saleh wrote:
> > > > diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > > > index a01c036c55be..f7d1c8feb358 100644
> > > > --- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > > > +++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
> > > > @@ -4,15 +4,94 @@
> > > > *
> > > > * Copyright (C) 2022 Linaro Ltd.
> > > > */
> > > > +#include <linux/iommu.h>
> > > > +
> > > > #include <nvhe/iommu.h>
> > > > +#include <nvhe/mem_protect.h>
> > > > +#include <nvhe/spinlock.h>
> > > >
> > > > /* Only one set of ops supported */
> > > > struct kvm_iommu_ops *kvm_iommu_ops;
> > > >
> > > > +/* Protected by host_mmu.lock */
> > > > +static bool kvm_idmap_initialized;
> > > > +
> > > > +static inline int pkvm_to_iommu_prot(enum kvm_pgtable_prot prot)
> > > > +{
> > > > + int iommu_prot = 0;
> > > > +
> > > > + if (prot & KVM_PGTABLE_PROT_R)
> > > > + iommu_prot |= IOMMU_READ;
> > > > + if (prot & KVM_PGTABLE_PROT_W)
> > > > + iommu_prot |= IOMMU_WRITE;
> > > > + if (prot == PKVM_HOST_MMIO_PROT)
> > > > + iommu_prot |= IOMMU_MMIO;
> > >
> > > This looks a little odd to me.
> > >
> > > On the CPU side, the only different between PKVM_HOST_MEM_PROT and
> > > PKVM_HOST_MMIO_PROT is that the former has execute permission. Both are
> > > mapped as cacheable at stage-2 because it's the job of the host to set
> > > the more restrictive memory type at stage-1.
> > >
> > > Carrying that over to the SMMU would suggest that we don't care about
> > > IOMMU_MMIO at stage-2 at all, so why do we need to set it here?
> >
> > Unlike the CPU, the host can set the SMMU to bypass, in that case the
> > hypervisor will attach its stage-2 with no stage-1 configured. So,
> > stage-2 must have the correct attrs for MMIO.
>
> I'm not sure about that.
>
> If the SMMU is in stage-1 bypass, we still have the incoming memory
> attributes from the transaction (modulo MTCFG which we shouldn't be
> setting) and they should combine with the stage-2 attributes in roughly
> the same way as the CPU, no?
Makes sense, we can remove that for now and map all stage-2 with
IOMMU_CACHE. However, that might not be true for other IOMMUs,
as they might not combine attributes as SMMUv3 stage-2, but we
can ignore that for now. I will update the logic in v5.
Thanks,
Mostafa
>
> > > > +static int __snapshot_host_stage2(const struct kvm_pgtable_visit_ctx *ctx,
> > > > + enum kvm_pgtable_walk_flags visit)
> > > > +{
> > > > + u64 start = ctx->addr;
> > > > + kvm_pte_t pte = *ctx->ptep;
> > > > + u32 level = ctx->level;
> > > > + u64 end = start + kvm_granule_size(level);
> > > > + int prot = IOMMU_READ | IOMMU_WRITE;
> > > > +
> > > > + /* Keep unmapped. */
> > > > + if (pte && !kvm_pte_valid(pte))
> > > > + return 0;
> > > > +
> > > > + if (kvm_pte_valid(pte))
> > > > + prot = pkvm_to_iommu_prot(kvm_pgtable_stage2_pte_prot(pte));
> > > > + else if (!addr_is_memory(start))
> > > > + prot |= IOMMU_MMIO;
> > >
> > > Why do we need to map MMIO regions pro-actively here? I'd have thought
> > > we could just do:
> > >
> > > if (!kvm_pte_valid(pte))
> > > return 0;
> > >
> > > prot = pkvm_to_iommu_prot(kvm_pgtable_stage2_pte_prot(pte);
> > > kvm_iommu_ops->host_stage2_idmap(start, end, prot);
> > > return 0;
> > >
> > > but I think that IOMMU_MMIO is throwing me again...
> >
> > We have to map everything pro-actively as we don’t handle page faults
> > in the SMMUv3 driver.
> > This would be a future work where the CPU stage-2 page table is shared with
> > the SMMUv3.
>
> Ah yes, I'd forgotten about that.
>
> Thanks,
>
> Will
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-09-23 17:38 ` Jason Gunthorpe
@ 2025-09-29 11:10 ` Mostafa Saleh
2025-10-02 15:13 ` Jason Gunthorpe
0 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-29 11:10 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Tue, Sep 23, 2025 at 02:38:06PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 23, 2025 at 02:35:48PM +0000, Mostafa Saleh wrote:
> > If we really want to avoid the current approach, we can keep deferring probe,
> > until a check for a new flag set from “finalize_pkvm” which is called
> > unconditionally of KVM state.
>
> I still think the pkvm drivers should be bound to some special pkvm
> device_driver and the driver core should handle all this special
> dancing:
> - Wait for pkvm to decide if it will start or not
> - Claim a device for pkvm and make it visible in some generic way,eg
> in sysfs
> - Fall back to using the normal driver once we conclude pkvm won't
> run.
>
> It sounds like a pain to open code all this logic in every pkvm
> driver? How many do you have?
I though more about this, I think involving the driver core will be
useful in the future for init, as it will ensure power domains are
probed before the SMMUs when RPM is supported.
One simple way to do that, is the make the KVM SMMUv3 driver bind to
the SMMUs first until KVM finish init, then it unbinds them so the
main driver can be bind to them, that will not require any changes
or assumptions from the main driver, but in runtime the KVM driver
can't interact with the driver model.
Another possible solution, to keep a device bound to the KVM driver,
is to probe the SMMUs from the KVM driver, then to create child devices;
possibly use something as device_set_of_node_from_dev to bind those to
the main SMMUv3 or find another way to probe the main SMMUv3 without
changes.
Then we have a clear parent/child representation in the kernel, we can
also use sysfs/debugfs. But this might be more challenging, I will
look more into both and will update the logic in v5.
Thanks,
Mostafa
>
> Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table
2025-09-29 11:01 ` Mostafa Saleh
@ 2025-09-30 12:38 ` Jason Gunthorpe
2025-09-30 12:55 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2025-09-30 12:38 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Mon, Sep 29, 2025 at 11:01:10AM +0000, Mostafa Saleh wrote:
> > If the SMMU is in stage-1 bypass, we still have the incoming memory
> > attributes from the transaction (modulo MTCFG which we shouldn't be
> > setting) and they should combine with the stage-2 attributes in roughly
> > the same way as the CPU, no?
>
> Makes sense, we can remove that for now and map all stage-2 with
> IOMMU_CACHE.
Robin was saying in another thread that the DMA API has to use
IOMMU_MMIO properly or it won't work.. I think what happens depends on
the SOC design.
Yes, the incoming attribute combines, but unlike the CPU which will
have per-page memory attributes in the S1, the DMA initiator will
almost always use the same memory attributes.
In other words, we cannot rely on the DMA initiator to indicate if the
underlying memory should be MMIO or CACHE like the CPU can.
I think you have to set CACHE/MMIO correctly here.
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table
2025-09-30 12:38 ` Jason Gunthorpe
@ 2025-09-30 12:55 ` Mostafa Saleh
0 siblings, 0 replies; 82+ messages in thread
From: Mostafa Saleh @ 2025-09-30 12:55 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Tue, Sep 30, 2025 at 09:38:39AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 29, 2025 at 11:01:10AM +0000, Mostafa Saleh wrote:
>
> > > If the SMMU is in stage-1 bypass, we still have the incoming memory
> > > attributes from the transaction (modulo MTCFG which we shouldn't be
> > > setting) and they should combine with the stage-2 attributes in roughly
> > > the same way as the CPU, no?
> >
> > Makes sense, we can remove that for now and map all stage-2 with
> > IOMMU_CACHE.
>
> Robin was saying in another thread that the DMA API has to use
> IOMMU_MMIO properly or it won't work.. I think what happens depends on
> the SOC design.
>
> Yes, the incoming attribute combines, but unlike the CPU which will
> have per-page memory attributes in the S1, the DMA initiator will
> almost always use the same memory attributes.
>
> In other words, we cannot rely on the DMA initiator to indicate if the
> underlying memory should be MMIO or CACHE like the CPU can.
>
> I think you have to set CACHE/MMIO correctly here.
I see, I think you mean[1], thanks for pointing it, I think we have to
keep things as is.
Thanks,
Mostafa
[1] https://lore.kernel.org/all/8f912671-f1d9-4f73-9c1d-e39938bfc09f@arm.com/
>
> Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-09-29 11:10 ` Mostafa Saleh
@ 2025-10-02 15:13 ` Jason Gunthorpe
2025-11-05 16:40 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2025-10-02 15:13 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Mon, Sep 29, 2025 at 11:10:11AM +0000, Mostafa Saleh wrote:
> Another possible solution, to keep a device bound to the KVM driver,
> is to probe the SMMUs from the KVM driver, then to create child devices;
> possibly use something as device_set_of_node_from_dev to bind those to
> the main SMMUv3 or find another way to probe the main SMMUv3 without
> changes.
I do prefer something more like this one, I think it is nice that the
kvm specific driver will remain bound and visible so there is some
breadcrumbs about what happened to the system for debugging/etc.
Not sure how to do it, but I think it should be achievable..
Maybe even a simple faux/aux device and just pick up the of_node from
the parent..
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-10-02 15:13 ` Jason Gunthorpe
@ 2025-11-05 16:40 ` Mostafa Saleh
2025-11-05 17:12 ` Jason Gunthorpe
0 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-11-05 16:40 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Thu, Oct 02, 2025 at 12:13:08PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 29, 2025 at 11:10:11AM +0000, Mostafa Saleh wrote:
> > Another possible solution, to keep a device bound to the KVM driver,
> > is to probe the SMMUs from the KVM driver, then to create child devices;
> > possibly use something as device_set_of_node_from_dev to bind those to
> > the main SMMUv3 or find another way to probe the main SMMUv3 without
> > changes.
>
> I do prefer something more like this one, I think it is nice that the
> kvm specific driver will remain bound and visible so there is some
> breadcrumbs about what happened to the system for debugging/etc.
>
> Not sure how to do it, but I think it should be achievable..
>
> Maybe even a simple faux/aux device and just pick up the of_node from
> the parent..
I spent some time looking into this
With the approach of creating new devices as:
pdev = platform_device_alloc(dev_name(dev), PLATFORM_DEVID_AUTO);
pdev->dev.parent = dev;
device_set_node(&pdev->dev, dev->fwnode);
platform_device_add_resources(pdev, cur_pdev->resource,
cur_pdev->num_resources);
platform_device_add(pdev);
That is done from an init call after KVM init, where the KVM driver
probes the SMMUs, which then does
bus_rescan_devices(&platform_bus_type);
In the KVM driver probe, it had:
if (pdev->dev.parent->driver == &smmuv3_nesting_driver.driver)
return -ENODEV;
Which causes the main SMMU driver to probe the new devices.
However, that didn’t work because, as from Linux perspective the
nested driver was bound to all the SMMUs which means that any
device that is connected to an SMMUv3 has its dependencies met, which
caused those drivers to start probing without IOMMU ops.
Also, the approach with bind/unbind seemed to not work reliably
because of the same reason.
Looking into the probe path, it roughly does.
1) Device / Driver matching driver_match_device
2) Check suppliers before probe (device_links_check_suppliers)
3) Actual probe
I can’t see a way of adding dependencies in #1
For #2, there 2 problems,
i) It’s not clear how to create links, something as fwnode_link_add()
won’t work as one of the devices won’t have fwnode and device_link_add()
will need the device to be already created (and not sure how
to guarantee it won’t probe)
ii) Assuming we were able to create the link, it will be set to
DL_STATE_AVAILABLE once the nested driver probes, which won’t prevent
the main driver from probing till KVM initialises.
It seems device links are not the write tool to use.
So far, the requirements we need to satisfy are:
1- No driver should bind to the SMMUs before KVM initialises.
2- Back the nested driver with devices and possibly link them
The only possible solutions I see:
1- Keep patch as is
2- Check if KVM is initialised from the SMMUv3 driver,
if not -EPROBE_DEFER (as Will suggested), that will guarded by the
KVM driver macro and cmdline to enable protected mode.
Then if needed, we can create devices from the nested driver and link
them to the main ones in same initcall after the devices are created.
I can to look into more suggestions, otherwise, I will try with #2
with the -EPROBE_DEFER.
Thanks,
Mostafa
>
> Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-11-05 16:40 ` Mostafa Saleh
@ 2025-11-05 17:12 ` Jason Gunthorpe
2025-11-06 11:06 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2025-11-05 17:12 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Wed, Nov 05, 2025 at 04:40:26PM +0000, Mostafa Saleh wrote:
> However, that didn’t work because, as from Linux perspective the
> nested driver was bound to all the SMMUs which means that any
> device that is connected to an SMMUv3 has its dependencies met, which
> caused those drivers to start probing without IOMMU ops.
??
What code is doing this?
If a struct device gets a fwspec attached to it then it should not
permit any driver to probe until iommu_init_device() has
succeeded. This broadly needs to work to support iommu drivers as
modules that are loaded by the initrd.
So the general principal of causing devices to not progress should
already be there and work, if it doesn't then maybe it needs some
fixing.
I expect iommu_init_device() to fail on devices up until the actual
iommu driver is loaded. iommu_fwspec_ops() should fail because
iommu_from_fwnode() will not find fwnode in the iommu_device_list
until the iommu subsystem driver is bound, the kvm driver cannot
supply this.
So where do things go wrong for you?
> It seems device links are not the write tool to use.
Yes
> So far, the requirements we need to satisfy are:
> 1- No driver should bind to the SMMUs before KVM initialises.
Using the above I'd expect a sequence where the KVM SMMU driver loads
first, it does it's bit, then once KVM is happy it creates the actual
SMMU driver which registers in iommu_device_list and triggers driver
binding.
This is basically an identical sequence to loading an iommu driver
from the initrd - just the trigger for the delayed load is the kvm
creating the device, not udev runnign.
> 2- Check if KVM is initialised from the SMMUv3 driver,
> if not -EPROBE_DEFER (as Will suggested), that will guarded by the
> KVM driver macro and cmdline to enable protected mode.
SMMUv3 driver shouldn't even be bound until KVM is ready and it is an
actual working driver? Do this by not creating the struct device until
it is ready.
Also Greg will not like if you use platform devices here, use an aux
device..
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-11-05 17:12 ` Jason Gunthorpe
@ 2025-11-06 11:06 ` Mostafa Saleh
2025-11-06 13:23 ` Jason Gunthorpe
0 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-11-06 11:06 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Wed, Nov 05, 2025 at 01:12:08PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 05, 2025 at 04:40:26PM +0000, Mostafa Saleh wrote:
> > However, that didn’t work because, as from Linux perspective the
> > nested driver was bound to all the SMMUs which means that any
> > device that is connected to an SMMUv3 has its dependencies met, which
> > caused those drivers to start probing without IOMMU ops.
>
> ??
>
> What code is doing this?
>
> If a struct device gets a fwspec attached to it then it should not
> permit any driver to probe until iommu_init_device() has
> succeeded. This broadly needs to work to support iommu drivers as
> modules that are loaded by the initrd.
>
> So the general principal of causing devices to not progress should
> already be there and work, if it doesn't then maybe it needs some
> fixing.
>
> I expect iommu_init_device() to fail on devices up until the actual
> iommu driver is loaded. iommu_fwspec_ops() should fail because
> iommu_from_fwnode() will not find fwnode in the iommu_device_list
> until the iommu subsystem driver is bound, the kvm driver cannot
> supply this.
>
> So where do things go wrong for you?
Thanks for the explanation, I had a closer look, and indeed I was
confused, iommu_init_device() was failing because of .probe_device().
Because of device_set_node(), now both devices have the same fwnode,
so bus_find_device_by_fwnode() from arm_smmu_get_by_fwnode() was returning
the wrong device.
driver_find_device_by_fwnode() seems to work, but that makes me question
the reliability of this approach.
>
> > It seems device links are not the write tool to use.
>
> Yes
>
> > So far, the requirements we need to satisfy are:
> > 1- No driver should bind to the SMMUs before KVM initialises.
>
> Using the above I'd expect a sequence where the KVM SMMU driver loads
> first, it does it's bit, then once KVM is happy it creates the actual
> SMMU driver which registers in iommu_device_list and triggers driver
> binding.
>
> This is basically an identical sequence to loading an iommu driver
> from the initrd - just the trigger for the delayed load is the kvm
> creating the device, not udev runnign.
SMMUv3 driver as a module won't be a problem as modules are loaded later
after KVM initialises. The problem is mainly with the SMMUv3 driver
built-in, I don't think there is a way to delay loading of the driver,
besides this patch, which registers the driver later in case of KVM.
>
> > 2- Check if KVM is initialised from the SMMUv3 driver,
> > if not -EPROBE_DEFER (as Will suggested), that will guarded by the
> > KVM driver macro and cmdline to enable protected mode.
>
> SMMUv3 driver shouldn't even be bound until KVM is ready and it is an
> actual working driver? Do this by not creating the struct device until
> it is ready.
>
> Also Greg will not like if you use platform devices here, use an aux
> device..
>
But I am not sure if it is possible with built-in drivers to delay
the binding.
Also, I had to use platform devices for this, as the KVM driver binds
to the actual SMMUv3 nodes, and then duplicates them so the SMMUv3
driver can bind to the duplicate nodes, where the KVM devices are the
parent, but this approach seems complicated, besides the problems
mentioned above.
The other approach would be to keep defering in case of KVM:
@@ -4454,6 +4454,10 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
struct arm_smmu_device *smmu;
struct device *dev = &pdev->dev;
+ if (IS_ENABLED(CONFIG_ARM_SMMU_V3_PKVM) && is_protected_kvm_enabled() &&
+ !static_branch_unlikely(&kvm_protected_mode_initialized))
+ return -EPROBE_DEFER;
That works for me. And if we want to back the KVM driver with device I was
thinking we can rely on impl_ops, that has 2 benefits:
1- The SMMUv3 devices can be the parent instead of KVM.
2- The KVM devices can be faux/aux as they are not coming from FW and
don't need to be on the platform bus.
And this is simpler.
Besides this approach and the one in this patch, I don't see a simple way
of achieving this without adding extra support in the driver model/platform
bus to express such dependency.
Thanks,
Mostafa
> Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-11-06 11:06 ` Mostafa Saleh
@ 2025-11-06 13:23 ` Jason Gunthorpe
2025-11-06 16:54 ` Mostafa Saleh
0 siblings, 1 reply; 82+ messages in thread
From: Jason Gunthorpe @ 2025-11-06 13:23 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Thu, Nov 06, 2025 at 11:06:11AM +0000, Mostafa Saleh wrote:
> Thanks for the explanation, I had a closer look, and indeed I was
> confused, iommu_init_device() was failing because of .probe_device().
> Because of device_set_node(), now both devices have the same fwnode,
> so bus_find_device_by_fwnode() from arm_smmu_get_by_fwnode() was returning
> the wrong device.
>
> driver_find_device_by_fwnode() seems to work, but that makes me question
> the reliability of this approach.
Yeah, this stuff is nasty. See the discussion here.
https://lore.kernel.org/linux-iommu/0d5d4d02-eb78-43dc-8784-83c0760099f7@arm.com/
riscv doesn't search, so maybe ARM should follow it's technique:
static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
{
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
struct riscv_iommu_device *iommu;
struct riscv_iommu_info *info;
struct riscv_iommu_dc *dc;
u64 tc;
int i;
if (!fwspec || !fwspec->iommu_fwnode->dev || !fwspec->num_ids)
return ERR_PTR(-ENODEV);
iommu = dev_get_drvdata(fwspec->iommu_fwnode->dev);
if (!iommu)
return ERR_PTR(-ENODEV);
It would make it reliable..
> > > 2- Check if KVM is initialised from the SMMUv3 driver,
> > > if not -EPROBE_DEFER (as Will suggested), that will guarded by the
> > > KVM driver macro and cmdline to enable protected mode.
> >
> > SMMUv3 driver shouldn't even be bound until KVM is ready and it is an
> > actual working driver? Do this by not creating the struct device until
> > it is ready.
> >
> > Also Greg will not like if you use platform devices here, use an aux
> > device..
>
> But I am not sure if it is possible with built-in drivers to delay
> the binding.
You should never be delaying binding, you should be delaying creating
the device that will be bound.
pkvm claims the platform device.
pkvm completes its initialization and then creates an aux device
smmu driver binds the aux device and grabs the real platform_device
smmu driver grabs the resources it needs from the parent, including
the of node. No duplication.
Seems straightforward to me.
> Also, I had to use platform devices for this, as the KVM driver binds
> to the actual SMMUv3 nodes, and then duplicates them so the SMMUv3
> driver can bind to the duplicate nodes, where the KVM devices are the
> parent, but this approach seems complicated, besides the problems
> mentioned above.
I don't think you need to do this this, you can use aux device and the
fwspec things all search the iommu_devices_list to find the
iommu_driver. You don't need to duplicate anything.
Create the aux driver when the emulated smmu is ready to go.
> That works for me. And if we want to back the KVM driver with device I was
> thinking we can rely on impl_ops, that has 2 benefits:
> 1- The SMMUv3 devices can be the parent instead of KVM.
> 2- The KVM devices can be faux/aux as they are not coming from FW and
> don't need to be on the platform bus.
IMHO this is backwards. The kvm driver should be probing first, the
smmu driver should come later once kvm is ready to go.
> Besides this approach and the one in this patch, I don't see a simple way
> of achieving this without adding extra support in the driver model/platform
> bus to express such dependency.
You shouldn't need anything like this.
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-11-06 13:23 ` Jason Gunthorpe
@ 2025-11-06 16:54 ` Mostafa Saleh
2025-11-06 17:16 ` Jason Gunthorpe
0 siblings, 1 reply; 82+ messages in thread
From: Mostafa Saleh @ 2025-11-06 16:54 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Thu, Nov 06, 2025 at 09:23:31AM -0400, Jason Gunthorpe wrote:
> On Thu, Nov 06, 2025 at 11:06:11AM +0000, Mostafa Saleh wrote:
> > Thanks for the explanation, I had a closer look, and indeed I was
> > confused, iommu_init_device() was failing because of .probe_device().
> > Because of device_set_node(), now both devices have the same fwnode,
> > so bus_find_device_by_fwnode() from arm_smmu_get_by_fwnode() was returning
> > the wrong device.
> >
> > driver_find_device_by_fwnode() seems to work, but that makes me question
> > the reliability of this approach.
>
> Yeah, this stuff is nasty. See the discussion here.
>
> https://lore.kernel.org/linux-iommu/0d5d4d02-eb78-43dc-8784-83c0760099f7@arm.com/
>
> riscv doesn't search, so maybe ARM should follow it's technique:
>
> static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
> {
> struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> struct riscv_iommu_device *iommu;
> struct riscv_iommu_info *info;
> struct riscv_iommu_dc *dc;
> u64 tc;
> int i;
>
> if (!fwspec || !fwspec->iommu_fwnode->dev || !fwspec->num_ids)
> return ERR_PTR(-ENODEV);
>
> iommu = dev_get_drvdata(fwspec->iommu_fwnode->dev);
> if (!iommu)
> return ERR_PTR(-ENODEV);
>
> It would make it reliable..
That makes sense, and it will address the problem Robin was solving also:
https://lore.kernel.org/r/6d7ce1dc31873abdb75c895fb8bd2097cce098b4.1733406914.git.robin.murphy@arm.com
>
> > > > 2- Check if KVM is initialised from the SMMUv3 driver,
> > > > if not -EPROBE_DEFER (as Will suggested), that will guarded by the
> > > > KVM driver macro and cmdline to enable protected mode.
> > >
> > > SMMUv3 driver shouldn't even be bound until KVM is ready and it is an
> > > actual working driver? Do this by not creating the struct device until
> > > it is ready.
> > >
> > > Also Greg will not like if you use platform devices here, use an aux
> > > device..
> >
> > But I am not sure if it is possible with built-in drivers to delay
> > the binding.
>
> You should never be delaying binding, you should be delaying creating
> the device that will be bound.
>
> pkvm claims the platform device.
>
> pkvm completes its initialization and then creates an aux device
>
> smmu driver binds the aux device and grabs the real platform_device
>
> smmu driver grabs the resources it needs from the parent, including
> the of node. No duplication.
>
> Seems straightforward to me.
Maybe I am misunderstanding this, but that looks really intrusive to me,
at the moment arm-smmuv-3.c is a platform driver, and rely on the
platform bus to understand the device (platform_get_resource...)
You are suggesting to change that so it can also bind to AUX devices, then
change the “arm_smmu_device_probe” function to understand that and possibly
parse info from the parent device?
One of the main benefits from choosing trap and emulate was that it
looks transparent from the kernel of point view, so doing such radical
changes to adapt to KVM doesn't look right to me, I think the driver
should remain as is (a platform driver that thinks it's directly
talking to the HW).
The only thing we need to do is to make the SMMUs available after
KVM is up (at device_sync initcall).
>
> > Also, I had to use platform devices for this, as the KVM driver binds
> > to the actual SMMUv3 nodes, and then duplicates them so the SMMUv3
> > driver can bind to the duplicate nodes, where the KVM devices are the
> > parent, but this approach seems complicated, besides the problems
> > mentioned above.
>
> I don't think you need to do this this, you can use aux device and the
> fwspec things all search the iommu_devices_list to find the
> iommu_driver. You don't need to duplicate anything.
>
> Create the aux driver when the emulated smmu is ready to go.
See my point above.
>
> > That works for me. And if we want to back the KVM driver with device I was
> > thinking we can rely on impl_ops, that has 2 benefits:
>
> > 1- The SMMUv3 devices can be the parent instead of KVM.
> > 2- The KVM devices can be faux/aux as they are not coming from FW and
> > don't need to be on the platform bus.
>
> IMHO this is backwards. The kvm driver should be probing first, the
> smmu driver should come later once kvm is ready to go.
Agree.
>
> > Besides this approach and the one in this patch, I don't see a simple way
> > of achieving this without adding extra support in the driver model/platform
> > bus to express such dependency.
>
> You shouldn't need anything like this.
Agree.
Thanks,
Mostafa
>
> Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode
2025-11-06 16:54 ` Mostafa Saleh
@ 2025-11-06 17:16 ` Jason Gunthorpe
0 siblings, 0 replies; 82+ messages in thread
From: Jason Gunthorpe @ 2025-11-06 17:16 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Will Deacon, linux-kernel, kvmarm, linux-arm-kernel, iommu, maz,
oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
catalin.marinas, robin.murphy, jean-philippe, qperret, tabba,
mark.rutland, praan
On Thu, Nov 06, 2025 at 04:54:38PM +0000, Mostafa Saleh wrote:
> Maybe I am misunderstanding this, but that looks really intrusive to me,
> at the moment arm-smmuv-3.c is a platform driver, and rely on the
> platform bus to understand the device (platform_get_resource...)
>
> You are suggesting to change that so it can also bind to AUX devices, then
> change the “arm_smmu_device_probe” function to understand that and possibly
> parse info from the parent device?
Yes, it is probably only a couple lines I think. You still have a
platform device, it just comes from a different spot.
I didn't it audit it closely, but basically it starts like this:
-static int arm_smmu_device_probe(struct platform_device *pdev)
+/*
+ * dev is the device that the driver is bound to
+ * pdev is the device that has the physical resources describing the smmu
+ */
+static int arm_smmu_device_probe_impl(struct device *dev,
+ struct platform_device *pdev)
{
int irq, ret;
struct resource *res;
resource_size_t ioaddr;
struct arm_smmu_device *smmu;
- struct device *dev = &pdev->dev;
smmu = devm_kzalloc(dev, sizeof(*smmu), GFP_KERNEL);
if (!smmu)
Probably needs some adjustments to switch places between pdev/dev, but
the ones I looked at were all OK already..
In the aux case dev is the aux dev, otherwise dev and pdev are the
same thing. devm related stuff has to dev.
> One of the main benefits from choosing trap and emulate was that it
> looks transparent from the kernel of point view, so doing such radical
> changes to adapt to KVM doesn't look right to me, I think the driver
> should remain as is (a platform driver that thinks it's directly
> talking to the HW).
I'm not so fixed on this idea, this kvm stuff makes enough meaningful
changes I don't think we need to sweep it all under the rug completely
fully transparently. If you need a couple of edits to the probe
function that's fine in my book.
Jason
^ permalink raw reply [flat|nested] 82+ messages in thread
end of thread, other threads:[~2025-11-06 17:17 UTC | newest]
Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19 21:51 [PATCH v4 00/28] KVM: arm64: SMMUv3 driver for pKVM (trap and emulate) Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 01/28] KVM: arm64: Add a new function to donate memory with prot Mostafa Saleh
2025-09-09 13:46 ` Will Deacon
2025-09-14 19:23 ` Pranjal Shrivastava
2025-09-16 11:58 ` Mostafa Saleh
2025-09-16 11:56 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 02/28] KVM: arm64: Donate MMIO to the hypervisor Mostafa Saleh
2025-09-09 14:12 ` Will Deacon
2025-09-16 13:27 ` Mostafa Saleh
2025-09-26 14:33 ` Will Deacon
2025-09-29 10:57 ` Mostafa Saleh
2025-09-14 20:41 ` Pranjal Shrivastava
2025-09-16 13:43 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 03/28] KVM: arm64: pkvm: Add pkvm_time_get() Mostafa Saleh
2025-09-09 14:16 ` Will Deacon
2025-09-09 15:56 ` Marc Zyngier
2025-09-15 11:10 ` Pranjal Shrivastava
2025-09-16 14:04 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 04/28] iommu/io-pgtable-arm: Move selftests to a separate file Mostafa Saleh
2025-09-15 14:37 ` Pranjal Shrivastava
2025-09-16 14:07 ` Mostafa Saleh
2025-09-15 16:45 ` Jason Gunthorpe
2025-09-16 14:09 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 05/28] iommu/io-pgtable-arm: Factor kernel specific code out Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 06/28] iommu/arm-smmu-v3: Split code with hyp Mostafa Saleh
2025-09-09 14:23 ` Will Deacon
2025-09-16 14:10 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 07/28] iommu/arm-smmu-v3: Move TLB range invalidation into a macro Mostafa Saleh
2025-09-09 14:25 ` Will Deacon
2025-08-19 21:51 ` [PATCH v4 08/28] iommu/arm-smmu-v3: Move IDR parsing to common functions Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 09/28] KVM: arm64: iommu: Introduce IOMMU driver infrastructure Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 10/28] KVM: arm64: iommu: Shadow host stage-2 page table Mostafa Saleh
2025-09-09 14:42 ` Will Deacon
2025-09-16 14:24 ` Mostafa Saleh
2025-09-26 14:42 ` Will Deacon
2025-09-29 11:01 ` Mostafa Saleh
2025-09-30 12:38 ` Jason Gunthorpe
2025-09-30 12:55 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 11/28] KVM: arm64: iommu: Add memory pool Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 12/28] KVM: arm64: iommu: Support DABT for IOMMU Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 13/28] iommu/arm-smmu-v3-kvm: Add SMMUv3 driver Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 14/28] iommu/arm-smmu-v3: Add KVM mode in the driver Mostafa Saleh
2025-09-12 13:52 ` Will Deacon
2025-09-16 14:30 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 15/28] iommu/arm-smmu-v3: Load the driver later in KVM mode Mostafa Saleh
2025-09-12 13:54 ` Will Deacon
2025-09-23 14:35 ` Mostafa Saleh
2025-09-23 17:38 ` Jason Gunthorpe
2025-09-29 11:10 ` Mostafa Saleh
2025-10-02 15:13 ` Jason Gunthorpe
2025-11-05 16:40 ` Mostafa Saleh
2025-11-05 17:12 ` Jason Gunthorpe
2025-11-06 11:06 ` Mostafa Saleh
2025-11-06 13:23 ` Jason Gunthorpe
2025-11-06 16:54 ` Mostafa Saleh
2025-11-06 17:16 ` Jason Gunthorpe
2025-08-19 21:51 ` [PATCH v4 16/28] iommu/arm-smmu-v3-kvm: Create array for hyp SMMUv3 Mostafa Saleh
2025-09-09 18:30 ` Daniel Mentz
2025-09-16 14:35 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 17/28] iommu/arm-smmu-v3-kvm: Take over SMMUs Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 18/28] iommu/arm-smmu-v3-kvm: Probe SMMU HW Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 19/28] iommu/arm-smmu-v3-kvm: Add MMIO emulation Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 20/28] iommu/arm-smmu-v3-kvm: Shadow the command queue Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 21/28] iommu/arm-smmu-v3-kvm: Add CMDQ functions Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 22/28] iommu/arm-smmu-v3-kvm: Emulate CMDQ for host Mostafa Saleh
2025-09-12 14:18 ` Will Deacon
2025-09-15 16:38 ` Jason Gunthorpe
2025-09-16 15:19 ` Mostafa Saleh
2025-09-17 12:36 ` Jason Gunthorpe
2025-09-17 15:01 ` Will Deacon
2025-09-17 15:16 ` Jason Gunthorpe
2025-09-17 15:25 ` Will Deacon
2025-09-17 15:59 ` Jason Gunthorpe
2025-09-18 10:26 ` Will Deacon
2025-09-18 14:36 ` Jason Gunthorpe
2025-09-16 14:50 ` Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 23/28] iommu/arm-smmu-v3-kvm: Shadow stream table Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 24/28] iommu/arm-smmu-v3-kvm: Shadow STEs Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 25/28] iommu/arm-smmu-v3-kvm: Emulate GBPA Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 26/28] iommu/arm-smmu-v3-kvm: Support io-pgtable Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 27/28] iommu/arm-smmu-v3-kvm: Shadow the CPU stage-2 page table Mostafa Saleh
2025-08-19 21:51 ` [PATCH v4 28/28] iommu/arm-smmu-v3-kvm: Enable nesting Mostafa Saleh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).