[RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
@ 2024-12-12 18:03 Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 01/58] iommu/io-pgtable-arm: Split the page table driver Mostafa Saleh
                   ` (58 more replies)
  0 siblings, 59 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

This is v2 of the series sent last year:
https://lore.kernel.org/kvmarm/20230201125328.2186498-1-jean-philippe@linaro.org/

pKVM overview:
=============
The pKVM hypervisor, recently introduced on arm64, provides a separation
of privileges between the host and hypervisor parts of KVM, where the
hypervisor is trusted by guests but the host is not [1][2]. The host is
initially trusted during boot, but its privileges are reduced after KVM
is initialized so that, if an adversary later gains access to the large
attack surface of the host, it cannot access guest data.

Currently with pKVM, the host can still instruct DMA-capable devices
like the GPU to access guest and hypervisor memory, which undermines
this isolation. Preventing DMA attacks requires an IOMMU, owned by the
hypervisor.

This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
host SMMU driver into nVHE isn't really an option. It is too large and
complex and requires infrastructure from all over the kernel. We add a
reduced nVHE driver that deals with populating the SMMU tables and the
command queue, and the host driver still deals with probing and some
initialization.

Some of the pKVM infrastructure is not upstream yet, which are dependencies
for this series, so it should be considered a forward looking RFC for
what we think about how DMA isolation can be supported in pKVM or in
other similar confidential computing solutions and not a ready to merge
solution.
This is discussed further in the dependencies section below.

Patches overview
================
The patches are split as follows:
Patches 1-10: Mostly about splitting the current SMMUv3 driver and
io-pgtable-arm library, so the code can be re-used in the KVM driver
either inside the kernel or the hypervisor.
Most of these patches are best reviewed with git's --color-moved.

Patches 11-24: Introduce the hypervisor core code for IOMMUs which is
not specific to SMMUv3, these are the hypercall handlers and common
logic in the hypervisor.
It also introduces the key functions __pkvm_host_{un}use_dma_page which
are used to track DMA mapped pages, more on this in the design section.

Patches 25-41: Add the hypervisor part of the KVM SMMUv3 driver which
is called by hypervisor core IOMMU code, these are para-virtualized
operations such as attach/detach, map/unmap...

Patches 42-54: Add the kernel part of the KVM SMMUv3 driver, this
probes the IOMMUs and initialises them and populates the list of SMMUs
to the hypervisor, it also implements the kernel iommu_ops and registers
the IOMMUs with the kernel.

Patches 55-58: Two extra optimizations introduced at the end to avoid
complicating the start of the series, one to optimise iommu_map_sg and
the other is to batch TLB invalidation which I noticed to be a problem
while testing as my HW doesn’t support range invalidation.

A development branch is available at:
https://android-kvm.googlesource.com/linux/+log/refs/heads/for-upstream/pkvm-smmu

Design
======
We've explored 4 solutions so far, we only mention two of them here
which I believe are the most promising as they offer private IO spaces,
while the others were discussed in the v1 of the series cover letter.

1. Paravirtual I/O page tables
This is the solution implemented in this series. The host creates
IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
the hypervisor populates the page tables. Page tables are abstracted into
IOMMU domains, which allow multiple devices to share the same address
space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
and free_domain(), manage the domains, the semantics of those hypercalls
are almost identical to the IOMMU ops which make the kernel driver part
simpler.

Some key points in the hypervisor design:
a- Tracking mapped pages: the hypervisor must prevent pages mapped in the
   IOMMU to be donated to a protected guest or the hypervisor, or allow
   a protected guest/hypervisor page be mapped in an IOMMU domain.

   For that we rely on the vmemmap refcount, where each time a page is
   mapped it’s refcount is incremented and ownership is checked, and
   each time it's successfully unmapped it’s decremented. And any memory
   donation would be denied for refcounted pages.

b- Locking: The io-pgtable-arm is lockless under some guarantees of how
   the IOMMU code behaves. However with pKVM, the kernel is not trusted
   and a malicious kernel can issue concurrent requests causing memory
   corruption or UAF, so that it has to be locked in the hypervisor.

c- Memory management: The hypervisor needs a way to allocate pages for
   the pv page tables, for that an IOMMU pool is created which can be
   topped up from a hypercall, and the IOMMU hypercalls returns encoded
   memory requests which can be fulfilled by the kernel driver.

2. Nested SMMUv3 translation (with emulation)
Another approach is to rely on nested translation support which is
optional in SMMUv3, that requires an architecturally accurate emulation
of SMMUv3 which can be complicated including cmdq emulation.

With this approach, we can use the same page tables as the CPU stage-2,
which adds more constraints on HW (SMMUv3 features must match the CPU)
and the ability of the devices to handle faults as the CPU part relies
on lazy mapping and has no guarantees about pages being mapped.
Or we can use a shadow IOMMU page table instead.

I have a prototype that is not ready yet to be posted for nested:
https://android-kvm.googlesource.com/linux/+log/refs/heads/smostafa/android15-6.6-smmu-nesting-wip

The trade off between the 2 approaches can be roughly summarised as:
Paravirtualization:
- Compatible with more HW (and IOMMUs).
- Better DMA performance due to shorter table walks/less TLB pressure
- Needs extra complexity to squeeze the last bit of optimization (around
  unmap, and map_sg).

Nested Emulation
- Faster map_pages (not sure about unmap because it requires cmdq
  emulation for TLB invalidation if DVM not used).
- Needs extra complexity for architecturally emulating SMMUv3.

I believe that the first approach looks more promising with this trade
off. However, I plan to complete the nested emulation and post it with
a comparison with this approach in terms of performance, and maybe this
topic can be discussed in an upcoming conference.

Dependencies
============
This series depends on some parts of pKVM that are not upstreamed yet,
some of them are currently posted[3][4]. However, not to spam the list
with many of these changes which are not relevant to IOMMU/SMMUv3 the
patches are developed on top of them.

This series also depends on another series reworking the io-pgtable walker[5]

Performance
===========
With CONFIG_DMA_MAP_BENCHMARK on a 4-core Morello board.
Numbers represent the average time needed for one dma_map/dma_unmap call
in μs, lower is better.
It is compared with the kernel driver, which is not quite a fair comparison
as it doesn't fulfil pKVM DMA isolation requirements. However, these
numbers are provided just to give a rough idea about how the overhead
looks like.
			Kernel driver	      pKVM driver
4K - 1 thread		0.1/0.7               0.3/1.3
4K - 4 threads		0.1/1.1               0.5/3.3
1M - 1 thread		0.8/21.5              2.6/27.3
1M - 4 threads		1.1/45.7              3.6/46.2

And tested as follows:
echo dma_map_benchmark > /sys/bus/pci/devices/0000\:06\:00.0/driver_override
echo 0000:06:00.0 >  /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind
echo 0000:06:00.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
./dma_map_bechmark -t $threads -g $nr_pages

Future work
==========
- Add IDENTITY_DOMAIN support, I already have some patches for that, but
  didn’t want to complicate this series, I can send them separately.
- Complete the comparison with the nesting support and find the most
  suitable solution for upstream.

Main changes since v1
=====================
- Patches are reordered to split the introduction of the KVM IOMMU
  code and the SMMUv3 driver.
- KVM EL2 code is closer the EL1 where domains are decoupled from
  IOMMUs.
- SMMUv3 new features (stage-1 support, IRQ and EVTQ in the kernel).
- Adaptions to the new SMMUv3 cleanups.
- Rework tracking of mapped pages to improve performance.
- Rework locking to improve performance.
- Rework unmap to improve performance.
- Adding iotlb_gather to optimize unmap.
- Add new operations to optimize map_sg operation.
- Registering driver is dynamically done instead of statically checked.
- Memory allocation for page table pages are changed to be separate
  pool and HVCs instead of share mc that required atomic allocation.
- Support for higher order page allocation.
- Support for non-coherent SMMUs.
- Support for DABT and MMIO emulation.

[1] https://lore.kernel.org/kvmarm/20220519134204.5379-1-will@kernel.org/
[2] https://www.youtube.com/watch?v=9npebeVFbFw
[3] https://lore.kernel.org/kvmarm/20241203103735.2267589-1-qperret@google.com/
[4] https://lore.kernel.org/all/20241202154742.3611749-1-tabba@google.com/
[5] https://lore.kernel.org/linux-iommu/20241028213146.238941-1-robdclark@gmail.com/T/#t

Jean-Philippe Brucker (23):
  iommu/io-pgtable-arm: Split the page table driver
  iommu/io-pgtable-arm: Split initialization
  iommu/io-pgtable: Add configure() operation
  iommu/arm-smmu-v3: Move some definitions to arm64 include/
  iommu/arm-smmu-v3: Extract driver-specific bits from probe function
  iommu/arm-smmu-v3: Move some functions to arm-smmu-v3-common.c
  iommu/arm-smmu-v3: Move queue and table allocation to
    arm-smmu-v3-common.c
  iommu/arm-smmu-v3: Move firmware probe to arm-smmu-v3-common
  iommu/arm-smmu-v3: Move IOMMU registration to arm-smmu-v3-common.c
  KVM: arm64: pkvm: Add pkvm_udelay()
  KVM: arm64: pkvm: Add __pkvm_host_add_remove_page()
  KVM: arm64: pkvm: Support SCMI power domain
  KVM: arm64: iommu: Support power management
  KVM: arm64: iommu: Add SMMUv3 driver
  KVM: arm64: smmu-v3: Initialize registers
  KVM: arm64: smmu-v3: Setup command queue
  KVM: arm64: smmu-v3: Reset the device
  KVM: arm64: smmu-v3: Support io-pgtable
  iommu/arm-smmu-v3-kvm: Add host driver for pKVM
  iommu/arm-smmu-v3-kvm: Pass a list of SMMU devices to the hypervisor
  iommu/arm-smmu-v3-kvm: Validate device features
  iommu/arm-smmu-v3-kvm: Allocate structures and reset device
  iommu/arm-smmu-v3-kvm: Probe power domains

Mostafa Saleh (35):
  iommu/arm-smmu-v3: Move common irq code to common file
  KVM: arm64: Add __pkvm_{use, unuse}_dma()
  KVM: arm64: Introduce IOMMU driver infrastructure
  KVM: arm64: pkvm: Add IOMMU hypercalls
  KVM: arm64: iommu: Add a memory pool for the IOMMU
  KVM: arm64: iommu: Add domains
  KVM: arm64: iommu: Add {attach, detach}_dev
  KVM: arm64: iommu: Add map/unmap() operations
  KVM: arm64: iommu: support iommu_iotlb_gather
  KVM: arm64: Support power domains
  KVM: arm64: iommu: Support DABT for IOMMU
  KVM: arm64: smmu-v3: Setup stream table
  KVM: arm64: smmu-v3: Setup event queue
  KVM: arm64: smmu-v3: Add {alloc/free}_domain
  KVM: arm64: smmu-v3: Add TLB ops
  KVM: arm64: smmu-v3: Add context descriptor functions
  KVM: arm64: smmu-v3: Add attach_dev
  KVM: arm64: smmu-v3: Add detach_dev
  iommu/io-pgtable: Generalize walker interface
  iommu/io-pgtable-arm: Add post table walker callback
  drivers/iommu: io-pgtable: Add IO_PGTABLE_QUIRK_UNMAP_INVAL
  KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys
  KVM: arm64: smmu-v3: Add DABT handler
  KVM: arm64: Add function to topup generic allocator
  KVM: arm64: Add macro for SMCCC call with all returns
  iommu/arm-smmu-v3-kvm: Add function to topup IOMMU allocator
  iommu/arm-smmu-v3-kvm: Add IOMMU ops
  iommu/arm-smmu-v3-kvm: Add map, unmap and iova_to_phys operations
  iommu/arm-smmu-v3-kvm: Support PASID operations
  iommu/arm-smmu-v3-kvm: Add IRQs for the driver
  iommu/arm-smmu-v3-kvm: Enable runtime PM
  drivers/iommu: Add deferred map_sg operations
  KVM: arm64: iommu: Add hypercall for map_sg
  iommu/arm-smmu-v3-kvm: Implement sg operations
  iommu/arm-smmu-v3-kvm: Support command queue batching

 arch/arm64/include/asm/arm-smmu-v3-common.h   |  592 +++++++
 arch/arm64/include/asm/kvm_asm.h              |    9 +
 arch/arm64/include/asm/kvm_host.h             |   48 +-
 arch/arm64/include/asm/kvm_hyp.h              |    2 +
 arch/arm64/kvm/Makefile                       |    2 +-
 arch/arm64/kvm/arm.c                          |    8 +-
 arch/arm64/kvm/hyp/hyp-constants.c            |    1 +
 arch/arm64/kvm/hyp/include/nvhe/iommu.h       |   91 ++
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |    3 +
 arch/arm64/kvm/hyp/include/nvhe/mm.h          |    1 +
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   37 +
 .../arm64/kvm/hyp/include/nvhe/trap_handler.h |    2 +
 arch/arm64/kvm/hyp/nvhe/Makefile              |    6 +-
 arch/arm64/kvm/hyp/nvhe/alloc_mgt.c           |    2 +
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  114 ++
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c   | 1390 +++++++++++++++++
 .../arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c |  153 ++
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c         |  490 ++++++
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  133 +-
 arch/arm64/kvm/hyp/nvhe/mm.c                  |   17 +
 arch/arm64/kvm/hyp/nvhe/power/hvc.c           |   47 +
 arch/arm64/kvm/hyp/nvhe/power/scmi.c          |  231 +++
 arch/arm64/kvm/hyp/nvhe/setup.c               |    9 +
 arch/arm64/kvm/hyp/nvhe/timer-sr.c            |   42 +
 arch/arm64/kvm/iommu.c                        |   89 ++
 arch/arm64/kvm/mmu.c                          |   20 +
 arch/arm64/kvm/pkvm.c                         |   20 +
 drivers/gpu/drm/msm/msm_iommu.c               |    5 +-
 drivers/iommu/Kconfig                         |    9 +
 drivers/iommu/Makefile                        |    2 +-
 drivers/iommu/arm/arm-smmu-v3/Makefile        |    7 +
 .../arm/arm-smmu-v3/arm-smmu-v3-common.c      |  824 ++++++++++
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 1093 +++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  989 +-----------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  758 +++------
 drivers/iommu/io-pgtable-arm-common.c         |  929 +++++++++++
 drivers/iommu/io-pgtable-arm.c                | 1061 +------------
 drivers/iommu/io-pgtable-arm.h                |   30 -
 drivers/iommu/io-pgtable.c                    |   15 +
 drivers/iommu/iommu.c                         |   53 +-
 include/kvm/arm_smmu_v3.h                     |   46 +
 include/kvm/iommu.h                           |   59 +
 include/kvm/power_domain.h                    |   24 +
 include/linux/io-pgtable-arm.h                |  233 +++
 include/linux/io-pgtable.h                    |   38 +-
 include/linux/iommu.h                         |   43 +-
 46 files changed, 7169 insertions(+), 2608 deletions(-)
 create mode 100644 arch/arm64/include/asm/arm-smmu-v3-common.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/iommu.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/power/hvc.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/power/scmi.c
 create mode 100644 arch/arm64/kvm/iommu.c
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
 create mode 100644 drivers/iommu/io-pgtable-arm-common.c
 delete mode 100644 drivers/iommu/io-pgtable-arm.h
 create mode 100644 include/kvm/arm_smmu_v3.h
 create mode 100644 include/kvm/iommu.h
 create mode 100644 include/kvm/power_domain.h
 create mode 100644 include/linux/io-pgtable-arm.h

-- 
2.47.0.338.g60cca15819-goog

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 01/58] iommu/io-pgtable-arm: Split the page table driver
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 02/58] iommu/io-pgtable-arm: Split initialization Mostafa Saleh
                   ` (57 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

To allow the KVM IOMMU driver to populate page tables using the
io-pgtable-arm code, move the shared bits into io-pgtable-arm-common.c.

Here we move the bulk of the common code, and a subsequent patch handles
the bits that require more care. phys_to_virt() and virt_to_phys() do
need special handling here because the hypervisor will have its own
version. It will also implement its own version of
__arm_lpae_alloc_pages(), __arm_lpae_free_pages() and
__arm_lpae_sync_pte() since the hypervisor needs some assistance for
allocating pages.

Some other minor changes around mapping existing or unmapping empty
PTEs as WARN_ON is fatal in the hypervisor.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/iommu/Makefile                |   2 +-
 drivers/iommu/io-pgtable-arm-common.c | 625 ++++++++++++++++++++
 drivers/iommu/io-pgtable-arm.c        | 795 +-------------------------
 drivers/iommu/io-pgtable-arm.h        |  30 -
 include/linux/io-pgtable-arm.h        | 223 ++++++++
 5 files changed, 866 insertions(+), 809 deletions(-)
 create mode 100644 drivers/iommu/io-pgtable-arm-common.c
 delete mode 100644 drivers/iommu/io-pgtable-arm.h
 create mode 100644 include/linux/io-pgtable-arm.h

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 542760d963ec..70c5386ce298 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
 obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
-obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
+obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o io-pgtable-arm-common.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
 obj-$(CONFIG_IOMMU_IOVA) += iova.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
diff --git a/drivers/iommu/io-pgtable-arm-common.c b/drivers/iommu/io-pgtable-arm-common.c
new file mode 100644
index 000000000000..ef14a1b50d32
--- /dev/null
+++ b/drivers/iommu/io-pgtable-arm-common.c
@@ -0,0 +1,625 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * CPU-agnostic ARM page table allocator.
+ * A copy of this library is embedded in the KVM nVHE image.
+ *
+ * Copyright (C) 2022 Arm Limited
+ *
+ * Author: Will Deacon <will.deacon@arm.com>
+ */
+
+#include <linux/io-pgtable-arm.h>
+
+#include <linux/sizes.h>
+#include <linux/types.h>
+
+#define iopte_deref(pte, d) __arm_lpae_phys_to_virt(iopte_to_paddr(pte, d))
+
+static arm_lpae_iopte paddr_to_iopte(phys_addr_t paddr,
+				     struct arm_lpae_io_pgtable *data)
+{
+	arm_lpae_iopte pte = paddr;
+
+	/* Of the bits which overlap, either 51:48 or 15:12 are always RES0 */
+	return (pte | (pte >> (48 - 12))) & ARM_LPAE_PTE_ADDR_MASK;
+}
+
+static phys_addr_t iopte_to_paddr(arm_lpae_iopte pte,
+				  struct arm_lpae_io_pgtable *data)
+{
+	u64 paddr = pte & ARM_LPAE_PTE_ADDR_MASK;
+
+	if (ARM_LPAE_GRANULE(data) < SZ_64K)
+		return paddr;
+
+	/* Rotate the packed high-order bits back to the top */
+	return (paddr | (paddr << (48 - 12))) & (ARM_LPAE_PTE_ADDR_MASK << 4);
+}
+
+static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cfg, int num_entries)
+{
+	for (int i = 0; i < num_entries; i++)
+		ptep[i] = 0;
+
+	if (!cfg->coherent_walk && num_entries)
+		__arm_lpae_sync_pte(ptep, num_entries, cfg);
+}
+
+static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
+			       struct iommu_iotlb_gather *gather,
+			       unsigned long iova, size_t size, size_t pgcount,
+			       int lvl, arm_lpae_iopte *ptep);
+
+static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
+				phys_addr_t paddr, arm_lpae_iopte prot,
+				int lvl, int num_entries, arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte = prot;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	int i;
+
+	if (data->iop.fmt != ARM_MALI_LPAE && lvl == ARM_LPAE_MAX_LEVELS - 1)
+		pte |= ARM_LPAE_PTE_TYPE_PAGE;
+	else
+		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
+
+	for (i = 0; i < num_entries; i++)
+		ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data);
+
+	if (!cfg->coherent_walk)
+		__arm_lpae_sync_pte(ptep, num_entries, cfg);
+}
+
+static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
+			     unsigned long iova, phys_addr_t paddr,
+			     arm_lpae_iopte prot, int lvl, int num_entries,
+			     arm_lpae_iopte *ptep)
+{
+	int i;
+
+	for (i = 0; i < num_entries; i++)
+		if (iopte_leaf(ptep[i], lvl, data->iop.fmt)) {
+			/* We require an unmap first */
+			return arm_lpae_map_exists();
+		} else if (iopte_type(ptep[i]) == ARM_LPAE_PTE_TYPE_TABLE) {
+			/*
+			 * We need to unmap and free the old table before
+			 * overwriting it with a block entry.
+			 */
+			arm_lpae_iopte *tblp;
+			size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+
+			tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
+			if (__arm_lpae_unmap(data, NULL, iova + i * sz, sz, 1,
+					     lvl, tblp) != sz) {
+				WARN_ON(1);
+				return -EINVAL;
+			}
+		}
+
+	__arm_lpae_init_pte(data, paddr, prot, lvl, num_entries, ptep);
+	return 0;
+}
+
+static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
+					     arm_lpae_iopte *ptep,
+					     arm_lpae_iopte curr,
+					     struct arm_lpae_io_pgtable *data)
+{
+	arm_lpae_iopte old, new;
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+
+	new = paddr_to_iopte(__arm_lpae_virt_to_phys(table), data) |
+			     ARM_LPAE_PTE_TYPE_TABLE;
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		new |= ARM_LPAE_PTE_NSTABLE;
+
+	/*
+	 * Ensure the table itself is visible before its PTE can be.
+	 * Whilst we could get away with cmpxchg64_release below, this
+	 * doesn't have any ordering semantics when !CONFIG_SMP.
+	 */
+	dma_wmb();
+
+	old = cmpxchg64_relaxed(ptep, curr, new);
+
+	if (cfg->coherent_walk || (old & ARM_LPAE_PTE_SW_SYNC))
+		return old;
+
+	/* Even if it's not ours, there's no point waiting; just kick it */
+	__arm_lpae_sync_pte(ptep, 1, cfg);
+	if (old == curr)
+		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
+
+	return old;
+}
+
+static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
+			  phys_addr_t paddr, size_t size, size_t pgcount,
+			  arm_lpae_iopte prot, int lvl, arm_lpae_iopte *ptep,
+			  gfp_t gfp, size_t *mapped)
+{
+	arm_lpae_iopte *cptep, pte;
+	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	size_t tblsz = ARM_LPAE_GRANULE(data);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	int ret = 0, num_entries, max_entries, map_idx_start;
+
+	/* Find our entry at the current level */
+	map_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
+	ptep += map_idx_start;
+
+	/* If we can install a leaf entry at this level, then do so */
+	if (size == block_size) {
+		max_entries = ARM_LPAE_PTES_PER_TABLE(data) - map_idx_start;
+		num_entries = min_t(int, pgcount, max_entries);
+		ret = arm_lpae_init_pte(data, iova, paddr, prot, lvl, num_entries, ptep);
+		if (!ret)
+			*mapped += num_entries * size;
+
+		return ret;
+	}
+
+	/* We can't allocate tables at the final level */
+	if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
+		return -EINVAL;
+
+	/* Grab a pointer to the next level */
+	pte = READ_ONCE(*ptep);
+	if (!pte) {
+		cptep = __arm_lpae_alloc_pages(tblsz, gfp, cfg, data->iop.cookie);
+		if (!cptep)
+			return -ENOMEM;
+
+		pte = arm_lpae_install_table(cptep, ptep, 0, data);
+		if (pte)
+			__arm_lpae_free_pages(cptep, tblsz, cfg, data->iop.cookie);
+	} else if (!cfg->coherent_walk && !(pte & ARM_LPAE_PTE_SW_SYNC)) {
+		__arm_lpae_sync_pte(ptep, 1, cfg);
+	}
+
+	if (pte && !iopte_leaf(pte, lvl, data->iop.fmt)) {
+		cptep = iopte_deref(pte, data);
+	} else if (pte) {
+		/* We require an unmap first */
+		return arm_lpae_unmap_empty();
+	}
+
+	/* Rinse, repeat */
+	return __arm_lpae_map(data, iova, paddr, size, pgcount, prot, lvl + 1,
+			      cptep, gfp, mapped);
+}
+
+static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
+					   int prot)
+{
+	arm_lpae_iopte pte;
+
+	if (data->iop.fmt == ARM_64_LPAE_S1 ||
+	    data->iop.fmt == ARM_32_LPAE_S1) {
+		pte = ARM_LPAE_PTE_nG;
+		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
+			pte |= ARM_LPAE_PTE_AP_RDONLY;
+		else if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_HD)
+			pte |= ARM_LPAE_PTE_DBM;
+		if (!(prot & IOMMU_PRIV))
+			pte |= ARM_LPAE_PTE_AP_UNPRIV;
+	} else {
+		pte = ARM_LPAE_PTE_HAP_FAULT;
+		if (prot & IOMMU_READ)
+			pte |= ARM_LPAE_PTE_HAP_READ;
+		if (prot & IOMMU_WRITE)
+			pte |= ARM_LPAE_PTE_HAP_WRITE;
+	}
+
+	/*
+	 * Note that this logic is structured to accommodate Mali LPAE
+	 * having stage-1-like attributes but stage-2-like permissions.
+	 */
+	if (data->iop.fmt == ARM_64_LPAE_S2 ||
+	    data->iop.fmt == ARM_32_LPAE_S2) {
+		if (prot & IOMMU_MMIO)
+			pte |= ARM_LPAE_PTE_MEMATTR_DEV;
+		else if (prot & IOMMU_CACHE)
+			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
+		else
+			pte |= ARM_LPAE_PTE_MEMATTR_NC;
+	} else {
+		if (prot & IOMMU_MMIO)
+			pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
+				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
+		else if (prot & IOMMU_CACHE)
+			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
+				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
+	}
+
+	/*
+	 * Also Mali has its own notions of shareability wherein its Inner
+	 * domain covers the cores within the GPU, and its Outer domain is
+	 * "outside the GPU" (i.e. either the Inner or System domain in CPU
+	 * terms, depending on coherency).
+	 */
+	if (prot & IOMMU_CACHE && data->iop.fmt != ARM_MALI_LPAE)
+		pte |= ARM_LPAE_PTE_SH_IS;
+	else
+		pte |= ARM_LPAE_PTE_SH_OS;
+
+	if (prot & IOMMU_NOEXEC)
+		pte |= ARM_LPAE_PTE_XN;
+
+	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_NS)
+		pte |= ARM_LPAE_PTE_NS;
+
+	if (data->iop.fmt != ARM_MALI_LPAE)
+		pte |= ARM_LPAE_PTE_AF;
+
+	return pte;
+}
+
+int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
+		       phys_addr_t paddr, size_t pgsize, size_t pgcount,
+		       int iommu_prot, gfp_t gfp, size_t *mapped)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte *ptep = data->pgd;
+	int ret, lvl = data->start_level;
+	arm_lpae_iopte prot;
+	long iaext = (s64)iova >> cfg->ias;
+
+	if (WARN_ON(!pgsize || (pgsize & cfg->pgsize_bitmap) != pgsize))
+		return -EINVAL;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext || paddr >> cfg->oas))
+		return -ERANGE;
+
+	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
+		return -EINVAL;
+
+	prot = arm_lpae_prot_to_pte(data, iommu_prot);
+	ret = __arm_lpae_map(data, iova, paddr, pgsize, pgcount, prot, lvl,
+			     ptep, gfp, mapped);
+	/*
+	 * Synchronise all PTE updates for the new mapping before there's
+	 * a chance for anything to kick off a table walk for the new iova.
+	 */
+	wmb();
+
+	return ret;
+}
+
+void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
+			     arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte *start, *end;
+	unsigned long table_size;
+
+	if (lvl == data->start_level)
+		table_size = ARM_LPAE_PGD_SIZE(data);
+	else
+		table_size = ARM_LPAE_GRANULE(data);
+
+	start = ptep;
+
+	/* Only leaf entries at the last level */
+	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		end = ptep;
+	else
+		end = (void *)ptep + table_size;
+
+	while (ptep != end) {
+		arm_lpae_iopte pte = *ptep++;
+
+		if (!pte || iopte_leaf(pte, lvl, data->iop.fmt))
+			continue;
+
+		__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
+	}
+
+	__arm_lpae_free_pages(start, table_size, &data->iop.cfg, data->iop.cookie);
+}
+
+static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
+				       struct iommu_iotlb_gather *gather,
+				       unsigned long iova, size_t size,
+				       arm_lpae_iopte blk_pte, int lvl,
+				       arm_lpae_iopte *ptep, size_t pgcount)
+{
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte pte, *tablep;
+	phys_addr_t blk_paddr;
+	size_t tablesz = ARM_LPAE_GRANULE(data);
+	size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	int ptes_per_table = ARM_LPAE_PTES_PER_TABLE(data);
+	int i, unmap_idx_start = -1, num_entries = 0, max_entries;
+
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return 0;
+
+	tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg, data->iop.cookie);
+	if (!tablep)
+		return 0; /* Bytes unmapped */
+
+	if (size == split_sz) {
+		unmap_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
+		max_entries = ptes_per_table - unmap_idx_start;
+		num_entries = min_t(int, pgcount, max_entries);
+	}
+
+	blk_paddr = iopte_to_paddr(blk_pte, data);
+	pte = iopte_prot(blk_pte);
+
+	for (i = 0; i < ptes_per_table; i++, blk_paddr += split_sz) {
+		/* Unmap! */
+		if (i >= unmap_idx_start && i < (unmap_idx_start + num_entries))
+			continue;
+
+		__arm_lpae_init_pte(data, blk_paddr, pte, lvl, 1, &tablep[i]);
+	}
+
+	pte = arm_lpae_install_table(tablep, ptep, blk_pte, data);
+	if (pte != blk_pte) {
+		__arm_lpae_free_pages(tablep, tablesz, cfg, data->iop.cookie);
+		/*
+		 * We may race against someone unmapping another part of this
+		 * block, but anything else is invalid. We can't misinterpret
+		 * a page entry here since we're never at the last level.
+		 */
+		if (iopte_type(pte) != ARM_LPAE_PTE_TYPE_TABLE)
+			return 0;
+
+		tablep = iopte_deref(pte, data);
+	} else if (unmap_idx_start >= 0) {
+		for (i = 0; i < num_entries; i++)
+			io_pgtable_tlb_add_page(&data->iop, gather, iova + i * size, size);
+
+		return num_entries * size;
+	}
+
+	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl, tablep);
+}
+
+static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
+			       struct iommu_iotlb_gather *gather,
+			       unsigned long iova, size_t size, size_t pgcount,
+			       int lvl, arm_lpae_iopte *ptep)
+{
+	arm_lpae_iopte pte;
+	struct io_pgtable *iop = &data->iop;
+	int i = 0, num_entries, max_entries, unmap_idx_start;
+
+	/* Something went horribly wrong and we ran out of page table */
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return 0;
+
+	unmap_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
+	ptep += unmap_idx_start;
+	pte = READ_ONCE(*ptep);
+	if (WARN_ON(!pte))
+		return 0;
+
+	/* If the size matches this level, we're in the right place */
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+		max_entries = ARM_LPAE_PTES_PER_TABLE(data) - unmap_idx_start;
+		num_entries = min_t(int, pgcount, max_entries);
+
+		/* Find and handle non-leaf entries */
+		for (i = 0; i < num_entries; i++) {
+			pte = READ_ONCE(ptep[i]);
+			if (WARN_ON(!pte))
+				break;
+
+			if (!iopte_leaf(pte, lvl, iop->fmt)) {
+				__arm_lpae_clear_pte(&ptep[i], &iop->cfg, 1);
+
+				/* Also flush any partial walks */
+				io_pgtable_tlb_flush_walk(iop, iova + i * size, size,
+							  ARM_LPAE_GRANULE(data));
+				__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
+			}
+		}
+
+		/* Clear the remaining entries */
+		__arm_lpae_clear_pte(ptep, &iop->cfg, i);
+
+		if (gather && !iommu_iotlb_gather_queued(gather))
+			for (int j = 0; j < i; j++)
+				io_pgtable_tlb_add_page(iop, gather, iova + j * size, size);
+
+		return i * size;
+	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
+		/*
+		 * Insert a table at the next level to map the old region,
+		 * minus the part we want to unmap
+		 */
+		return arm_lpae_split_blk_unmap(data, gather, iova, size, pte,
+						lvl + 1, ptep, pgcount);
+	}
+
+	/* Keep on walkin' */
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl + 1, ptep);
+}
+
+size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
+			    size_t pgsize, size_t pgcount,
+			    struct iommu_iotlb_gather *gather)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte *ptep = data->pgd;
+	long iaext = (s64)iova >> cfg->ias;
+
+	if (WARN_ON(!pgsize || (pgsize & cfg->pgsize_bitmap) != pgsize || !pgcount))
+		return 0;
+
+	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
+		iaext = ~iaext;
+	if (WARN_ON(iaext))
+		return 0;
+
+	return __arm_lpae_unmap(data, gather, iova, pgsize, pgcount,
+				data->start_level, ptep);
+}
+
+static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
+				 struct io_pgtable_walk_data *walk_data,
+				 arm_lpae_iopte *ptep,
+				 int lvl);
+
+struct iova_to_phys_data {
+	arm_lpae_iopte pte;
+	int lvl;
+};
+
+static int visit_iova_to_phys(struct io_pgtable_walk_data *walk_data, int lvl,
+			      arm_lpae_iopte *ptep, size_t size)
+{
+	struct iova_to_phys_data *data = walk_data->data;
+	data->pte = *ptep;
+	data->lvl = lvl;
+	return 0;
+}
+
+phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
+				  unsigned long iova)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct iova_to_phys_data d;
+	struct io_pgtable_walk_data walk_data = {
+		.data = &d,
+		.visit = visit_iova_to_phys,
+		.addr = iova,
+		.end = iova + 1,
+	};
+	int ret;
+
+	ret = __arm_lpae_iopte_walk(data, &walk_data, data->pgd, data->start_level);
+	if (ret)
+		return 0;
+
+	iova &= (ARM_LPAE_BLOCK_SIZE(d.lvl, data) - 1);
+	return iopte_to_paddr(d.pte, data) | iova;
+}
+
+static int visit_pgtable_walk(struct io_pgtable_walk_data *walk_data, int lvl,
+			      arm_lpae_iopte *ptep, size_t size)
+{
+	struct arm_lpae_io_pgtable_walk_data *data = walk_data->data;
+	data->ptes[data->level++] = *ptep;
+	return 0;
+}
+
+int arm_lpae_pgtable_walk(struct io_pgtable_ops *ops, unsigned long iova, void *wd)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_walk_data walk_data = {
+		.data = wd,
+		.visit = visit_pgtable_walk,
+		.addr = iova,
+		.end = iova + 1,
+	};
+
+	((struct arm_lpae_io_pgtable_walk_data *)wd)->level = 0;
+
+	return __arm_lpae_iopte_walk(data, &walk_data, data->pgd, data->start_level);
+}
+
+static int io_pgtable_visit(struct arm_lpae_io_pgtable *data,
+			    struct io_pgtable_walk_data *walk_data,
+			    arm_lpae_iopte *ptep, int lvl)
+{
+	struct io_pgtable *iop = &data->iop;
+	arm_lpae_iopte pte = READ_ONCE(*ptep);
+
+	size_t size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	int ret = walk_data->visit(walk_data, lvl, ptep, size);
+	if (ret)
+		return ret;
+
+	if (iopte_leaf(pte, lvl, iop->fmt)) {
+		walk_data->addr += size;
+		return 0;
+	}
+
+	if (!iopte_table(pte, lvl)) {
+		return -EINVAL;
+	}
+
+	ptep = iopte_deref(pte, data);
+	return __arm_lpae_iopte_walk(data, walk_data, ptep, lvl + 1);
+}
+
+static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
+				 struct io_pgtable_walk_data *walk_data,
+				 arm_lpae_iopte *ptep,
+				 int lvl)
+{
+	u32 idx;
+	int max_entries, ret;
+
+	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+		return -EINVAL;
+
+	if (lvl == data->start_level)
+		max_entries = ARM_LPAE_PGD_SIZE(data) / sizeof(arm_lpae_iopte);
+	else
+		max_entries = ARM_LPAE_PTES_PER_TABLE(data);
+
+	for (idx = ARM_LPAE_LVL_IDX(walk_data->addr, lvl, data);
+	     (idx < max_entries) && (walk_data->addr < walk_data->end); ++idx) {
+		ret = io_pgtable_visit(data, walk_data, ptep + idx, lvl);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int visit_dirty(struct io_pgtable_walk_data *walk_data, int lvl,
+		       arm_lpae_iopte *ptep, size_t size)
+{
+	struct iommu_dirty_bitmap *dirty = walk_data->data;
+
+	if (!iopte_leaf(*ptep, lvl, walk_data->iop->fmt))
+		return 0;
+
+	if (iopte_writeable_dirty(*ptep)) {
+		iommu_dirty_bitmap_record(dirty, walk_data->addr, size);
+		if (!(walk_data->flags & IOMMU_DIRTY_NO_CLEAR))
+			iopte_set_writeable_clean(ptep);
+	}
+
+	return 0;
+}
+
+int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
+				  unsigned long iova, size_t size,
+				  unsigned long flags,
+				  struct iommu_dirty_bitmap *dirty)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	struct io_pgtable_walk_data walk_data = {
+		.iop = &data->iop,
+		.data = dirty,
+		.visit = visit_dirty,
+		.flags = flags,
+		.addr = iova,
+		.end = iova + size,
+	};
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = data->start_level;
+
+	if (WARN_ON(!size))
+		return -EINVAL;
+	if (WARN_ON((iova + size - 1) & ~(BIT(cfg->ias) - 1)))
+		return -EINVAL;
+	if (data->iop.fmt != ARM_64_LPAE_S1)
+		return -EINVAL;
+
+	return __arm_lpae_iopte_walk(data, &walk_data, ptep, lvl);
+}
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 6739e1fa54ec..cb4eb513adbf 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
  * CPU-agnostic ARM page table allocator.
+ * Host-specific functions. The rest is in io-pgtable-arm-common.c.
  *
  * Copyright (C) 2014 ARM Limited
  *
@@ -11,7 +12,7 @@
 
 #include <linux/atomic.h>
 #include <linux/bitops.h>
-#include <linux/io-pgtable.h>
+#include <linux/io-pgtable-arm.h>
 #include <linux/kernel.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
@@ -20,195 +21,33 @@
 
 #include <asm/barrier.h>
 
-#include "io-pgtable-arm.h"
 #include "iommu-pages.h"
 
 #define ARM_LPAE_MAX_ADDR_BITS		52
 #define ARM_LPAE_S2_MAX_CONCAT_PAGES	16
-#define ARM_LPAE_MAX_LEVELS		4
 
-/* Struct accessors */
-#define io_pgtable_to_data(x)						\
-	container_of((x), struct arm_lpae_io_pgtable, iop)
-
-#define io_pgtable_ops_to_data(x)					\
-	io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
-
-/*
- * Calculate the right shift amount to get to the portion describing level l
- * in a virtual address mapped by the pagetable in d.
- */
-#define ARM_LPAE_LVL_SHIFT(l,d)						\
-	(((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level) +		\
-	ilog2(sizeof(arm_lpae_iopte)))
-
-#define ARM_LPAE_GRANULE(d)						\
-	(sizeof(arm_lpae_iopte) << (d)->bits_per_level)
-#define ARM_LPAE_PGD_SIZE(d)						\
-	(sizeof(arm_lpae_iopte) << (d)->pgd_bits)
-
-#define ARM_LPAE_PTES_PER_TABLE(d)					\
-	(ARM_LPAE_GRANULE(d) >> ilog2(sizeof(arm_lpae_iopte)))
-
-/*
- * Calculate the index at level l used to map virtual address a using the
- * pagetable in d.
- */
-#define ARM_LPAE_PGD_IDX(l,d)						\
-	((l) == (d)->start_level ? (d)->pgd_bits - (d)->bits_per_level : 0)
-
-#define ARM_LPAE_LVL_IDX(a,l,d)						\
-	(((u64)(a) >> ARM_LPAE_LVL_SHIFT(l,d)) &			\
-	 ((1 << ((d)->bits_per_level + ARM_LPAE_PGD_IDX(l,d))) - 1))
-
-/* Calculate the block/page mapping size at level l for pagetable in d. */
-#define ARM_LPAE_BLOCK_SIZE(l,d)	(1ULL << ARM_LPAE_LVL_SHIFT(l,d))
-
-/* Page table bits */
-#define ARM_LPAE_PTE_TYPE_SHIFT		0
-#define ARM_LPAE_PTE_TYPE_MASK		0x3
-
-#define ARM_LPAE_PTE_TYPE_BLOCK		1
-#define ARM_LPAE_PTE_TYPE_TABLE		3
-#define ARM_LPAE_PTE_TYPE_PAGE		3
-
-#define ARM_LPAE_PTE_ADDR_MASK		GENMASK_ULL(47,12)
-
-#define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
-#define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
-#define ARM_LPAE_PTE_DBM		(((arm_lpae_iopte)1) << 51)
-#define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
-#define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
-#define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
-#define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
-#define ARM_LPAE_PTE_NS			(((arm_lpae_iopte)1) << 5)
-#define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
-
-#define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
-/* Ignore the contiguous bit for block splitting */
-#define ARM_LPAE_PTE_ATTR_HI_MASK	(ARM_LPAE_PTE_XN | ARM_LPAE_PTE_DBM)
-#define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
-					 ARM_LPAE_PTE_ATTR_HI_MASK)
-/* Software bit for solving coherency races */
-#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
-
-/* Stage-1 PTE */
-#define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
-#define ARM_LPAE_PTE_AP_RDONLY_BIT	7
-#define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)1) << \
-					   ARM_LPAE_PTE_AP_RDONLY_BIT)
-#define ARM_LPAE_PTE_AP_WR_CLEAN_MASK	(ARM_LPAE_PTE_AP_RDONLY | \
-					 ARM_LPAE_PTE_DBM)
-#define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
-#define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
-
-/* Stage-2 PTE */
-#define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
-#define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
-#define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
-#define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
-#define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
-#define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
-
-/* Register bits */
-#define ARM_LPAE_VTCR_SL0_MASK		0x3
-
-#define ARM_LPAE_TCR_T0SZ_SHIFT		0
-
-#define ARM_LPAE_VTCR_PS_SHIFT		16
-#define ARM_LPAE_VTCR_PS_MASK		0x7
-
-#define ARM_LPAE_MAIR_ATTR_SHIFT(n)	((n) << 3)
-#define ARM_LPAE_MAIR_ATTR_MASK		0xff
-#define ARM_LPAE_MAIR_ATTR_DEVICE	0x04
-#define ARM_LPAE_MAIR_ATTR_NC		0x44
-#define ARM_LPAE_MAIR_ATTR_INC_OWBRWA	0xf4
-#define ARM_LPAE_MAIR_ATTR_WBRWA	0xff
-#define ARM_LPAE_MAIR_ATTR_IDX_NC	0
-#define ARM_LPAE_MAIR_ATTR_IDX_CACHE	1
-#define ARM_LPAE_MAIR_ATTR_IDX_DEV	2
-#define ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE	3
-
-#define ARM_MALI_LPAE_TTBR_ADRMODE_TABLE (3u << 0)
-#define ARM_MALI_LPAE_TTBR_READ_INNER	BIT(2)
-#define ARM_MALI_LPAE_TTBR_SHARE_OUTER	BIT(4)
-
-#define ARM_MALI_LPAE_MEMATTR_IMP_DEF	0x88ULL
-#define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL
-
-/* IOPTE accessors */
-#define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d))
-
-#define iopte_type(pte)					\
-	(((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
-
-#define iopte_prot(pte)	((pte) & ARM_LPAE_PTE_ATTR_MASK)
-
-#define iopte_writeable_dirty(pte)				\
-	(((pte) & ARM_LPAE_PTE_AP_WR_CLEAN_MASK) == ARM_LPAE_PTE_DBM)
-
-#define iopte_set_writeable_clean(ptep)				\
-	set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long *)(ptep))
-
-struct arm_lpae_io_pgtable {
-	struct io_pgtable	iop;
-
-	int			pgd_bits;
-	int			start_level;
-	int			bits_per_level;
-
-	void			*pgd;
-};
-
-typedef u64 arm_lpae_iopte;
-
-static inline bool iopte_leaf(arm_lpae_iopte pte, int lvl,
-			      enum io_pgtable_fmt fmt)
-{
-	if (lvl == (ARM_LPAE_MAX_LEVELS - 1) && fmt != ARM_MALI_LPAE)
-		return iopte_type(pte) == ARM_LPAE_PTE_TYPE_PAGE;
-
-	return iopte_type(pte) == ARM_LPAE_PTE_TYPE_BLOCK;
-}
+static bool selftest_running = false;
 
-static inline bool iopte_table(arm_lpae_iopte pte, int lvl)
+int arm_lpae_map_exists(void)
 {
-	if (lvl == (ARM_LPAE_MAX_LEVELS - 1))
-		return false;
-	return iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE;
+	WARN_ON(!selftest_running);
+	return -EEXIST;
 }
 
-static arm_lpae_iopte paddr_to_iopte(phys_addr_t paddr,
-				     struct arm_lpae_io_pgtable *data)
+int arm_lpae_unmap_empty(void)
 {
-	arm_lpae_iopte pte = paddr;
-
-	/* Of the bits which overlap, either 51:48 or 15:12 are always RES0 */
-	return (pte | (pte >> (48 - 12))) & ARM_LPAE_PTE_ADDR_MASK;
+	WARN_ON(!selftest_running);
+	return -EEXIST;
 }
 
-static phys_addr_t iopte_to_paddr(arm_lpae_iopte pte,
-				  struct arm_lpae_io_pgtable *data)
-{
-	u64 paddr = pte & ARM_LPAE_PTE_ADDR_MASK;
-
-	if (ARM_LPAE_GRANULE(data) < SZ_64K)
-		return paddr;
-
-	/* Rotate the packed high-order bits back to the top */
-	return (paddr | (paddr << (48 - 12))) & (ARM_LPAE_PTE_ADDR_MASK << 4);
-}
-
-static bool selftest_running = false;
-
 static dma_addr_t __arm_lpae_dma_addr(void *pages)
 {
 	return (dma_addr_t)virt_to_phys(pages);
 }
 
-static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
-				    struct io_pgtable_cfg *cfg,
-				    void *cookie)
+void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
+			     struct io_pgtable_cfg *cfg,
+			     void *cookie)
 {
 	struct device *dev = cfg->iommu_dev;
 	int order = get_order(size);
@@ -253,9 +92,9 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
 	return NULL;
 }
 
-static void __arm_lpae_free_pages(void *pages, size_t size,
-				  struct io_pgtable_cfg *cfg,
-				  void *cookie)
+void __arm_lpae_free_pages(void *pages, size_t size,
+			   struct io_pgtable_cfg *cfg,
+			   void *cookie)
 {
 	if (!cfg->coherent_walk)
 		dma_unmap_single(cfg->iommu_dev, __arm_lpae_dma_addr(pages),
@@ -267,300 +106,13 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
 		iommu_free_pages(pages, get_order(size));
 }
 
-static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
-				struct io_pgtable_cfg *cfg)
+void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
+			 struct io_pgtable_cfg *cfg)
 {
 	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
 				   sizeof(*ptep) * num_entries, DMA_TO_DEVICE);
 }
 
-static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cfg, int num_entries)
-{
-	for (int i = 0; i < num_entries; i++)
-		ptep[i] = 0;
-
-	if (!cfg->coherent_walk && num_entries)
-		__arm_lpae_sync_pte(ptep, num_entries, cfg);
-}
-
-static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
-			       struct iommu_iotlb_gather *gather,
-			       unsigned long iova, size_t size, size_t pgcount,
-			       int lvl, arm_lpae_iopte *ptep);
-
-static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
-				phys_addr_t paddr, arm_lpae_iopte prot,
-				int lvl, int num_entries, arm_lpae_iopte *ptep)
-{
-	arm_lpae_iopte pte = prot;
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-	size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
-	int i;
-
-	if (data->iop.fmt != ARM_MALI_LPAE && lvl == ARM_LPAE_MAX_LEVELS - 1)
-		pte |= ARM_LPAE_PTE_TYPE_PAGE;
-	else
-		pte |= ARM_LPAE_PTE_TYPE_BLOCK;
-
-	for (i = 0; i < num_entries; i++)
-		ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data);
-
-	if (!cfg->coherent_walk)
-		__arm_lpae_sync_pte(ptep, num_entries, cfg);
-}
-
-static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
-			     unsigned long iova, phys_addr_t paddr,
-			     arm_lpae_iopte prot, int lvl, int num_entries,
-			     arm_lpae_iopte *ptep)
-{
-	int i;
-
-	for (i = 0; i < num_entries; i++)
-		if (iopte_leaf(ptep[i], lvl, data->iop.fmt)) {
-			/* We require an unmap first */
-			WARN_ON(!selftest_running);
-			return -EEXIST;
-		} else if (iopte_type(ptep[i]) == ARM_LPAE_PTE_TYPE_TABLE) {
-			/*
-			 * We need to unmap and free the old table before
-			 * overwriting it with a block entry.
-			 */
-			arm_lpae_iopte *tblp;
-			size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
-
-			tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
-			if (__arm_lpae_unmap(data, NULL, iova + i * sz, sz, 1,
-					     lvl, tblp) != sz) {
-				WARN_ON(1);
-				return -EINVAL;
-			}
-		}
-
-	__arm_lpae_init_pte(data, paddr, prot, lvl, num_entries, ptep);
-	return 0;
-}
-
-static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
-					     arm_lpae_iopte *ptep,
-					     arm_lpae_iopte curr,
-					     struct arm_lpae_io_pgtable *data)
-{
-	arm_lpae_iopte old, new;
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-
-	new = paddr_to_iopte(__pa(table), data) | ARM_LPAE_PTE_TYPE_TABLE;
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
-		new |= ARM_LPAE_PTE_NSTABLE;
-
-	/*
-	 * Ensure the table itself is visible before its PTE can be.
-	 * Whilst we could get away with cmpxchg64_release below, this
-	 * doesn't have any ordering semantics when !CONFIG_SMP.
-	 */
-	dma_wmb();
-
-	old = cmpxchg64_relaxed(ptep, curr, new);
-
-	if (cfg->coherent_walk || (old & ARM_LPAE_PTE_SW_SYNC))
-		return old;
-
-	/* Even if it's not ours, there's no point waiting; just kick it */
-	__arm_lpae_sync_pte(ptep, 1, cfg);
-	if (old == curr)
-		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
-
-	return old;
-}
-
-static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
-			  phys_addr_t paddr, size_t size, size_t pgcount,
-			  arm_lpae_iopte prot, int lvl, arm_lpae_iopte *ptep,
-			  gfp_t gfp, size_t *mapped)
-{
-	arm_lpae_iopte *cptep, pte;
-	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
-	size_t tblsz = ARM_LPAE_GRANULE(data);
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-	int ret = 0, num_entries, max_entries, map_idx_start;
-
-	/* Find our entry at the current level */
-	map_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
-	ptep += map_idx_start;
-
-	/* If we can install a leaf entry at this level, then do so */
-	if (size == block_size) {
-		max_entries = ARM_LPAE_PTES_PER_TABLE(data) - map_idx_start;
-		num_entries = min_t(int, pgcount, max_entries);
-		ret = arm_lpae_init_pte(data, iova, paddr, prot, lvl, num_entries, ptep);
-		if (!ret)
-			*mapped += num_entries * size;
-
-		return ret;
-	}
-
-	/* We can't allocate tables at the final level */
-	if (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1))
-		return -EINVAL;
-
-	/* Grab a pointer to the next level */
-	pte = READ_ONCE(*ptep);
-	if (!pte) {
-		cptep = __arm_lpae_alloc_pages(tblsz, gfp, cfg, data->iop.cookie);
-		if (!cptep)
-			return -ENOMEM;
-
-		pte = arm_lpae_install_table(cptep, ptep, 0, data);
-		if (pte)
-			__arm_lpae_free_pages(cptep, tblsz, cfg, data->iop.cookie);
-	} else if (!cfg->coherent_walk && !(pte & ARM_LPAE_PTE_SW_SYNC)) {
-		__arm_lpae_sync_pte(ptep, 1, cfg);
-	}
-
-	if (pte && !iopte_leaf(pte, lvl, data->iop.fmt)) {
-		cptep = iopte_deref(pte, data);
-	} else if (pte) {
-		/* We require an unmap first */
-		WARN_ON(!selftest_running);
-		return -EEXIST;
-	}
-
-	/* Rinse, repeat */
-	return __arm_lpae_map(data, iova, paddr, size, pgcount, prot, lvl + 1,
-			      cptep, gfp, mapped);
-}
-
-static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
-					   int prot)
-{
-	arm_lpae_iopte pte;
-
-	if (data->iop.fmt == ARM_64_LPAE_S1 ||
-	    data->iop.fmt == ARM_32_LPAE_S1) {
-		pte = ARM_LPAE_PTE_nG;
-		if (!(prot & IOMMU_WRITE) && (prot & IOMMU_READ))
-			pte |= ARM_LPAE_PTE_AP_RDONLY;
-		else if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_HD)
-			pte |= ARM_LPAE_PTE_DBM;
-		if (!(prot & IOMMU_PRIV))
-			pte |= ARM_LPAE_PTE_AP_UNPRIV;
-	} else {
-		pte = ARM_LPAE_PTE_HAP_FAULT;
-		if (prot & IOMMU_READ)
-			pte |= ARM_LPAE_PTE_HAP_READ;
-		if (prot & IOMMU_WRITE)
-			pte |= ARM_LPAE_PTE_HAP_WRITE;
-	}
-
-	/*
-	 * Note that this logic is structured to accommodate Mali LPAE
-	 * having stage-1-like attributes but stage-2-like permissions.
-	 */
-	if (data->iop.fmt == ARM_64_LPAE_S2 ||
-	    data->iop.fmt == ARM_32_LPAE_S2) {
-		if (prot & IOMMU_MMIO)
-			pte |= ARM_LPAE_PTE_MEMATTR_DEV;
-		else if (prot & IOMMU_CACHE)
-			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
-		else
-			pte |= ARM_LPAE_PTE_MEMATTR_NC;
-	} else {
-		if (prot & IOMMU_MMIO)
-			pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
-				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
-		else if (prot & IOMMU_CACHE)
-			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
-				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
-	}
-
-	/*
-	 * Also Mali has its own notions of shareability wherein its Inner
-	 * domain covers the cores within the GPU, and its Outer domain is
-	 * "outside the GPU" (i.e. either the Inner or System domain in CPU
-	 * terms, depending on coherency).
-	 */
-	if (prot & IOMMU_CACHE && data->iop.fmt != ARM_MALI_LPAE)
-		pte |= ARM_LPAE_PTE_SH_IS;
-	else
-		pte |= ARM_LPAE_PTE_SH_OS;
-
-	if (prot & IOMMU_NOEXEC)
-		pte |= ARM_LPAE_PTE_XN;
-
-	if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_NS)
-		pte |= ARM_LPAE_PTE_NS;
-
-	if (data->iop.fmt != ARM_MALI_LPAE)
-		pte |= ARM_LPAE_PTE_AF;
-
-	return pte;
-}
-
-static int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
-			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
-			      int iommu_prot, gfp_t gfp, size_t *mapped)
-{
-	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-	arm_lpae_iopte *ptep = data->pgd;
-	int ret, lvl = data->start_level;
-	arm_lpae_iopte prot;
-	long iaext = (s64)iova >> cfg->ias;
-
-	if (WARN_ON(!pgsize || (pgsize & cfg->pgsize_bitmap) != pgsize))
-		return -EINVAL;
-
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
-		iaext = ~iaext;
-	if (WARN_ON(iaext || paddr >> cfg->oas))
-		return -ERANGE;
-
-	if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))
-		return -EINVAL;
-
-	prot = arm_lpae_prot_to_pte(data, iommu_prot);
-	ret = __arm_lpae_map(data, iova, paddr, pgsize, pgcount, prot, lvl,
-			     ptep, gfp, mapped);
-	/*
-	 * Synchronise all PTE updates for the new mapping before there's
-	 * a chance for anything to kick off a table walk for the new iova.
-	 */
-	wmb();
-
-	return ret;
-}
-
-static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
-				    arm_lpae_iopte *ptep)
-{
-	arm_lpae_iopte *start, *end;
-	unsigned long table_size;
-
-	if (lvl == data->start_level)
-		table_size = ARM_LPAE_PGD_SIZE(data);
-	else
-		table_size = ARM_LPAE_GRANULE(data);
-
-	start = ptep;
-
-	/* Only leaf entries at the last level */
-	if (lvl == ARM_LPAE_MAX_LEVELS - 1)
-		end = ptep;
-	else
-		end = (void *)ptep + table_size;
-
-	while (ptep != end) {
-		arm_lpae_iopte pte = *ptep++;
-
-		if (!pte || iopte_leaf(pte, lvl, data->iop.fmt))
-			continue;
-
-		__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
-	}
-
-	__arm_lpae_free_pages(start, table_size, &data->iop.cfg, data->iop.cookie);
-}
-
 static void arm_lpae_free_pgtable(struct io_pgtable *iop)
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_to_data(iop);
@@ -569,319 +121,6 @@ static void arm_lpae_free_pgtable(struct io_pgtable *iop)
 	kfree(data);
 }
 
-static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable *data,
-				       struct iommu_iotlb_gather *gather,
-				       unsigned long iova, size_t size,
-				       arm_lpae_iopte blk_pte, int lvl,
-				       arm_lpae_iopte *ptep, size_t pgcount)
-{
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-	arm_lpae_iopte pte, *tablep;
-	phys_addr_t blk_paddr;
-	size_t tablesz = ARM_LPAE_GRANULE(data);
-	size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
-	int ptes_per_table = ARM_LPAE_PTES_PER_TABLE(data);
-	int i, unmap_idx_start = -1, num_entries = 0, max_entries;
-
-	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
-		return 0;
-
-	tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg, data->iop.cookie);
-	if (!tablep)
-		return 0; /* Bytes unmapped */
-
-	if (size == split_sz) {
-		unmap_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
-		max_entries = ptes_per_table - unmap_idx_start;
-		num_entries = min_t(int, pgcount, max_entries);
-	}
-
-	blk_paddr = iopte_to_paddr(blk_pte, data);
-	pte = iopte_prot(blk_pte);
-
-	for (i = 0; i < ptes_per_table; i++, blk_paddr += split_sz) {
-		/* Unmap! */
-		if (i >= unmap_idx_start && i < (unmap_idx_start + num_entries))
-			continue;
-
-		__arm_lpae_init_pte(data, blk_paddr, pte, lvl, 1, &tablep[i]);
-	}
-
-	pte = arm_lpae_install_table(tablep, ptep, blk_pte, data);
-	if (pte != blk_pte) {
-		__arm_lpae_free_pages(tablep, tablesz, cfg, data->iop.cookie);
-		/*
-		 * We may race against someone unmapping another part of this
-		 * block, but anything else is invalid. We can't misinterpret
-		 * a page entry here since we're never at the last level.
-		 */
-		if (iopte_type(pte) != ARM_LPAE_PTE_TYPE_TABLE)
-			return 0;
-
-		tablep = iopte_deref(pte, data);
-	} else if (unmap_idx_start >= 0) {
-		for (i = 0; i < num_entries; i++)
-			io_pgtable_tlb_add_page(&data->iop, gather, iova + i * size, size);
-
-		return num_entries * size;
-	}
-
-	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl, tablep);
-}
-
-static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
-			       struct iommu_iotlb_gather *gather,
-			       unsigned long iova, size_t size, size_t pgcount,
-			       int lvl, arm_lpae_iopte *ptep)
-{
-	arm_lpae_iopte pte;
-	struct io_pgtable *iop = &data->iop;
-	int i = 0, num_entries, max_entries, unmap_idx_start;
-
-	/* Something went horribly wrong and we ran out of page table */
-	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
-		return 0;
-
-	unmap_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
-	ptep += unmap_idx_start;
-	pte = READ_ONCE(*ptep);
-	if (WARN_ON(!pte))
-		return 0;
-
-	/* If the size matches this level, we're in the right place */
-	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
-		max_entries = ARM_LPAE_PTES_PER_TABLE(data) - unmap_idx_start;
-		num_entries = min_t(int, pgcount, max_entries);
-
-		/* Find and handle non-leaf entries */
-		for (i = 0; i < num_entries; i++) {
-			pte = READ_ONCE(ptep[i]);
-			if (WARN_ON(!pte))
-				break;
-
-			if (!iopte_leaf(pte, lvl, iop->fmt)) {
-				__arm_lpae_clear_pte(&ptep[i], &iop->cfg, 1);
-
-				/* Also flush any partial walks */
-				io_pgtable_tlb_flush_walk(iop, iova + i * size, size,
-							  ARM_LPAE_GRANULE(data));
-				__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
-			}
-		}
-
-		/* Clear the remaining entries */
-		__arm_lpae_clear_pte(ptep, &iop->cfg, i);
-
-		if (gather && !iommu_iotlb_gather_queued(gather))
-			for (int j = 0; j < i; j++)
-				io_pgtable_tlb_add_page(iop, gather, iova + j * size, size);
-
-		return i * size;
-	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
-		/*
-		 * Insert a table at the next level to map the old region,
-		 * minus the part we want to unmap
-		 */
-		return arm_lpae_split_blk_unmap(data, gather, iova, size, pte,
-						lvl + 1, ptep, pgcount);
-	}
-
-	/* Keep on walkin' */
-	ptep = iopte_deref(pte, data);
-	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl + 1, ptep);
-}
-
-static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
-				   size_t pgsize, size_t pgcount,
-				   struct iommu_iotlb_gather *gather)
-{
-	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-	arm_lpae_iopte *ptep = data->pgd;
-	long iaext = (s64)iova >> cfg->ias;
-
-	if (WARN_ON(!pgsize || (pgsize & cfg->pgsize_bitmap) != pgsize || !pgcount))
-		return 0;
-
-	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1)
-		iaext = ~iaext;
-	if (WARN_ON(iaext))
-		return 0;
-
-	return __arm_lpae_unmap(data, gather, iova, pgsize, pgcount,
-				data->start_level, ptep);
-}
-
-struct io_pgtable_walk_data {
-	struct io_pgtable		*iop;
-	void				*data;
-	int (*visit)(struct io_pgtable_walk_data *walk_data, int lvl,
-		     arm_lpae_iopte *ptep, size_t size);
-	unsigned long			flags;
-	u64				addr;
-	const u64			end;
-};
-
-static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
-				 struct io_pgtable_walk_data *walk_data,
-				 arm_lpae_iopte *ptep,
-				 int lvl);
-
-struct iova_to_phys_data {
-	arm_lpae_iopte pte;
-	int lvl;
-};
-
-static int visit_iova_to_phys(struct io_pgtable_walk_data *walk_data, int lvl,
-			      arm_lpae_iopte *ptep, size_t size)
-{
-	struct iova_to_phys_data *data = walk_data->data;
-	data->pte = *ptep;
-	data->lvl = lvl;
-	return 0;
-}
-
-static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
-					 unsigned long iova)
-{
-	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
-	struct iova_to_phys_data d;
-	struct io_pgtable_walk_data walk_data = {
-		.data = &d,
-		.visit = visit_iova_to_phys,
-		.addr = iova,
-		.end = iova + 1,
-	};
-	int ret;
-
-	ret = __arm_lpae_iopte_walk(data, &walk_data, data->pgd, data->start_level);
-	if (ret)
-		return 0;
-
-	iova &= (ARM_LPAE_BLOCK_SIZE(d.lvl, data) - 1);
-	return iopte_to_paddr(d.pte, data) | iova;
-}
-
-static int visit_pgtable_walk(struct io_pgtable_walk_data *walk_data, int lvl,
-			      arm_lpae_iopte *ptep, size_t size)
-{
-	struct arm_lpae_io_pgtable_walk_data *data = walk_data->data;
-	data->ptes[data->level++] = *ptep;
-	return 0;
-}
-
-static int arm_lpae_pgtable_walk(struct io_pgtable_ops *ops, unsigned long iova,
-				 void *wd)
-{
-	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
-	struct io_pgtable_walk_data walk_data = {
-		.data = wd,
-		.visit = visit_pgtable_walk,
-		.addr = iova,
-		.end = iova + 1,
-	};
-
-	((struct arm_lpae_io_pgtable_walk_data *)wd)->level = 0;
-
-	return __arm_lpae_iopte_walk(data, &walk_data, data->pgd, data->start_level);
-}
-
-static int io_pgtable_visit(struct arm_lpae_io_pgtable *data,
-			    struct io_pgtable_walk_data *walk_data,
-			    arm_lpae_iopte *ptep, int lvl)
-{
-	struct io_pgtable *iop = &data->iop;
-	arm_lpae_iopte pte = READ_ONCE(*ptep);
-
-	size_t size = ARM_LPAE_BLOCK_SIZE(lvl, data);
-	int ret = walk_data->visit(walk_data, lvl, ptep, size);
-	if (ret)
-		return ret;
-
-	if (iopte_leaf(pte, lvl, iop->fmt)) {
-		walk_data->addr += size;
-		return 0;
-	}
-
-	if (!iopte_table(pte, lvl)) {
-		return -EINVAL;
-	}
-
-	ptep = iopte_deref(pte, data);
-	return __arm_lpae_iopte_walk(data, walk_data, ptep, lvl + 1);
-}
-
-static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
-				 struct io_pgtable_walk_data *walk_data,
-				 arm_lpae_iopte *ptep,
-				 int lvl)
-{
-	u32 idx;
-	int max_entries, ret;
-
-	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
-		return -EINVAL;
-
-	if (lvl == data->start_level)
-		max_entries = ARM_LPAE_PGD_SIZE(data) / sizeof(arm_lpae_iopte);
-	else
-		max_entries = ARM_LPAE_PTES_PER_TABLE(data);
-
-	for (idx = ARM_LPAE_LVL_IDX(walk_data->addr, lvl, data);
-	     (idx < max_entries) && (walk_data->addr < walk_data->end); ++idx) {
-		ret = io_pgtable_visit(data, walk_data, ptep + idx, lvl);
-		if (ret)
-			return ret;
-	}
-
-	return 0;
-}
-
-static int visit_dirty(struct io_pgtable_walk_data *walk_data, int lvl,
-		       arm_lpae_iopte *ptep, size_t size)
-{
-	struct iommu_dirty_bitmap *dirty = walk_data->data;
-
-	if (!iopte_leaf(*ptep, lvl, walk_data->iop->fmt))
-		return 0;
-
-	if (iopte_writeable_dirty(*ptep)) {
-		iommu_dirty_bitmap_record(dirty, walk_data->addr, size);
-		if (!(walk_data->flags & IOMMU_DIRTY_NO_CLEAR))
-			iopte_set_writeable_clean(ptep);
-	}
-
-	return 0;
-}
-
-static int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
-					 unsigned long iova, size_t size,
-					 unsigned long flags,
-					 struct iommu_dirty_bitmap *dirty)
-{
-	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
-	struct io_pgtable_cfg *cfg = &data->iop.cfg;
-	struct io_pgtable_walk_data walk_data = {
-		.iop = &data->iop,
-		.data = dirty,
-		.visit = visit_dirty,
-		.flags = flags,
-		.addr = iova,
-		.end = iova + size,
-	};
-	arm_lpae_iopte *ptep = data->pgd;
-	int lvl = data->start_level;
-
-	if (WARN_ON(!size))
-		return -EINVAL;
-	if (WARN_ON((iova + size - 1) & ~(BIT(cfg->ias) - 1)))
-		return -EINVAL;
-	if (data->iop.fmt != ARM_64_LPAE_S1)
-		return -EINVAL;
-
-	return __arm_lpae_iopte_walk(data, &walk_data, ptep, lvl);
-}
-
 static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 {
 	unsigned long granule, page_sizes;
diff --git a/drivers/iommu/io-pgtable-arm.h b/drivers/iommu/io-pgtable-arm.h
deleted file mode 100644
index ba7cfdf7afa0..000000000000
--- a/drivers/iommu/io-pgtable-arm.h
+++ /dev/null
@@ -1,30 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-#ifndef IO_PGTABLE_ARM_H_
-#define IO_PGTABLE_ARM_H_
-
-#define ARM_LPAE_TCR_TG0_4K		0
-#define ARM_LPAE_TCR_TG0_64K		1
-#define ARM_LPAE_TCR_TG0_16K		2
-
-#define ARM_LPAE_TCR_TG1_16K		1
-#define ARM_LPAE_TCR_TG1_4K		2
-#define ARM_LPAE_TCR_TG1_64K		3
-
-#define ARM_LPAE_TCR_SH_NS		0
-#define ARM_LPAE_TCR_SH_OS		2
-#define ARM_LPAE_TCR_SH_IS		3
-
-#define ARM_LPAE_TCR_RGN_NC		0
-#define ARM_LPAE_TCR_RGN_WBWA		1
-#define ARM_LPAE_TCR_RGN_WT		2
-#define ARM_LPAE_TCR_RGN_WB		3
-
-#define ARM_LPAE_TCR_PS_32_BIT		0x0ULL
-#define ARM_LPAE_TCR_PS_36_BIT		0x1ULL
-#define ARM_LPAE_TCR_PS_40_BIT		0x2ULL
-#define ARM_LPAE_TCR_PS_42_BIT		0x3ULL
-#define ARM_LPAE_TCR_PS_44_BIT		0x4ULL
-#define ARM_LPAE_TCR_PS_48_BIT		0x5ULL
-#define ARM_LPAE_TCR_PS_52_BIT		0x6ULL
-
-#endif /* IO_PGTABLE_ARM_H_ */
diff --git a/include/linux/io-pgtable-arm.h b/include/linux/io-pgtable-arm.h
new file mode 100644
index 000000000000..1f56dabca18c
--- /dev/null
+++ b/include/linux/io-pgtable-arm.h
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef IO_PGTABLE_H_
+#define IO_PGTABLE_H_
+
+#include <linux/io-pgtable.h>
+
+typedef u64 arm_lpae_iopte;
+
+struct arm_lpae_io_pgtable {
+	struct io_pgtable	iop;
+
+	int			pgd_bits;
+	int			start_level;
+	int			bits_per_level;
+
+	void			*pgd;
+};
+
+struct io_pgtable_walk_data {
+	struct io_pgtable		*iop;
+	void				*data;
+	int (*visit)(struct io_pgtable_walk_data *walk_data, int lvl,
+		     arm_lpae_iopte *ptep, size_t size);
+	unsigned long			flags;
+	u64				addr;
+	const u64			end;
+};
+
+/* Struct accessors */
+#define io_pgtable_to_data(x)						\
+	container_of((x), struct arm_lpae_io_pgtable, iop)
+
+#define io_pgtable_ops_to_data(x)					\
+	io_pgtable_to_data(io_pgtable_ops_to_pgtable(x))
+
+/*
+ * Calculate the right shift amount to get to the portion describing level l
+ * in a virtual address mapped by the pagetable in d.
+ */
+#define ARM_LPAE_LVL_SHIFT(l,d)						\
+	(((ARM_LPAE_MAX_LEVELS - (l)) * (d)->bits_per_level) +		\
+	ilog2(sizeof(arm_lpae_iopte)))
+
+#define ARM_LPAE_GRANULE(d)						\
+	(sizeof(arm_lpae_iopte) << (d)->bits_per_level)
+#define ARM_LPAE_PGD_SIZE(d)						\
+	(sizeof(arm_lpae_iopte) << (d)->pgd_bits)
+
+#define ARM_LPAE_PTES_PER_TABLE(d)					\
+	(ARM_LPAE_GRANULE(d) >> ilog2(sizeof(arm_lpae_iopte)))
+
+/*
+ * Calculate the index at level l used to map virtual address a using the
+ * pagetable in d.
+ */
+#define ARM_LPAE_PGD_IDX(l,d)						\
+	((l) == (d)->start_level ? (d)->pgd_bits - (d)->bits_per_level : 0)
+
+#define ARM_LPAE_LVL_IDX(a,l,d)						\
+	(((u64)(a) >> ARM_LPAE_LVL_SHIFT(l,d)) &			\
+	 ((1 << ((d)->bits_per_level + ARM_LPAE_PGD_IDX(l,d))) - 1))
+
+/* Calculate the block/page mapping size at level l for pagetable in d. */
+#define ARM_LPAE_BLOCK_SIZE(l,d)	(1ULL << ARM_LPAE_LVL_SHIFT(l,d))
+
+/* Page table bits */
+#define ARM_LPAE_PTE_TYPE_SHIFT		0
+#define ARM_LPAE_PTE_TYPE_MASK		0x3
+
+#define ARM_LPAE_PTE_TYPE_BLOCK		1
+#define ARM_LPAE_PTE_TYPE_TABLE		3
+#define ARM_LPAE_PTE_TYPE_PAGE		3
+
+#define ARM_LPAE_PTE_ADDR_MASK		GENMASK_ULL(47,12)
+
+#define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
+#define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
+#define ARM_LPAE_PTE_DBM		(((arm_lpae_iopte)1) << 51)
+#define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
+#define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
+#define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
+#define ARM_LPAE_PTE_SH_IS		(((arm_lpae_iopte)3) << 8)
+#define ARM_LPAE_PTE_NS			(((arm_lpae_iopte)1) << 5)
+#define ARM_LPAE_PTE_VALID		(((arm_lpae_iopte)1) << 0)
+
+#define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)
+/* Ignore the contiguous bit for block splitting */
+#define ARM_LPAE_PTE_ATTR_HI_MASK	(ARM_LPAE_PTE_XN | ARM_LPAE_PTE_DBM)
+#define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
+					 ARM_LPAE_PTE_ATTR_HI_MASK)
+/* Software bit for solving coherency races */
+#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
+
+/* Stage-1 PTE */
+#define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
+#define ARM_LPAE_PTE_AP_RDONLY_BIT	7
+#define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)1) << \
+					   ARM_LPAE_PTE_AP_RDONLY_BIT)
+#define ARM_LPAE_PTE_AP_WR_CLEAN_MASK	(ARM_LPAE_PTE_AP_RDONLY | \
+					 ARM_LPAE_PTE_DBM)
+#define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
+#define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
+
+/* Stage-2 PTE */
+#define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
+#define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
+#define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
+#define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
+#define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
+#define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
+
+/* Register bits */
+#define ARM_LPAE_VTCR_SL0_MASK		0x3
+
+#define ARM_LPAE_TCR_T0SZ_SHIFT		0
+
+#define ARM_LPAE_VTCR_PS_SHIFT		16
+#define ARM_LPAE_VTCR_PS_MASK		0x7
+
+#define ARM_LPAE_MAIR_ATTR_SHIFT(n)	((n) << 3)
+#define ARM_LPAE_MAIR_ATTR_MASK		0xff
+#define ARM_LPAE_MAIR_ATTR_DEVICE	0x04
+#define ARM_LPAE_MAIR_ATTR_NC		0x44
+#define ARM_LPAE_MAIR_ATTR_INC_OWBRWA	0xf4
+#define ARM_LPAE_MAIR_ATTR_WBRWA	0xff
+#define ARM_LPAE_MAIR_ATTR_IDX_NC	0
+#define ARM_LPAE_MAIR_ATTR_IDX_CACHE	1
+#define ARM_LPAE_MAIR_ATTR_IDX_DEV	2
+#define ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE	3
+
+#define ARM_MALI_LPAE_TTBR_ADRMODE_TABLE (3u << 0)
+#define ARM_MALI_LPAE_TTBR_READ_INNER	BIT(2)
+#define ARM_MALI_LPAE_TTBR_SHARE_OUTER	BIT(4)
+
+#define ARM_MALI_LPAE_MEMATTR_IMP_DEF	0x88ULL
+#define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL
+
+#define ARM_LPAE_MAX_LEVELS		4
+
+#define ARM_LPAE_TCR_TG0_4K		0
+#define ARM_LPAE_TCR_TG0_64K		1
+#define ARM_LPAE_TCR_TG0_16K		2
+
+#define ARM_LPAE_TCR_TG1_16K		1
+#define ARM_LPAE_TCR_TG1_4K		2
+#define ARM_LPAE_TCR_TG1_64K		3
+
+#define ARM_LPAE_TCR_SH_NS		0
+#define ARM_LPAE_TCR_SH_OS		2
+#define ARM_LPAE_TCR_SH_IS		3
+
+#define ARM_LPAE_TCR_RGN_NC		0
+#define ARM_LPAE_TCR_RGN_WBWA		1
+#define ARM_LPAE_TCR_RGN_WT		2
+#define ARM_LPAE_TCR_RGN_WB		3
+
+#define ARM_LPAE_TCR_PS_32_BIT		0x0ULL
+#define ARM_LPAE_TCR_PS_36_BIT		0x1ULL
+#define ARM_LPAE_TCR_PS_40_BIT		0x2ULL
+#define ARM_LPAE_TCR_PS_42_BIT		0x3ULL
+#define ARM_LPAE_TCR_PS_44_BIT		0x4ULL
+#define ARM_LPAE_TCR_PS_48_BIT		0x5ULL
+#define ARM_LPAE_TCR_PS_52_BIT		0x6ULL
+
+/* IOPTE accessors */
+#define iopte_type(pte)					\
+	(((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
+
+#define iopte_prot(pte)	((pte) & ARM_LPAE_PTE_ATTR_MASK)
+
+#define iopte_writeable_dirty(pte)				\
+	(((pte) & ARM_LPAE_PTE_AP_WR_CLEAN_MASK) == ARM_LPAE_PTE_DBM)
+
+#define iopte_set_writeable_clean(ptep)				\
+	set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long *)(ptep))
+
+
+static inline bool iopte_leaf(arm_lpae_iopte pte, int lvl,
+			      enum io_pgtable_fmt fmt)
+{
+	if (lvl == (ARM_LPAE_MAX_LEVELS - 1) && fmt != ARM_MALI_LPAE)
+		return iopte_type(pte) == ARM_LPAE_PTE_TYPE_PAGE;
+
+	return iopte_type(pte) == ARM_LPAE_PTE_TYPE_BLOCK;
+}
+
+static inline bool iopte_table(arm_lpae_iopte pte, int lvl)
+{
+	if (lvl == (ARM_LPAE_MAX_LEVELS - 1))
+		return false;
+	return iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE;
+}
+
+#define __arm_lpae_virt_to_phys	__pa
+#define __arm_lpae_phys_to_virt	__va
+
+/* Generic functions */
+int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
+		       phys_addr_t paddr, size_t pgsize, size_t pgcount,
+		       int iommu_prot, gfp_t gfp, size_t *mapped);
+size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
+			    size_t pgsize, size_t pgcount,
+			    struct iommu_iotlb_gather *gather);
+phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
+				  unsigned long iova);
+void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
+			     arm_lpae_iopte *ptep);
+
+int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
+				  unsigned long iova, size_t size,
+				  unsigned long flags,
+				  struct iommu_dirty_bitmap *dirty);
+
+int arm_lpae_pgtable_walk(struct io_pgtable_ops *ops, unsigned long iova, void *wd);
+
+/* Host/hyp-specific functions */
+void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp, struct io_pgtable_cfg *cfg, void *cookie);
+void __arm_lpae_free_pages(void *pages, size_t size, struct io_pgtable_cfg *cfg, void *cookie);
+void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
+			 struct io_pgtable_cfg *cfg);
+int arm_lpae_map_exists(void);
+int arm_lpae_unmap_empty(void);
+#endif /* IO_PGTABLE_H_ */
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 02/58] iommu/io-pgtable-arm: Split initialization
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 01/58] iommu/io-pgtable-arm: Split the page table driver Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 03/58] iommu/io-pgtable: Add configure() operation Mostafa Saleh
                   ` (56 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Extract the configuration part from io-pgtable-arm.c, move it to
io-pgtable-arm-common.c.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/iommu/io-pgtable-arm-common.c | 284 ++++++++++++++++++++++++--
 drivers/iommu/io-pgtable-arm.c        | 250 +----------------------
 include/linux/io-pgtable-arm.h        |  20 +-
 3 files changed, 286 insertions(+), 268 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-common.c b/drivers/iommu/io-pgtable-arm-common.c
index ef14a1b50d32..21ee8ff7c881 100644
--- a/drivers/iommu/io-pgtable-arm-common.c
+++ b/drivers/iommu/io-pgtable-arm-common.c
@@ -15,6 +15,9 @@
 
 #define iopte_deref(pte, d) __arm_lpae_phys_to_virt(iopte_to_paddr(pte, d))
 
+#define ARM_LPAE_MAX_ADDR_BITS		52
+#define ARM_LPAE_S2_MAX_CONCAT_PAGES	16
+
 static arm_lpae_iopte paddr_to_iopte(phys_addr_t paddr,
 				     struct arm_lpae_io_pgtable *data)
 {
@@ -257,9 +260,9 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
 	return pte;
 }
 
-int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
-		       phys_addr_t paddr, size_t pgsize, size_t pgcount,
-		       int iommu_prot, gfp_t gfp, size_t *mapped)
+static int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
+			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
+			      int iommu_prot, gfp_t gfp, size_t *mapped)
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
@@ -444,9 +447,9 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 	return __arm_lpae_unmap(data, gather, iova, size, pgcount, lvl + 1, ptep);
 }
 
-size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
-			    size_t pgsize, size_t pgcount,
-			    struct iommu_iotlb_gather *gather)
+static size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
+				   size_t pgsize, size_t pgcount,
+				   struct iommu_iotlb_gather *gather)
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
@@ -484,8 +487,8 @@ static int visit_iova_to_phys(struct io_pgtable_walk_data *walk_data, int lvl,
 	return 0;
 }
 
-phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
-				  unsigned long iova)
+static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
+					 unsigned long iova)
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
 	struct iova_to_phys_data d;
@@ -513,7 +516,7 @@ static int visit_pgtable_walk(struct io_pgtable_walk_data *walk_data, int lvl,
 	return 0;
 }
 
-int arm_lpae_pgtable_walk(struct io_pgtable_ops *ops, unsigned long iova, void *wd)
+static int arm_lpae_pgtable_walk(struct io_pgtable_ops *ops, unsigned long iova, void *wd)
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
 	struct io_pgtable_walk_data walk_data = {
@@ -596,10 +599,10 @@ static int visit_dirty(struct io_pgtable_walk_data *walk_data, int lvl,
 	return 0;
 }
 
-int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
-				  unsigned long iova, size_t size,
-				  unsigned long flags,
-				  struct iommu_dirty_bitmap *dirty)
+static int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
+					 unsigned long iova, size_t size,
+					 unsigned long flags,
+					 struct iommu_dirty_bitmap *dirty)
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
@@ -623,3 +626,258 @@ int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
 
 	return __arm_lpae_iopte_walk(data, &walk_data, ptep, lvl);
 }
+
+static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
+{
+	unsigned long granule, page_sizes;
+	unsigned int max_addr_bits = 48;
+
+	/*
+	 * We need to restrict the supported page sizes to match the
+	 * translation regime for a particular granule. Aim to match
+	 * the CPU page size if possible, otherwise prefer smaller sizes.
+	 * While we're at it, restrict the block sizes to match the
+	 * chosen granule.
+	 */
+	if (cfg->pgsize_bitmap & PAGE_SIZE)
+		granule = PAGE_SIZE;
+	else if (cfg->pgsize_bitmap & ~PAGE_MASK)
+		granule = 1UL << __fls(cfg->pgsize_bitmap & ~PAGE_MASK);
+	else if (cfg->pgsize_bitmap & PAGE_MASK)
+		granule = 1UL << __ffs(cfg->pgsize_bitmap & PAGE_MASK);
+	else
+		granule = 0;
+
+	switch (granule) {
+	case SZ_4K:
+		page_sizes = (SZ_4K | SZ_2M | SZ_1G);
+		break;
+	case SZ_16K:
+		page_sizes = (SZ_16K | SZ_32M);
+		break;
+	case SZ_64K:
+		max_addr_bits = 52;
+		page_sizes = (SZ_64K | SZ_512M);
+		if (cfg->oas > 48)
+			page_sizes |= 1ULL << 42; /* 4TB */
+		break;
+	default:
+		page_sizes = 0;
+	}
+
+	cfg->pgsize_bitmap &= page_sizes;
+	cfg->ias = min(cfg->ias, max_addr_bits);
+	cfg->oas = min(cfg->oas, max_addr_bits);
+}
+
+int arm_lpae_init_pgtable(struct io_pgtable_cfg *cfg,
+			  struct arm_lpae_io_pgtable *data)
+{
+	int levels, va_bits, pg_shift;
+
+	arm_lpae_restrict_pgsizes(cfg);
+
+	if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
+		return -EINVAL;
+
+	if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
+		return -E2BIG;
+
+	if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
+		return -E2BIG;
+
+	pg_shift = __ffs(cfg->pgsize_bitmap);
+	data->bits_per_level = pg_shift - ilog2(sizeof(arm_lpae_iopte));
+
+	va_bits = cfg->ias - pg_shift;
+	levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
+	data->start_level = ARM_LPAE_MAX_LEVELS - levels;
+
+	/* Calculate the actual size of our pgd (without concatenation) */
+	data->pgd_bits = va_bits - (data->bits_per_level * (levels - 1));
+
+	data->iop.ops = (struct io_pgtable_ops) {
+		.map_pages	= arm_lpae_map_pages,
+		.unmap_pages	= arm_lpae_unmap_pages,
+		.iova_to_phys	= arm_lpae_iova_to_phys,
+		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
+		.pgtable_walk	= arm_lpae_pgtable_walk,
+	};
+
+	return 0;
+}
+
+int arm_lpae_init_pgtable_s1(struct io_pgtable_cfg *cfg,
+			     struct arm_lpae_io_pgtable *data)
+{
+	u64 reg;
+	int ret;
+	typeof(&cfg->arm_lpae_s1_cfg.tcr) tcr = &cfg->arm_lpae_s1_cfg.tcr;
+	bool tg1;
+
+	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
+			    IO_PGTABLE_QUIRK_ARM_TTBR1 |
+			    IO_PGTABLE_QUIRK_ARM_OUTER_WBWA |
+			    IO_PGTABLE_QUIRK_ARM_HD))
+		return -EINVAL;
+
+	ret = arm_lpae_init_pgtable(cfg, data);
+	if (ret)
+		return ret;
+
+	/* TCR */
+	if (cfg->coherent_walk) {
+		tcr->sh = ARM_LPAE_TCR_SH_IS;
+		tcr->irgn = ARM_LPAE_TCR_RGN_WBWA;
+		tcr->orgn = ARM_LPAE_TCR_RGN_WBWA;
+		if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_OUTER_WBWA)
+			return -EINVAL;
+	} else {
+		tcr->sh = ARM_LPAE_TCR_SH_OS;
+		tcr->irgn = ARM_LPAE_TCR_RGN_NC;
+		if (!(cfg->quirks & IO_PGTABLE_QUIRK_ARM_OUTER_WBWA))
+			tcr->orgn = ARM_LPAE_TCR_RGN_NC;
+		else
+			tcr->orgn = ARM_LPAE_TCR_RGN_WBWA;
+	}
+
+	tg1 = cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1;
+	switch (ARM_LPAE_GRANULE(data)) {
+	case SZ_4K:
+		tcr->tg = tg1 ? ARM_LPAE_TCR_TG1_4K : ARM_LPAE_TCR_TG0_4K;
+		break;
+	case SZ_16K:
+		tcr->tg = tg1 ? ARM_LPAE_TCR_TG1_16K : ARM_LPAE_TCR_TG0_16K;
+		break;
+	case SZ_64K:
+		tcr->tg = tg1 ? ARM_LPAE_TCR_TG1_64K : ARM_LPAE_TCR_TG0_64K;
+		break;
+	}
+
+	switch (cfg->oas) {
+	case 32:
+		tcr->ips = ARM_LPAE_TCR_PS_32_BIT;
+		break;
+	case 36:
+		tcr->ips = ARM_LPAE_TCR_PS_36_BIT;
+		break;
+	case 40:
+		tcr->ips = ARM_LPAE_TCR_PS_40_BIT;
+		break;
+	case 42:
+		tcr->ips = ARM_LPAE_TCR_PS_42_BIT;
+		break;
+	case 44:
+		tcr->ips = ARM_LPAE_TCR_PS_44_BIT;
+		break;
+	case 48:
+		tcr->ips = ARM_LPAE_TCR_PS_48_BIT;
+		break;
+	case 52:
+		tcr->ips = ARM_LPAE_TCR_PS_52_BIT;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	tcr->tsz = 64ULL - cfg->ias;
+
+	/* MAIRs */
+	reg = (ARM_LPAE_MAIR_ATTR_NC
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
+	      (ARM_LPAE_MAIR_ATTR_WBRWA
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
+	      (ARM_LPAE_MAIR_ATTR_DEVICE
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV)) |
+	      (ARM_LPAE_MAIR_ATTR_INC_OWBRWA
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE));
+
+	cfg->arm_lpae_s1_cfg.mair = reg;
+	return 0;
+}
+
+int arm_lpae_init_pgtable_s2(struct io_pgtable_cfg *cfg,
+			     struct arm_lpae_io_pgtable *data)
+{
+	u64 sl;
+	int ret;
+	typeof(&cfg->arm_lpae_s2_cfg.vtcr) vtcr = &cfg->arm_lpae_s2_cfg.vtcr;
+
+	/* The NS quirk doesn't apply at stage 2 */
+	if (cfg->quirks)
+		return -EINVAL;
+
+	ret = arm_lpae_init_pgtable(cfg, data);
+	if (ret)
+		return ret;
+
+	/*
+	 * Concatenate PGDs at level 1 if possible in order to reduce
+	 * the depth of the stage-2 walk.
+	 */
+	if (data->start_level == 0) {
+		unsigned long pgd_pages;
+
+		pgd_pages = ARM_LPAE_PGD_SIZE(data) / sizeof(arm_lpae_iopte);
+		if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
+			data->pgd_bits += data->bits_per_level;
+			data->start_level++;
+		}
+	}
+
+	/* VTCR */
+	if (cfg->coherent_walk) {
+		vtcr->sh = ARM_LPAE_TCR_SH_IS;
+		vtcr->irgn = ARM_LPAE_TCR_RGN_WBWA;
+		vtcr->orgn = ARM_LPAE_TCR_RGN_WBWA;
+	} else {
+		vtcr->sh = ARM_LPAE_TCR_SH_OS;
+		vtcr->irgn = ARM_LPAE_TCR_RGN_NC;
+		vtcr->orgn = ARM_LPAE_TCR_RGN_NC;
+	}
+
+	sl = data->start_level;
+
+	switch (ARM_LPAE_GRANULE(data)) {
+	case SZ_4K:
+		vtcr->tg = ARM_LPAE_TCR_TG0_4K;
+		sl++; /* SL0 format is different for 4K granule size */
+		break;
+	case SZ_16K:
+		vtcr->tg = ARM_LPAE_TCR_TG0_16K;
+		break;
+	case SZ_64K:
+		vtcr->tg = ARM_LPAE_TCR_TG0_64K;
+		break;
+	}
+
+	switch (cfg->oas) {
+	case 32:
+		vtcr->ps = ARM_LPAE_TCR_PS_32_BIT;
+		break;
+	case 36:
+		vtcr->ps = ARM_LPAE_TCR_PS_36_BIT;
+		break;
+	case 40:
+		vtcr->ps = ARM_LPAE_TCR_PS_40_BIT;
+		break;
+	case 42:
+		vtcr->ps = ARM_LPAE_TCR_PS_42_BIT;
+		break;
+	case 44:
+		vtcr->ps = ARM_LPAE_TCR_PS_44_BIT;
+		break;
+	case 48:
+		vtcr->ps = ARM_LPAE_TCR_PS_48_BIT;
+		break;
+	case 52:
+		vtcr->ps = ARM_LPAE_TCR_PS_52_BIT;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	vtcr->tsz = 64ULL - cfg->ias;
+	vtcr->sl = ~sl & ARM_LPAE_VTCR_SL0_MASK;
+	return 0;
+}
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index cb4eb513adbf..8d435a5bcd9a 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -19,12 +19,9 @@
 #include <linux/types.h>
 #include <linux/dma-mapping.h>
 
-#include <asm/barrier.h>
-
 #include "iommu-pages.h"
 
-#define ARM_LPAE_MAX_ADDR_BITS		52
-#define ARM_LPAE_S2_MAX_CONCAT_PAGES	16
+#include <asm/barrier.h>
 
 static bool selftest_running = false;
 
@@ -121,177 +118,17 @@ static void arm_lpae_free_pgtable(struct io_pgtable *iop)
 	kfree(data);
 }
 
-static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
-{
-	unsigned long granule, page_sizes;
-	unsigned int max_addr_bits = 48;
-
-	/*
-	 * We need to restrict the supported page sizes to match the
-	 * translation regime for a particular granule. Aim to match
-	 * the CPU page size if possible, otherwise prefer smaller sizes.
-	 * While we're at it, restrict the block sizes to match the
-	 * chosen granule.
-	 */
-	if (cfg->pgsize_bitmap & PAGE_SIZE)
-		granule = PAGE_SIZE;
-	else if (cfg->pgsize_bitmap & ~PAGE_MASK)
-		granule = 1UL << __fls(cfg->pgsize_bitmap & ~PAGE_MASK);
-	else if (cfg->pgsize_bitmap & PAGE_MASK)
-		granule = 1UL << __ffs(cfg->pgsize_bitmap & PAGE_MASK);
-	else
-		granule = 0;
-
-	switch (granule) {
-	case SZ_4K:
-		page_sizes = (SZ_4K | SZ_2M | SZ_1G);
-		break;
-	case SZ_16K:
-		page_sizes = (SZ_16K | SZ_32M);
-		break;
-	case SZ_64K:
-		max_addr_bits = 52;
-		page_sizes = (SZ_64K | SZ_512M);
-		if (cfg->oas > 48)
-			page_sizes |= 1ULL << 42; /* 4TB */
-		break;
-	default:
-		page_sizes = 0;
-	}
-
-	cfg->pgsize_bitmap &= page_sizes;
-	cfg->ias = min(cfg->ias, max_addr_bits);
-	cfg->oas = min(cfg->oas, max_addr_bits);
-}
-
-static struct arm_lpae_io_pgtable *
-arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
-{
-	struct arm_lpae_io_pgtable *data;
-	int levels, va_bits, pg_shift;
-
-	arm_lpae_restrict_pgsizes(cfg);
-
-	if (!(cfg->pgsize_bitmap & (SZ_4K | SZ_16K | SZ_64K)))
-		return NULL;
-
-	if (cfg->ias > ARM_LPAE_MAX_ADDR_BITS)
-		return NULL;
-
-	if (cfg->oas > ARM_LPAE_MAX_ADDR_BITS)
-		return NULL;
-
-	data = kmalloc(sizeof(*data), GFP_KERNEL);
-	if (!data)
-		return NULL;
-
-	pg_shift = __ffs(cfg->pgsize_bitmap);
-	data->bits_per_level = pg_shift - ilog2(sizeof(arm_lpae_iopte));
-
-	va_bits = cfg->ias - pg_shift;
-	levels = DIV_ROUND_UP(va_bits, data->bits_per_level);
-	data->start_level = ARM_LPAE_MAX_LEVELS - levels;
-
-	/* Calculate the actual size of our pgd (without concatenation) */
-	data->pgd_bits = va_bits - (data->bits_per_level * (levels - 1));
-
-	data->iop.ops = (struct io_pgtable_ops) {
-		.map_pages	= arm_lpae_map_pages,
-		.unmap_pages	= arm_lpae_unmap_pages,
-		.iova_to_phys	= arm_lpae_iova_to_phys,
-		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
-		.pgtable_walk	= arm_lpae_pgtable_walk,
-	};
-
-	return data;
-}
-
 static struct io_pgtable *
 arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 {
-	u64 reg;
 	struct arm_lpae_io_pgtable *data;
-	typeof(&cfg->arm_lpae_s1_cfg.tcr) tcr = &cfg->arm_lpae_s1_cfg.tcr;
-	bool tg1;
-
-	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
-			    IO_PGTABLE_QUIRK_ARM_TTBR1 |
-			    IO_PGTABLE_QUIRK_ARM_OUTER_WBWA |
-			    IO_PGTABLE_QUIRK_ARM_HD))
-		return NULL;
 
-	data = arm_lpae_alloc_pgtable(cfg);
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
 	if (!data)
 		return NULL;
 
-	/* TCR */
-	if (cfg->coherent_walk) {
-		tcr->sh = ARM_LPAE_TCR_SH_IS;
-		tcr->irgn = ARM_LPAE_TCR_RGN_WBWA;
-		tcr->orgn = ARM_LPAE_TCR_RGN_WBWA;
-		if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_OUTER_WBWA)
-			goto out_free_data;
-	} else {
-		tcr->sh = ARM_LPAE_TCR_SH_OS;
-		tcr->irgn = ARM_LPAE_TCR_RGN_NC;
-		if (!(cfg->quirks & IO_PGTABLE_QUIRK_ARM_OUTER_WBWA))
-			tcr->orgn = ARM_LPAE_TCR_RGN_NC;
-		else
-			tcr->orgn = ARM_LPAE_TCR_RGN_WBWA;
-	}
-
-	tg1 = cfg->quirks & IO_PGTABLE_QUIRK_ARM_TTBR1;
-	switch (ARM_LPAE_GRANULE(data)) {
-	case SZ_4K:
-		tcr->tg = tg1 ? ARM_LPAE_TCR_TG1_4K : ARM_LPAE_TCR_TG0_4K;
-		break;
-	case SZ_16K:
-		tcr->tg = tg1 ? ARM_LPAE_TCR_TG1_16K : ARM_LPAE_TCR_TG0_16K;
-		break;
-	case SZ_64K:
-		tcr->tg = tg1 ? ARM_LPAE_TCR_TG1_64K : ARM_LPAE_TCR_TG0_64K;
-		break;
-	}
-
-	switch (cfg->oas) {
-	case 32:
-		tcr->ips = ARM_LPAE_TCR_PS_32_BIT;
-		break;
-	case 36:
-		tcr->ips = ARM_LPAE_TCR_PS_36_BIT;
-		break;
-	case 40:
-		tcr->ips = ARM_LPAE_TCR_PS_40_BIT;
-		break;
-	case 42:
-		tcr->ips = ARM_LPAE_TCR_PS_42_BIT;
-		break;
-	case 44:
-		tcr->ips = ARM_LPAE_TCR_PS_44_BIT;
-		break;
-	case 48:
-		tcr->ips = ARM_LPAE_TCR_PS_48_BIT;
-		break;
-	case 52:
-		tcr->ips = ARM_LPAE_TCR_PS_52_BIT;
-		break;
-	default:
+	if (arm_lpae_init_pgtable_s1(cfg, data))
 		goto out_free_data;
-	}
-
-	tcr->tsz = 64ULL - cfg->ias;
-
-	/* MAIRs */
-	reg = (ARM_LPAE_MAIR_ATTR_NC
-	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
-	      (ARM_LPAE_MAIR_ATTR_WBRWA
-	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
-	      (ARM_LPAE_MAIR_ATTR_DEVICE
-	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV)) |
-	      (ARM_LPAE_MAIR_ATTR_INC_OWBRWA
-	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE));
-
-	cfg->arm_lpae_s1_cfg.mair = reg;
 
 	/* Looking good; allocate a pgd */
 	data->pgd = __arm_lpae_alloc_pages(ARM_LPAE_PGD_SIZE(data),
@@ -314,86 +151,14 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 static struct io_pgtable *
 arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
 {
-	u64 sl;
 	struct arm_lpae_io_pgtable *data;
-	typeof(&cfg->arm_lpae_s2_cfg.vtcr) vtcr = &cfg->arm_lpae_s2_cfg.vtcr;
-
-	/* The NS quirk doesn't apply at stage 2 */
-	if (cfg->quirks)
-		return NULL;
 
-	data = arm_lpae_alloc_pgtable(cfg);
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
 	if (!data)
 		return NULL;
 
-	/*
-	 * Concatenate PGDs at level 1 if possible in order to reduce
-	 * the depth of the stage-2 walk.
-	 */
-	if (data->start_level == 0) {
-		unsigned long pgd_pages;
-
-		pgd_pages = ARM_LPAE_PGD_SIZE(data) / sizeof(arm_lpae_iopte);
-		if (pgd_pages <= ARM_LPAE_S2_MAX_CONCAT_PAGES) {
-			data->pgd_bits += data->bits_per_level;
-			data->start_level++;
-		}
-	}
-
-	/* VTCR */
-	if (cfg->coherent_walk) {
-		vtcr->sh = ARM_LPAE_TCR_SH_IS;
-		vtcr->irgn = ARM_LPAE_TCR_RGN_WBWA;
-		vtcr->orgn = ARM_LPAE_TCR_RGN_WBWA;
-	} else {
-		vtcr->sh = ARM_LPAE_TCR_SH_OS;
-		vtcr->irgn = ARM_LPAE_TCR_RGN_NC;
-		vtcr->orgn = ARM_LPAE_TCR_RGN_NC;
-	}
-
-	sl = data->start_level;
-
-	switch (ARM_LPAE_GRANULE(data)) {
-	case SZ_4K:
-		vtcr->tg = ARM_LPAE_TCR_TG0_4K;
-		sl++; /* SL0 format is different for 4K granule size */
-		break;
-	case SZ_16K:
-		vtcr->tg = ARM_LPAE_TCR_TG0_16K;
-		break;
-	case SZ_64K:
-		vtcr->tg = ARM_LPAE_TCR_TG0_64K;
-		break;
-	}
-
-	switch (cfg->oas) {
-	case 32:
-		vtcr->ps = ARM_LPAE_TCR_PS_32_BIT;
-		break;
-	case 36:
-		vtcr->ps = ARM_LPAE_TCR_PS_36_BIT;
-		break;
-	case 40:
-		vtcr->ps = ARM_LPAE_TCR_PS_40_BIT;
-		break;
-	case 42:
-		vtcr->ps = ARM_LPAE_TCR_PS_42_BIT;
-		break;
-	case 44:
-		vtcr->ps = ARM_LPAE_TCR_PS_44_BIT;
-		break;
-	case 48:
-		vtcr->ps = ARM_LPAE_TCR_PS_48_BIT;
-		break;
-	case 52:
-		vtcr->ps = ARM_LPAE_TCR_PS_52_BIT;
-		break;
-	default:
+	if (arm_lpae_init_pgtable_s2(cfg, data))
 		goto out_free_data;
-	}
-
-	vtcr->tsz = 64ULL - cfg->ias;
-	vtcr->sl = ~sl & ARM_LPAE_VTCR_SL0_MASK;
 
 	/* Allocate pgd pages */
 	data->pgd = __arm_lpae_alloc_pages(ARM_LPAE_PGD_SIZE(data),
@@ -447,10 +212,13 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
 
 	cfg->pgsize_bitmap &= (SZ_4K | SZ_2M | SZ_1G);
 
-	data = arm_lpae_alloc_pgtable(cfg);
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
 	if (!data)
 		return NULL;
 
+	if (arm_lpae_init_pgtable(cfg, data))
+		return NULL;
+
 	/* Mali seems to need a full 4-level table regardless of IAS */
 	if (data->start_level > 0) {
 		data->start_level = 0;
diff --git a/include/linux/io-pgtable-arm.h b/include/linux/io-pgtable-arm.h
index 1f56dabca18c..337e9254fdbd 100644
--- a/include/linux/io-pgtable-arm.h
+++ b/include/linux/io-pgtable-arm.h
@@ -195,23 +195,15 @@ static inline bool iopte_table(arm_lpae_iopte pte, int lvl)
 #define __arm_lpae_phys_to_virt	__va
 
 /* Generic functions */
-int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
-		       phys_addr_t paddr, size_t pgsize, size_t pgcount,
-		       int iommu_prot, gfp_t gfp, size_t *mapped);
-size_t arm_lpae_unmap_pages(struct io_pgtable_ops *ops, unsigned long iova,
-			    size_t pgsize, size_t pgcount,
-			    struct iommu_iotlb_gather *gather);
-phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
-				  unsigned long iova);
 void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
 			     arm_lpae_iopte *ptep);
 
-int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
-				  unsigned long iova, size_t size,
-				  unsigned long flags,
-				  struct iommu_dirty_bitmap *dirty);
-
-int arm_lpae_pgtable_walk(struct io_pgtable_ops *ops, unsigned long iova, void *wd);
+int arm_lpae_init_pgtable(struct io_pgtable_cfg *cfg,
+			  struct arm_lpae_io_pgtable *data);
+int arm_lpae_init_pgtable_s1(struct io_pgtable_cfg *cfg,
+			     struct arm_lpae_io_pgtable *data);
+int arm_lpae_init_pgtable_s2(struct io_pgtable_cfg *cfg,
+			     struct arm_lpae_io_pgtable *data);
 
 /* Host/hyp-specific functions */
 void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp, struct io_pgtable_cfg *cfg, void *cookie);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 03/58] iommu/io-pgtable: Add configure() operation
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 01/58] iommu/io-pgtable-arm: Split the page table driver Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 02/58] iommu/io-pgtable-arm: Split initialization Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 04/58] iommu/arm-smmu-v3: Move some definitions to arm64 include/ Mostafa Saleh
                   ` (55 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Allow IOMMU drivers to create the io-pgtable configuration without
allocating any tables. This will be used by the SMMUv3-KVM driver to
initialize a config and pass it to KVM.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/iommu/io-pgtable-arm.c | 16 ++++++++++++++++
 drivers/iommu/io-pgtable.c     | 15 +++++++++++++++
 include/linux/io-pgtable.h     | 15 +++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 8d435a5bcd9a..e85866c90290 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -148,6 +148,13 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 	return NULL;
 }
 
+static int arm_64_lpae_configure_s1(struct io_pgtable_cfg *cfg)
+{
+	struct arm_lpae_io_pgtable data = {};
+
+	return arm_lpae_init_pgtable_s1(cfg, &data);
+}
+
 static struct io_pgtable *
 arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
 {
@@ -178,6 +185,13 @@ arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
 	return NULL;
 }
 
+static int arm_64_lpae_configure_s2(struct io_pgtable_cfg *cfg)
+{
+	struct arm_lpae_io_pgtable data = {};
+
+	return arm_lpae_init_pgtable_s2(cfg, &data);
+}
+
 static struct io_pgtable *
 arm_32_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 {
@@ -264,12 +278,14 @@ struct io_pgtable_init_fns io_pgtable_arm_64_lpae_s1_init_fns = {
 	.caps	= IO_PGTABLE_CAP_CUSTOM_ALLOCATOR,
 	.alloc	= arm_64_lpae_alloc_pgtable_s1,
 	.free	= arm_lpae_free_pgtable,
+	.configure	= arm_64_lpae_configure_s1,
 };
 
 struct io_pgtable_init_fns io_pgtable_arm_64_lpae_s2_init_fns = {
 	.caps	= IO_PGTABLE_CAP_CUSTOM_ALLOCATOR,
 	.alloc	= arm_64_lpae_alloc_pgtable_s2,
 	.free	= arm_lpae_free_pgtable,
+	.configure	= arm_64_lpae_configure_s2,
 };
 
 struct io_pgtable_init_fns io_pgtable_arm_32_lpae_s1_init_fns = {
diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
index 8841c1487f00..be65f70ec2a6 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -99,3 +99,18 @@ void free_io_pgtable_ops(struct io_pgtable_ops *ops)
 	io_pgtable_init_table[iop->fmt]->free(iop);
 }
 EXPORT_SYMBOL_GPL(free_io_pgtable_ops);
+
+int io_pgtable_configure(struct io_pgtable_cfg *cfg)
+{
+	const struct io_pgtable_init_fns *fns;
+
+	if (cfg->fmt >= IO_PGTABLE_NUM_FMTS)
+		return -EINVAL;
+
+	fns = io_pgtable_init_table[cfg->fmt];
+	if (!fns || !fns->configure)
+		return -EOPNOTSUPP;
+
+	return fns->configure(cfg);
+}
+EXPORT_SYMBOL_GPL(io_pgtable_configure);
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index d7bfbf351975..f789234c703b 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -49,6 +49,7 @@ struct iommu_flush_ops {
 /**
  * struct io_pgtable_cfg - Configuration data for a set of page tables.
  *
+ * @fmt:	       Format used for these page tables
  * @quirks:        A bitmap of hardware quirks that require some special
  *                 action by the low-level page table allocator.
  * @pgsize_bitmap: A bitmap of page sizes supported by this set of page
@@ -62,6 +63,7 @@ struct iommu_flush_ops {
  *                 page table walker.
  */
 struct io_pgtable_cfg {
+	enum io_pgtable_fmt		fmt;
 	/*
 	 * IO_PGTABLE_QUIRK_ARM_NS: (ARM formats) Set NS and NSTABLE bits in
 	 *	stage 1 PTEs, for hardware which insists on validating them
@@ -241,6 +243,17 @@ struct io_pgtable_ops *alloc_io_pgtable_ops(enum io_pgtable_fmt fmt,
  */
 void free_io_pgtable_ops(struct io_pgtable_ops *ops);
 
+/**
+ * io_pgtable_configure - Create page table config
+ *
+ * @cfg:	The page table configuration.
+ *
+ * Initialize @cfg in the same way as alloc_io_pgtable_ops(), without allocating
+ * anything.
+ *
+ * Not all io_pgtable drivers implement this operation.
+ */
+int io_pgtable_configure(struct io_pgtable_cfg *cfg);
 
 /*
  * Internal structures for page table allocator implementations.
@@ -301,11 +314,13 @@ enum io_pgtable_caps {
  *
  * @alloc: Allocate a set of page tables described by cfg.
  * @free:  Free the page tables associated with iop.
+ * @configure: Create the configuration without allocating anything. Optional.
  * @caps:  Combination of @io_pgtable_caps flags encoding the backend capabilities.
  */
 struct io_pgtable_init_fns {
 	struct io_pgtable *(*alloc)(struct io_pgtable_cfg *cfg, void *cookie);
 	void (*free)(struct io_pgtable *iop);
+	int (*configure)(struct io_pgtable_cfg *cfg);
 	u32 caps;
 };
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 04/58] iommu/arm-smmu-v3: Move some definitions to arm64 include/
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (2 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 03/58] iommu/io-pgtable: Add configure() operation Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 05/58] iommu/arm-smmu-v3: Extract driver-specific bits from probe function Mostafa Saleh
                   ` (54 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

So that the KVM SMMUv3 driver can re-use architectural definitions,
command structures and feature bits, move them to the arm64 include/

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/include/asm/arm-smmu-v3-common.h | 547 ++++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 540 +------------------
 2 files changed, 549 insertions(+), 538 deletions(-)
 create mode 100644 arch/arm64/include/asm/arm-smmu-v3-common.h

diff --git a/arch/arm64/include/asm/arm-smmu-v3-common.h b/arch/arm64/include/asm/arm-smmu-v3-common.h
new file mode 100644
index 000000000000..e6e339248816
--- /dev/null
+++ b/arch/arm64/include/asm/arm-smmu-v3-common.h
@@ -0,0 +1,547 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ARM_SMMU_V3_COMMON_H
+#define _ARM_SMMU_V3_COMMON_H
+
+#include <linux/bitfield.h>
+
+/* MMIO registers */
+#define ARM_SMMU_IDR0			0x0
+#define IDR0_ST_LVL			GENMASK(28, 27)
+#define IDR0_ST_LVL_2LVL		1
+#define IDR0_STALL_MODEL		GENMASK(25, 24)
+#define IDR0_STALL_MODEL_STALL		0
+#define IDR0_STALL_MODEL_FORCE		2
+#define IDR0_TTENDIAN			GENMASK(22, 21)
+#define IDR0_TTENDIAN_MIXED		0
+#define IDR0_TTENDIAN_LE		2
+#define IDR0_TTENDIAN_BE		3
+#define IDR0_CD2L			(1 << 19)
+#define IDR0_VMID16			(1 << 18)
+#define IDR0_PRI			(1 << 16)
+#define IDR0_SEV			(1 << 14)
+#define IDR0_MSI			(1 << 13)
+#define IDR0_ASID16			(1 << 12)
+#define IDR0_ATS			(1 << 10)
+#define IDR0_HYP			(1 << 9)
+#define IDR0_HTTU			GENMASK(7, 6)
+#define IDR0_HTTU_ACCESS		1
+#define IDR0_HTTU_ACCESS_DIRTY		2
+#define IDR0_COHACC			(1 << 4)
+#define IDR0_TTF			GENMASK(3, 2)
+#define IDR0_TTF_AARCH64		2
+#define IDR0_TTF_AARCH32_64		3
+#define IDR0_S1P			(1 << 1)
+#define IDR0_S2P			(1 << 0)
+
+#define ARM_SMMU_IDR1			0x4
+#define IDR1_TABLES_PRESET		(1 << 30)
+#define IDR1_QUEUES_PRESET		(1 << 29)
+#define IDR1_REL			(1 << 28)
+#define IDR1_ATTR_TYPES_OVR		(1 << 27)
+#define IDR1_CMDQS			GENMASK(25, 21)
+#define IDR1_EVTQS			GENMASK(20, 16)
+#define IDR1_PRIQS			GENMASK(15, 11)
+#define IDR1_SSIDSIZE			GENMASK(10, 6)
+#define IDR1_SIDSIZE			GENMASK(5, 0)
+
+#define ARM_SMMU_IDR3			0xc
+#define IDR3_RIL			(1 << 10)
+
+#define ARM_SMMU_IDR5			0x14
+#define IDR5_STALL_MAX			GENMASK(31, 16)
+#define IDR5_GRAN64K			(1 << 6)
+#define IDR5_GRAN16K			(1 << 5)
+#define IDR5_GRAN4K			(1 << 4)
+#define IDR5_OAS			GENMASK(2, 0)
+#define IDR5_OAS_32_BIT			0
+#define IDR5_OAS_36_BIT			1
+#define IDR5_OAS_40_BIT			2
+#define IDR5_OAS_42_BIT			3
+#define IDR5_OAS_44_BIT			4
+#define IDR5_OAS_48_BIT			5
+#define IDR5_OAS_52_BIT			6
+#define IDR5_VAX			GENMASK(11, 10)
+#define IDR5_VAX_52_BIT			1
+
+#define ARM_SMMU_IIDR			0x18
+#define IIDR_PRODUCTID			GENMASK(31, 20)
+#define IIDR_VARIANT			GENMASK(19, 16)
+#define IIDR_REVISION			GENMASK(15, 12)
+#define IIDR_IMPLEMENTER		GENMASK(11, 0)
+
+#define ARM_SMMU_CR0			0x20
+#define CR0_ATSCHK			(1 << 4)
+#define CR0_CMDQEN			(1 << 3)
+#define CR0_EVTQEN			(1 << 2)
+#define CR0_PRIQEN			(1 << 1)
+#define CR0_SMMUEN			(1 << 0)
+
+#define ARM_SMMU_CR0ACK			0x24
+
+#define ARM_SMMU_CR1			0x28
+#define CR1_TABLE_SH			GENMASK(11, 10)
+#define CR1_TABLE_OC			GENMASK(9, 8)
+#define CR1_TABLE_IC			GENMASK(7, 6)
+#define CR1_QUEUE_SH			GENMASK(5, 4)
+#define CR1_QUEUE_OC			GENMASK(3, 2)
+#define CR1_QUEUE_IC			GENMASK(1, 0)
+/* CR1 cacheability fields don't quite follow the usual TCR-style encoding */
+#define CR1_CACHE_NC			0
+#define CR1_CACHE_WB			1
+#define CR1_CACHE_WT			2
+
+#define ARM_SMMU_CR2			0x2c
+#define CR2_PTM				(1 << 2)
+#define CR2_RECINVSID			(1 << 1)
+#define CR2_E2H				(1 << 0)
+
+#define ARM_SMMU_GBPA			0x44
+#define GBPA_UPDATE			(1 << 31)
+#define GBPA_ABORT			(1 << 20)
+
+#define ARM_SMMU_IRQ_CTRL		0x50
+#define IRQ_CTRL_EVTQ_IRQEN		(1 << 2)
+#define IRQ_CTRL_PRIQ_IRQEN		(1 << 1)
+#define IRQ_CTRL_GERROR_IRQEN		(1 << 0)
+
+#define ARM_SMMU_IRQ_CTRLACK		0x54
+
+#define ARM_SMMU_GERROR			0x60
+#define GERROR_SFM_ERR			(1 << 8)
+#define GERROR_MSI_GERROR_ABT_ERR	(1 << 7)
+#define GERROR_MSI_PRIQ_ABT_ERR		(1 << 6)
+#define GERROR_MSI_EVTQ_ABT_ERR		(1 << 5)
+#define GERROR_MSI_CMDQ_ABT_ERR		(1 << 4)
+#define GERROR_PRIQ_ABT_ERR		(1 << 3)
+#define GERROR_EVTQ_ABT_ERR		(1 << 2)
+#define GERROR_CMDQ_ERR			(1 << 0)
+#define GERROR_ERR_MASK			0x1fd
+
+#define ARM_SMMU_GERRORN		0x64
+
+#define ARM_SMMU_GERROR_IRQ_CFG0	0x68
+#define ARM_SMMU_GERROR_IRQ_CFG1	0x70
+#define ARM_SMMU_GERROR_IRQ_CFG2	0x74
+
+#define ARM_SMMU_STRTAB_BASE		0x80
+#define STRTAB_BASE_RA			(1UL << 62)
+#define STRTAB_BASE_ADDR_MASK		GENMASK_ULL(51, 6)
+
+#define ARM_SMMU_STRTAB_BASE_CFG	0x88
+#define STRTAB_BASE_CFG_FMT		GENMASK(17, 16)
+#define STRTAB_BASE_CFG_FMT_LINEAR	0
+#define STRTAB_BASE_CFG_FMT_2LVL	1
+#define STRTAB_BASE_CFG_SPLIT		GENMASK(10, 6)
+#define STRTAB_BASE_CFG_LOG2SIZE	GENMASK(5, 0)
+
+#define ARM_SMMU_CMDQ_BASE		0x90
+#define ARM_SMMU_CMDQ_PROD		0x98
+#define ARM_SMMU_CMDQ_CONS		0x9c
+
+#define ARM_SMMU_EVTQ_BASE		0xa0
+#define ARM_SMMU_EVTQ_PROD		0xa8
+#define ARM_SMMU_EVTQ_CONS		0xac
+#define ARM_SMMU_EVTQ_IRQ_CFG0		0xb0
+#define ARM_SMMU_EVTQ_IRQ_CFG1		0xb8
+#define ARM_SMMU_EVTQ_IRQ_CFG2		0xbc
+
+#define ARM_SMMU_PRIQ_BASE		0xc0
+#define ARM_SMMU_PRIQ_PROD		0xc8
+#define ARM_SMMU_PRIQ_CONS		0xcc
+#define ARM_SMMU_PRIQ_IRQ_CFG0		0xd0
+#define ARM_SMMU_PRIQ_IRQ_CFG1		0xd8
+#define ARM_SMMU_PRIQ_IRQ_CFG2		0xdc
+
+#define ARM_SMMU_REG_SZ			0xe00
+
+/* Common MSI config fields */
+#define MSI_CFG0_ADDR_MASK		GENMASK_ULL(51, 2)
+#define MSI_CFG2_SH			GENMASK(5, 4)
+#define MSI_CFG2_MEMATTR		GENMASK(3, 0)
+
+/* Common memory attribute values */
+#define ARM_SMMU_SH_NSH			0
+#define ARM_SMMU_SH_OSH			2
+#define ARM_SMMU_SH_ISH			3
+#define ARM_SMMU_MEMATTR_DEVICE_nGnRE	0x1
+#define ARM_SMMU_MEMATTR_OIWB		0xf
+
+#define Q_BASE_RWA			(1UL << 62)
+#define Q_BASE_ADDR_MASK		GENMASK_ULL(51, 5)
+#define Q_BASE_LOG2SIZE			GENMASK(4, 0)
+
+/*
+ * Stream table.
+ *
+ * Linear: Enough to cover 1 << IDR1.SIDSIZE entries
+ * 2lvl: 128k L1 entries,
+ *       256 lazy entries per table (each table covers a PCI bus)
+ */
+#define STRTAB_SPLIT			8
+
+#define STRTAB_L1_DESC_SPAN		GENMASK_ULL(4, 0)
+#define STRTAB_L1_DESC_L2PTR_MASK	GENMASK_ULL(51, 6)
+
+#define STRTAB_STE_DWORDS		8
+
+struct arm_smmu_ste {
+	__le64 data[STRTAB_STE_DWORDS];
+};
+
+#define STRTAB_NUM_L2_STES		(1 << STRTAB_SPLIT)
+struct arm_smmu_strtab_l2 {
+	struct arm_smmu_ste stes[STRTAB_NUM_L2_STES];
+};
+
+struct arm_smmu_strtab_l1 {
+	__le64 l2ptr;
+};
+#define STRTAB_MAX_L1_ENTRIES		(1 << 17)
+
+static inline u32 arm_smmu_strtab_l1_idx(u32 sid)
+{
+	return sid / STRTAB_NUM_L2_STES;
+}
+
+static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
+{
+	return sid % STRTAB_NUM_L2_STES;
+}
+
+#define STRTAB_STE_0_V			(1UL << 0)
+#define STRTAB_STE_0_CFG		GENMASK_ULL(3, 1)
+#define STRTAB_STE_0_CFG_ABORT		0
+#define STRTAB_STE_0_CFG_BYPASS		4
+#define STRTAB_STE_0_CFG_S1_TRANS	5
+#define STRTAB_STE_0_CFG_S2_TRANS	6
+
+#define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)
+#define STRTAB_STE_0_S1FMT_LINEAR	0
+#define STRTAB_STE_0_S1FMT_64K_L2	2
+#define STRTAB_STE_0_S1CTXPTR_MASK	GENMASK_ULL(51, 6)
+#define STRTAB_STE_0_S1CDMAX		GENMASK_ULL(63, 59)
+
+#define STRTAB_STE_1_S1DSS		GENMASK_ULL(1, 0)
+#define STRTAB_STE_1_S1DSS_TERMINATE	0x0
+#define STRTAB_STE_1_S1DSS_BYPASS	0x1
+#define STRTAB_STE_1_S1DSS_SSID0	0x2
+
+#define STRTAB_STE_1_S1C_CACHE_NC	0UL
+#define STRTAB_STE_1_S1C_CACHE_WBRA	1UL
+#define STRTAB_STE_1_S1C_CACHE_WT	2UL
+#define STRTAB_STE_1_S1C_CACHE_WB	3UL
+#define STRTAB_STE_1_S1CIR		GENMASK_ULL(3, 2)
+#define STRTAB_STE_1_S1COR		GENMASK_ULL(5, 4)
+#define STRTAB_STE_1_S1CSH		GENMASK_ULL(7, 6)
+
+#define STRTAB_STE_1_S1STALLD		(1UL << 27)
+
+#define STRTAB_STE_1_EATS		GENMASK_ULL(29, 28)
+#define STRTAB_STE_1_EATS_ABT		0UL
+#define STRTAB_STE_1_EATS_TRANS		1UL
+#define STRTAB_STE_1_EATS_S1CHK		2UL
+
+#define STRTAB_STE_1_STRW		GENMASK_ULL(31, 30)
+#define STRTAB_STE_1_STRW_NSEL1		0UL
+#define STRTAB_STE_1_STRW_EL2		2UL
+
+#define STRTAB_STE_1_SHCFG		GENMASK_ULL(45, 44)
+#define STRTAB_STE_1_SHCFG_INCOMING	1UL
+
+#define STRTAB_STE_2_S2VMID		GENMASK_ULL(15, 0)
+#define STRTAB_STE_2_VTCR		GENMASK_ULL(50, 32)
+#define STRTAB_STE_2_VTCR_S2T0SZ	GENMASK_ULL(5, 0)
+#define STRTAB_STE_2_VTCR_S2SL0		GENMASK_ULL(7, 6)
+#define STRTAB_STE_2_VTCR_S2IR0		GENMASK_ULL(9, 8)
+#define STRTAB_STE_2_VTCR_S2OR0		GENMASK_ULL(11, 10)
+#define STRTAB_STE_2_VTCR_S2SH0		GENMASK_ULL(13, 12)
+#define STRTAB_STE_2_VTCR_S2TG		GENMASK_ULL(15, 14)
+#define STRTAB_STE_2_VTCR_S2PS		GENMASK_ULL(18, 16)
+#define STRTAB_STE_2_S2AA64		(1UL << 51)
+#define STRTAB_STE_2_S2ENDI		(1UL << 52)
+#define STRTAB_STE_2_S2PTW		(1UL << 54)
+#define STRTAB_STE_2_S2S		(1UL << 57)
+#define STRTAB_STE_2_S2R		(1UL << 58)
+
+#define STRTAB_STE_3_S2TTB_MASK		GENMASK_ULL(51, 4)
+
+/*
+ * Context descriptors.
+ *
+ * Linear: when less than 1024 SSIDs are supported
+ * 2lvl: at most 1024 L1 entries,
+ *       1024 lazy entries per table.
+ */
+#define CTXDESC_L2_ENTRIES		1024
+
+#define CTXDESC_L1_DESC_V		(1UL << 0)
+#define CTXDESC_L1_DESC_L2PTR_MASK	GENMASK_ULL(51, 12)
+
+#define CTXDESC_CD_DWORDS		8
+
+struct arm_smmu_cd {
+	__le64 data[CTXDESC_CD_DWORDS];
+};
+
+struct arm_smmu_cdtab_l2 {
+	struct arm_smmu_cd cds[CTXDESC_L2_ENTRIES];
+};
+
+struct arm_smmu_cdtab_l1 {
+	__le64 l2ptr;
+};
+
+static inline unsigned int arm_smmu_cdtab_l1_idx(unsigned int ssid)
+{
+	return ssid / CTXDESC_L2_ENTRIES;
+}
+
+static inline unsigned int arm_smmu_cdtab_l2_idx(unsigned int ssid)
+{
+	return ssid % CTXDESC_L2_ENTRIES;
+}
+
+#define CTXDESC_CD_0_TCR_T0SZ		GENMASK_ULL(5, 0)
+#define CTXDESC_CD_0_TCR_TG0		GENMASK_ULL(7, 6)
+#define CTXDESC_CD_0_TCR_IRGN0		GENMASK_ULL(9, 8)
+#define CTXDESC_CD_0_TCR_ORGN0		GENMASK_ULL(11, 10)
+#define CTXDESC_CD_0_TCR_SH0		GENMASK_ULL(13, 12)
+#define CTXDESC_CD_0_TCR_EPD0		(1ULL << 14)
+#define CTXDESC_CD_0_TCR_EPD1		(1ULL << 30)
+
+#define CTXDESC_CD_0_ENDI		(1UL << 15)
+#define CTXDESC_CD_0_V			(1UL << 31)
+
+#define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
+#define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
+
+#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
+#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
+
+#define CTXDESC_CD_0_AA64		(1UL << 41)
+#define CTXDESC_CD_0_S			(1UL << 44)
+#define CTXDESC_CD_0_R			(1UL << 45)
+#define CTXDESC_CD_0_A			(1UL << 46)
+#define CTXDESC_CD_0_ASET		(1UL << 47)
+#define CTXDESC_CD_0_ASID		GENMASK_ULL(63, 48)
+
+#define CTXDESC_CD_1_TTB0_MASK		GENMASK_ULL(51, 4)
+
+/*
+ * When the SMMU only supports linear context descriptor tables, pick a
+ * reasonable size limit (64kB).
+ */
+#define CTXDESC_LINEAR_CDMAX		ilog2(SZ_64K / sizeof(struct arm_smmu_cd))
+
+/* Command queue */
+#define CMDQ_ENT_SZ_SHIFT		4
+#define CMDQ_ENT_DWORDS			((1 << CMDQ_ENT_SZ_SHIFT) >> 3)
+#define CMDQ_MAX_SZ_SHIFT		(Q_MAX_SZ_SHIFT - CMDQ_ENT_SZ_SHIFT)
+
+#define CMDQ_CONS_ERR			GENMASK(30, 24)
+#define CMDQ_ERR_CERROR_NONE_IDX	0
+#define CMDQ_ERR_CERROR_ILL_IDX		1
+#define CMDQ_ERR_CERROR_ABT_IDX		2
+#define CMDQ_ERR_CERROR_ATC_INV_IDX	3
+
+#define CMDQ_0_OP			GENMASK_ULL(7, 0)
+#define CMDQ_0_SSV			(1UL << 11)
+
+#define CMDQ_PREFETCH_0_SID		GENMASK_ULL(63, 32)
+#define CMDQ_PREFETCH_1_SIZE		GENMASK_ULL(4, 0)
+#define CMDQ_PREFETCH_1_ADDR_MASK	GENMASK_ULL(63, 12)
+
+#define CMDQ_CFGI_0_SSID		GENMASK_ULL(31, 12)
+#define CMDQ_CFGI_0_SID			GENMASK_ULL(63, 32)
+#define CMDQ_CFGI_1_LEAF		(1UL << 0)
+#define CMDQ_CFGI_1_RANGE		GENMASK_ULL(4, 0)
+
+#define CMDQ_TLBI_0_NUM			GENMASK_ULL(16, 12)
+#define CMDQ_TLBI_RANGE_NUM_MAX		31
+#define CMDQ_TLBI_0_SCALE		GENMASK_ULL(24, 20)
+#define CMDQ_TLBI_0_VMID		GENMASK_ULL(47, 32)
+#define CMDQ_TLBI_0_ASID		GENMASK_ULL(63, 48)
+#define CMDQ_TLBI_1_LEAF		(1UL << 0)
+#define CMDQ_TLBI_1_TTL			GENMASK_ULL(9, 8)
+#define CMDQ_TLBI_1_TG			GENMASK_ULL(11, 10)
+#define CMDQ_TLBI_1_VA_MASK		GENMASK_ULL(63, 12)
+#define CMDQ_TLBI_1_IPA_MASK		GENMASK_ULL(51, 12)
+
+#define CMDQ_ATC_0_SSID			GENMASK_ULL(31, 12)
+#define CMDQ_ATC_0_SID			GENMASK_ULL(63, 32)
+#define CMDQ_ATC_0_GLOBAL		(1UL << 9)
+#define CMDQ_ATC_1_SIZE			GENMASK_ULL(5, 0)
+#define CMDQ_ATC_1_ADDR_MASK		GENMASK_ULL(63, 12)
+
+#define CMDQ_PRI_0_SSID			GENMASK_ULL(31, 12)
+#define CMDQ_PRI_0_SID			GENMASK_ULL(63, 32)
+#define CMDQ_PRI_1_GRPID		GENMASK_ULL(8, 0)
+#define CMDQ_PRI_1_RESP			GENMASK_ULL(13, 12)
+
+#define CMDQ_RESUME_0_RESP_TERM		0UL
+#define CMDQ_RESUME_0_RESP_RETRY	1UL
+#define CMDQ_RESUME_0_RESP_ABORT	2UL
+#define CMDQ_RESUME_0_RESP		GENMASK_ULL(13, 12)
+#define CMDQ_RESUME_0_SID		GENMASK_ULL(63, 32)
+#define CMDQ_RESUME_1_STAG		GENMASK_ULL(15, 0)
+
+#define CMDQ_SYNC_0_CS			GENMASK_ULL(13, 12)
+#define CMDQ_SYNC_0_CS_NONE		0
+#define CMDQ_SYNC_0_CS_IRQ		1
+#define CMDQ_SYNC_0_CS_SEV		2
+#define CMDQ_SYNC_0_MSH			GENMASK_ULL(23, 22)
+#define CMDQ_SYNC_0_MSIATTR		GENMASK_ULL(27, 24)
+#define CMDQ_SYNC_0_MSIDATA		GENMASK_ULL(63, 32)
+#define CMDQ_SYNC_1_MSIADDR_MASK	GENMASK_ULL(51, 2)
+
+/* Event queue */
+#define EVTQ_ENT_SZ_SHIFT		5
+#define EVTQ_ENT_DWORDS			((1 << EVTQ_ENT_SZ_SHIFT) >> 3)
+#define EVTQ_MAX_SZ_SHIFT		(Q_MAX_SZ_SHIFT - EVTQ_ENT_SZ_SHIFT)
+
+#define EVTQ_0_ID			GENMASK_ULL(7, 0)
+
+#define EVT_ID_TRANSLATION_FAULT	0x10
+#define EVT_ID_ADDR_SIZE_FAULT		0x11
+#define EVT_ID_ACCESS_FAULT		0x12
+#define EVT_ID_PERMISSION_FAULT		0x13
+
+#define EVTQ_0_SSV			(1UL << 11)
+#define EVTQ_0_SSID			GENMASK_ULL(31, 12)
+#define EVTQ_0_SID			GENMASK_ULL(63, 32)
+#define EVTQ_1_STAG			GENMASK_ULL(15, 0)
+#define EVTQ_1_STALL			(1UL << 31)
+#define EVTQ_1_PnU			(1UL << 33)
+#define EVTQ_1_InD			(1UL << 34)
+#define EVTQ_1_RnW			(1UL << 35)
+#define EVTQ_1_S2			(1UL << 39)
+#define EVTQ_1_CLASS			GENMASK_ULL(41, 40)
+#define EVTQ_1_TT_READ			(1UL << 44)
+#define EVTQ_2_ADDR			GENMASK_ULL(63, 0)
+#define EVTQ_3_IPA			GENMASK_ULL(51, 12)
+
+/* PRI queue */
+#define PRIQ_ENT_SZ_SHIFT		4
+#define PRIQ_ENT_DWORDS			((1 << PRIQ_ENT_SZ_SHIFT) >> 3)
+#define PRIQ_MAX_SZ_SHIFT		(Q_MAX_SZ_SHIFT - PRIQ_ENT_SZ_SHIFT)
+
+#define PRIQ_0_SID			GENMASK_ULL(31, 0)
+#define PRIQ_0_SSID			GENMASK_ULL(51, 32)
+#define PRIQ_0_PERM_PRIV		(1UL << 58)
+#define PRIQ_0_PERM_EXEC		(1UL << 59)
+#define PRIQ_0_PERM_READ		(1UL << 60)
+#define PRIQ_0_PERM_WRITE		(1UL << 61)
+#define PRIQ_0_PRG_LAST			(1UL << 62)
+#define PRIQ_0_SSID_V			(1UL << 63)
+
+#define PRIQ_1_PRG_IDX			GENMASK_ULL(8, 0)
+#define PRIQ_1_ADDR_MASK		GENMASK_ULL(63, 12)
+
+/* Synthesized features */
+#define ARM_SMMU_FEAT_2_LVL_STRTAB	(1 << 0)
+#define ARM_SMMU_FEAT_2_LVL_CDTAB	(1 << 1)
+#define ARM_SMMU_FEAT_TT_LE		(1 << 2)
+#define ARM_SMMU_FEAT_TT_BE		(1 << 3)
+#define ARM_SMMU_FEAT_PRI		(1 << 4)
+#define ARM_SMMU_FEAT_ATS		(1 << 5)
+#define ARM_SMMU_FEAT_SEV		(1 << 6)
+#define ARM_SMMU_FEAT_MSI		(1 << 7)
+#define ARM_SMMU_FEAT_COHERENCY		(1 << 8)
+#define ARM_SMMU_FEAT_TRANS_S1		(1 << 9)
+#define ARM_SMMU_FEAT_TRANS_S2		(1 << 10)
+#define ARM_SMMU_FEAT_STALLS		(1 << 11)
+#define ARM_SMMU_FEAT_HYP		(1 << 12)
+#define ARM_SMMU_FEAT_STALL_FORCE	(1 << 13)
+#define ARM_SMMU_FEAT_VAX		(1 << 14)
+#define ARM_SMMU_FEAT_RANGE_INV		(1 << 15)
+#define ARM_SMMU_FEAT_BTM		(1 << 16)
+#define ARM_SMMU_FEAT_SVA		(1 << 17)
+#define ARM_SMMU_FEAT_E2H		(1 << 18)
+#define ARM_SMMU_FEAT_NESTING		(1 << 19)
+#define ARM_SMMU_FEAT_ATTR_TYPES_OVR	(1 << 20)
+#define ARM_SMMU_FEAT_HA		(1 << 21)
+#define ARM_SMMU_FEAT_HD		(1 << 22)
+
+enum pri_resp {
+	PRI_RESP_DENY = 0,
+	PRI_RESP_FAIL = 1,
+	PRI_RESP_SUCC = 2,
+};
+
+struct arm_smmu_cmdq_ent {
+	/* Common fields */
+	u8				opcode;
+	bool				substream_valid;
+
+	/* Command-specific fields */
+	union {
+		#define CMDQ_OP_PREFETCH_CFG	0x1
+		struct {
+			u32			sid;
+		} prefetch;
+
+		#define CMDQ_OP_CFGI_STE	0x3
+		#define CMDQ_OP_CFGI_ALL	0x4
+		#define CMDQ_OP_CFGI_CD		0x5
+		#define CMDQ_OP_CFGI_CD_ALL	0x6
+		struct {
+			u32			sid;
+			u32			ssid;
+			union {
+				bool		leaf;
+				u8		span;
+			};
+		} cfgi;
+
+		#define CMDQ_OP_TLBI_NH_ASID	0x11
+		#define CMDQ_OP_TLBI_NH_VA	0x12
+		#define CMDQ_OP_TLBI_EL2_ALL	0x20
+		#define CMDQ_OP_TLBI_EL2_ASID	0x21
+		#define CMDQ_OP_TLBI_EL2_VA	0x22
+		#define CMDQ_OP_TLBI_S12_VMALL	0x28
+		#define CMDQ_OP_TLBI_S2_IPA	0x2a
+		#define CMDQ_OP_TLBI_NSNH_ALL	0x30
+		struct {
+			u8			num;
+			u8			scale;
+			u16			asid;
+			u16			vmid;
+			bool			leaf;
+			u8			ttl;
+			u8			tg;
+			u64			addr;
+		} tlbi;
+
+		#define CMDQ_OP_ATC_INV		0x40
+		#define ATC_INV_SIZE_ALL	52
+		struct {
+			u32			sid;
+			u32			ssid;
+			u64			addr;
+			u8			size;
+			bool			global;
+		} atc;
+
+		#define CMDQ_OP_PRI_RESP	0x41
+		struct {
+			u32			sid;
+			u32			ssid;
+			u16			grpid;
+			enum pri_resp		resp;
+		} pri;
+
+		#define CMDQ_OP_RESUME		0x44
+		struct {
+			u32			sid;
+			u16			stag;
+			u8			resp;
+		} resume;
+
+		#define CMDQ_OP_CMD_SYNC	0x46
+		struct {
+			u64			msiaddr;
+		} sync;
+	};
+};
+
+#endif /* _ARM_SMMU_V3_COMMON_H */
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1e9952ca989f..fc1b8c2af2a2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -8,7 +8,6 @@
 #ifndef _ARM_SMMU_V3_H
 #define _ARM_SMMU_V3_H
 
-#include <linux/bitfield.h>
 #include <linux/iommu.h>
 #include <linux/kernel.h>
 #include <linux/mmzone.h>
@@ -16,167 +15,7 @@
 
 struct arm_smmu_device;
 
-/* MMIO registers */
-#define ARM_SMMU_IDR0			0x0
-#define IDR0_ST_LVL			GENMASK(28, 27)
-#define IDR0_ST_LVL_2LVL		1
-#define IDR0_STALL_MODEL		GENMASK(25, 24)
-#define IDR0_STALL_MODEL_STALL		0
-#define IDR0_STALL_MODEL_FORCE		2
-#define IDR0_TTENDIAN			GENMASK(22, 21)
-#define IDR0_TTENDIAN_MIXED		0
-#define IDR0_TTENDIAN_LE		2
-#define IDR0_TTENDIAN_BE		3
-#define IDR0_CD2L			(1 << 19)
-#define IDR0_VMID16			(1 << 18)
-#define IDR0_PRI			(1 << 16)
-#define IDR0_SEV			(1 << 14)
-#define IDR0_MSI			(1 << 13)
-#define IDR0_ASID16			(1 << 12)
-#define IDR0_ATS			(1 << 10)
-#define IDR0_HYP			(1 << 9)
-#define IDR0_HTTU			GENMASK(7, 6)
-#define IDR0_HTTU_ACCESS		1
-#define IDR0_HTTU_ACCESS_DIRTY		2
-#define IDR0_COHACC			(1 << 4)
-#define IDR0_TTF			GENMASK(3, 2)
-#define IDR0_TTF_AARCH64		2
-#define IDR0_TTF_AARCH32_64		3
-#define IDR0_S1P			(1 << 1)
-#define IDR0_S2P			(1 << 0)
-
-#define ARM_SMMU_IDR1			0x4
-#define IDR1_TABLES_PRESET		(1 << 30)
-#define IDR1_QUEUES_PRESET		(1 << 29)
-#define IDR1_REL			(1 << 28)
-#define IDR1_ATTR_TYPES_OVR		(1 << 27)
-#define IDR1_CMDQS			GENMASK(25, 21)
-#define IDR1_EVTQS			GENMASK(20, 16)
-#define IDR1_PRIQS			GENMASK(15, 11)
-#define IDR1_SSIDSIZE			GENMASK(10, 6)
-#define IDR1_SIDSIZE			GENMASK(5, 0)
-
-#define ARM_SMMU_IDR3			0xc
-#define IDR3_RIL			(1 << 10)
-
-#define ARM_SMMU_IDR5			0x14
-#define IDR5_STALL_MAX			GENMASK(31, 16)
-#define IDR5_GRAN64K			(1 << 6)
-#define IDR5_GRAN16K			(1 << 5)
-#define IDR5_GRAN4K			(1 << 4)
-#define IDR5_OAS			GENMASK(2, 0)
-#define IDR5_OAS_32_BIT			0
-#define IDR5_OAS_36_BIT			1
-#define IDR5_OAS_40_BIT			2
-#define IDR5_OAS_42_BIT			3
-#define IDR5_OAS_44_BIT			4
-#define IDR5_OAS_48_BIT			5
-#define IDR5_OAS_52_BIT			6
-#define IDR5_VAX			GENMASK(11, 10)
-#define IDR5_VAX_52_BIT			1
-
-#define ARM_SMMU_IIDR			0x18
-#define IIDR_PRODUCTID			GENMASK(31, 20)
-#define IIDR_VARIANT			GENMASK(19, 16)
-#define IIDR_REVISION			GENMASK(15, 12)
-#define IIDR_IMPLEMENTER		GENMASK(11, 0)
-
-#define ARM_SMMU_CR0			0x20
-#define CR0_ATSCHK			(1 << 4)
-#define CR0_CMDQEN			(1 << 3)
-#define CR0_EVTQEN			(1 << 2)
-#define CR0_PRIQEN			(1 << 1)
-#define CR0_SMMUEN			(1 << 0)
-
-#define ARM_SMMU_CR0ACK			0x24
-
-#define ARM_SMMU_CR1			0x28
-#define CR1_TABLE_SH			GENMASK(11, 10)
-#define CR1_TABLE_OC			GENMASK(9, 8)
-#define CR1_TABLE_IC			GENMASK(7, 6)
-#define CR1_QUEUE_SH			GENMASK(5, 4)
-#define CR1_QUEUE_OC			GENMASK(3, 2)
-#define CR1_QUEUE_IC			GENMASK(1, 0)
-/* CR1 cacheability fields don't quite follow the usual TCR-style encoding */
-#define CR1_CACHE_NC			0
-#define CR1_CACHE_WB			1
-#define CR1_CACHE_WT			2
-
-#define ARM_SMMU_CR2			0x2c
-#define CR2_PTM				(1 << 2)
-#define CR2_RECINVSID			(1 << 1)
-#define CR2_E2H				(1 << 0)
-
-#define ARM_SMMU_GBPA			0x44
-#define GBPA_UPDATE			(1 << 31)
-#define GBPA_ABORT			(1 << 20)
-
-#define ARM_SMMU_IRQ_CTRL		0x50
-#define IRQ_CTRL_EVTQ_IRQEN		(1 << 2)
-#define IRQ_CTRL_PRIQ_IRQEN		(1 << 1)
-#define IRQ_CTRL_GERROR_IRQEN		(1 << 0)
-
-#define ARM_SMMU_IRQ_CTRLACK		0x54
-
-#define ARM_SMMU_GERROR			0x60
-#define GERROR_SFM_ERR			(1 << 8)
-#define GERROR_MSI_GERROR_ABT_ERR	(1 << 7)
-#define GERROR_MSI_PRIQ_ABT_ERR		(1 << 6)
-#define GERROR_MSI_EVTQ_ABT_ERR		(1 << 5)
-#define GERROR_MSI_CMDQ_ABT_ERR		(1 << 4)
-#define GERROR_PRIQ_ABT_ERR		(1 << 3)
-#define GERROR_EVTQ_ABT_ERR		(1 << 2)
-#define GERROR_CMDQ_ERR			(1 << 0)
-#define GERROR_ERR_MASK			0x1fd
-
-#define ARM_SMMU_GERRORN		0x64
-
-#define ARM_SMMU_GERROR_IRQ_CFG0	0x68
-#define ARM_SMMU_GERROR_IRQ_CFG1	0x70
-#define ARM_SMMU_GERROR_IRQ_CFG2	0x74
-
-#define ARM_SMMU_STRTAB_BASE		0x80
-#define STRTAB_BASE_RA			(1UL << 62)
-#define STRTAB_BASE_ADDR_MASK		GENMASK_ULL(51, 6)
-
-#define ARM_SMMU_STRTAB_BASE_CFG	0x88
-#define STRTAB_BASE_CFG_FMT		GENMASK(17, 16)
-#define STRTAB_BASE_CFG_FMT_LINEAR	0
-#define STRTAB_BASE_CFG_FMT_2LVL	1
-#define STRTAB_BASE_CFG_SPLIT		GENMASK(10, 6)
-#define STRTAB_BASE_CFG_LOG2SIZE	GENMASK(5, 0)
-
-#define ARM_SMMU_CMDQ_BASE		0x90
-#define ARM_SMMU_CMDQ_PROD		0x98
-#define ARM_SMMU_CMDQ_CONS		0x9c
-
-#define ARM_SMMU_EVTQ_BASE		0xa0
-#define ARM_SMMU_EVTQ_PROD		0xa8
-#define ARM_SMMU_EVTQ_CONS		0xac
-#define ARM_SMMU_EVTQ_IRQ_CFG0		0xb0
-#define ARM_SMMU_EVTQ_IRQ_CFG1		0xb8
-#define ARM_SMMU_EVTQ_IRQ_CFG2		0xbc
-
-#define ARM_SMMU_PRIQ_BASE		0xc0
-#define ARM_SMMU_PRIQ_PROD		0xc8
-#define ARM_SMMU_PRIQ_CONS		0xcc
-#define ARM_SMMU_PRIQ_IRQ_CFG0		0xd0
-#define ARM_SMMU_PRIQ_IRQ_CFG1		0xd8
-#define ARM_SMMU_PRIQ_IRQ_CFG2		0xdc
-
-#define ARM_SMMU_REG_SZ			0xe00
-
-/* Common MSI config fields */
-#define MSI_CFG0_ADDR_MASK		GENMASK_ULL(51, 2)
-#define MSI_CFG2_SH			GENMASK(5, 4)
-#define MSI_CFG2_MEMATTR		GENMASK(3, 0)
-
-/* Common memory attribute values */
-#define ARM_SMMU_SH_NSH			0
-#define ARM_SMMU_SH_OSH			2
-#define ARM_SMMU_SH_ISH			3
-#define ARM_SMMU_MEMATTR_DEVICE_nGnRE	0x1
-#define ARM_SMMU_MEMATTR_OIWB		0xf
+#include <asm/arm-smmu-v3-common.h>
 
 #define Q_IDX(llq, p)			((p) & ((1 << (llq)->max_n_shift) - 1))
 #define Q_WRP(llq, p)			((p) & (1 << (llq)->max_n_shift))
@@ -186,10 +25,6 @@ struct arm_smmu_device;
 					 Q_IDX(&((q)->llq), p) *	\
 					 (q)->ent_dwords)
 
-#define Q_BASE_RWA			(1UL << 62)
-#define Q_BASE_ADDR_MASK		GENMASK_ULL(51, 5)
-#define Q_BASE_LOG2SIZE			GENMASK(4, 0)
-
 /* Ensure DMA allocations are naturally aligned */
 #ifdef CONFIG_CMA_ALIGNMENT
 #define Q_MAX_SZ_SHIFT			(PAGE_SHIFT + CONFIG_CMA_ALIGNMENT)
@@ -197,180 +32,6 @@ struct arm_smmu_device;
 #define Q_MAX_SZ_SHIFT			(PAGE_SHIFT + MAX_PAGE_ORDER)
 #endif
 
-/*
- * Stream table.
- *
- * Linear: Enough to cover 1 << IDR1.SIDSIZE entries
- * 2lvl: 128k L1 entries,
- *       256 lazy entries per table (each table covers a PCI bus)
- */
-#define STRTAB_SPLIT			8
-
-#define STRTAB_L1_DESC_SPAN		GENMASK_ULL(4, 0)
-#define STRTAB_L1_DESC_L2PTR_MASK	GENMASK_ULL(51, 6)
-
-#define STRTAB_STE_DWORDS		8
-
-struct arm_smmu_ste {
-	__le64 data[STRTAB_STE_DWORDS];
-};
-
-#define STRTAB_NUM_L2_STES		(1 << STRTAB_SPLIT)
-struct arm_smmu_strtab_l2 {
-	struct arm_smmu_ste stes[STRTAB_NUM_L2_STES];
-};
-
-struct arm_smmu_strtab_l1 {
-	__le64 l2ptr;
-};
-#define STRTAB_MAX_L1_ENTRIES		(1 << 17)
-
-static inline u32 arm_smmu_strtab_l1_idx(u32 sid)
-{
-	return sid / STRTAB_NUM_L2_STES;
-}
-
-static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
-{
-	return sid % STRTAB_NUM_L2_STES;
-}
-
-#define STRTAB_STE_0_V			(1UL << 0)
-#define STRTAB_STE_0_CFG		GENMASK_ULL(3, 1)
-#define STRTAB_STE_0_CFG_ABORT		0
-#define STRTAB_STE_0_CFG_BYPASS		4
-#define STRTAB_STE_0_CFG_S1_TRANS	5
-#define STRTAB_STE_0_CFG_S2_TRANS	6
-
-#define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)
-#define STRTAB_STE_0_S1FMT_LINEAR	0
-#define STRTAB_STE_0_S1FMT_64K_L2	2
-#define STRTAB_STE_0_S1CTXPTR_MASK	GENMASK_ULL(51, 6)
-#define STRTAB_STE_0_S1CDMAX		GENMASK_ULL(63, 59)
-
-#define STRTAB_STE_1_S1DSS		GENMASK_ULL(1, 0)
-#define STRTAB_STE_1_S1DSS_TERMINATE	0x0
-#define STRTAB_STE_1_S1DSS_BYPASS	0x1
-#define STRTAB_STE_1_S1DSS_SSID0	0x2
-
-#define STRTAB_STE_1_S1C_CACHE_NC	0UL
-#define STRTAB_STE_1_S1C_CACHE_WBRA	1UL
-#define STRTAB_STE_1_S1C_CACHE_WT	2UL
-#define STRTAB_STE_1_S1C_CACHE_WB	3UL
-#define STRTAB_STE_1_S1CIR		GENMASK_ULL(3, 2)
-#define STRTAB_STE_1_S1COR		GENMASK_ULL(5, 4)
-#define STRTAB_STE_1_S1CSH		GENMASK_ULL(7, 6)
-
-#define STRTAB_STE_1_S1STALLD		(1UL << 27)
-
-#define STRTAB_STE_1_EATS		GENMASK_ULL(29, 28)
-#define STRTAB_STE_1_EATS_ABT		0UL
-#define STRTAB_STE_1_EATS_TRANS		1UL
-#define STRTAB_STE_1_EATS_S1CHK		2UL
-
-#define STRTAB_STE_1_STRW		GENMASK_ULL(31, 30)
-#define STRTAB_STE_1_STRW_NSEL1		0UL
-#define STRTAB_STE_1_STRW_EL2		2UL
-
-#define STRTAB_STE_1_SHCFG		GENMASK_ULL(45, 44)
-#define STRTAB_STE_1_SHCFG_INCOMING	1UL
-
-#define STRTAB_STE_2_S2VMID		GENMASK_ULL(15, 0)
-#define STRTAB_STE_2_VTCR		GENMASK_ULL(50, 32)
-#define STRTAB_STE_2_VTCR_S2T0SZ	GENMASK_ULL(5, 0)
-#define STRTAB_STE_2_VTCR_S2SL0		GENMASK_ULL(7, 6)
-#define STRTAB_STE_2_VTCR_S2IR0		GENMASK_ULL(9, 8)
-#define STRTAB_STE_2_VTCR_S2OR0		GENMASK_ULL(11, 10)
-#define STRTAB_STE_2_VTCR_S2SH0		GENMASK_ULL(13, 12)
-#define STRTAB_STE_2_VTCR_S2TG		GENMASK_ULL(15, 14)
-#define STRTAB_STE_2_VTCR_S2PS		GENMASK_ULL(18, 16)
-#define STRTAB_STE_2_S2AA64		(1UL << 51)
-#define STRTAB_STE_2_S2ENDI		(1UL << 52)
-#define STRTAB_STE_2_S2PTW		(1UL << 54)
-#define STRTAB_STE_2_S2S		(1UL << 57)
-#define STRTAB_STE_2_S2R		(1UL << 58)
-
-#define STRTAB_STE_3_S2TTB_MASK		GENMASK_ULL(51, 4)
-
-/*
- * Context descriptors.
- *
- * Linear: when less than 1024 SSIDs are supported
- * 2lvl: at most 1024 L1 entries,
- *       1024 lazy entries per table.
- */
-#define CTXDESC_L2_ENTRIES		1024
-
-#define CTXDESC_L1_DESC_V		(1UL << 0)
-#define CTXDESC_L1_DESC_L2PTR_MASK	GENMASK_ULL(51, 12)
-
-#define CTXDESC_CD_DWORDS		8
-
-struct arm_smmu_cd {
-	__le64 data[CTXDESC_CD_DWORDS];
-};
-
-struct arm_smmu_cdtab_l2 {
-	struct arm_smmu_cd cds[CTXDESC_L2_ENTRIES];
-};
-
-struct arm_smmu_cdtab_l1 {
-	__le64 l2ptr;
-};
-
-static inline unsigned int arm_smmu_cdtab_l1_idx(unsigned int ssid)
-{
-	return ssid / CTXDESC_L2_ENTRIES;
-}
-
-static inline unsigned int arm_smmu_cdtab_l2_idx(unsigned int ssid)
-{
-	return ssid % CTXDESC_L2_ENTRIES;
-}
-
-#define CTXDESC_CD_0_TCR_T0SZ		GENMASK_ULL(5, 0)
-#define CTXDESC_CD_0_TCR_TG0		GENMASK_ULL(7, 6)
-#define CTXDESC_CD_0_TCR_IRGN0		GENMASK_ULL(9, 8)
-#define CTXDESC_CD_0_TCR_ORGN0		GENMASK_ULL(11, 10)
-#define CTXDESC_CD_0_TCR_SH0		GENMASK_ULL(13, 12)
-#define CTXDESC_CD_0_TCR_EPD0		(1ULL << 14)
-#define CTXDESC_CD_0_TCR_EPD1		(1ULL << 30)
-
-#define CTXDESC_CD_0_ENDI		(1UL << 15)
-#define CTXDESC_CD_0_V			(1UL << 31)
-
-#define CTXDESC_CD_0_TCR_IPS		GENMASK_ULL(34, 32)
-#define CTXDESC_CD_0_TCR_TBI0		(1ULL << 38)
-
-#define CTXDESC_CD_0_TCR_HA            (1UL << 43)
-#define CTXDESC_CD_0_TCR_HD            (1UL << 42)
-
-#define CTXDESC_CD_0_AA64		(1UL << 41)
-#define CTXDESC_CD_0_S			(1UL << 44)
-#define CTXDESC_CD_0_R			(1UL << 45)
-#define CTXDESC_CD_0_A			(1UL << 46)
-#define CTXDESC_CD_0_ASET		(1UL << 47)
-#define CTXDESC_CD_0_ASID		GENMASK_ULL(63, 48)
-
-#define CTXDESC_CD_1_TTB0_MASK		GENMASK_ULL(51, 4)
-
-/*
- * When the SMMU only supports linear context descriptor tables, pick a
- * reasonable size limit (64kB).
- */
-#define CTXDESC_LINEAR_CDMAX		ilog2(SZ_64K / sizeof(struct arm_smmu_cd))
-
-/* Command queue */
-#define CMDQ_ENT_SZ_SHIFT		4
-#define CMDQ_ENT_DWORDS			((1 << CMDQ_ENT_SZ_SHIFT) >> 3)
-#define CMDQ_MAX_SZ_SHIFT		(Q_MAX_SZ_SHIFT - CMDQ_ENT_SZ_SHIFT)
-
-#define CMDQ_CONS_ERR			GENMASK(30, 24)
-#define CMDQ_ERR_CERROR_NONE_IDX	0
-#define CMDQ_ERR_CERROR_ILL_IDX		1
-#define CMDQ_ERR_CERROR_ABT_IDX		2
-#define CMDQ_ERR_CERROR_ATC_INV_IDX	3
-
 #define CMDQ_PROD_OWNED_FLAG		Q_OVERFLOW_FLAG
 
 /*
@@ -380,99 +41,6 @@ static inline unsigned int arm_smmu_cdtab_l2_idx(unsigned int ssid)
  */
 #define CMDQ_BATCH_ENTRIES		BITS_PER_LONG
 
-#define CMDQ_0_OP			GENMASK_ULL(7, 0)
-#define CMDQ_0_SSV			(1UL << 11)
-
-#define CMDQ_PREFETCH_0_SID		GENMASK_ULL(63, 32)
-#define CMDQ_PREFETCH_1_SIZE		GENMASK_ULL(4, 0)
-#define CMDQ_PREFETCH_1_ADDR_MASK	GENMASK_ULL(63, 12)
-
-#define CMDQ_CFGI_0_SSID		GENMASK_ULL(31, 12)
-#define CMDQ_CFGI_0_SID			GENMASK_ULL(63, 32)
-#define CMDQ_CFGI_1_LEAF		(1UL << 0)
-#define CMDQ_CFGI_1_RANGE		GENMASK_ULL(4, 0)
-
-#define CMDQ_TLBI_0_NUM			GENMASK_ULL(16, 12)
-#define CMDQ_TLBI_RANGE_NUM_MAX		31
-#define CMDQ_TLBI_0_SCALE		GENMASK_ULL(24, 20)
-#define CMDQ_TLBI_0_VMID		GENMASK_ULL(47, 32)
-#define CMDQ_TLBI_0_ASID		GENMASK_ULL(63, 48)
-#define CMDQ_TLBI_1_LEAF		(1UL << 0)
-#define CMDQ_TLBI_1_TTL			GENMASK_ULL(9, 8)
-#define CMDQ_TLBI_1_TG			GENMASK_ULL(11, 10)
-#define CMDQ_TLBI_1_VA_MASK		GENMASK_ULL(63, 12)
-#define CMDQ_TLBI_1_IPA_MASK		GENMASK_ULL(51, 12)
-
-#define CMDQ_ATC_0_SSID			GENMASK_ULL(31, 12)
-#define CMDQ_ATC_0_SID			GENMASK_ULL(63, 32)
-#define CMDQ_ATC_0_GLOBAL		(1UL << 9)
-#define CMDQ_ATC_1_SIZE			GENMASK_ULL(5, 0)
-#define CMDQ_ATC_1_ADDR_MASK		GENMASK_ULL(63, 12)
-
-#define CMDQ_PRI_0_SSID			GENMASK_ULL(31, 12)
-#define CMDQ_PRI_0_SID			GENMASK_ULL(63, 32)
-#define CMDQ_PRI_1_GRPID		GENMASK_ULL(8, 0)
-#define CMDQ_PRI_1_RESP			GENMASK_ULL(13, 12)
-
-#define CMDQ_RESUME_0_RESP_TERM		0UL
-#define CMDQ_RESUME_0_RESP_RETRY	1UL
-#define CMDQ_RESUME_0_RESP_ABORT	2UL
-#define CMDQ_RESUME_0_RESP		GENMASK_ULL(13, 12)
-#define CMDQ_RESUME_0_SID		GENMASK_ULL(63, 32)
-#define CMDQ_RESUME_1_STAG		GENMASK_ULL(15, 0)
-
-#define CMDQ_SYNC_0_CS			GENMASK_ULL(13, 12)
-#define CMDQ_SYNC_0_CS_NONE		0
-#define CMDQ_SYNC_0_CS_IRQ		1
-#define CMDQ_SYNC_0_CS_SEV		2
-#define CMDQ_SYNC_0_MSH			GENMASK_ULL(23, 22)
-#define CMDQ_SYNC_0_MSIATTR		GENMASK_ULL(27, 24)
-#define CMDQ_SYNC_0_MSIDATA		GENMASK_ULL(63, 32)
-#define CMDQ_SYNC_1_MSIADDR_MASK	GENMASK_ULL(51, 2)
-
-/* Event queue */
-#define EVTQ_ENT_SZ_SHIFT		5
-#define EVTQ_ENT_DWORDS			((1 << EVTQ_ENT_SZ_SHIFT) >> 3)
-#define EVTQ_MAX_SZ_SHIFT		(Q_MAX_SZ_SHIFT - EVTQ_ENT_SZ_SHIFT)
-
-#define EVTQ_0_ID			GENMASK_ULL(7, 0)
-
-#define EVT_ID_TRANSLATION_FAULT	0x10
-#define EVT_ID_ADDR_SIZE_FAULT		0x11
-#define EVT_ID_ACCESS_FAULT		0x12
-#define EVT_ID_PERMISSION_FAULT		0x13
-
-#define EVTQ_0_SSV			(1UL << 11)
-#define EVTQ_0_SSID			GENMASK_ULL(31, 12)
-#define EVTQ_0_SID			GENMASK_ULL(63, 32)
-#define EVTQ_1_STAG			GENMASK_ULL(15, 0)
-#define EVTQ_1_STALL			(1UL << 31)
-#define EVTQ_1_PnU			(1UL << 33)
-#define EVTQ_1_InD			(1UL << 34)
-#define EVTQ_1_RnW			(1UL << 35)
-#define EVTQ_1_S2			(1UL << 39)
-#define EVTQ_1_CLASS			GENMASK_ULL(41, 40)
-#define EVTQ_1_TT_READ			(1UL << 44)
-#define EVTQ_2_ADDR			GENMASK_ULL(63, 0)
-#define EVTQ_3_IPA			GENMASK_ULL(51, 12)
-
-/* PRI queue */
-#define PRIQ_ENT_SZ_SHIFT		4
-#define PRIQ_ENT_DWORDS			((1 << PRIQ_ENT_SZ_SHIFT) >> 3)
-#define PRIQ_MAX_SZ_SHIFT		(Q_MAX_SZ_SHIFT - PRIQ_ENT_SZ_SHIFT)
-
-#define PRIQ_0_SID			GENMASK_ULL(31, 0)
-#define PRIQ_0_SSID			GENMASK_ULL(51, 32)
-#define PRIQ_0_PERM_PRIV		(1UL << 58)
-#define PRIQ_0_PERM_EXEC		(1UL << 59)
-#define PRIQ_0_PERM_READ		(1UL << 60)
-#define PRIQ_0_PERM_WRITE		(1UL << 61)
-#define PRIQ_0_PRG_LAST			(1UL << 62)
-#define PRIQ_0_SSID_V			(1UL << 63)
-
-#define PRIQ_1_PRG_IDX			GENMASK_ULL(8, 0)
-#define PRIQ_1_ADDR_MASK		GENMASK_ULL(63, 12)
-
 /* High-level queue structures */
 #define ARM_SMMU_POLL_TIMEOUT_US	1000000 /* 1s! */
 #define ARM_SMMU_POLL_SPIN_COUNT	10
@@ -480,88 +48,6 @@ static inline unsigned int arm_smmu_cdtab_l2_idx(unsigned int ssid)
 #define MSI_IOVA_BASE			0x8000000
 #define MSI_IOVA_LENGTH			0x100000
 
-enum pri_resp {
-	PRI_RESP_DENY = 0,
-	PRI_RESP_FAIL = 1,
-	PRI_RESP_SUCC = 2,
-};
-
-struct arm_smmu_cmdq_ent {
-	/* Common fields */
-	u8				opcode;
-	bool				substream_valid;
-
-	/* Command-specific fields */
-	union {
-		#define CMDQ_OP_PREFETCH_CFG	0x1
-		struct {
-			u32			sid;
-		} prefetch;
-
-		#define CMDQ_OP_CFGI_STE	0x3
-		#define CMDQ_OP_CFGI_ALL	0x4
-		#define CMDQ_OP_CFGI_CD		0x5
-		#define CMDQ_OP_CFGI_CD_ALL	0x6
-		struct {
-			u32			sid;
-			u32			ssid;
-			union {
-				bool		leaf;
-				u8		span;
-			};
-		} cfgi;
-
-		#define CMDQ_OP_TLBI_NH_ASID	0x11
-		#define CMDQ_OP_TLBI_NH_VA	0x12
-		#define CMDQ_OP_TLBI_EL2_ALL	0x20
-		#define CMDQ_OP_TLBI_EL2_ASID	0x21
-		#define CMDQ_OP_TLBI_EL2_VA	0x22
-		#define CMDQ_OP_TLBI_S12_VMALL	0x28
-		#define CMDQ_OP_TLBI_S2_IPA	0x2a
-		#define CMDQ_OP_TLBI_NSNH_ALL	0x30
-		struct {
-			u8			num;
-			u8			scale;
-			u16			asid;
-			u16			vmid;
-			bool			leaf;
-			u8			ttl;
-			u8			tg;
-			u64			addr;
-		} tlbi;
-
-		#define CMDQ_OP_ATC_INV		0x40
-		#define ATC_INV_SIZE_ALL	52
-		struct {
-			u32			sid;
-			u32			ssid;
-			u64			addr;
-			u8			size;
-			bool			global;
-		} atc;
-
-		#define CMDQ_OP_PRI_RESP	0x41
-		struct {
-			u32			sid;
-			u32			ssid;
-			u16			grpid;
-			enum pri_resp		resp;
-		} pri;
-
-		#define CMDQ_OP_RESUME		0x44
-		struct {
-			u32			sid;
-			u16			stag;
-			u8			resp;
-		} resume;
-
-		#define CMDQ_OP_CMD_SYNC	0x46
-		struct {
-			u64			msiaddr;
-		} sync;
-	};
-};
-
 struct arm_smmu_ll_queue {
 	union {
 		u64			val;
@@ -703,29 +189,7 @@ struct arm_smmu_device {
 	void __iomem			*base;
 	void __iomem			*page1;
 
-#define ARM_SMMU_FEAT_2_LVL_STRTAB	(1 << 0)
-#define ARM_SMMU_FEAT_2_LVL_CDTAB	(1 << 1)
-#define ARM_SMMU_FEAT_TT_LE		(1 << 2)
-#define ARM_SMMU_FEAT_TT_BE		(1 << 3)
-#define ARM_SMMU_FEAT_PRI		(1 << 4)
-#define ARM_SMMU_FEAT_ATS		(1 << 5)
-#define ARM_SMMU_FEAT_SEV		(1 << 6)
-#define ARM_SMMU_FEAT_MSI		(1 << 7)
-#define ARM_SMMU_FEAT_COHERENCY		(1 << 8)
-#define ARM_SMMU_FEAT_TRANS_S1		(1 << 9)
-#define ARM_SMMU_FEAT_TRANS_S2		(1 << 10)
-#define ARM_SMMU_FEAT_STALLS		(1 << 11)
-#define ARM_SMMU_FEAT_HYP		(1 << 12)
-#define ARM_SMMU_FEAT_STALL_FORCE	(1 << 13)
-#define ARM_SMMU_FEAT_VAX		(1 << 14)
-#define ARM_SMMU_FEAT_RANGE_INV		(1 << 15)
-#define ARM_SMMU_FEAT_BTM		(1 << 16)
-#define ARM_SMMU_FEAT_SVA		(1 << 17)
-#define ARM_SMMU_FEAT_E2H		(1 << 18)
-#define ARM_SMMU_FEAT_NESTING		(1 << 19)
-#define ARM_SMMU_FEAT_ATTR_TYPES_OVR	(1 << 20)
-#define ARM_SMMU_FEAT_HA		(1 << 21)
-#define ARM_SMMU_FEAT_HD		(1 << 22)
+	/* See arm-smmu-v3-common.h*/
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 05/58] iommu/arm-smmu-v3: Extract driver-specific bits from probe function
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (3 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 04/58] iommu/arm-smmu-v3: Move some definitions to arm64 include/ Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 06/58] iommu/arm-smmu-v3: Move some functions to arm-smmu-v3-common.c Mostafa Saleh
                   ` (53 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

As we're about to share the arm_smmu_device_hw_probe() function with the
KVM driver, extract bits that are specific to the normal driver.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 737c5b882355..702863c94f91 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -4167,7 +4167,7 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
 	if (reg & IDR0_MSI) {
 		smmu->features |= ARM_SMMU_FEAT_MSI;
-		if (coherent && !disable_msipolling)
+		if (coherent)
 			smmu->options |= ARM_SMMU_OPT_MSIPOLL;
 	}
 
@@ -4316,11 +4316,6 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 		smmu->oas = 48;
 	}
 
-	if (arm_smmu_ops.pgsize_bitmap == -1UL)
-		arm_smmu_ops.pgsize_bitmap = smmu->pgsize_bitmap;
-	else
-		arm_smmu_ops.pgsize_bitmap |= smmu->pgsize_bitmap;
-
 	/* Set the DMA mask for our table walker */
 	if (dma_set_mask_and_coherent(smmu->dev, DMA_BIT_MASK(smmu->oas)))
 		dev_warn(smmu->dev,
@@ -4334,9 +4329,6 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
 	arm_smmu_device_iidr_probe(smmu);
 
-	if (arm_smmu_sva_supported(smmu))
-		smmu->features |= ARM_SMMU_FEAT_SVA;
-
 	dev_info(smmu->dev, "ias %lu-bit, oas %lu-bit (features 0x%08x)\n",
 		 smmu->ias, smmu->oas, smmu->features);
 	return 0;
@@ -4606,6 +4598,17 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 	if (ret)
 		return ret;
 
+	if (arm_smmu_sva_supported(smmu))
+		smmu->features |= ARM_SMMU_FEAT_SVA;
+
+	if (disable_msipolling)
+		smmu->options &= ~ARM_SMMU_OPT_MSIPOLL;
+
+	if (arm_smmu_ops.pgsize_bitmap == -1UL)
+		arm_smmu_ops.pgsize_bitmap = smmu->pgsize_bitmap;
+	else
+		arm_smmu_ops.pgsize_bitmap |= smmu->pgsize_bitmap;
+
 	/* Initialise in-memory data structures */
 	ret = arm_smmu_init_structures(smmu);
 	if (ret)
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 06/58] iommu/arm-smmu-v3: Move some functions to arm-smmu-v3-common.c
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (4 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 05/58] iommu/arm-smmu-v3: Extract driver-specific bits from probe function Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 07/58] iommu/arm-smmu-v3: Move queue and table allocation " Mostafa Saleh
                   ` (52 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Move functions that can be shared between normal and KVM drivers to
arm-smmu-v3-common.c

Only straightforward moves here. More subtle factoring will be done in
then next patches.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/Makefile        |   1 +
 .../arm/arm-smmu-v3/arm-smmu-v3-common.c      | 365 ++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 363 -----------------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  11 +
 4 files changed, 377 insertions(+), 363 deletions(-)
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c

diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
index dc98c88b48c8..515a84f14783 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_ARM_SMMU_V3) += arm_smmu_v3.o
 arm_smmu_v3-y := arm-smmu-v3.o
+arm_smmu_v3-y += arm-smmu-v3-common.o
 arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
 arm_smmu_v3-$(CONFIG_TEGRA241_CMDQV) += tegra241-cmdqv.o
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
new file mode 100644
index 000000000000..cfd5ba69e67e
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
@@ -0,0 +1,365 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/dma-mapping.h>
+#include <linux/iopoll.h>
+#include <linux/pci.h>
+
+#include "arm-smmu-v3.h"
+#include "../../dma-iommu.h"
+
+#define IIDR_IMPLEMENTER_ARM		0x43b
+#define IIDR_PRODUCTID_ARM_MMU_600	0x483
+#define IIDR_PRODUCTID_ARM_MMU_700	0x487
+
+static void arm_smmu_device_iidr_probe(struct arm_smmu_device *smmu)
+{
+	u32 reg;
+	unsigned int implementer, productid, variant, revision;
+
+	reg = readl_relaxed(smmu->base + ARM_SMMU_IIDR);
+	implementer = FIELD_GET(IIDR_IMPLEMENTER, reg);
+	productid = FIELD_GET(IIDR_PRODUCTID, reg);
+	variant = FIELD_GET(IIDR_VARIANT, reg);
+	revision = FIELD_GET(IIDR_REVISION, reg);
+
+	switch (implementer) {
+	case IIDR_IMPLEMENTER_ARM:
+		switch (productid) {
+		case IIDR_PRODUCTID_ARM_MMU_600:
+			/* Arm erratum 1076982 */
+			if (variant == 0 && revision <= 2)
+				smmu->features &= ~ARM_SMMU_FEAT_SEV;
+			/* Arm erratum 1209401 */
+			if (variant < 2)
+				smmu->features &= ~ARM_SMMU_FEAT_NESTING;
+			break;
+		case IIDR_PRODUCTID_ARM_MMU_700:
+			/* Arm erratum 2812531 */
+			smmu->features &= ~ARM_SMMU_FEAT_BTM;
+			smmu->options |= ARM_SMMU_OPT_CMDQ_FORCE_SYNC;
+			/* Arm errata 2268618, 2812531 */
+			smmu->features &= ~ARM_SMMU_FEAT_NESTING;
+			break;
+		}
+		break;
+	}
+}
+
+static void arm_smmu_get_httu(struct arm_smmu_device *smmu, u32 reg)
+{
+	u32 fw_features = smmu->features & (ARM_SMMU_FEAT_HA | ARM_SMMU_FEAT_HD);
+	u32 hw_features = 0;
+
+	switch (FIELD_GET(IDR0_HTTU, reg)) {
+	case IDR0_HTTU_ACCESS_DIRTY:
+		hw_features |= ARM_SMMU_FEAT_HD;
+		fallthrough;
+	case IDR0_HTTU_ACCESS:
+		hw_features |= ARM_SMMU_FEAT_HA;
+	}
+
+	if (smmu->dev->of_node)
+		smmu->features |= hw_features;
+	else if (hw_features != fw_features)
+		/* ACPI IORT sets the HTTU bits */
+		dev_warn(smmu->dev,
+			 "IDR0.HTTU features(0x%x) overridden by FW configuration (0x%x)\n",
+			  hw_features, fw_features);
+}
+
+int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
+{
+	u32 reg;
+	bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
+
+	/* IDR0 */
+	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
+
+	/* 2-level structures */
+	if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL)
+		smmu->features |= ARM_SMMU_FEAT_2_LVL_STRTAB;
+
+	if (reg & IDR0_CD2L)
+		smmu->features |= ARM_SMMU_FEAT_2_LVL_CDTAB;
+
+	/*
+	 * Translation table endianness.
+	 * We currently require the same endianness as the CPU, but this
+	 * could be changed later by adding a new IO_PGTABLE_QUIRK.
+	 */
+	switch (FIELD_GET(IDR0_TTENDIAN, reg)) {
+	case IDR0_TTENDIAN_MIXED:
+		smmu->features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE;
+		break;
+#ifdef __BIG_ENDIAN
+	case IDR0_TTENDIAN_BE:
+		smmu->features |= ARM_SMMU_FEAT_TT_BE;
+		break;
+#else
+	case IDR0_TTENDIAN_LE:
+		smmu->features |= ARM_SMMU_FEAT_TT_LE;
+		break;
+#endif
+	default:
+		dev_err(smmu->dev, "unknown/unsupported TT endianness!\n");
+		return -ENXIO;
+	}
+
+	/* Boolean feature flags */
+	if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
+		smmu->features |= ARM_SMMU_FEAT_PRI;
+
+	if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
+		smmu->features |= ARM_SMMU_FEAT_ATS;
+
+	if (reg & IDR0_SEV)
+		smmu->features |= ARM_SMMU_FEAT_SEV;
+
+	if (reg & IDR0_MSI) {
+		smmu->features |= ARM_SMMU_FEAT_MSI;
+		if (coherent)
+			smmu->options |= ARM_SMMU_OPT_MSIPOLL;
+	}
+
+	if (reg & IDR0_HYP) {
+		smmu->features |= ARM_SMMU_FEAT_HYP;
+		if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN))
+			smmu->features |= ARM_SMMU_FEAT_E2H;
+	}
+
+	arm_smmu_get_httu(smmu, reg);
+
+	/*
+	 * The coherency feature as set by FW is used in preference to the ID
+	 * register, but warn on mismatch.
+	 */
+	if (!!(reg & IDR0_COHACC) != coherent)
+		dev_warn(smmu->dev, "IDR0.COHACC overridden by FW configuration (%s)\n",
+			 coherent ? "true" : "false");
+
+	switch (FIELD_GET(IDR0_STALL_MODEL, reg)) {
+	case IDR0_STALL_MODEL_FORCE:
+		smmu->features |= ARM_SMMU_FEAT_STALL_FORCE;
+		fallthrough;
+	case IDR0_STALL_MODEL_STALL:
+		smmu->features |= ARM_SMMU_FEAT_STALLS;
+	}
+
+	if (reg & IDR0_S1P)
+		smmu->features |= ARM_SMMU_FEAT_TRANS_S1;
+
+	if (reg & IDR0_S2P)
+		smmu->features |= ARM_SMMU_FEAT_TRANS_S2;
+
+	if (!(reg & (IDR0_S1P | IDR0_S2P))) {
+		dev_err(smmu->dev, "no translation support!\n");
+		return -ENXIO;
+	}
+
+	/* We only support the AArch64 table format at present */
+	switch (FIELD_GET(IDR0_TTF, reg)) {
+	case IDR0_TTF_AARCH32_64:
+		smmu->ias = 40;
+		fallthrough;
+	case IDR0_TTF_AARCH64:
+		break;
+	default:
+		dev_err(smmu->dev, "AArch64 table format not supported!\n");
+		return -ENXIO;
+	}
+
+	/* ASID/VMID sizes */
+	smmu->asid_bits = reg & IDR0_ASID16 ? 16 : 8;
+	smmu->vmid_bits = reg & IDR0_VMID16 ? 16 : 8;
+
+	/* IDR1 */
+	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR1);
+	if (reg & (IDR1_TABLES_PRESET | IDR1_QUEUES_PRESET | IDR1_REL)) {
+		dev_err(smmu->dev, "embedded implementation not supported\n");
+		return -ENXIO;
+	}
+
+	if (reg & IDR1_ATTR_TYPES_OVR)
+		smmu->features |= ARM_SMMU_FEAT_ATTR_TYPES_OVR;
+
+	/* Queue sizes, capped to ensure natural alignment */
+	smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT,
+					     FIELD_GET(IDR1_CMDQS, reg));
+	if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) {
+		/*
+		 * We don't support splitting up batches, so one batch of
+		 * commands plus an extra sync needs to fit inside the command
+		 * queue. There's also no way we can handle the weird alignment
+		 * restrictions on the base pointer for a unit-length queue.
+		 */
+		dev_err(smmu->dev, "command queue size <= %d entries not supported\n",
+			CMDQ_BATCH_ENTRIES);
+		return -ENXIO;
+	}
+
+	smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
+					     FIELD_GET(IDR1_EVTQS, reg));
+	smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
+					     FIELD_GET(IDR1_PRIQS, reg));
+
+	/* SID/SSID sizes */
+	smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
+	smmu->sid_bits = FIELD_GET(IDR1_SIDSIZE, reg);
+	smmu->iommu.max_pasids = 1UL << smmu->ssid_bits;
+
+	/*
+	 * If the SMMU supports fewer bits than would fill a single L2 stream
+	 * table, use a linear table instead.
+	 */
+	if (smmu->sid_bits <= STRTAB_SPLIT)
+		smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+
+	/* IDR3 */
+	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+	if (FIELD_GET(IDR3_RIL, reg))
+		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
+
+	/* IDR5 */
+	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
+
+	/* Maximum number of outstanding stalls */
+	smmu->evtq.max_stalls = FIELD_GET(IDR5_STALL_MAX, reg);
+
+	/* Page sizes */
+	if (reg & IDR5_GRAN64K)
+		smmu->pgsize_bitmap |= SZ_64K | SZ_512M;
+	if (reg & IDR5_GRAN16K)
+		smmu->pgsize_bitmap |= SZ_16K | SZ_32M;
+	if (reg & IDR5_GRAN4K)
+		smmu->pgsize_bitmap |= SZ_4K | SZ_2M | SZ_1G;
+
+	/* Input address size */
+	if (FIELD_GET(IDR5_VAX, reg) == IDR5_VAX_52_BIT)
+		smmu->features |= ARM_SMMU_FEAT_VAX;
+
+	/* Output address size */
+	switch (FIELD_GET(IDR5_OAS, reg)) {
+	case IDR5_OAS_32_BIT:
+		smmu->oas = 32;
+		break;
+	case IDR5_OAS_36_BIT:
+		smmu->oas = 36;
+		break;
+	case IDR5_OAS_40_BIT:
+		smmu->oas = 40;
+		break;
+	case IDR5_OAS_42_BIT:
+		smmu->oas = 42;
+		break;
+	case IDR5_OAS_44_BIT:
+		smmu->oas = 44;
+		break;
+	case IDR5_OAS_52_BIT:
+		smmu->oas = 52;
+		smmu->pgsize_bitmap |= 1ULL << 42; /* 4TB */
+		break;
+	default:
+		dev_info(smmu->dev,
+			"unknown output address size. Truncating to 48-bit\n");
+		fallthrough;
+	case IDR5_OAS_48_BIT:
+		smmu->oas = 48;
+	}
+
+	/* Set the DMA mask for our table walker */
+	if (dma_set_mask_and_coherent(smmu->dev, DMA_BIT_MASK(smmu->oas)))
+		dev_warn(smmu->dev,
+			 "failed to set DMA mask for table walker\n");
+
+	smmu->ias = max(smmu->ias, smmu->oas);
+
+	if ((smmu->features & ARM_SMMU_FEAT_TRANS_S1) &&
+	    (smmu->features & ARM_SMMU_FEAT_TRANS_S2))
+		smmu->features |= ARM_SMMU_FEAT_NESTING;
+
+	arm_smmu_device_iidr_probe(smmu);
+
+	dev_info(smmu->dev, "ias %lu-bit, oas %lu-bit (features 0x%08x)\n",
+		 smmu->ias, smmu->oas, smmu->features);
+	return 0;
+}
+
+int arm_smmu_write_reg_sync(struct arm_smmu_device *smmu, u32 val,
+			    unsigned int reg_off, unsigned int ack_off)
+{
+	u32 reg;
+
+	writel_relaxed(val, smmu->base + reg_off);
+	return readl_relaxed_poll_timeout(smmu->base + ack_off, reg, reg == val,
+					  1, ARM_SMMU_POLL_TIMEOUT_US);
+}
+
+/* GBPA is "special" */
+int arm_smmu_update_gbpa(struct arm_smmu_device *smmu, u32 set, u32 clr)
+{
+	int ret;
+	u32 reg, __iomem *gbpa = smmu->base + ARM_SMMU_GBPA;
+
+	ret = readl_relaxed_poll_timeout(gbpa, reg, !(reg & GBPA_UPDATE),
+					 1, ARM_SMMU_POLL_TIMEOUT_US);
+	if (ret)
+		return ret;
+
+	reg &= ~clr;
+	reg |= set;
+	writel_relaxed(reg | GBPA_UPDATE, gbpa);
+	ret = readl_relaxed_poll_timeout(gbpa, reg, !(reg & GBPA_UPDATE),
+					 1, ARM_SMMU_POLL_TIMEOUT_US);
+
+	if (ret)
+		dev_err(smmu->dev, "GBPA not responding to update\n");
+	return ret;
+}
+
+int arm_smmu_device_disable(struct arm_smmu_device *smmu)
+{
+	int ret;
+
+	ret = arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_CR0, ARM_SMMU_CR0ACK);
+	if (ret)
+		dev_err(smmu->dev, "failed to clear cr0\n");
+
+	return ret;
+}
+
+struct iommu_group *arm_smmu_device_group(struct device *dev)
+{
+	struct iommu_group *group;
+
+	/*
+	 * We don't support devices sharing stream IDs other than PCI RID
+	pick f473c7c1b189 fixup drv splt * aliases, since the necessary ID-to-device lookup becomes rather
+	 * impractical given a potential sparse 32-bit stream ID space.
+	 */
+	if (dev_is_pci(dev))
+		group = pci_device_group(dev);
+	else
+		group = generic_device_group(dev);
+
+	return group;
+}
+
+int arm_smmu_of_xlate(struct device *dev, const struct of_phandle_args *args)
+{
+	return iommu_fwspec_add_ids(dev, args->args, 1);
+}
+
+void arm_smmu_get_resv_regions(struct device *dev,
+			       struct list_head *head)
+{
+	struct iommu_resv_region *region;
+	int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+
+	region = iommu_alloc_resv_region(MSI_IOVA_BASE, MSI_IOVA_LENGTH,
+					 prot, IOMMU_RESV_SW_MSI, GFP_KERNEL);
+	if (!region)
+		return;
+
+	list_add_tail(&region->list, head);
+
+	iommu_dma_get_resv_regions(dev, head);
+}
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 702863c94f91..8741b8f57a8d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -17,13 +17,11 @@
 #include <linux/err.h>
 #include <linux/interrupt.h>
 #include <linux/io-pgtable.h>
-#include <linux/iopoll.h>
 #include <linux/module.h>
 #include <linux/msi.h>
 #include <linux/of.h>
 #include <linux/of_address.h>
 #include <linux/of_platform.h>
-#include <linux/pci.h>
 #include <linux/pci-ats.h>
 #include <linux/platform_device.h>
 #include <kunit/visibility.h>
@@ -1914,8 +1912,6 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
 	return IRQ_HANDLED;
 }
 
-static int arm_smmu_device_disable(struct arm_smmu_device *smmu);
-
 static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
 {
 	u32 gerror, gerrorn, active;
@@ -3361,23 +3357,6 @@ static int arm_smmu_set_dirty_tracking(struct iommu_domain *domain,
 	return 0;
 }
 
-static struct iommu_group *arm_smmu_device_group(struct device *dev)
-{
-	struct iommu_group *group;
-
-	/*
-	 * We don't support devices sharing stream IDs other than PCI RID
-	 * aliases, since the necessary ID-to-device lookup becomes rather
-	 * impractical given a potential sparse 32-bit stream ID space.
-	 */
-	if (dev_is_pci(dev))
-		group = pci_device_group(dev);
-	else
-		group = generic_device_group(dev);
-
-	return group;
-}
-
 static int arm_smmu_enable_nesting(struct iommu_domain *domain)
 {
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -3393,28 +3372,6 @@ static int arm_smmu_enable_nesting(struct iommu_domain *domain)
 	return ret;
 }
 
-static int arm_smmu_of_xlate(struct device *dev,
-			     const struct of_phandle_args *args)
-{
-	return iommu_fwspec_add_ids(dev, args->args, 1);
-}
-
-static void arm_smmu_get_resv_regions(struct device *dev,
-				      struct list_head *head)
-{
-	struct iommu_resv_region *region;
-	int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
-
-	region = iommu_alloc_resv_region(MSI_IOVA_BASE, MSI_IOVA_LENGTH,
-					 prot, IOMMU_RESV_SW_MSI, GFP_KERNEL);
-	if (!region)
-		return;
-
-	list_add_tail(&region->list, head);
-
-	iommu_dma_get_resv_regions(dev, head);
-}
-
 static int arm_smmu_dev_enable_feature(struct device *dev,
 				       enum iommu_dev_features feat)
 {
@@ -3711,38 +3668,6 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
 	return 0;
 }
 
-static int arm_smmu_write_reg_sync(struct arm_smmu_device *smmu, u32 val,
-				   unsigned int reg_off, unsigned int ack_off)
-{
-	u32 reg;
-
-	writel_relaxed(val, smmu->base + reg_off);
-	return readl_relaxed_poll_timeout(smmu->base + ack_off, reg, reg == val,
-					  1, ARM_SMMU_POLL_TIMEOUT_US);
-}
-
-/* GBPA is "special" */
-static int arm_smmu_update_gbpa(struct arm_smmu_device *smmu, u32 set, u32 clr)
-{
-	int ret;
-	u32 reg, __iomem *gbpa = smmu->base + ARM_SMMU_GBPA;
-
-	ret = readl_relaxed_poll_timeout(gbpa, reg, !(reg & GBPA_UPDATE),
-					 1, ARM_SMMU_POLL_TIMEOUT_US);
-	if (ret)
-		return ret;
-
-	reg &= ~clr;
-	reg |= set;
-	writel_relaxed(reg | GBPA_UPDATE, gbpa);
-	ret = readl_relaxed_poll_timeout(gbpa, reg, !(reg & GBPA_UPDATE),
-					 1, ARM_SMMU_POLL_TIMEOUT_US);
-
-	if (ret)
-		dev_err(smmu->dev, "GBPA not responding to update\n");
-	return ret;
-}
-
 static void arm_smmu_free_msis(void *data)
 {
 	struct device *dev = data;
@@ -3889,17 +3814,6 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
 	return 0;
 }
 
-static int arm_smmu_device_disable(struct arm_smmu_device *smmu)
-{
-	int ret;
-
-	ret = arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_CR0, ARM_SMMU_CR0ACK);
-	if (ret)
-		dev_err(smmu->dev, "failed to clear cr0\n");
-
-	return ret;
-}
-
 static void arm_smmu_write_strtab(struct arm_smmu_device *smmu)
 {
 	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
@@ -4057,283 +3971,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu)
 	return 0;
 }
 
-#define IIDR_IMPLEMENTER_ARM		0x43b
-#define IIDR_PRODUCTID_ARM_MMU_600	0x483
-#define IIDR_PRODUCTID_ARM_MMU_700	0x487
-
-static void arm_smmu_device_iidr_probe(struct arm_smmu_device *smmu)
-{
-	u32 reg;
-	unsigned int implementer, productid, variant, revision;
-
-	reg = readl_relaxed(smmu->base + ARM_SMMU_IIDR);
-	implementer = FIELD_GET(IIDR_IMPLEMENTER, reg);
-	productid = FIELD_GET(IIDR_PRODUCTID, reg);
-	variant = FIELD_GET(IIDR_VARIANT, reg);
-	revision = FIELD_GET(IIDR_REVISION, reg);
-
-	switch (implementer) {
-	case IIDR_IMPLEMENTER_ARM:
-		switch (productid) {
-		case IIDR_PRODUCTID_ARM_MMU_600:
-			/* Arm erratum 1076982 */
-			if (variant == 0 && revision <= 2)
-				smmu->features &= ~ARM_SMMU_FEAT_SEV;
-			/* Arm erratum 1209401 */
-			if (variant < 2)
-				smmu->features &= ~ARM_SMMU_FEAT_NESTING;
-			break;
-		case IIDR_PRODUCTID_ARM_MMU_700:
-			/* Arm erratum 2812531 */
-			smmu->features &= ~ARM_SMMU_FEAT_BTM;
-			smmu->options |= ARM_SMMU_OPT_CMDQ_FORCE_SYNC;
-			/* Arm errata 2268618, 2812531 */
-			smmu->features &= ~ARM_SMMU_FEAT_NESTING;
-			break;
-		}
-		break;
-	}
-}
-
-static void arm_smmu_get_httu(struct arm_smmu_device *smmu, u32 reg)
-{
-	u32 fw_features = smmu->features & (ARM_SMMU_FEAT_HA | ARM_SMMU_FEAT_HD);
-	u32 hw_features = 0;
-
-	switch (FIELD_GET(IDR0_HTTU, reg)) {
-	case IDR0_HTTU_ACCESS_DIRTY:
-		hw_features |= ARM_SMMU_FEAT_HD;
-		fallthrough;
-	case IDR0_HTTU_ACCESS:
-		hw_features |= ARM_SMMU_FEAT_HA;
-	}
-
-	if (smmu->dev->of_node)
-		smmu->features |= hw_features;
-	else if (hw_features != fw_features)
-		/* ACPI IORT sets the HTTU bits */
-		dev_warn(smmu->dev,
-			 "IDR0.HTTU features(0x%x) overridden by FW configuration (0x%x)\n",
-			  hw_features, fw_features);
-}
-
-static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
-{
-	u32 reg;
-	bool coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
-
-	/* IDR0 */
-	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR0);
-
-	/* 2-level structures */
-	if (FIELD_GET(IDR0_ST_LVL, reg) == IDR0_ST_LVL_2LVL)
-		smmu->features |= ARM_SMMU_FEAT_2_LVL_STRTAB;
-
-	if (reg & IDR0_CD2L)
-		smmu->features |= ARM_SMMU_FEAT_2_LVL_CDTAB;
-
-	/*
-	 * Translation table endianness.
-	 * We currently require the same endianness as the CPU, but this
-	 * could be changed later by adding a new IO_PGTABLE_QUIRK.
-	 */
-	switch (FIELD_GET(IDR0_TTENDIAN, reg)) {
-	case IDR0_TTENDIAN_MIXED:
-		smmu->features |= ARM_SMMU_FEAT_TT_LE | ARM_SMMU_FEAT_TT_BE;
-		break;
-#ifdef __BIG_ENDIAN
-	case IDR0_TTENDIAN_BE:
-		smmu->features |= ARM_SMMU_FEAT_TT_BE;
-		break;
-#else
-	case IDR0_TTENDIAN_LE:
-		smmu->features |= ARM_SMMU_FEAT_TT_LE;
-		break;
-#endif
-	default:
-		dev_err(smmu->dev, "unknown/unsupported TT endianness!\n");
-		return -ENXIO;
-	}
-
-	/* Boolean feature flags */
-	if (IS_ENABLED(CONFIG_PCI_PRI) && reg & IDR0_PRI)
-		smmu->features |= ARM_SMMU_FEAT_PRI;
-
-	if (IS_ENABLED(CONFIG_PCI_ATS) && reg & IDR0_ATS)
-		smmu->features |= ARM_SMMU_FEAT_ATS;
-
-	if (reg & IDR0_SEV)
-		smmu->features |= ARM_SMMU_FEAT_SEV;
-
-	if (reg & IDR0_MSI) {
-		smmu->features |= ARM_SMMU_FEAT_MSI;
-		if (coherent)
-			smmu->options |= ARM_SMMU_OPT_MSIPOLL;
-	}
-
-	if (reg & IDR0_HYP) {
-		smmu->features |= ARM_SMMU_FEAT_HYP;
-		if (cpus_have_cap(ARM64_HAS_VIRT_HOST_EXTN))
-			smmu->features |= ARM_SMMU_FEAT_E2H;
-	}
-
-	arm_smmu_get_httu(smmu, reg);
-
-	/*
-	 * The coherency feature as set by FW is used in preference to the ID
-	 * register, but warn on mismatch.
-	 */
-	if (!!(reg & IDR0_COHACC) != coherent)
-		dev_warn(smmu->dev, "IDR0.COHACC overridden by FW configuration (%s)\n",
-			 coherent ? "true" : "false");
-
-	switch (FIELD_GET(IDR0_STALL_MODEL, reg)) {
-	case IDR0_STALL_MODEL_FORCE:
-		smmu->features |= ARM_SMMU_FEAT_STALL_FORCE;
-		fallthrough;
-	case IDR0_STALL_MODEL_STALL:
-		smmu->features |= ARM_SMMU_FEAT_STALLS;
-	}
-
-	if (reg & IDR0_S1P)
-		smmu->features |= ARM_SMMU_FEAT_TRANS_S1;
-
-	if (reg & IDR0_S2P)
-		smmu->features |= ARM_SMMU_FEAT_TRANS_S2;
-
-	if (!(reg & (IDR0_S1P | IDR0_S2P))) {
-		dev_err(smmu->dev, "no translation support!\n");
-		return -ENXIO;
-	}
-
-	/* We only support the AArch64 table format at present */
-	switch (FIELD_GET(IDR0_TTF, reg)) {
-	case IDR0_TTF_AARCH32_64:
-		smmu->ias = 40;
-		fallthrough;
-	case IDR0_TTF_AARCH64:
-		break;
-	default:
-		dev_err(smmu->dev, "AArch64 table format not supported!\n");
-		return -ENXIO;
-	}
-
-	/* ASID/VMID sizes */
-	smmu->asid_bits = reg & IDR0_ASID16 ? 16 : 8;
-	smmu->vmid_bits = reg & IDR0_VMID16 ? 16 : 8;
-
-	/* IDR1 */
-	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR1);
-	if (reg & (IDR1_TABLES_PRESET | IDR1_QUEUES_PRESET | IDR1_REL)) {
-		dev_err(smmu->dev, "embedded implementation not supported\n");
-		return -ENXIO;
-	}
-
-	if (reg & IDR1_ATTR_TYPES_OVR)
-		smmu->features |= ARM_SMMU_FEAT_ATTR_TYPES_OVR;
-
-	/* Queue sizes, capped to ensure natural alignment */
-	smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT,
-					     FIELD_GET(IDR1_CMDQS, reg));
-	if (smmu->cmdq.q.llq.max_n_shift <= ilog2(CMDQ_BATCH_ENTRIES)) {
-		/*
-		 * We don't support splitting up batches, so one batch of
-		 * commands plus an extra sync needs to fit inside the command
-		 * queue. There's also no way we can handle the weird alignment
-		 * restrictions on the base pointer for a unit-length queue.
-		 */
-		dev_err(smmu->dev, "command queue size <= %d entries not supported\n",
-			CMDQ_BATCH_ENTRIES);
-		return -ENXIO;
-	}
-
-	smmu->evtq.q.llq.max_n_shift = min_t(u32, EVTQ_MAX_SZ_SHIFT,
-					     FIELD_GET(IDR1_EVTQS, reg));
-	smmu->priq.q.llq.max_n_shift = min_t(u32, PRIQ_MAX_SZ_SHIFT,
-					     FIELD_GET(IDR1_PRIQS, reg));
-
-	/* SID/SSID sizes */
-	smmu->ssid_bits = FIELD_GET(IDR1_SSIDSIZE, reg);
-	smmu->sid_bits = FIELD_GET(IDR1_SIDSIZE, reg);
-	smmu->iommu.max_pasids = 1UL << smmu->ssid_bits;
-
-	/*
-	 * If the SMMU supports fewer bits than would fill a single L2 stream
-	 * table, use a linear table instead.
-	 */
-	if (smmu->sid_bits <= STRTAB_SPLIT)
-		smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
-
-	/* IDR3 */
-	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
-	if (FIELD_GET(IDR3_RIL, reg))
-		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
-
-	/* IDR5 */
-	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
-
-	/* Maximum number of outstanding stalls */
-	smmu->evtq.max_stalls = FIELD_GET(IDR5_STALL_MAX, reg);
-
-	/* Page sizes */
-	if (reg & IDR5_GRAN64K)
-		smmu->pgsize_bitmap |= SZ_64K | SZ_512M;
-	if (reg & IDR5_GRAN16K)
-		smmu->pgsize_bitmap |= SZ_16K | SZ_32M;
-	if (reg & IDR5_GRAN4K)
-		smmu->pgsize_bitmap |= SZ_4K | SZ_2M | SZ_1G;
-
-	/* Input address size */
-	if (FIELD_GET(IDR5_VAX, reg) == IDR5_VAX_52_BIT)
-		smmu->features |= ARM_SMMU_FEAT_VAX;
-
-	/* Output address size */
-	switch (FIELD_GET(IDR5_OAS, reg)) {
-	case IDR5_OAS_32_BIT:
-		smmu->oas = 32;
-		break;
-	case IDR5_OAS_36_BIT:
-		smmu->oas = 36;
-		break;
-	case IDR5_OAS_40_BIT:
-		smmu->oas = 40;
-		break;
-	case IDR5_OAS_42_BIT:
-		smmu->oas = 42;
-		break;
-	case IDR5_OAS_44_BIT:
-		smmu->oas = 44;
-		break;
-	case IDR5_OAS_52_BIT:
-		smmu->oas = 52;
-		smmu->pgsize_bitmap |= 1ULL << 42; /* 4TB */
-		break;
-	default:
-		dev_info(smmu->dev,
-			"unknown output address size. Truncating to 48-bit\n");
-		fallthrough;
-	case IDR5_OAS_48_BIT:
-		smmu->oas = 48;
-	}
-
-	/* Set the DMA mask for our table walker */
-	if (dma_set_mask_and_coherent(smmu->dev, DMA_BIT_MASK(smmu->oas)))
-		dev_warn(smmu->dev,
-			 "failed to set DMA mask for table walker\n");
-
-	smmu->ias = max(smmu->ias, smmu->oas);
-
-	if ((smmu->features & ARM_SMMU_FEAT_TRANS_S1) &&
-	    (smmu->features & ARM_SMMU_FEAT_TRANS_S2))
-		smmu->features |= ARM_SMMU_FEAT_NESTING;
-
-	arm_smmu_device_iidr_probe(smmu);
-
-	dev_info(smmu->dev, "ias %lu-bit, oas %lu-bit (features 0x%08x)\n",
-		 smmu->ias, smmu->oas, smmu->features);
-	return 0;
-}
-
 #ifdef CONFIG_ACPI
 #ifdef CONFIG_TEGRA241_CMDQV
 static void acpi_smmu_dsdt_probe_tegra241_cmdqv(struct acpi_iort_node *node,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index fc1b8c2af2a2..1ffc8320b846 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -341,6 +341,17 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
 		       struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
 		       struct arm_smmu_cd *cd);
 
+int arm_smmu_write_reg_sync(struct arm_smmu_device *smmu, u32 val,
+			    unsigned int reg_off, unsigned int ack_off);
+int arm_smmu_update_gbpa(struct arm_smmu_device *smmu, u32 set, u32 clr);
+int arm_smmu_device_disable(struct arm_smmu_device *smmu);
+struct iommu_group *arm_smmu_device_group(struct device *dev);
+int arm_smmu_of_xlate(struct device *dev, const struct of_phandle_args *args);
+void arm_smmu_get_resv_regions(struct device *dev,
+			       struct list_head *head);
+
+int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu);
+
 void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid);
 void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
 				 size_t granule, bool leaf,
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 07/58] iommu/arm-smmu-v3: Move queue and table allocation to arm-smmu-v3-common.c
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (5 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 06/58] iommu/arm-smmu-v3: Move some functions to arm-smmu-v3-common.c Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 08/58] iommu/arm-smmu-v3: Move firmware probe to arm-smmu-v3-common Mostafa Saleh
                   ` (51 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Move more code to arm-smmu-v3-common.c, so that the KVM driver can reuse
it.

Also, make sure that that allocated memory is aligned as it its going
to protected by the hypervisor stage-2.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/include/asm/arm-smmu-v3-common.h   |  29 ++++
 .../arm/arm-smmu-v3/arm-smmu-v3-common.c      | 136 ++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 151 +-----------------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  26 ++-
 4 files changed, 179 insertions(+), 163 deletions(-)

diff --git a/arch/arm64/include/asm/arm-smmu-v3-common.h b/arch/arm64/include/asm/arm-smmu-v3-common.h
index e6e339248816..f2fbd286f674 100644
--- a/arch/arm64/include/asm/arm-smmu-v3-common.h
+++ b/arch/arm64/include/asm/arm-smmu-v3-common.h
@@ -3,6 +3,7 @@
 #define _ARM_SMMU_V3_COMMON_H
 
 #include <linux/bitfield.h>
+#include <linux/bits.h>
 
 /* MMIO registers */
 #define ARM_SMMU_IDR0			0x0
@@ -198,6 +199,22 @@ struct arm_smmu_strtab_l1 {
 };
 #define STRTAB_MAX_L1_ENTRIES		(1 << 17)
 
+struct arm_smmu_strtab_cfg {
+	union {
+		struct {
+			struct arm_smmu_ste *table;
+			dma_addr_t ste_dma;
+			unsigned int num_ents;
+		} linear;
+		struct {
+			struct arm_smmu_strtab_l1 *l1tab;
+			struct arm_smmu_strtab_l2 **l2ptrs;
+			dma_addr_t l1_dma;
+			unsigned int num_l1_ents;
+		} l2;
+	};
+};
+
 static inline u32 arm_smmu_strtab_l1_idx(u32 sid)
 {
 	return sid / STRTAB_NUM_L2_STES;
@@ -208,6 +225,18 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
 	return sid % STRTAB_NUM_L2_STES;
 }
 
+static inline void arm_smmu_write_strtab_l1_desc(struct arm_smmu_strtab_l1 *dst,
+						 dma_addr_t l2ptr_dma)
+{
+	u64 val = 0;
+
+	val |= FIELD_PREP(STRTAB_L1_DESC_SPAN, STRTAB_SPLIT + 1);
+	val |= l2ptr_dma & STRTAB_L1_DESC_L2PTR_MASK;
+
+	/* The HW has 64 bit atomicity with stores to the L2 STE table */
+	WRITE_ONCE(dst->l2ptr, cpu_to_le64(val));
+}
+
 #define STRTAB_STE_0_V			(1UL << 0)
 #define STRTAB_STE_0_CFG		GENMASK_ULL(3, 1)
 #define STRTAB_STE_0_CFG_ABORT		0
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
index cfd5ba69e67e..80ac13b0dc06 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
@@ -348,6 +348,7 @@ int arm_smmu_of_xlate(struct device *dev, const struct of_phandle_args *args)
 	return iommu_fwspec_add_ids(dev, args->args, 1);
 }
 
+
 void arm_smmu_get_resv_regions(struct device *dev,
 			       struct list_head *head)
 {
@@ -363,3 +364,138 @@ void arm_smmu_get_resv_regions(struct device *dev,
 
 	iommu_dma_get_resv_regions(dev, head);
 }
+
+int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
+			    struct arm_smmu_queue *q, void __iomem *page,
+			    unsigned long prod_off, unsigned long cons_off,
+			    size_t dwords, const char *name)
+{
+	size_t qsz;
+
+	do {
+		qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
+		q->base = dmam_alloc_coherent(smmu->dev, PAGE_ALIGN(qsz), &q->base_dma,
+					      GFP_KERNEL);
+		if (q->base || qsz < PAGE_SIZE)
+			break;
+
+		q->llq.max_n_shift--;
+	} while (1);
+
+	if (!q->base) {
+		dev_err(smmu->dev,
+			"failed to allocate queue (0x%zx bytes) for %s\n",
+			qsz, name);
+		return -ENOMEM;
+	}
+
+	if (!WARN_ON(q->base_dma & (qsz - 1))) {
+		dev_info(smmu->dev, "allocated %u entries for %s\n",
+			 1 << q->llq.max_n_shift, name);
+	}
+
+	q->prod_reg	= page + prod_off;
+	q->cons_reg	= page + cons_off;
+	q->ent_dwords	= dwords;
+
+	q->q_base  = Q_BASE_RWA;
+	q->q_base |= q->base_dma & Q_BASE_ADDR_MASK;
+	q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
+
+	q->llq.prod = q->llq.cons = 0;
+	return 0;
+}
+
+/* Stream table initialization functions */
+static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
+{
+	u32 l1size;
+	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+	unsigned int last_sid_idx =
+		arm_smmu_strtab_l1_idx((1 << smmu->sid_bits) - 1);
+
+	/* Calculate the L1 size, capped to the SIDSIZE. */
+	cfg->l2.num_l1_ents = min(last_sid_idx + 1, STRTAB_MAX_L1_ENTRIES);
+	if (cfg->l2.num_l1_ents <= last_sid_idx)
+		dev_warn(smmu->dev,
+			 "2-level strtab only covers %u/%u bits of SID\n",
+			 ilog2(cfg->l2.num_l1_ents * STRTAB_NUM_L2_STES),
+			 smmu->sid_bits);
+
+	l1size = cfg->l2.num_l1_ents * sizeof(struct arm_smmu_strtab_l1);
+	cfg->l2.l1tab = dmam_alloc_coherent(smmu->dev, PAGE_ALIGN(l1size), &cfg->l2.l1_dma,
+					    GFP_KERNEL);
+	if (!cfg->l2.l1tab) {
+		dev_err(smmu->dev,
+			"failed to allocate l1 stream table (%u bytes)\n",
+			l1size);
+		return -ENOMEM;
+	}
+
+	cfg->l2.l2ptrs = devm_kcalloc(smmu->dev, cfg->l2.num_l1_ents,
+				      sizeof(*cfg->l2.l2ptrs), GFP_KERNEL);
+	if (!cfg->l2.l2ptrs)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
+{
+	u32 size;
+	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+
+	size = (1 << smmu->sid_bits) * sizeof(struct arm_smmu_ste);
+	cfg->linear.table = dmam_alloc_coherent(smmu->dev, PAGE_ALIGN(size),
+						&cfg->linear.ste_dma,
+						GFP_KERNEL);
+	if (!cfg->linear.table) {
+		dev_err(smmu->dev,
+			"failed to allocate linear stream table (%u bytes)\n",
+			size);
+		return -ENOMEM;
+	}
+	cfg->linear.num_ents = 1 << smmu->sid_bits;
+
+	return 0;
+}
+
+int arm_smmu_init_strtab(struct arm_smmu_device *smmu)
+{
+	int ret;
+
+	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB)
+		ret = arm_smmu_init_strtab_2lvl(smmu);
+	else
+		ret = arm_smmu_init_strtab_linear(smmu);
+	if (ret)
+		return ret;
+
+	ida_init(&smmu->vmid_map);
+
+	return 0;
+}
+
+void arm_smmu_write_strtab(struct arm_smmu_device *smmu)
+{
+	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+	dma_addr_t dma;
+	u32 reg;
+
+	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+		reg = FIELD_PREP(STRTAB_BASE_CFG_FMT,
+				 STRTAB_BASE_CFG_FMT_2LVL) |
+		      FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE,
+				 ilog2(cfg->l2.num_l1_ents) + STRTAB_SPLIT) |
+		      FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
+		dma = cfg->l2.l1_dma;
+	} else {
+		reg = FIELD_PREP(STRTAB_BASE_CFG_FMT,
+				 STRTAB_BASE_CFG_FMT_LINEAR) |
+		      FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
+		dma = cfg->linear.ste_dma;
+	}
+	writeq_relaxed((dma & STRTAB_BASE_ADDR_MASK) | STRTAB_BASE_RA,
+		       smmu->base + ARM_SMMU_STRTAB_BASE);
+	writel_relaxed(reg, smmu->base + ARM_SMMU_STRTAB_BASE_CFG);
+}
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8741b8f57a8d..cfee7f9b5afc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1483,18 +1483,6 @@ static void arm_smmu_free_cd_tables(struct arm_smmu_master *master)
 }
 
 /* Stream table manipulation functions */
-static void arm_smmu_write_strtab_l1_desc(struct arm_smmu_strtab_l1 *dst,
-					  dma_addr_t l2ptr_dma)
-{
-	u64 val = 0;
-
-	val |= FIELD_PREP(STRTAB_L1_DESC_SPAN, STRTAB_SPLIT + 1);
-	val |= l2ptr_dma & STRTAB_L1_DESC_L2PTR_MASK;
-
-	/* The HW has 64 bit atomicity with stores to the L2 STE table */
-	WRITE_ONCE(dst->l2ptr, cpu_to_le64(val));
-}
-
 struct arm_smmu_ste_writer {
 	struct arm_smmu_entry_writer writer;
 	u32 sid;
@@ -3482,47 +3470,6 @@ static struct iommu_dirty_ops arm_smmu_dirty_ops = {
 };
 
 /* Probing and initialisation functions */
-int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
-			    struct arm_smmu_queue *q, void __iomem *page,
-			    unsigned long prod_off, unsigned long cons_off,
-			    size_t dwords, const char *name)
-{
-	size_t qsz;
-
-	do {
-		qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
-		q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
-					      GFP_KERNEL);
-		if (q->base || qsz < PAGE_SIZE)
-			break;
-
-		q->llq.max_n_shift--;
-	} while (1);
-
-	if (!q->base) {
-		dev_err(smmu->dev,
-			"failed to allocate queue (0x%zx bytes) for %s\n",
-			qsz, name);
-		return -ENOMEM;
-	}
-
-	if (!WARN_ON(q->base_dma & (qsz - 1))) {
-		dev_info(smmu->dev, "allocated %u entries for %s\n",
-			 1 << q->llq.max_n_shift, name);
-	}
-
-	q->prod_reg	= page + prod_off;
-	q->cons_reg	= page + cons_off;
-	q->ent_dwords	= dwords;
-
-	q->q_base  = Q_BASE_RWA;
-	q->q_base |= q->base_dma & Q_BASE_ADDR_MASK;
-	q->q_base |= FIELD_PREP(Q_BASE_LOG2SIZE, q->llq.max_n_shift);
-
-	q->llq.prod = q->llq.cons = 0;
-	return 0;
-}
-
 int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
 		       struct arm_smmu_cmdq *cmdq)
 {
@@ -3577,76 +3524,6 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
 				       PRIQ_ENT_DWORDS, "priq");
 }
 
-static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
-{
-	u32 l1size;
-	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
-	unsigned int last_sid_idx =
-		arm_smmu_strtab_l1_idx((1 << smmu->sid_bits) - 1);
-
-	/* Calculate the L1 size, capped to the SIDSIZE. */
-	cfg->l2.num_l1_ents = min(last_sid_idx + 1, STRTAB_MAX_L1_ENTRIES);
-	if (cfg->l2.num_l1_ents <= last_sid_idx)
-		dev_warn(smmu->dev,
-			 "2-level strtab only covers %u/%u bits of SID\n",
-			 ilog2(cfg->l2.num_l1_ents * STRTAB_NUM_L2_STES),
-			 smmu->sid_bits);
-
-	l1size = cfg->l2.num_l1_ents * sizeof(struct arm_smmu_strtab_l1);
-	cfg->l2.l1tab = dmam_alloc_coherent(smmu->dev, l1size, &cfg->l2.l1_dma,
-					    GFP_KERNEL);
-	if (!cfg->l2.l1tab) {
-		dev_err(smmu->dev,
-			"failed to allocate l1 stream table (%u bytes)\n",
-			l1size);
-		return -ENOMEM;
-	}
-
-	cfg->l2.l2ptrs = devm_kcalloc(smmu->dev, cfg->l2.num_l1_ents,
-				      sizeof(*cfg->l2.l2ptrs), GFP_KERNEL);
-	if (!cfg->l2.l2ptrs)
-		return -ENOMEM;
-
-	return 0;
-}
-
-static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
-{
-	u32 size;
-	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
-
-	size = (1 << smmu->sid_bits) * sizeof(struct arm_smmu_ste);
-	cfg->linear.table = dmam_alloc_coherent(smmu->dev, size,
-						&cfg->linear.ste_dma,
-						GFP_KERNEL);
-	if (!cfg->linear.table) {
-		dev_err(smmu->dev,
-			"failed to allocate linear stream table (%u bytes)\n",
-			size);
-		return -ENOMEM;
-	}
-	cfg->linear.num_ents = 1 << smmu->sid_bits;
-
-	arm_smmu_init_initial_stes(cfg->linear.table, cfg->linear.num_ents);
-	return 0;
-}
-
-static int arm_smmu_init_strtab(struct arm_smmu_device *smmu)
-{
-	int ret;
-
-	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB)
-		ret = arm_smmu_init_strtab_2lvl(smmu);
-	else
-		ret = arm_smmu_init_strtab_linear(smmu);
-	if (ret)
-		return ret;
-
-	ida_init(&smmu->vmid_map);
-
-	return 0;
-}
-
 static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
 {
 	int ret;
@@ -3662,6 +3539,10 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
 	if (ret)
 		return ret;
 
+	if (!(smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB))
+		arm_smmu_init_initial_stes(smmu->strtab_cfg.linear.table,
+					   smmu->strtab_cfg.linear.num_ents);
+
 	if (smmu->impl_ops && smmu->impl_ops->init_structures)
 		return smmu->impl_ops->init_structures(smmu);
 
@@ -3814,30 +3695,6 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
 	return 0;
 }
 
-static void arm_smmu_write_strtab(struct arm_smmu_device *smmu)
-{
-	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
-	dma_addr_t dma;
-	u32 reg;
-
-	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
-		reg = FIELD_PREP(STRTAB_BASE_CFG_FMT,
-				 STRTAB_BASE_CFG_FMT_2LVL) |
-		      FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE,
-				 ilog2(cfg->l2.num_l1_ents) + STRTAB_SPLIT) |
-		      FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
-		dma = cfg->l2.l1_dma;
-	} else {
-		reg = FIELD_PREP(STRTAB_BASE_CFG_FMT,
-				 STRTAB_BASE_CFG_FMT_LINEAR) |
-		      FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
-		dma = cfg->linear.ste_dma;
-	}
-	writeq_relaxed((dma & STRTAB_BASE_ADDR_MASK) | STRTAB_BASE_RA,
-		       smmu->base + ARM_SMMU_STRTAB_BASE);
-	writel_relaxed(reg, smmu->base + ARM_SMMU_STRTAB_BASE_CFG);
-}
-
 static int arm_smmu_device_reset(struct arm_smmu_device *smmu)
 {
 	int ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1ffc8320b846..1a3452554ca8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -156,22 +156,6 @@ struct arm_smmu_s2_cfg {
 	u16				vmid;
 };
 
-struct arm_smmu_strtab_cfg {
-	union {
-		struct {
-			struct arm_smmu_ste *table;
-			dma_addr_t ste_dma;
-			unsigned int num_ents;
-		} linear;
-		struct {
-			struct arm_smmu_strtab_l1 *l1tab;
-			struct arm_smmu_strtab_l2 **l2ptrs;
-			dma_addr_t l1_dma;
-			unsigned int num_l1_ents;
-		} l2;
-	};
-};
-
 struct arm_smmu_impl_ops {
 	int (*device_reset)(struct arm_smmu_device *smmu);
 	void (*device_remove)(struct arm_smmu_device *smmu);
@@ -351,6 +335,16 @@ void arm_smmu_get_resv_regions(struct device *dev,
 			       struct list_head *head);
 
 int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu);
+int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
+			    struct arm_smmu_queue *q,
+			    void __iomem *page,
+			    unsigned long prod_off,
+			    unsigned long cons_off,
+			    size_t dwords, const char *name);
+int arm_smmu_init_strtab(struct arm_smmu_device *smmu);
+void arm_smmu_write_strtab_l1_desc(struct arm_smmu_strtab_l1 *dst,
+				   dma_addr_t l2ptr_dma);
+void arm_smmu_write_strtab(struct arm_smmu_device *smmu);
 
 void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid);
 void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 08/58] iommu/arm-smmu-v3: Move firmware probe to arm-smmu-v3-common
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (6 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 07/58] iommu/arm-smmu-v3: Move queue and table allocation " Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 09/58] iommu/arm-smmu-v3: Move IOMMU registration to arm-smmu-v3-common.c Mostafa Saleh
                   ` (50 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Move the FW probe functions to the common source.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 .../arm/arm-smmu-v3/arm-smmu-v3-common.c      | 146 ++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 142 +----------------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   3 +
 3 files changed, 150 insertions(+), 141 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
index 80ac13b0dc06..04f1e2f1c458 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
@@ -1,11 +1,157 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/acpi.h>
 #include <linux/dma-mapping.h>
 #include <linux/iopoll.h>
+#include <linux/of.h>
+#include <linux/of_address.h>
+#include <linux/of_platform.h>
 #include <linux/pci.h>
+#include <linux/platform_device.h>
 
 #include "arm-smmu-v3.h"
 #include "../../dma-iommu.h"
 
+struct arm_smmu_option_prop {
+	u32 opt;
+	const char *prop;
+};
+
+static struct arm_smmu_option_prop arm_smmu_options[] = {
+	{ ARM_SMMU_OPT_SKIP_PREFETCH, "hisilicon,broken-prefetch-cmd" },
+	{ ARM_SMMU_OPT_PAGE0_REGS_ONLY, "cavium,cn9900-broken-page1-regspace"},
+	{ 0, NULL},
+};
+
+static void parse_driver_options(struct arm_smmu_device *smmu)
+{
+	int i = 0;
+
+	do {
+		if (of_property_read_bool(smmu->dev->of_node,
+						arm_smmu_options[i].prop)) {
+			smmu->options |= arm_smmu_options[i].opt;
+			dev_notice(smmu->dev, "option %s\n",
+				arm_smmu_options[i].prop);
+		}
+	} while (arm_smmu_options[++i].opt);
+}
+
+#ifdef CONFIG_ACPI
+#ifdef CONFIG_TEGRA241_CMDQV
+static void acpi_smmu_dsdt_probe_tegra241_cmdqv(struct acpi_iort_node *node,
+						struct arm_smmu_device *smmu)
+{
+	const char *uid = kasprintf(GFP_KERNEL, "%u", node->identifier);
+	struct acpi_device *adev;
+
+	/* Look for an NVDA200C node whose _UID matches the SMMU node ID */
+	adev = acpi_dev_get_first_match_dev("NVDA200C", uid, -1);
+	if (adev) {
+		/* Tegra241 CMDQV driver is responsible for put_device() */
+		smmu->impl_dev = &adev->dev;
+		smmu->options |= ARM_SMMU_OPT_TEGRA241_CMDQV;
+		dev_info(smmu->dev, "found companion CMDQV device: %s\n",
+			 dev_name(smmu->impl_dev));
+	}
+	kfree(uid);
+}
+#else
+static void acpi_smmu_dsdt_probe_tegra241_cmdqv(struct acpi_iort_node *node,
+						struct arm_smmu_device *smmu)
+{
+}
+#endif
+
+static int acpi_smmu_iort_probe_model(struct acpi_iort_node *node,
+				      struct arm_smmu_device *smmu)
+{
+	struct acpi_iort_smmu_v3 *iort_smmu =
+		(struct acpi_iort_smmu_v3 *)node->node_data;
+
+	switch (iort_smmu->model) {
+	case ACPI_IORT_SMMU_V3_CAVIUM_CN99XX:
+		smmu->options |= ARM_SMMU_OPT_PAGE0_REGS_ONLY;
+		break;
+	case ACPI_IORT_SMMU_V3_HISILICON_HI161X:
+		smmu->options |= ARM_SMMU_OPT_SKIP_PREFETCH;
+		break;
+	case ACPI_IORT_SMMU_V3_GENERIC:
+		/*
+		 * Tegra241 implementation stores its SMMU options and impl_dev
+		 * in DSDT. Thus, go through the ACPI tables unconditionally.
+		 */
+		acpi_smmu_dsdt_probe_tegra241_cmdqv(node, smmu);
+		break;
+	}
+
+	dev_notice(smmu->dev, "option mask 0x%x\n", smmu->options);
+	return 0;
+}
+
+static int arm_smmu_device_acpi_probe(struct platform_device *pdev,
+				      struct arm_smmu_device *smmu)
+{
+	struct acpi_iort_smmu_v3 *iort_smmu;
+	struct device *dev = smmu->dev;
+	struct acpi_iort_node *node;
+
+	node = *(struct acpi_iort_node **)dev_get_platdata(dev);
+
+	/* Retrieve SMMUv3 specific data */
+	iort_smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
+
+	if (iort_smmu->flags & ACPI_IORT_SMMU_V3_COHACC_OVERRIDE)
+		smmu->features |= ARM_SMMU_FEAT_COHERENCY;
+
+	switch (FIELD_GET(ACPI_IORT_SMMU_V3_HTTU_OVERRIDE, iort_smmu->flags)) {
+	case IDR0_HTTU_ACCESS_DIRTY:
+		smmu->features |= ARM_SMMU_FEAT_HD;
+		fallthrough;
+	case IDR0_HTTU_ACCESS:
+		smmu->features |= ARM_SMMU_FEAT_HA;
+	}
+
+	return acpi_smmu_iort_probe_model(node, smmu);
+}
+#else
+static inline int arm_smmu_device_acpi_probe(struct platform_device *pdev,
+					     struct arm_smmu_device *smmu)
+{
+	return -ENODEV;
+}
+#endif
+
+static int arm_smmu_device_dt_probe(struct platform_device *pdev,
+				    struct arm_smmu_device *smmu)
+{
+	struct device *dev = &pdev->dev;
+	u32 cells;
+	int ret = -EINVAL;
+
+	if (of_property_read_u32(dev->of_node, "#iommu-cells", &cells))
+		dev_err(dev, "missing #iommu-cells property\n");
+	else if (cells != 1)
+		dev_err(dev, "invalid #iommu-cells value (%d)\n", cells);
+	else
+		ret = 0;
+
+	parse_driver_options(smmu);
+
+	if (of_dma_is_coherent(dev->of_node))
+		smmu->features |= ARM_SMMU_FEAT_COHERENCY;
+
+	return ret;
+}
+
+int arm_smmu_fw_probe(struct platform_device *pdev,
+		      struct arm_smmu_device *smmu)
+{
+	if (smmu->dev->of_node)
+		return arm_smmu_device_dt_probe(pdev, smmu);
+	else
+		return arm_smmu_device_acpi_probe(pdev, smmu);
+}
+
 #define IIDR_IMPLEMENTER_ARM		0x43b
 #define IIDR_PRODUCTID_ARM_MMU_600	0x483
 #define IIDR_PRODUCTID_ARM_MMU_700	0x487
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index cfee7f9b5afc..91f64416900b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -9,7 +9,6 @@
  * This driver is powered by bad coffee and bombay mix.
  */
 
-#include <linux/acpi.h>
 #include <linux/acpi_iort.h>
 #include <linux/bitops.h>
 #include <linux/crash_dump.h>
@@ -19,9 +18,6 @@
 #include <linux/io-pgtable.h>
 #include <linux/module.h>
 #include <linux/msi.h>
-#include <linux/of.h>
-#include <linux/of_address.h>
-#include <linux/of_platform.h>
 #include <linux/pci-ats.h>
 #include <linux/platform_device.h>
 #include <kunit/visibility.h>
@@ -67,38 +63,13 @@ static phys_addr_t arm_smmu_msi_cfg[ARM_SMMU_MAX_MSIS][3] = {
 	},
 };
 
-struct arm_smmu_option_prop {
-	u32 opt;
-	const char *prop;
-};
-
 DEFINE_XARRAY_ALLOC1(arm_smmu_asid_xa);
 DEFINE_MUTEX(arm_smmu_asid_lock);
 
-static struct arm_smmu_option_prop arm_smmu_options[] = {
-	{ ARM_SMMU_OPT_SKIP_PREFETCH, "hisilicon,broken-prefetch-cmd" },
-	{ ARM_SMMU_OPT_PAGE0_REGS_ONLY, "cavium,cn9900-broken-page1-regspace"},
-	{ 0, NULL},
-};
-
 static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
 				    struct arm_smmu_device *smmu, u32 flags);
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
 
-static void parse_driver_options(struct arm_smmu_device *smmu)
-{
-	int i = 0;
-
-	do {
-		if (of_property_read_bool(smmu->dev->of_node,
-						arm_smmu_options[i].prop)) {
-			smmu->options |= arm_smmu_options[i].opt;
-			dev_notice(smmu->dev, "option %s\n",
-				arm_smmu_options[i].prop);
-		}
-	} while (arm_smmu_options[++i].opt);
-}
-
 /* Low-level queue manipulation functions */
 static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
 {
@@ -3828,113 +3799,6 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu)
 	return 0;
 }
 
-#ifdef CONFIG_ACPI
-#ifdef CONFIG_TEGRA241_CMDQV
-static void acpi_smmu_dsdt_probe_tegra241_cmdqv(struct acpi_iort_node *node,
-						struct arm_smmu_device *smmu)
-{
-	const char *uid = kasprintf(GFP_KERNEL, "%u", node->identifier);
-	struct acpi_device *adev;
-
-	/* Look for an NVDA200C node whose _UID matches the SMMU node ID */
-	adev = acpi_dev_get_first_match_dev("NVDA200C", uid, -1);
-	if (adev) {
-		/* Tegra241 CMDQV driver is responsible for put_device() */
-		smmu->impl_dev = &adev->dev;
-		smmu->options |= ARM_SMMU_OPT_TEGRA241_CMDQV;
-		dev_info(smmu->dev, "found companion CMDQV device: %s\n",
-			 dev_name(smmu->impl_dev));
-	}
-	kfree(uid);
-}
-#else
-static void acpi_smmu_dsdt_probe_tegra241_cmdqv(struct acpi_iort_node *node,
-						struct arm_smmu_device *smmu)
-{
-}
-#endif
-
-static int acpi_smmu_iort_probe_model(struct acpi_iort_node *node,
-				      struct arm_smmu_device *smmu)
-{
-	struct acpi_iort_smmu_v3 *iort_smmu =
-		(struct acpi_iort_smmu_v3 *)node->node_data;
-
-	switch (iort_smmu->model) {
-	case ACPI_IORT_SMMU_V3_CAVIUM_CN99XX:
-		smmu->options |= ARM_SMMU_OPT_PAGE0_REGS_ONLY;
-		break;
-	case ACPI_IORT_SMMU_V3_HISILICON_HI161X:
-		smmu->options |= ARM_SMMU_OPT_SKIP_PREFETCH;
-		break;
-	case ACPI_IORT_SMMU_V3_GENERIC:
-		/*
-		 * Tegra241 implementation stores its SMMU options and impl_dev
-		 * in DSDT. Thus, go through the ACPI tables unconditionally.
-		 */
-		acpi_smmu_dsdt_probe_tegra241_cmdqv(node, smmu);
-		break;
-	}
-
-	dev_notice(smmu->dev, "option mask 0x%x\n", smmu->options);
-	return 0;
-}
-
-static int arm_smmu_device_acpi_probe(struct platform_device *pdev,
-				      struct arm_smmu_device *smmu)
-{
-	struct acpi_iort_smmu_v3 *iort_smmu;
-	struct device *dev = smmu->dev;
-	struct acpi_iort_node *node;
-
-	node = *(struct acpi_iort_node **)dev_get_platdata(dev);
-
-	/* Retrieve SMMUv3 specific data */
-	iort_smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
-
-	if (iort_smmu->flags & ACPI_IORT_SMMU_V3_COHACC_OVERRIDE)
-		smmu->features |= ARM_SMMU_FEAT_COHERENCY;
-
-	switch (FIELD_GET(ACPI_IORT_SMMU_V3_HTTU_OVERRIDE, iort_smmu->flags)) {
-	case IDR0_HTTU_ACCESS_DIRTY:
-		smmu->features |= ARM_SMMU_FEAT_HD;
-		fallthrough;
-	case IDR0_HTTU_ACCESS:
-		smmu->features |= ARM_SMMU_FEAT_HA;
-	}
-
-	return acpi_smmu_iort_probe_model(node, smmu);
-}
-#else
-static inline int arm_smmu_device_acpi_probe(struct platform_device *pdev,
-					     struct arm_smmu_device *smmu)
-{
-	return -ENODEV;
-}
-#endif
-
-static int arm_smmu_device_dt_probe(struct platform_device *pdev,
-				    struct arm_smmu_device *smmu)
-{
-	struct device *dev = &pdev->dev;
-	u32 cells;
-	int ret = -EINVAL;
-
-	if (of_property_read_u32(dev->of_node, "#iommu-cells", &cells))
-		dev_err(dev, "missing #iommu-cells property\n");
-	else if (cells != 1)
-		dev_err(dev, "invalid #iommu-cells value (%d)\n", cells);
-	else
-		ret = 0;
-
-	parse_driver_options(smmu);
-
-	if (of_dma_is_coherent(dev->of_node))
-		smmu->features |= ARM_SMMU_FEAT_COHERENCY;
-
-	return ret;
-}
-
 static unsigned long arm_smmu_resource_size(struct arm_smmu_device *smmu)
 {
 	if (smmu->options & ARM_SMMU_OPT_PAGE0_REGS_ONLY)
@@ -4030,11 +3894,7 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 		return -ENOMEM;
 	smmu->dev = dev;
 
-	if (dev->of_node) {
-		ret = arm_smmu_device_dt_probe(pdev, smmu);
-	} else {
-		ret = arm_smmu_device_acpi_probe(pdev, smmu);
-	}
+	ret = arm_smmu_fw_probe(pdev, smmu);
 	if (ret)
 		return ret;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1a3452554ca8..2d658f15973a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -334,6 +334,9 @@ int arm_smmu_of_xlate(struct device *dev, const struct of_phandle_args *args);
 void arm_smmu_get_resv_regions(struct device *dev,
 			       struct list_head *head);
 
+struct platform_device;
+int arm_smmu_fw_probe(struct platform_device *pdev,
+		      struct arm_smmu_device *smmu);
 int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu);
 int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 			    struct arm_smmu_queue *q,
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 09/58] iommu/arm-smmu-v3: Move IOMMU registration to arm-smmu-v3-common.c
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (7 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 08/58] iommu/arm-smmu-v3: Move firmware probe to arm-smmu-v3-common Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 10/58] iommu/arm-smmu-v3: Move common irq code to common file Mostafa Saleh
                   ` (49 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

The KVM driver will need to implement a few IOMMU ops, so move the
helpers to arm-smmu-v3-common.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 .../arm/arm-smmu-v3/arm-smmu-v3-common.c      | 27 +++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 17 ++----------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  4 +++
 3 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
index 04f1e2f1c458..b7bcac51cc7d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
@@ -645,3 +645,30 @@ void arm_smmu_write_strtab(struct arm_smmu_device *smmu)
 		       smmu->base + ARM_SMMU_STRTAB_BASE);
 	writel_relaxed(reg, smmu->base + ARM_SMMU_STRTAB_BASE_CFG);
 }
+
+int arm_smmu_register_iommu(struct arm_smmu_device *smmu,
+			    struct iommu_ops *ops, phys_addr_t ioaddr)
+{
+	int ret;
+	struct device *dev = smmu->dev;
+
+	ret = iommu_device_sysfs_add(&smmu->iommu, dev, NULL,
+				     "smmu3.%pa", &ioaddr);
+	if (ret)
+		return ret;
+
+	ret = iommu_device_register(&smmu->iommu, ops, dev);
+	if (ret) {
+		dev_err(dev, "Failed to register iommu\n");
+		iommu_device_sysfs_remove(&smmu->iommu);
+		return ret;
+	}
+
+	return 0;
+}
+
+void arm_smmu_unregister_iommu(struct arm_smmu_device *smmu)
+{
+	iommu_device_unregister(&smmu->iommu);
+	iommu_device_sysfs_remove(&smmu->iommu);
+}
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 91f64416900b..bcefa361f3d3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3980,27 +3980,14 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 		return ret;
 
 	/* And we're up. Go go go! */
-	ret = iommu_device_sysfs_add(&smmu->iommu, dev, NULL,
-				     "smmu3.%pa", &ioaddr);
-	if (ret)
-		return ret;
-
-	ret = iommu_device_register(&smmu->iommu, &arm_smmu_ops, dev);
-	if (ret) {
-		dev_err(dev, "Failed to register iommu\n");
-		iommu_device_sysfs_remove(&smmu->iommu);
-		return ret;
-	}
-
-	return 0;
+	return arm_smmu_register_iommu(smmu, &arm_smmu_ops, ioaddr);
 }
 
 static void arm_smmu_device_remove(struct platform_device *pdev)
 {
 	struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
 
-	iommu_device_unregister(&smmu->iommu);
-	iommu_device_sysfs_remove(&smmu->iommu);
+	arm_smmu_unregister_iommu(smmu);
 	arm_smmu_device_disable(smmu);
 	iopf_queue_free(smmu->evtq.iopf);
 	ida_destroy(&smmu->vmid_map);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 2d658f15973a..63545fdf55f9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -349,6 +349,10 @@ void arm_smmu_write_strtab_l1_desc(struct arm_smmu_strtab_l1 *dst,
 				   dma_addr_t l2ptr_dma);
 void arm_smmu_write_strtab(struct arm_smmu_device *smmu);
 
+int arm_smmu_register_iommu(struct arm_smmu_device *smmu,
+			    struct iommu_ops *ops, phys_addr_t ioaddr);
+void arm_smmu_unregister_iommu(struct arm_smmu_device *smmu);
+
 void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid);
 void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
 				 size_t granule, bool leaf,
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 10/58] iommu/arm-smmu-v3: Move common irq code to common file
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (8 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 09/58] iommu/arm-smmu-v3: Move IOMMU registration to arm-smmu-v3-common.c Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 11/58] KVM: arm64: pkvm: Add pkvm_udelay() Mostafa Saleh
                   ` (48 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Soon, KVM SMMUv3 driver will support irqs, instead of re-implementing
the architectural bits and the firmware bindings, move this code to
the common file shared with KVM.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 .../arm/arm-smmu-v3/arm-smmu-v3-common.c      | 150 +++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 297 +-----------------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   | 159 ++++++++++
 3 files changed, 313 insertions(+), 293 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
index b7bcac51cc7d..d842e592b351 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/acpi.h>
 #include <linux/dma-mapping.h>
+#include <linux/interrupt.h>
 #include <linux/iopoll.h>
 #include <linux/of.h>
 #include <linux/of_address.h>
@@ -22,6 +23,24 @@ static struct arm_smmu_option_prop arm_smmu_options[] = {
 	{ 0, NULL},
 };
 
+static phys_addr_t arm_smmu_msi_cfg[ARM_SMMU_MAX_MSIS][3] = {
+	[EVTQ_MSI_INDEX] = {
+		ARM_SMMU_EVTQ_IRQ_CFG0,
+		ARM_SMMU_EVTQ_IRQ_CFG1,
+		ARM_SMMU_EVTQ_IRQ_CFG2,
+	},
+	[GERROR_MSI_INDEX] = {
+		ARM_SMMU_GERROR_IRQ_CFG0,
+		ARM_SMMU_GERROR_IRQ_CFG1,
+		ARM_SMMU_GERROR_IRQ_CFG2,
+	},
+	[PRIQ_MSI_INDEX] = {
+		ARM_SMMU_PRIQ_IRQ_CFG0,
+		ARM_SMMU_PRIQ_IRQ_CFG1,
+		ARM_SMMU_PRIQ_IRQ_CFG2,
+	},
+};
+
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
 	int i = 0;
@@ -646,6 +665,137 @@ void arm_smmu_write_strtab(struct arm_smmu_device *smmu)
 	writel_relaxed(reg, smmu->base + ARM_SMMU_STRTAB_BASE_CFG);
 }
 
+static void arm_smmu_free_msis(void *data)
+{
+	struct device *dev = data;
+
+	platform_device_msi_free_irqs_all(dev);
+}
+
+static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
+{
+	phys_addr_t doorbell;
+	struct device *dev = msi_desc_to_dev(desc);
+	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+	phys_addr_t *cfg = arm_smmu_msi_cfg[desc->msi_index];
+
+	doorbell = (((u64)msg->address_hi) << 32) | msg->address_lo;
+	doorbell &= MSI_CFG0_ADDR_MASK;
+
+	writeq_relaxed(doorbell, smmu->base + cfg[0]);
+	writel_relaxed(msg->data, smmu->base + cfg[1]);
+	writel_relaxed(ARM_SMMU_MEMATTR_DEVICE_nGnRE, smmu->base + cfg[2]);
+}
+
+static void arm_smmu_setup_msis(struct arm_smmu_device *smmu)
+{
+	int ret, nvec = ARM_SMMU_MAX_MSIS;
+	struct device *dev = smmu->dev;
+
+	/* Clear the MSI address regs */
+	writeq_relaxed(0, smmu->base + ARM_SMMU_GERROR_IRQ_CFG0);
+	writeq_relaxed(0, smmu->base + ARM_SMMU_EVTQ_IRQ_CFG0);
+
+	if (smmu->features & ARM_SMMU_FEAT_PRI)
+		writeq_relaxed(0, smmu->base + ARM_SMMU_PRIQ_IRQ_CFG0);
+	else
+		nvec--;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_MSI))
+		return;
+
+	if (!dev->msi.domain) {
+		dev_info(smmu->dev, "msi_domain absent - falling back to wired irqs\n");
+		return;
+	}
+
+	/* Allocate MSIs for evtq, gerror and priq. Ignore cmdq */
+	ret = platform_device_msi_init_and_alloc_irqs(dev, nvec, arm_smmu_write_msi_msg);
+	if (ret) {
+		dev_warn(dev, "failed to allocate MSIs - falling back to wired irqs\n");
+		return;
+	}
+
+	smmu->evtq.q.irq = msi_get_virq(dev, EVTQ_MSI_INDEX);
+	smmu->gerr_irq = msi_get_virq(dev, GERROR_MSI_INDEX);
+	smmu->priq.q.irq = msi_get_virq(dev, PRIQ_MSI_INDEX);
+
+	/* Add callback to free MSIs on teardown */
+	devm_add_action_or_reset(dev, arm_smmu_free_msis, dev);
+}
+
+void arm_smmu_setup_unique_irqs(struct arm_smmu_device *smmu,
+				irqreturn_t evtqirq(int irq, void *dev),
+				irqreturn_t gerrorirq(int irq, void *dev),
+				irqreturn_t priirq(int irq, void *dev))
+{
+	int irq, ret;
+
+	arm_smmu_setup_msis(smmu);
+
+	/* Request interrupt lines */
+	irq = smmu->evtq.q.irq;
+	if (irq) {
+		ret = devm_request_threaded_irq(smmu->dev, irq, NULL,
+						evtqirq,
+						IRQF_ONESHOT,
+						"arm-smmu-v3-evtq", smmu);
+		if (ret < 0)
+			dev_warn(smmu->dev, "failed to enable evtq irq\n");
+	} else {
+		dev_warn(smmu->dev, "no evtq irq - events will not be reported!\n");
+	}
+
+	irq = smmu->gerr_irq;
+	if (irq) {
+		ret = devm_request_irq(smmu->dev, irq, gerrorirq,
+				       0, "arm-smmu-v3-gerror", smmu);
+		if (ret < 0)
+			dev_warn(smmu->dev, "failed to enable gerror irq\n");
+	} else {
+		dev_warn(smmu->dev, "no gerr irq - errors will not be reported!\n");
+	}
+
+	if (smmu->features & ARM_SMMU_FEAT_PRI) {
+		irq = smmu->priq.q.irq;
+		if (irq) {
+			ret = devm_request_threaded_irq(smmu->dev, irq, NULL,
+							priirq,
+							IRQF_ONESHOT,
+							"arm-smmu-v3-priq",
+							smmu);
+			if (ret < 0)
+				dev_warn(smmu->dev,
+					 "failed to enable priq irq\n");
+		} else {
+			dev_warn(smmu->dev, "no priq irq - PRI will be broken\n");
+		}
+	}
+}
+
+void arm_smmu_probe_irq(struct platform_device *pdev,
+			struct arm_smmu_device *smmu)
+{
+	int irq;
+
+	irq = platform_get_irq_byname_optional(pdev, "combined");
+	if (irq > 0)
+		smmu->combined_irq = irq;
+	else {
+		irq = platform_get_irq_byname_optional(pdev, "eventq");
+		if (irq > 0)
+			smmu->evtq.q.irq = irq;
+
+		irq = platform_get_irq_byname_optional(pdev, "priq");
+		if (irq > 0)
+			smmu->priq.q.irq = irq;
+
+		irq = platform_get_irq_byname_optional(pdev, "gerror");
+		if (irq > 0)
+			smmu->gerr_irq = irq;
+	}
+}
+
 int arm_smmu_register_iommu(struct arm_smmu_device *smmu,
 			    struct iommu_ops *ops, phys_addr_t ioaddr)
 {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index bcefa361f3d3..8234a9754a04 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -12,7 +12,6 @@
 #include <linux/acpi_iort.h>
 #include <linux/bitops.h>
 #include <linux/crash_dump.h>
-#include <linux/delay.h>
 #include <linux/err.h>
 #include <linux/interrupt.h>
 #include <linux/io-pgtable.h>
@@ -34,35 +33,10 @@ MODULE_PARM_DESC(disable_msipolling,
 static struct iommu_ops arm_smmu_ops;
 static struct iommu_dirty_ops arm_smmu_dirty_ops;
 
-enum arm_smmu_msi_index {
-	EVTQ_MSI_INDEX,
-	GERROR_MSI_INDEX,
-	PRIQ_MSI_INDEX,
-	ARM_SMMU_MAX_MSIS,
-};
-
 #define NUM_ENTRY_QWORDS 8
 static_assert(sizeof(struct arm_smmu_ste) == NUM_ENTRY_QWORDS * sizeof(u64));
 static_assert(sizeof(struct arm_smmu_cd) == NUM_ENTRY_QWORDS * sizeof(u64));
 
-static phys_addr_t arm_smmu_msi_cfg[ARM_SMMU_MAX_MSIS][3] = {
-	[EVTQ_MSI_INDEX] = {
-		ARM_SMMU_EVTQ_IRQ_CFG0,
-		ARM_SMMU_EVTQ_IRQ_CFG1,
-		ARM_SMMU_EVTQ_IRQ_CFG2,
-	},
-	[GERROR_MSI_INDEX] = {
-		ARM_SMMU_GERROR_IRQ_CFG0,
-		ARM_SMMU_GERROR_IRQ_CFG1,
-		ARM_SMMU_GERROR_IRQ_CFG2,
-	},
-	[PRIQ_MSI_INDEX] = {
-		ARM_SMMU_PRIQ_IRQ_CFG0,
-		ARM_SMMU_PRIQ_IRQ_CFG1,
-		ARM_SMMU_PRIQ_IRQ_CFG2,
-	},
-};
-
 DEFINE_XARRAY_ALLOC1(arm_smmu_asid_xa);
 DEFINE_MUTEX(arm_smmu_asid_lock);
 
@@ -70,149 +44,6 @@ static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
 				    struct arm_smmu_device *smmu, u32 flags);
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
 
-/* Low-level queue manipulation functions */
-static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
-{
-	u32 space, prod, cons;
-
-	prod = Q_IDX(q, q->prod);
-	cons = Q_IDX(q, q->cons);
-
-	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
-		space = (1 << q->max_n_shift) - (prod - cons);
-	else
-		space = cons - prod;
-
-	return space >= n;
-}
-
-static bool queue_full(struct arm_smmu_ll_queue *q)
-{
-	return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
-	       Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
-}
-
-static bool queue_empty(struct arm_smmu_ll_queue *q)
-{
-	return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
-	       Q_WRP(q, q->prod) == Q_WRP(q, q->cons);
-}
-
-static bool queue_consumed(struct arm_smmu_ll_queue *q, u32 prod)
-{
-	return ((Q_WRP(q, q->cons) == Q_WRP(q, prod)) &&
-		(Q_IDX(q, q->cons) > Q_IDX(q, prod))) ||
-	       ((Q_WRP(q, q->cons) != Q_WRP(q, prod)) &&
-		(Q_IDX(q, q->cons) <= Q_IDX(q, prod)));
-}
-
-static void queue_sync_cons_out(struct arm_smmu_queue *q)
-{
-	/*
-	 * Ensure that all CPU accesses (reads and writes) to the queue
-	 * are complete before we update the cons pointer.
-	 */
-	__iomb();
-	writel_relaxed(q->llq.cons, q->cons_reg);
-}
-
-static void queue_inc_cons(struct arm_smmu_ll_queue *q)
-{
-	u32 cons = (Q_WRP(q, q->cons) | Q_IDX(q, q->cons)) + 1;
-	q->cons = Q_OVF(q->cons) | Q_WRP(q, cons) | Q_IDX(q, cons);
-}
-
-static void queue_sync_cons_ovf(struct arm_smmu_queue *q)
-{
-	struct arm_smmu_ll_queue *llq = &q->llq;
-
-	if (likely(Q_OVF(llq->prod) == Q_OVF(llq->cons)))
-		return;
-
-	llq->cons = Q_OVF(llq->prod) | Q_WRP(llq, llq->cons) |
-		      Q_IDX(llq, llq->cons);
-	queue_sync_cons_out(q);
-}
-
-static int queue_sync_prod_in(struct arm_smmu_queue *q)
-{
-	u32 prod;
-	int ret = 0;
-
-	/*
-	 * We can't use the _relaxed() variant here, as we must prevent
-	 * speculative reads of the queue before we have determined that
-	 * prod has indeed moved.
-	 */
-	prod = readl(q->prod_reg);
-
-	if (Q_OVF(prod) != Q_OVF(q->llq.prod))
-		ret = -EOVERFLOW;
-
-	q->llq.prod = prod;
-	return ret;
-}
-
-static u32 queue_inc_prod_n(struct arm_smmu_ll_queue *q, int n)
-{
-	u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + n;
-	return Q_OVF(q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
-}
-
-static void queue_poll_init(struct arm_smmu_device *smmu,
-			    struct arm_smmu_queue_poll *qp)
-{
-	qp->delay = 1;
-	qp->spin_cnt = 0;
-	qp->wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
-	qp->timeout = ktime_add_us(ktime_get(), ARM_SMMU_POLL_TIMEOUT_US);
-}
-
-static int queue_poll(struct arm_smmu_queue_poll *qp)
-{
-	if (ktime_compare(ktime_get(), qp->timeout) > 0)
-		return -ETIMEDOUT;
-
-	if (qp->wfe) {
-		wfe();
-	} else if (++qp->spin_cnt < ARM_SMMU_POLL_SPIN_COUNT) {
-		cpu_relax();
-	} else {
-		udelay(qp->delay);
-		qp->delay *= 2;
-		qp->spin_cnt = 0;
-	}
-
-	return 0;
-}
-
-static void queue_write(__le64 *dst, u64 *src, size_t n_dwords)
-{
-	int i;
-
-	for (i = 0; i < n_dwords; ++i)
-		*dst++ = cpu_to_le64(*src++);
-}
-
-static void queue_read(u64 *dst, __le64 *src, size_t n_dwords)
-{
-	int i;
-
-	for (i = 0; i < n_dwords; ++i)
-		*dst++ = le64_to_cpu(*src++);
-}
-
-static int queue_remove_raw(struct arm_smmu_queue *q, u64 *ent)
-{
-	if (queue_empty(&q->llq))
-		return -EAGAIN;
-
-	queue_read(ent, Q_ENT(q, q->llq.cons), q->ent_dwords);
-	queue_inc_cons(&q->llq);
-	queue_sync_cons_out(q);
-	return 0;
-}
-
 /* High-level queue accessors */
 static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
 {
@@ -3520,111 +3351,6 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
 	return 0;
 }
 
-static void arm_smmu_free_msis(void *data)
-{
-	struct device *dev = data;
-
-	platform_device_msi_free_irqs_all(dev);
-}
-
-static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
-{
-	phys_addr_t doorbell;
-	struct device *dev = msi_desc_to_dev(desc);
-	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
-	phys_addr_t *cfg = arm_smmu_msi_cfg[desc->msi_index];
-
-	doorbell = (((u64)msg->address_hi) << 32) | msg->address_lo;
-	doorbell &= MSI_CFG0_ADDR_MASK;
-
-	writeq_relaxed(doorbell, smmu->base + cfg[0]);
-	writel_relaxed(msg->data, smmu->base + cfg[1]);
-	writel_relaxed(ARM_SMMU_MEMATTR_DEVICE_nGnRE, smmu->base + cfg[2]);
-}
-
-static void arm_smmu_setup_msis(struct arm_smmu_device *smmu)
-{
-	int ret, nvec = ARM_SMMU_MAX_MSIS;
-	struct device *dev = smmu->dev;
-
-	/* Clear the MSI address regs */
-	writeq_relaxed(0, smmu->base + ARM_SMMU_GERROR_IRQ_CFG0);
-	writeq_relaxed(0, smmu->base + ARM_SMMU_EVTQ_IRQ_CFG0);
-
-	if (smmu->features & ARM_SMMU_FEAT_PRI)
-		writeq_relaxed(0, smmu->base + ARM_SMMU_PRIQ_IRQ_CFG0);
-	else
-		nvec--;
-
-	if (!(smmu->features & ARM_SMMU_FEAT_MSI))
-		return;
-
-	if (!dev->msi.domain) {
-		dev_info(smmu->dev, "msi_domain absent - falling back to wired irqs\n");
-		return;
-	}
-
-	/* Allocate MSIs for evtq, gerror and priq. Ignore cmdq */
-	ret = platform_device_msi_init_and_alloc_irqs(dev, nvec, arm_smmu_write_msi_msg);
-	if (ret) {
-		dev_warn(dev, "failed to allocate MSIs - falling back to wired irqs\n");
-		return;
-	}
-
-	smmu->evtq.q.irq = msi_get_virq(dev, EVTQ_MSI_INDEX);
-	smmu->gerr_irq = msi_get_virq(dev, GERROR_MSI_INDEX);
-	smmu->priq.q.irq = msi_get_virq(dev, PRIQ_MSI_INDEX);
-
-	/* Add callback to free MSIs on teardown */
-	devm_add_action_or_reset(dev, arm_smmu_free_msis, dev);
-}
-
-static void arm_smmu_setup_unique_irqs(struct arm_smmu_device *smmu)
-{
-	int irq, ret;
-
-	arm_smmu_setup_msis(smmu);
-
-	/* Request interrupt lines */
-	irq = smmu->evtq.q.irq;
-	if (irq) {
-		ret = devm_request_threaded_irq(smmu->dev, irq, NULL,
-						arm_smmu_evtq_thread,
-						IRQF_ONESHOT,
-						"arm-smmu-v3-evtq", smmu);
-		if (ret < 0)
-			dev_warn(smmu->dev, "failed to enable evtq irq\n");
-	} else {
-		dev_warn(smmu->dev, "no evtq irq - events will not be reported!\n");
-	}
-
-	irq = smmu->gerr_irq;
-	if (irq) {
-		ret = devm_request_irq(smmu->dev, irq, arm_smmu_gerror_handler,
-				       0, "arm-smmu-v3-gerror", smmu);
-		if (ret < 0)
-			dev_warn(smmu->dev, "failed to enable gerror irq\n");
-	} else {
-		dev_warn(smmu->dev, "no gerr irq - errors will not be reported!\n");
-	}
-
-	if (smmu->features & ARM_SMMU_FEAT_PRI) {
-		irq = smmu->priq.q.irq;
-		if (irq) {
-			ret = devm_request_threaded_irq(smmu->dev, irq, NULL,
-							arm_smmu_priq_thread,
-							IRQF_ONESHOT,
-							"arm-smmu-v3-priq",
-							smmu);
-			if (ret < 0)
-				dev_warn(smmu->dev,
-					 "failed to enable priq irq\n");
-		} else {
-			dev_warn(smmu->dev, "no priq irq - PRI will be broken\n");
-		}
-	}
-}
-
 static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
 {
 	int ret, irq;
@@ -3652,7 +3378,8 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
 		if (ret < 0)
 			dev_warn(smmu->dev, "failed to enable combined irq\n");
 	} else
-		arm_smmu_setup_unique_irqs(smmu);
+		arm_smmu_setup_unique_irqs(smmu, arm_smmu_evtq_thread,
+					   arm_smmu_gerror_handler, arm_smmu_priq_thread);
 
 	if (smmu->features & ARM_SMMU_FEAT_PRI)
 		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
@@ -3883,7 +3610,7 @@ static struct arm_smmu_device *arm_smmu_impl_probe(struct arm_smmu_device *smmu)
 
 static int arm_smmu_device_probe(struct platform_device *pdev)
 {
-	int irq, ret;
+	int ret;
 	struct resource *res;
 	resource_size_t ioaddr;
 	struct arm_smmu_device *smmu;
@@ -3929,24 +3656,8 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 		smmu->page1 = smmu->base;
 	}
 
-	/* Interrupt lines */
-
-	irq = platform_get_irq_byname_optional(pdev, "combined");
-	if (irq > 0)
-		smmu->combined_irq = irq;
-	else {
-		irq = platform_get_irq_byname_optional(pdev, "eventq");
-		if (irq > 0)
-			smmu->evtq.q.irq = irq;
+	arm_smmu_probe_irq(pdev, smmu);
 
-		irq = platform_get_irq_byname_optional(pdev, "priq");
-		if (irq > 0)
-			smmu->priq.q.irq = irq;
-
-		irq = platform_get_irq_byname_optional(pdev, "gerror");
-		if (irq > 0)
-			smmu->gerr_irq = irq;
-	}
 	/* Probe the h/w */
 	ret = arm_smmu_device_hw_probe(smmu);
 	if (ret)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 63545fdf55f9..d91dfe55835d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -8,6 +8,7 @@
 #ifndef _ARM_SMMU_V3_H
 #define _ARM_SMMU_V3_H
 
+#include <linux/delay.h>
 #include <linux/iommu.h>
 #include <linux/kernel.h>
 #include <linux/mmzone.h>
@@ -349,6 +350,13 @@ void arm_smmu_write_strtab_l1_desc(struct arm_smmu_strtab_l1 *dst,
 				   dma_addr_t l2ptr_dma);
 void arm_smmu_write_strtab(struct arm_smmu_device *smmu);
 
+void arm_smmu_probe_irq(struct platform_device *pdev,
+			struct arm_smmu_device *smmu);
+void arm_smmu_setup_unique_irqs(struct arm_smmu_device *smmu,
+				irqreturn_t evtqirq(int irq, void *dev),
+				irqreturn_t gerrorirq(int irq, void *dev),
+				irqreturn_t priirq(int irq, void *dev));
+
 int arm_smmu_register_iommu(struct arm_smmu_device *smmu,
 			    struct iommu_ops *ops, phys_addr_t ioaddr);
 void arm_smmu_unregister_iommu(struct arm_smmu_device *smmu);
@@ -425,4 +433,155 @@ tegra241_cmdqv_probe(struct arm_smmu_device *smmu)
 	return ERR_PTR(-ENODEV);
 }
 #endif /* CONFIG_TEGRA241_CMDQV */
+
+/* Queue functions shared with common and kernel drivers */
+static bool __maybe_unused queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
+{
+	u32 space, prod, cons;
+
+	prod = Q_IDX(q, q->prod);
+	cons = Q_IDX(q, q->cons);
+
+	if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
+		space = (1 << q->max_n_shift) - (prod - cons);
+	else
+		space = cons - prod;
+
+	return space >= n;
+}
+
+static bool __maybe_unused queue_full(struct arm_smmu_ll_queue *q)
+{
+	return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
+	       Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
+}
+
+static bool __maybe_unused queue_empty(struct arm_smmu_ll_queue *q)
+{
+	return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
+	       Q_WRP(q, q->prod) == Q_WRP(q, q->cons);
+}
+
+static bool __maybe_unused queue_consumed(struct arm_smmu_ll_queue *q, u32 prod)
+{
+	return ((Q_WRP(q, q->cons) == Q_WRP(q, prod)) &&
+		(Q_IDX(q, q->cons) > Q_IDX(q, prod))) ||
+	       ((Q_WRP(q, q->cons) != Q_WRP(q, prod)) &&
+		(Q_IDX(q, q->cons) <= Q_IDX(q, prod)));
+}
+
+static void __maybe_unused queue_sync_cons_out(struct arm_smmu_queue *q)
+{
+	/*
+	 * Ensure that all CPU accesses (reads and writes) to the queue
+	 * are complete before we update the cons pointer.
+	 */
+	__iomb();
+	writel_relaxed(q->llq.cons, q->cons_reg);
+}
+
+static void __maybe_unused queue_inc_cons(struct arm_smmu_ll_queue *q)
+{
+	u32 cons = (Q_WRP(q, q->cons) | Q_IDX(q, q->cons)) + 1;
+	q->cons = Q_OVF(q->cons) | Q_WRP(q, cons) | Q_IDX(q, cons);
+}
+
+static void __maybe_unused queue_sync_cons_ovf(struct arm_smmu_queue *q)
+{
+	struct arm_smmu_ll_queue *llq = &q->llq;
+
+	if (likely(Q_OVF(llq->prod) == Q_OVF(llq->cons)))
+		return;
+
+	llq->cons = Q_OVF(llq->prod) | Q_WRP(llq, llq->cons) |
+		    Q_IDX(llq, llq->cons);
+	queue_sync_cons_out(q);
+}
+
+static int __maybe_unused queue_sync_prod_in(struct arm_smmu_queue *q)
+{
+	u32 prod;
+	int ret = 0;
+
+	/*
+	 * We can't use the _relaxed() variant here, as we must prevent
+	 * speculative reads of the queue before we have determined that
+	 * prod has indeed moved.
+	 */
+	prod = readl(q->prod_reg);
+
+	if (Q_OVF(prod) != Q_OVF(q->llq.prod))
+		ret = -EOVERFLOW;
+
+	q->llq.prod = prod;
+	return ret;
+}
+
+static u32 __maybe_unused queue_inc_prod_n(struct arm_smmu_ll_queue *q, int n)
+{
+	u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + n;
+	return Q_OVF(q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
+}
+
+static void __maybe_unused queue_poll_init(struct arm_smmu_device *smmu,
+					   struct arm_smmu_queue_poll *qp)
+{
+	qp->delay = 1;
+	qp->spin_cnt = 0;
+	qp->wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
+	qp->timeout = ktime_add_us(ktime_get(), ARM_SMMU_POLL_TIMEOUT_US);
+}
+
+static int __maybe_unused queue_poll(struct arm_smmu_queue_poll *qp)
+{
+	if (ktime_compare(ktime_get(), qp->timeout) > 0)
+		return -ETIMEDOUT;
+
+	if (qp->wfe) {
+		wfe();
+	} else if (++qp->spin_cnt < ARM_SMMU_POLL_SPIN_COUNT) {
+		cpu_relax();
+	} else {
+		udelay(qp->delay);
+		qp->delay *= 2;
+		qp->spin_cnt = 0;
+	}
+
+	return 0;
+}
+
+static void __maybe_unused queue_write(__le64 *dst, u64 *src, size_t n_dwords)
+{
+	int i;
+
+	for (i = 0; i < n_dwords; ++i)
+		*dst++ = cpu_to_le64(*src++);
+}
+
+static void __maybe_unused queue_read(u64 *dst, __le64 *src, size_t n_dwords)
+{
+	int i;
+
+	for (i = 0; i < n_dwords; ++i)
+		*dst++ = le64_to_cpu(*src++);
+}
+
+static int __maybe_unused queue_remove_raw(struct arm_smmu_queue *q, u64 *ent)
+{
+	if (queue_empty(&q->llq))
+		return -EAGAIN;
+
+	queue_read(ent, Q_ENT(q, q->llq.cons), q->ent_dwords);
+	queue_inc_cons(&q->llq);
+	queue_sync_cons_out(q);
+	return 0;
+}
+
+enum arm_smmu_msi_index {
+	EVTQ_MSI_INDEX,
+	GERROR_MSI_INDEX,
+	PRIQ_MSI_INDEX,
+	ARM_SMMU_MAX_MSIS,
+};
+
 #endif /* _ARM_SMMU_V3_H */
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 11/58] KVM: arm64: pkvm: Add pkvm_udelay()
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (9 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 10/58] iommu/arm-smmu-v3: Move common irq code to common file Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-19 11:14   ` Quentin Perret
  2024-12-12 18:03 ` [RFC PATCH v2 12/58] KVM: arm64: Add __pkvm_{use, unuse}_dma() Mostafa Saleh
                   ` (47 subsequent siblings)
  58 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Add a simple delay loop for drivers.

This could use more work. It should be possible to insert a wfe and save
power, but I haven't studied whether it is safe to do so with the host
in control of the event stream. The SMMU driver will use wfe anyway for
frequent waits (provided the implementation can send command queue
events).

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h |  3 ++
 arch/arm64/kvm/hyp/nvhe/setup.c        |  4 +++
 arch/arm64/kvm/hyp/nvhe/timer-sr.c     | 42 ++++++++++++++++++++++++++
 3 files changed, 49 insertions(+)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
index 3b515ce4c433..8a5554615e40 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
@@ -143,4 +143,7 @@ int pkvm_load_pvmfw_pages(struct pkvm_hyp_vm *vm, u64 ipa, phys_addr_t phys,
 			  u64 size);
 void pkvm_poison_pvmfw_pages(void);
 
+int pkvm_timer_init(void);
+void pkvm_udelay(unsigned long usecs);
+
 #endif /* __ARM64_KVM_NVHE_PKVM_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 46dd68161979..9d09f5f471b9 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -356,6 +356,10 @@ void __noreturn __pkvm_init_finalise(void)
 	if (ret)
 		goto out;
 
+	ret = pkvm_timer_init();
+	if (ret)
+		goto out;
+
 	ret = fix_host_ownership();
 	if (ret)
 		goto out;
diff --git a/arch/arm64/kvm/hyp/nvhe/timer-sr.c b/arch/arm64/kvm/hyp/nvhe/timer-sr.c
index 3aaab20ae5b4..732beb5fe24b 100644
--- a/arch/arm64/kvm/hyp/nvhe/timer-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/timer-sr.c
@@ -11,6 +11,10 @@
 #include <asm/kvm_hyp.h>
 #include <asm/kvm_mmu.h>
 
+#include <nvhe/pkvm.h>
+
+static u32 timer_freq;
+
 void __kvm_timer_set_cntvoff(u64 cntvoff)
 {
 	write_sysreg(cntvoff, cntvoff_el2);
@@ -60,3 +64,41 @@ void __timer_enable_traps(struct kvm_vcpu *vcpu)
 
 	sysreg_clear_set(cnthctl_el2, clr, set);
 }
+
+static u64 pkvm_ticks_get(void)
+{
+	return __arch_counter_get_cntvct();
+}
+
+#define SEC_TO_US 1000000
+
+int pkvm_timer_init(void)
+{
+	timer_freq = read_sysreg(cntfrq_el0);
+	/*
+	 * TODO: The highest privileged level is supposed to initialize this
+	 * register. But on some systems (which?), this information is only
+	 * contained in the device-tree, so we'll need to find it out some other
+	 * way.
+	 */
+	if (!timer_freq || timer_freq < SEC_TO_US)
+		return -ENODEV;
+	return 0;
+}
+
+#define pkvm_time_us_to_ticks(us) ((u64)(us) * timer_freq / SEC_TO_US)
+
+void pkvm_udelay(unsigned long usecs)
+{
+	u64 ticks = pkvm_time_us_to_ticks(usecs);
+	u64 start = pkvm_ticks_get();
+
+	while (true) {
+		u64 cur = pkvm_ticks_get();
+
+		if ((cur - start) >= ticks || cur < start)
+			break;
+		/* TODO wfe */
+		cpu_relax();
+	}
+}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 12/58] KVM: arm64: Add __pkvm_{use, unuse}_dma()
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (10 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 11/58] KVM: arm64: pkvm: Add pkvm_udelay() Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-19 11:23   ` Quentin Perret
  2024-12-12 18:03 ` [RFC PATCH v2 13/58] KVM: arm64: Introduce IOMMU driver infrastructure Mostafa Saleh
                   ` (46 subsequent siblings)
  58 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

When a page is mapped in an IOMMU page table for DMA, it must
not be donated to a guest or the hypervisor we ensure this with:
- Host can only map pages that are OWNED
- Any page that is mapped is refcounted
- Donation/Sharing is prevented from refcount check in
  host_request_owned_transition()
- No MMIO transtion is allowed beyond IOMMU MMIO which
  happens during de-privilege.
In case in the future shared pages are allowed to be mapped,
similar checks are needed in host_request_unshare() and
host_ack_unshare()

Add 2 functions that would be called before each IOMMU map
and after each successful IOMMU unmap.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  2 +
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 97 +++++++++++++++++++
 2 files changed, 99 insertions(+)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 67466b4941b4..d75e64e59596 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -92,6 +92,8 @@ int __pkvm_remove_ioguard_page(struct pkvm_hyp_vcpu *hyp_vcpu, u64 ipa);
 bool __pkvm_check_ioguard_page(struct pkvm_hyp_vcpu *hyp_vcpu);
 int __pkvm_guest_relinquish_to_host(struct pkvm_hyp_vcpu *vcpu,
 				    u64 ipa, u64 *ppa);
+int __pkvm_host_use_dma(u64 phys_addr, size_t size);
+int __pkvm_host_unuse_dma(u64 phys_addr, size_t size);
 
 bool addr_is_memory(phys_addr_t phys);
 int host_stage2_idmap_locked(phys_addr_t addr, u64 size, enum kvm_pgtable_prot prot);
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index d14f4d63eb8b..0840af20c366 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -513,6 +513,20 @@ bool addr_is_memory(phys_addr_t phys)
 	return !!find_mem_range(phys, &range);
 }
 
+static bool is_range_refcounted(phys_addr_t addr, u64 nr_pages)
+{
+	struct hyp_page *p;
+	int i;
+
+	for (i = 0 ; i < nr_pages ; ++i) {
+		p = hyp_phys_to_page(addr + i * PAGE_SIZE);
+		if (hyp_refcount_get(p->refcount))
+			return true;
+	}
+
+	return false;
+}
+
 static bool addr_is_allowed_memory(phys_addr_t phys)
 {
 	struct memblock_region *reg;
@@ -927,6 +941,9 @@ static int host_request_owned_transition(u64 *completer_addr,
 	u64 size = tx->nr_pages * PAGE_SIZE;
 	u64 addr = tx->initiator.addr;
 
+	if (range_is_memory(addr, addr + size) && is_range_refcounted(addr, tx->nr_pages))
+		return -EINVAL;
+
 	*completer_addr = tx->initiator.host.completer_addr;
 	return __host_check_page_state_range(addr, size, PKVM_PAGE_OWNED);
 }
@@ -938,6 +955,7 @@ static int host_request_unshare(u64 *completer_addr,
 	u64 addr = tx->initiator.addr;
 
 	*completer_addr = tx->initiator.host.completer_addr;
+
 	return __host_check_page_state_range(addr, size, PKVM_PAGE_SHARED_OWNED);
 }
 
@@ -2047,6 +2065,85 @@ int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages)
 	return ret;
 }
 
+static void __pkvm_host_use_dma_page(phys_addr_t phys_addr)
+{
+	struct hyp_page *p = hyp_phys_to_page(phys_addr);
+
+	hyp_page_ref_inc(p);
+}
+
+static void __pkvm_host_unuse_dma_page(phys_addr_t phys_addr)
+{
+	struct hyp_page *p = hyp_phys_to_page(phys_addr);
+
+	hyp_page_ref_dec(p);
+}
+
+/*
+ * __pkvm_host_use_dma - Mark host memory as used for DMA
+ * @phys_addr:	physical address of the DMA region
+ * @size:	size of the DMA region
+ * When a page is mapped in an IOMMU page table for DMA, it must
+ * not be donated to a guest or the hypervisor we ensure this with:
+ * - Host can only map pages that are OWNED
+ * - Any page that is mapped is refcounted
+ * - Donation/Sharing is prevented from refcount check in
+ *   host_request_owned_transition()
+ * - No MMIO transtion is allowed beyond IOMMU MMIO which
+ *   happens during de-privilege.
+ * In case in the future shared pages are allowed to be mapped,
+ * similar checks are needed in host_request_unshare() and
+ * host_ack_unshare()
+ */
+int __pkvm_host_use_dma(phys_addr_t phys_addr, size_t size)
+{
+	int i;
+	int ret = 0;
+	size_t nr_pages = size >> PAGE_SHIFT;
+
+	if (WARN_ON(!PAGE_ALIGNED(phys_addr | size)))
+		return -EINVAL;
+
+	host_lock_component();
+	ret = __host_check_page_state_range(phys_addr, size, PKVM_PAGE_OWNED);
+	if (ret)
+		goto out_ret;
+
+	if (!range_is_memory(phys_addr, phys_addr + size))
+		goto out_ret;
+
+	for (i = 0; i < nr_pages; i++)
+		__pkvm_host_use_dma_page(phys_addr + i * PAGE_SIZE);
+
+out_ret:
+	host_unlock_component();
+	return ret;
+}
+
+int __pkvm_host_unuse_dma(phys_addr_t phys_addr, size_t size)
+{
+	int i;
+	size_t nr_pages = size >> PAGE_SHIFT;
+
+	if (WARN_ON(!PAGE_ALIGNED(phys_addr | size)))
+		return -EINVAL;
+
+	host_lock_component();
+	if (!range_is_memory(phys_addr, phys_addr + size))
+		goto out_ret;
+	/*
+	 * We end up here after the caller successfully unmapped the page from
+	 * the IOMMU table. Which means that a ref is held, the page is shared
+	 * in the host s2, there can be no failure.
+	 */
+	for (i = 0; i < nr_pages; i++)
+		__pkvm_host_unuse_dma_page(phys_addr + i * PAGE_SIZE);
+
+out_ret:
+	host_unlock_component();
+	return 0;
+}
+
 int __pkvm_host_share_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu,
 			    enum kvm_pgtable_prot prot)
 {
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 13/58] KVM: arm64: Introduce IOMMU driver infrastructure
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (11 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 12/58] KVM: arm64: Add __pkvm_{use, unuse}_dma() Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 14/58] KVM: arm64: pkvm: Add IOMMU hypercalls Mostafa Saleh
                   ` (45 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

To establish DMA isolation, KVM needs an IOMMU driver which provide
certain ops, these ops are defined outside of the iommu_ops,
and has 2 components:
- kvm_iommu_driver (kernel): Implements simple interaction with
  the kernel (init, remove,...)
- kvm_iommu_ops (hypervisor): Implements paravirtual interface
  (map, unmap, attach, detach,...)

Only one driver can be used and is registered with
kvm_iommu_register_driver() by passing pointers to both ops.

KVM will initialise the driver after it initialises and before the
de-privilege point, which is a suitable point to establish trusted
interaction between the host and the hypervisor, this also allows
the host kernel to do one initialization from the kernel and avoid
such complexity in the hypervisor as the kernel is still trusted at
this point.

Also, during the registration call, the pointer for the hypervisor
ops will be initialised.

The hypervisor init part is called from the finalise hypercall which
is executed after the kernel kvm IOMMU driver init.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 arch/arm64/include/asm/kvm_host.h       | 11 ++++++
 arch/arm64/kvm/Makefile                 |  2 +-
 arch/arm64/kvm/arm.c                    |  8 ++++-
 arch/arm64/kvm/hyp/include/nvhe/iommu.h | 13 +++++++
 arch/arm64/kvm/hyp/nvhe/Makefile        |  2 +-
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c   | 18 ++++++++++
 arch/arm64/kvm/hyp/nvhe/setup.c         |  5 +++
 arch/arm64/kvm/iommu.c                  | 47 +++++++++++++++++++++++++
 8 files changed, 103 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/iommu.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
 create mode 100644 arch/arm64/kvm/iommu.c

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 53916a7f0def..54416cfea573 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1628,4 +1628,15 @@ void kvm_set_vm_id_reg(struct kvm *kvm, u32 reg, u64 val);
 
 unsigned long __pkvm_reclaim_hyp_alloc_mgt(unsigned long nr_pages);
 
+struct kvm_iommu_driver {
+	int (*init_driver)(void);
+	void (*remove_driver)(void);
+};
+
+struct kvm_iommu_ops;
+int kvm_iommu_register_driver(struct kvm_iommu_driver *kern_ops,
+			      struct kvm_iommu_ops *el2_ops);
+int kvm_iommu_init_driver(void);
+void kvm_iommu_remove_driver(void);
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index f9e208273031..440897366e88 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 	 vgic/vgic-v3.o vgic/vgic-v4.o \
 	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
 	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
-	 vgic/vgic-its.o vgic/vgic-debug.o
+	 vgic/vgic-its.o vgic/vgic-debug.o iommu.o
 
 kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
 kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 94b210f36573..4b486323c0c9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2510,9 +2510,15 @@ static int __init kvm_hyp_init_protection(u32 hyp_va_bits)
 	if (ret)
 		return ret;
 
+	ret = kvm_iommu_init_driver();
+	if (ret < 0)
+		return ret;
+
 	ret = do_pkvm_init(hyp_va_bits);
-	if (ret)
+	if (ret) {
+		kvm_iommu_remove_driver();
 		return ret;
+	}
 
 	free_hyp_pgds();
 
diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
new file mode 100644
index 000000000000..1ac70cc28a9e
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ARM64_KVM_NVHE_IOMMU_H__
+#define __ARM64_KVM_NVHE_IOMMU_H__
+
+#include <asm/kvm_host.h>
+
+struct kvm_iommu_ops {
+	int (*init)(void);
+};
+
+int kvm_iommu_init(void);
+
+#endif /* __ARM64_KVM_NVHE_IOMMU_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 415cc51fe391..9e1b74c661d2 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -8,7 +8,7 @@ CFLAGS_switch.nvhe.o += -Wno-override-init
 hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
 	 hyp-main.o hyp-smp.o psci-relay.o alloc.o early_alloc.o page_alloc.o \
 	 cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o \
-	 serial.o alloc_mgt.o
+	 serial.o alloc_mgt.o iommu/iommu.o
 hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
 	 ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
 hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
new file mode 100644
index 000000000000..3bd87d2084e9
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * IOMMU operations for pKVM
+ *
+ * Copyright (C) 2022 Linaro Ltd.
+ */
+#include <nvhe/iommu.h>
+
+/* Only one set of ops supported, similary to the kernel */
+struct kvm_iommu_ops *kvm_iommu_ops;
+
+int kvm_iommu_init(void)
+{
+	if (!kvm_iommu_ops || !kvm_iommu_ops->init)
+		return -ENODEV;
+
+	return kvm_iommu_ops->init();
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 9d09f5f471b9..4d36616a7f02 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -14,6 +14,7 @@
 #include <nvhe/early_alloc.h>
 #include <nvhe/ffa.h>
 #include <nvhe/gfp.h>
+#include <nvhe/iommu.h>
 #include <nvhe/memory.h>
 #include <nvhe/mem_protect.h>
 #include <nvhe/mm.h>
@@ -360,6 +361,10 @@ void __noreturn __pkvm_init_finalise(void)
 	if (ret)
 		goto out;
 
+	ret = kvm_iommu_init();
+	if (ret)
+		goto out;
+
 	ret = fix_host_ownership();
 	if (ret)
 		goto out;
diff --git a/arch/arm64/kvm/iommu.c b/arch/arm64/kvm/iommu.c
new file mode 100644
index 000000000000..ed77ea0d12bb
--- /dev/null
+++ b/arch/arm64/kvm/iommu.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ * Author: Mostafa Saleh <smostafa@google.com>
+ */
+
+#include <asm/kvm_mmu.h>
+#include <linux/kvm_host.h>
+
+struct kvm_iommu_driver *iommu_driver;
+extern struct kvm_iommu_ops *kvm_nvhe_sym(kvm_iommu_ops);
+
+int kvm_iommu_register_driver(struct kvm_iommu_driver *kern_ops, struct kvm_iommu_ops *el2_ops)
+{
+	int ret;
+
+	if (WARN_ON(!kern_ops || !el2_ops))
+		return -EINVAL;
+
+	/*
+	 * Paired with smp_load_acquire(&iommu_driver)
+	 * Ensure memory stores happening during a driver
+	 * init are observed before executing kvm iommu callbacks.
+	 */
+	ret = cmpxchg_release(&iommu_driver, NULL, kern_ops) ? -EBUSY : 0;
+	if (ret)
+		return ret;
+
+	kvm_nvhe_sym(kvm_iommu_ops) = el2_ops;
+	return 0;
+}
+
+int kvm_iommu_init_driver(void)
+{
+	if (WARN_ON(!smp_load_acquire(&iommu_driver))) {
+		kvm_err("pKVM enabled without an IOMMU driver, do not run confidential workloads in virtual machines\n");
+		return -ENODEV;
+	}
+
+	return iommu_driver->init_driver();
+}
+
+void kvm_iommu_remove_driver(void)
+{
+	if (smp_load_acquire(&iommu_driver))
+		iommu_driver->remove_driver();
+}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 14/58] KVM: arm64: pkvm: Add IOMMU hypercalls
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (12 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 13/58] KVM: arm64: Introduce IOMMU driver infrastructure Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 15/58] KVM: arm64: iommu: Add a memory pool for the IOMMU Mostafa Saleh
                   ` (44 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

The unprivileged host IOMMU driver forwards some of the IOMMU API calls
to the hypervisor, which installs and populates the page tables.

Note that this is not a stable ABI. Those hypercalls change with the
kernel just like internal function calls.

One thing special about some of the IOMMU hypercalls, that they use
newly added hyp_reqs_smccc_encode() to encode memory requests in
the HVC return, leveraging X1, X2 and X3 registers as allowed SMCCC.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 arch/arm64/include/asm/kvm_asm.h        |  7 ++
 arch/arm64/kvm/hyp/include/nvhe/iommu.h | 14 ++++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c      | 89 +++++++++++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c   | 40 +++++++++++
 4 files changed, 150 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index e4b391bdfdac..9ea155a04332 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -107,6 +107,13 @@ enum __kvm_host_smccc_func {
 	__KVM_HOST_SMCCC_FUNC___pkvm_hyp_alloc_mgt_refill,
 	__KVM_HOST_SMCCC_FUNC___pkvm_hyp_alloc_mgt_reclaimable,
 	__KVM_HOST_SMCCC_FUNC___pkvm_hyp_alloc_mgt_reclaim,
+	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_alloc_domain,
+	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_free_domain,
+	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_attach_dev,
+	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_detach_dev,
+	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_map_pages,
+	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_unmap_pages,
+	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_iova_to_phys,
 
 	/*
 	 * Start of the dynamically registered hypercalls. Start a bit
diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index 1ac70cc28a9e..908863f07b0b 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -4,6 +4,20 @@
 
 #include <asm/kvm_host.h>
 
+/* Hypercall handlers */
+int kvm_iommu_alloc_domain(pkvm_handle_t domain_id, int type);
+int kvm_iommu_free_domain(pkvm_handle_t domain_id);
+int kvm_iommu_attach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
+			 u32 endpoint_id, u32 pasid, u32 pasid_bits);
+int kvm_iommu_detach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
+			 u32 endpoint_id, u32 pasid);
+size_t kvm_iommu_map_pages(pkvm_handle_t domain_id,
+			   unsigned long iova, phys_addr_t paddr, size_t pgsize,
+			   size_t pgcount, int prot);
+size_t kvm_iommu_unmap_pages(pkvm_handle_t domain_id, unsigned long iova,
+			     size_t pgsize, size_t pgcount);
+phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t domain_id, unsigned long iova);
+
 struct kvm_iommu_ops {
 	int (*init)(void);
 };
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 08c0ff823a55..9b224842c487 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -19,6 +19,7 @@
 #include <nvhe/alloc.h>
 #include <nvhe/alloc_mgt.h>
 #include <nvhe/ffa.h>
+#include <nvhe/iommu.h>
 #include <nvhe/mem_protect.h>
 #include <nvhe/modules.h>
 #include <nvhe/mm.h>
@@ -1592,6 +1593,87 @@ static void handle___pkvm_hyp_alloc_mgt_reclaim(struct kvm_cpu_context *host_ctx
 	cpu_reg(host_ctxt, 2) = mc.nr_pages;
 }
 
+static void handle___pkvm_host_iommu_alloc_domain(struct kvm_cpu_context *host_ctxt)
+{
+	int ret;
+	DECLARE_REG(pkvm_handle_t, domain, host_ctxt, 1);
+	DECLARE_REG(int, type, host_ctxt, 2);
+
+	ret = kvm_iommu_alloc_domain(domain, type);
+	hyp_reqs_smccc_encode(ret, host_ctxt, this_cpu_ptr(&host_hyp_reqs));
+}
+
+static void handle___pkvm_host_iommu_free_domain(struct kvm_cpu_context *host_ctxt)
+{
+	int ret;
+	DECLARE_REG(pkvm_handle_t, domain, host_ctxt, 1);
+
+	ret = kvm_iommu_free_domain(domain);
+	hyp_reqs_smccc_encode(ret, host_ctxt, this_cpu_ptr(&host_hyp_reqs));
+}
+
+static void handle___pkvm_host_iommu_attach_dev(struct kvm_cpu_context *host_ctxt)
+{
+	int ret;
+	DECLARE_REG(pkvm_handle_t, iommu, host_ctxt, 1);
+	DECLARE_REG(pkvm_handle_t, domain, host_ctxt, 2);
+	DECLARE_REG(unsigned int, endpoint, host_ctxt, 3);
+	DECLARE_REG(unsigned int, pasid, host_ctxt, 4);
+	DECLARE_REG(unsigned int, pasid_bits, host_ctxt, 5);
+
+	ret = kvm_iommu_attach_dev(iommu, domain, endpoint,
+				   pasid, pasid_bits);
+	hyp_reqs_smccc_encode(ret, host_ctxt, this_cpu_ptr(&host_hyp_reqs));
+}
+
+static void handle___pkvm_host_iommu_detach_dev(struct kvm_cpu_context *host_ctxt)
+{
+	int ret;
+	DECLARE_REG(pkvm_handle_t, iommu, host_ctxt, 1);
+	DECLARE_REG(pkvm_handle_t, domain, host_ctxt, 2);
+	DECLARE_REG(unsigned int, endpoint, host_ctxt, 3);
+	DECLARE_REG(unsigned int, pasid, host_ctxt, 4);
+
+	ret = kvm_iommu_detach_dev(iommu, domain, endpoint, pasid);
+	hyp_reqs_smccc_encode(ret, host_ctxt, this_cpu_ptr(&host_hyp_reqs));
+}
+
+static void handle___pkvm_host_iommu_map_pages(struct kvm_cpu_context *host_ctxt)
+{
+	unsigned long ret;
+	DECLARE_REG(pkvm_handle_t, domain, host_ctxt, 1);
+	DECLARE_REG(unsigned long, iova, host_ctxt, 2);
+	DECLARE_REG(phys_addr_t, paddr, host_ctxt, 3);
+	DECLARE_REG(size_t, pgsize, host_ctxt, 4);
+	DECLARE_REG(size_t, pgcount, host_ctxt, 5);
+	DECLARE_REG(unsigned int, prot, host_ctxt, 6);
+
+	ret = kvm_iommu_map_pages(domain, iova, paddr,
+				  pgsize, pgcount, prot);
+	hyp_reqs_smccc_encode(ret, host_ctxt, this_cpu_ptr(&host_hyp_reqs));
+}
+
+static void handle___pkvm_host_iommu_unmap_pages(struct kvm_cpu_context *host_ctxt)
+{
+	unsigned long ret;
+	DECLARE_REG(pkvm_handle_t, domain, host_ctxt, 1);
+	DECLARE_REG(unsigned long, iova, host_ctxt, 2);
+	DECLARE_REG(size_t, pgsize, host_ctxt, 3);
+	DECLARE_REG(size_t, pgcount, host_ctxt, 4);
+
+	ret = kvm_iommu_unmap_pages(domain, iova,
+				    pgsize, pgcount);
+	hyp_reqs_smccc_encode(ret, host_ctxt, this_cpu_ptr(&host_hyp_reqs));
+}
+
+static void handle___pkvm_host_iommu_iova_to_phys(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(pkvm_handle_t, domain, host_ctxt, 1);
+	DECLARE_REG(unsigned long, iova, host_ctxt, 2);
+
+	cpu_reg(host_ctxt, 1) = kvm_iommu_iova_to_phys(domain, iova);
+}
+
 typedef void (*hcall_t)(struct kvm_cpu_context *);
 
 #define HANDLE_FUNC(x)	[__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -1649,6 +1731,13 @@ static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__pkvm_hyp_alloc_mgt_refill),
 	HANDLE_FUNC(__pkvm_hyp_alloc_mgt_reclaimable),
 	HANDLE_FUNC(__pkvm_hyp_alloc_mgt_reclaim),
+	HANDLE_FUNC(__pkvm_host_iommu_alloc_domain),
+	HANDLE_FUNC(__pkvm_host_iommu_free_domain),
+	HANDLE_FUNC(__pkvm_host_iommu_attach_dev),
+	HANDLE_FUNC(__pkvm_host_iommu_detach_dev),
+	HANDLE_FUNC(__pkvm_host_iommu_map_pages),
+	HANDLE_FUNC(__pkvm_host_iommu_unmap_pages),
+	HANDLE_FUNC(__pkvm_host_iommu_iova_to_phys),
 };
 
 static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index 3bd87d2084e9..9022fd612a49 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -16,3 +16,43 @@ int kvm_iommu_init(void)
 
 	return kvm_iommu_ops->init();
 }
+
+int kvm_iommu_alloc_domain(pkvm_handle_t domain_id, int type)
+{
+	return -ENODEV;
+}
+
+int kvm_iommu_free_domain(pkvm_handle_t domain_id)
+{
+	return -ENODEV;
+}
+
+int kvm_iommu_attach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
+			 u32 endpoint_id, u32 pasid, u32 pasid_bits)
+{
+	return -ENODEV;
+}
+
+int kvm_iommu_detach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
+			 u32 endpoint_id, u32 pasid)
+{
+	return -ENODEV;
+}
+
+size_t kvm_iommu_map_pages(pkvm_handle_t domain_id,
+			   unsigned long iova, phys_addr_t paddr, size_t pgsize,
+			   size_t pgcount, int prot)
+{
+	return 0;
+}
+
+size_t kvm_iommu_unmap_pages(pkvm_handle_t domain_id, unsigned long iova,
+			     size_t pgsize, size_t pgcount)
+{
+	return 0;
+}
+
+phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t domain_id, unsigned long iova)
+{
+	return 0;
+}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 15/58] KVM: arm64: iommu: Add a memory pool for the IOMMU
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (13 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 14/58] KVM: arm64: pkvm: Add IOMMU hypercalls Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 16/58] KVM: arm64: iommu: Add domains Mostafa Saleh
                   ` (43 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

This patch defines a new hypervisor allocator which is an instance of
the hyp buddy allocator.
This allocator would be used from the IOMMU drivers to be used for
the page tables, generally these pages have 2 properties:
- Can be multi order
- Can be non-coherent

The interface provide functions and wrappers for those types of
allocations.

The IOMMU hypervisor will leverage the allocator interface which
provides a standardized interface that can be called from the kernel
part of the IOMMU driver to top up the allocator, and can be
reclaimed through the shrinker for pKVM.

Also, the allocation function would automatically create a request
when it fails to allocate memory from the pool, so it’s sufficient
for the driver to return an error code and the kernel part of the
driver should check the requests in the return and fill the
hypervisor allocator.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/include/asm/kvm_host.h       |  1 +
 arch/arm64/kvm/hyp/include/nvhe/iommu.h | 13 ++++
 arch/arm64/kvm/hyp/include/nvhe/mm.h    |  1 +
 arch/arm64/kvm/hyp/nvhe/alloc_mgt.c     |  2 +
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c   | 86 +++++++++++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/mm.c            | 17 +++++
 6 files changed, 120 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 54416cfea573..a3b5d8dd8995 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1625,6 +1625,7 @@ void kvm_set_vm_id_reg(struct kvm *kvm, u32 reg, u64 val);
 
 /* Allocator interface IDs. */
 #define HYP_ALLOC_MGT_HEAP_ID		0
+#define HYP_ALLOC_MGT_IOMMU_ID		1
 
 unsigned long __pkvm_reclaim_hyp_alloc_mgt(unsigned long nr_pages);
 
diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index 908863f07b0b..5f91605cd48a 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -4,6 +4,8 @@
 
 #include <asm/kvm_host.h>
 
+#include <nvhe/alloc_mgt.h>
+
 /* Hypercall handlers */
 int kvm_iommu_alloc_domain(pkvm_handle_t domain_id, int type);
 int kvm_iommu_free_domain(pkvm_handle_t domain_id);
@@ -18,10 +20,21 @@ size_t kvm_iommu_unmap_pages(pkvm_handle_t domain_id, unsigned long iova,
 			     size_t pgsize, size_t pgcount);
 phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t domain_id, unsigned long iova);
 
+/* Flags for memory allocation for IOMMU drivers */
+#define IOMMU_PAGE_NOCACHE				BIT(0)
+void *kvm_iommu_donate_pages(u8 order, int flags);
+void kvm_iommu_reclaim_pages(void *p, u8 order);
+
+#define kvm_iommu_donate_page()		kvm_iommu_donate_pages(0, 0)
+#define kvm_iommu_donate_page_nc()	kvm_iommu_donate_pages(0, IOMMU_PAGE_NOCACHE)
+#define kvm_iommu_reclaim_page(p)	kvm_iommu_reclaim_pages(p, 0)
+
 struct kvm_iommu_ops {
 	int (*init)(void);
 };
 
 int kvm_iommu_init(void);
 
+extern struct hyp_mgt_allocator_ops kvm_iommu_allocator_ops;
+
 #endif /* __ARM64_KVM_NVHE_IOMMU_H__ */
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
index 5d33aca7d686..7b425f811efb 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
@@ -37,6 +37,7 @@ int __hyp_allocator_map(unsigned long start, phys_addr_t phys);
 int __pkvm_map_module_page(u64 pfn, void *va, enum kvm_pgtable_prot prot, bool is_protected);
 void __pkvm_unmap_module_page(u64 pfn, void *va);
 void *__pkvm_alloc_module_va(u64 nr_pages);
+int pkvm_remap_range(void *va, int nr_pages, bool nc);
 #ifdef CONFIG_NVHE_EL2_DEBUG
 void assert_in_mod_range(unsigned long addr);
 #else
diff --git a/arch/arm64/kvm/hyp/nvhe/alloc_mgt.c b/arch/arm64/kvm/hyp/nvhe/alloc_mgt.c
index 4a0f33b9820a..cfd903b30427 100644
--- a/arch/arm64/kvm/hyp/nvhe/alloc_mgt.c
+++ b/arch/arm64/kvm/hyp/nvhe/alloc_mgt.c
@@ -7,9 +7,11 @@
 
 #include <nvhe/alloc.h>
 #include <nvhe/alloc_mgt.h>
+#include <nvhe/iommu.h>
 
 static struct hyp_mgt_allocator_ops *registered_allocators[] = {
 	[HYP_ALLOC_MGT_HEAP_ID] = &hyp_alloc_ops,
+	[HYP_ALLOC_MGT_IOMMU_ID] = &kvm_iommu_allocator_ops,
 };
 
 #define MAX_ALLOC_ID		(ARRAY_SIZE(registered_allocators))
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index 9022fd612a49..af6ae9b4dc51 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -5,15 +5,101 @@
  * Copyright (C) 2022 Linaro Ltd.
  */
 #include <nvhe/iommu.h>
+#include <nvhe/mem_protect.h>
+#include <nvhe/mm.h>
 
 /* Only one set of ops supported, similary to the kernel */
 struct kvm_iommu_ops *kvm_iommu_ops;
 
+/*
+ * Common pool that can be used by IOMMU driver to allocate pages.
+ */
+static struct hyp_pool iommu_host_pool;
+
+DECLARE_PER_CPU(struct kvm_hyp_req, host_hyp_reqs);
+
+static int kvm_iommu_refill(struct kvm_hyp_memcache *host_mc)
+{
+	if (!kvm_iommu_ops)
+		return -EINVAL;
+
+	return refill_hyp_pool(&iommu_host_pool, host_mc);
+}
+
+static void kvm_iommu_reclaim(struct kvm_hyp_memcache *host_mc, int target)
+{
+	if (!kvm_iommu_ops)
+		return;
+
+	reclaim_hyp_pool(&iommu_host_pool, host_mc, target);
+}
+
+static int kvm_iommu_reclaimable(void)
+{
+	if (!kvm_iommu_ops)
+		return 0;
+
+	return hyp_pool_free_pages(&iommu_host_pool);
+}
+
+struct hyp_mgt_allocator_ops kvm_iommu_allocator_ops = {
+	.refill = kvm_iommu_refill,
+	.reclaim = kvm_iommu_reclaim,
+	.reclaimable = kvm_iommu_reclaimable,
+};
+
+void *kvm_iommu_donate_pages(u8 order, int flags)
+{
+	void *p;
+	struct kvm_hyp_req *req = this_cpu_ptr(&host_hyp_reqs);
+	int ret;
+
+	p = hyp_alloc_pages(&iommu_host_pool, order);
+	if (p) {
+		/*
+		 * If page request is non-cacheable remap it as such
+		 * as all pages in the pool are mapped before hand and
+		 * assumed to be cacheable.
+		 */
+		if (flags & IOMMU_PAGE_NOCACHE) {
+			ret = pkvm_remap_range(p, 1 << order, true);
+			if (ret) {
+				hyp_put_page(&iommu_host_pool, p);
+				return NULL;
+			}
+		}
+		return p;
+	}
+
+	req->type = KVM_HYP_REQ_TYPE_MEM;
+	req->mem.dest = REQ_MEM_DEST_HYP_IOMMU;
+	req->mem.sz_alloc = (1 << order) * PAGE_SIZE;
+	req->mem.nr_pages = 1;
+	return NULL;
+}
+
+void kvm_iommu_reclaim_pages(void *p, u8 order)
+{
+	/*
+	 * Remap all pages to cacheable, as we don't know, may be use a flag
+	 * in the vmemmap or trust the driver to pass the cacheability same
+	 * as the allocation on free?
+	 */
+	pkvm_remap_range(p, 1 << order, false);
+	hyp_put_page(&iommu_host_pool, p);
+}
+
 int kvm_iommu_init(void)
 {
+	int ret;
+
 	if (!kvm_iommu_ops || !kvm_iommu_ops->init)
 		return -ENODEV;
 
+	ret = hyp_pool_init_empty(&iommu_host_pool, 64);
+	if (ret)
+		return ret;
+
 	return kvm_iommu_ops->init();
 }
 
diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
index 76bbb4c9012e..7a18b31538ae 100644
--- a/arch/arm64/kvm/hyp/nvhe/mm.c
+++ b/arch/arm64/kvm/hyp/nvhe/mm.c
@@ -564,3 +564,20 @@ int reclaim_hyp_pool(struct hyp_pool *pool, struct kvm_hyp_memcache *host_mc,
 
 	return 0;
 }
+
+/* Remap hyp memory with different cacheability */
+int pkvm_remap_range(void *va, int nr_pages, bool nc)
+{
+	size_t size = nr_pages << PAGE_SHIFT;
+	phys_addr_t phys = hyp_virt_to_phys(va);
+	enum kvm_pgtable_prot prot = PKVM_HOST_MEM_PROT;
+	int ret;
+
+	if (nc)
+		prot |= KVM_PGTABLE_PROT_NORMAL_NC;
+	hyp_spin_lock(&pkvm_pgd_lock);
+	WARN_ON(kvm_pgtable_hyp_unmap(&pkvm_pgtable, (u64)va, size) != size);
+	ret = kvm_pgtable_hyp_map(&pkvm_pgtable, (u64)va, size, phys, prot);
+	hyp_spin_unlock(&pkvm_pgd_lock);
+	return ret;
+}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 16/58] KVM: arm64: iommu: Add domains
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (14 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 15/58] KVM: arm64: iommu: Add a memory pool for the IOMMU Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 17/58] KVM: arm64: iommu: Add {attach, detach}_dev Mostafa Saleh
                   ` (42 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

The IOMMU domain abstraction allows to share the same page tables
between multiple devices. That may be necessary due to hardware
constraints, if multiple devices cannot be isolated by the IOMMU
(conventional PCI bus for example). It may also help with optimizing
resource or TLB use. For pKVM in particular, it may be useful to reduce
the amount of memory required for page tables. All devices owned by the
host kernel could be attached to the same domain (though that requires
host changes).

There is one shared domain space with all IOMMUs holding up to 2^16
domains.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/hyp-constants.c      |   1 +
 arch/arm64/kvm/hyp/include/nvhe/iommu.h |   4 +
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c   | 102 +++++++++++++++++++++++-
 arch/arm64/kvm/iommu.c                  |  10 +++
 include/kvm/iommu.h                     |  48 +++++++++++
 5 files changed, 161 insertions(+), 4 deletions(-)
 create mode 100644 include/kvm/iommu.h

diff --git a/arch/arm64/kvm/hyp/hyp-constants.c b/arch/arm64/kvm/hyp/hyp-constants.c
index 5fb26cabd606..96a6b45b424a 100644
--- a/arch/arm64/kvm/hyp/hyp-constants.c
+++ b/arch/arm64/kvm/hyp/hyp-constants.c
@@ -8,5 +8,6 @@
 int main(void)
 {
 	DEFINE(STRUCT_HYP_PAGE_SIZE,	sizeof(struct hyp_page));
+	DEFINE(HYP_SPINLOCK_SIZE,       sizeof(hyp_spinlock_t));
 	return 0;
 }
diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index 5f91605cd48a..8f619f415d1f 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -4,6 +4,8 @@
 
 #include <asm/kvm_host.h>
 
+#include <kvm/iommu.h>
+
 #include <nvhe/alloc_mgt.h>
 
 /* Hypercall handlers */
@@ -31,6 +33,8 @@ void kvm_iommu_reclaim_pages(void *p, u8 order);
 
 struct kvm_iommu_ops {
 	int (*init)(void);
+	int (*alloc_domain)(struct kvm_hyp_iommu_domain *domain, int type);
+	void (*free_domain)(struct kvm_hyp_iommu_domain *domain);
 };
 
 int kvm_iommu_init(void);
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index af6ae9b4dc51..ba2aed52a74f 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -4,12 +4,15 @@
  *
  * Copyright (C) 2022 Linaro Ltd.
  */
+#include <kvm/iommu.h>
+
 #include <nvhe/iommu.h>
 #include <nvhe/mem_protect.h>
 #include <nvhe/mm.h>
 
 /* Only one set of ops supported, similary to the kernel */
 struct kvm_iommu_ops *kvm_iommu_ops;
+void **kvm_hyp_iommu_domains;
 
 /*
  * Common pool that can be used by IOMMU driver to allocate pages.
@@ -18,6 +21,9 @@ static struct hyp_pool iommu_host_pool;
 
 DECLARE_PER_CPU(struct kvm_hyp_req, host_hyp_reqs);
 
+/* Protects domains in kvm_hyp_iommu_domains */
+static DEFINE_HYP_SPINLOCK(kvm_iommu_domain_lock);
+
 static int kvm_iommu_refill(struct kvm_hyp_memcache *host_mc)
 {
 	if (!kvm_iommu_ops)
@@ -89,28 +95,116 @@ void kvm_iommu_reclaim_pages(void *p, u8 order)
 	hyp_put_page(&iommu_host_pool, p);
 }
 
+static struct kvm_hyp_iommu_domain *
+handle_to_domain(pkvm_handle_t domain_id)
+{
+	int idx;
+	struct kvm_hyp_iommu_domain *domains;
+
+	if (domain_id >= KVM_IOMMU_MAX_DOMAINS)
+		return NULL;
+	domain_id = array_index_nospec(domain_id, KVM_IOMMU_MAX_DOMAINS);
+
+	idx = domain_id / KVM_IOMMU_DOMAINS_PER_PAGE;
+	domains = (struct kvm_hyp_iommu_domain *)READ_ONCE(kvm_hyp_iommu_domains[idx]);
+	if (!domains) {
+		domains = kvm_iommu_donate_page();
+		if (!domains)
+			return NULL;
+		/*
+		 * handle_to_domain() does not have to be called under a lock,
+		 * but even though we allocate a leaf in all cases, it's only
+		 * really a valid thing to do under alloc_domain(), which uses a
+		 * lock. Races are therefore a host bug and we don't need to be
+		 * delicate about it.
+		 */
+		if (WARN_ON(cmpxchg64_relaxed(&kvm_hyp_iommu_domains[idx], 0,
+					      (void *)domains) != 0)) {
+			kvm_iommu_reclaim_page(domains);
+			return NULL;
+		}
+	}
+	return &domains[domain_id % KVM_IOMMU_DOMAINS_PER_PAGE];
+}
+
 int kvm_iommu_init(void)
 {
 	int ret;
+	u64 domain_root_pfn = __hyp_pa(kvm_hyp_iommu_domains) >> PAGE_SHIFT;
 
-	if (!kvm_iommu_ops || !kvm_iommu_ops->init)
+	if (!kvm_iommu_ops ||
+	    !kvm_iommu_ops->init ||
+	    !kvm_iommu_ops->alloc_domain ||
+	    !kvm_iommu_ops->free_domain)
 		return -ENODEV;
 
 	ret = hyp_pool_init_empty(&iommu_host_pool, 64);
 	if (ret)
 		return ret;
 
-	return kvm_iommu_ops->init();
+	ret = __pkvm_host_donate_hyp(domain_root_pfn,
+				     KVM_IOMMU_DOMAINS_ROOT_ORDER_NR);
+	if (ret)
+		return ret;
+
+	ret = kvm_iommu_ops->init();
+	if (ret)
+		goto out_reclaim_domain;
+
+	return ret;
+
+out_reclaim_domain:
+	__pkvm_hyp_donate_host(domain_root_pfn, KVM_IOMMU_DOMAINS_ROOT_ORDER_NR);
+	return ret;
 }
 
 int kvm_iommu_alloc_domain(pkvm_handle_t domain_id, int type)
 {
-	return -ENODEV;
+	int ret = -EINVAL;
+	struct kvm_hyp_iommu_domain *domain;
+
+	domain = handle_to_domain(domain_id);
+	if (!domain)
+		return -ENOMEM;
+
+	hyp_spin_lock(&kvm_iommu_domain_lock);
+	if (atomic_read(&domain->refs))
+		goto out_unlock;
+
+	domain->domain_id = domain_id;
+	ret = kvm_iommu_ops->alloc_domain(domain, type);
+	if (ret)
+		goto out_unlock;
+
+	atomic_set_release(&domain->refs, 1);
+out_unlock:
+	hyp_spin_unlock(&kvm_iommu_domain_lock);
+	return ret;
 }
 
 int kvm_iommu_free_domain(pkvm_handle_t domain_id)
 {
-	return -ENODEV;
+	int ret = 0;
+	struct kvm_hyp_iommu_domain *domain;
+
+	domain = handle_to_domain(domain_id);
+	if (!domain)
+		return -EINVAL;
+
+	hyp_spin_lock(&kvm_iommu_domain_lock);
+	if (WARN_ON(atomic_cmpxchg_acquire(&domain->refs, 1, 0) != 1)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	kvm_iommu_ops->free_domain(domain);
+
+	memset(domain, 0, sizeof(*domain));
+
+out_unlock:
+	hyp_spin_unlock(&kvm_iommu_domain_lock);
+
+	return ret;
 }
 
 int kvm_iommu_attach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
diff --git a/arch/arm64/kvm/iommu.c b/arch/arm64/kvm/iommu.c
index ed77ea0d12bb..af3417e6259d 100644
--- a/arch/arm64/kvm/iommu.c
+++ b/arch/arm64/kvm/iommu.c
@@ -5,6 +5,9 @@
  */
 
 #include <asm/kvm_mmu.h>
+
+#include <kvm/iommu.h>
+
 #include <linux/kvm_host.h>
 
 struct kvm_iommu_driver *iommu_driver;
@@ -37,6 +40,13 @@ int kvm_iommu_init_driver(void)
 		return -ENODEV;
 	}
 
+	kvm_hyp_iommu_domains = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
+				get_order(KVM_IOMMU_DOMAINS_ROOT_SIZE));
+	if (!kvm_hyp_iommu_domains)
+		return -ENOMEM;
+
+	kvm_hyp_iommu_domains = kern_hyp_va(kvm_hyp_iommu_domains);
+
 	return iommu_driver->init_driver();
 }
 
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
new file mode 100644
index 000000000000..10ecaae0f6a3
--- /dev/null
+++ b/include/kvm/iommu.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_IOMMU_H
+#define __KVM_IOMMU_H
+
+#include <asm/kvm_host.h>
+#include <linux/io-pgtable.h>
+#ifdef __KVM_NVHE_HYPERVISOR__
+#include <nvhe/spinlock.h>
+#else
+#include "hyp_constants.h"
+#endif
+
+struct kvm_hyp_iommu_domain {
+	atomic_t		refs;
+	pkvm_handle_t		domain_id;
+	void			*priv;
+};
+
+extern void **kvm_nvhe_sym(kvm_hyp_iommu_domains);
+#define kvm_hyp_iommu_domains kvm_nvhe_sym(kvm_hyp_iommu_domains)
+
+/*
+ * At the moment the number of domains is limited to 2^16
+ * In practice we're rarely going to need a lot of domains. To avoid allocating
+ * a large domain table, we use a two-level table, indexed by domain ID. With
+ * 4kB pages and 16-bytes domains, the leaf table contains 256 domains, and the
+ * root table 256 pointers. With 64kB pages, the leaf table contains 4096
+ * domains and the root table 16 pointers. In this case, or when using 8-bit
+ * VMIDs, it may be more advantageous to use a single level. But using two
+ * levels allows to easily extend the domain size.
+ */
+#define KVM_IOMMU_MAX_DOMAINS	(1 << 16)
+
+/* Number of entries in the level-2 domain table */
+#define KVM_IOMMU_DOMAINS_PER_PAGE \
+	(PAGE_SIZE / sizeof(struct kvm_hyp_iommu_domain))
+
+/* Number of entries in the root domain table */
+#define KVM_IOMMU_DOMAINS_ROOT_ENTRIES \
+	(KVM_IOMMU_MAX_DOMAINS / KVM_IOMMU_DOMAINS_PER_PAGE)
+
+#define KVM_IOMMU_DOMAINS_ROOT_SIZE \
+	(KVM_IOMMU_DOMAINS_ROOT_ENTRIES * sizeof(void *))
+
+#define KVM_IOMMU_DOMAINS_ROOT_ORDER_NR	\
+	(1 << get_order(KVM_IOMMU_DOMAINS_ROOT_SIZE))
+
+#endif /* __KVM_IOMMU_H */
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 17/58] KVM: arm64: iommu: Add {attach, detach}_dev
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (15 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 16/58] KVM: arm64: iommu: Add domains Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 18/58] KVM: arm64: iommu: Add map/unmap() operations Mostafa Saleh
                   ` (41 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add attach/detach dev operations which are forwarded to the driver.

To avoid racing between alloc/free domain and attach/detach dev,
the refcount is used.

Although, as IOMMU attach/detach are per-IOMMU and would require
some sort of locking, nothing in the IOMMU core code need the lock
so delegate that to the driver to use locks when needed and the
hypervisor only guarantees no races between alloc/free domain.

Also, add a new function kvm_iommu_init_device() to initialise common
fields of the IOMMU struct, which is only the lock at the moment.
The IOMMU core code will need to use the lock next for power
management.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 arch/arm64/kvm/hyp/include/nvhe/iommu.h | 29 +++++++++++++
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c   | 56 ++++++++++++++++++++++++-
 include/kvm/iommu.h                     |  8 ++++
 3 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index 8f619f415d1f..d6d7447fbac8 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -35,10 +35,39 @@ struct kvm_iommu_ops {
 	int (*init)(void);
 	int (*alloc_domain)(struct kvm_hyp_iommu_domain *domain, int type);
 	void (*free_domain)(struct kvm_hyp_iommu_domain *domain);
+	struct kvm_hyp_iommu *(*get_iommu_by_id)(pkvm_handle_t iommu_id);
+	int (*attach_dev)(struct kvm_hyp_iommu *iommu, struct kvm_hyp_iommu_domain *domain,
+			  u32 endpoint_id, u32 pasid, u32 pasid_bits);
+	int (*detach_dev)(struct kvm_hyp_iommu *iommu, struct kvm_hyp_iommu_domain *domain,
+			  u32 endpoint_id, u32 pasid);
 };
 
 int kvm_iommu_init(void);
 
+int kvm_iommu_init_device(struct kvm_hyp_iommu *iommu);
+
+static inline hyp_spinlock_t *kvm_iommu_get_lock(struct kvm_hyp_iommu *iommu)
+{
+	/* See struct kvm_hyp_iommu */
+	BUILD_BUG_ON(sizeof(iommu->lock) != sizeof(hyp_spinlock_t));
+	return (hyp_spinlock_t *)(&iommu->lock);
+}
+
+static inline void kvm_iommu_lock_init(struct kvm_hyp_iommu *iommu)
+{
+	hyp_spin_lock_init(kvm_iommu_get_lock(iommu));
+}
+
+static inline void kvm_iommu_lock(struct kvm_hyp_iommu *iommu)
+{
+	hyp_spin_lock(kvm_iommu_get_lock(iommu));
+}
+
+static inline void kvm_iommu_unlock(struct kvm_hyp_iommu *iommu)
+{
+	hyp_spin_unlock(kvm_iommu_get_lock(iommu));
+}
+
 extern struct hyp_mgt_allocator_ops kvm_iommu_allocator_ops;
 
 #endif /* __ARM64_KVM_NVHE_IOMMU_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index ba2aed52a74f..df2dbe4c0121 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -127,6 +127,19 @@ handle_to_domain(pkvm_handle_t domain_id)
 	return &domains[domain_id % KVM_IOMMU_DOMAINS_PER_PAGE];
 }
 
+static int domain_get(struct kvm_hyp_iommu_domain *domain)
+{
+	int old = atomic_fetch_inc_acquire(&domain->refs);
+
+	BUG_ON(!old || (old + 1 < 0));
+	return 0;
+}
+
+static void domain_put(struct kvm_hyp_iommu_domain *domain)
+{
+	BUG_ON(!atomic_dec_return_release(&domain->refs));
+}
+
 int kvm_iommu_init(void)
 {
 	int ret;
@@ -210,13 +223,44 @@ int kvm_iommu_free_domain(pkvm_handle_t domain_id)
 int kvm_iommu_attach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 			 u32 endpoint_id, u32 pasid, u32 pasid_bits)
 {
-	return -ENODEV;
+	int ret;
+	struct kvm_hyp_iommu *iommu;
+	struct kvm_hyp_iommu_domain *domain;
+
+	iommu = kvm_iommu_ops->get_iommu_by_id(iommu_id);
+	if (!iommu)
+		return -EINVAL;
+
+	domain = handle_to_domain(domain_id);
+	if (!domain || domain_get(domain))
+		return -EINVAL;
+
+	ret = kvm_iommu_ops->attach_dev(iommu, domain, endpoint_id, pasid, pasid_bits);
+	if (ret)
+		domain_put(domain);
+	return ret;
 }
 
 int kvm_iommu_detach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 			 u32 endpoint_id, u32 pasid)
 {
-	return -ENODEV;
+	int ret;
+	struct kvm_hyp_iommu *iommu;
+	struct kvm_hyp_iommu_domain *domain;
+
+	iommu = kvm_iommu_ops->get_iommu_by_id(iommu_id);
+	if (!iommu)
+		return -EINVAL;
+
+	domain = handle_to_domain(domain_id);
+	if (!domain || atomic_read(&domain->refs) <= 1)
+		return -EINVAL;
+
+	ret = kvm_iommu_ops->detach_dev(iommu, domain, endpoint_id, pasid);
+	if (ret)
+		return ret;
+	domain_put(domain);
+	return ret;
 }
 
 size_t kvm_iommu_map_pages(pkvm_handle_t domain_id,
@@ -236,3 +280,11 @@ phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t domain_id, unsigned long iova)
 {
 	return 0;
 }
+
+/* Must be called from the IOMMU driver per IOMMU */
+int kvm_iommu_init_device(struct kvm_hyp_iommu *iommu)
+{
+	kvm_iommu_lock_init(iommu);
+
+	return 0;
+}
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 10ecaae0f6a3..6ff78d766466 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -45,4 +45,12 @@ extern void **kvm_nvhe_sym(kvm_hyp_iommu_domains);
 #define KVM_IOMMU_DOMAINS_ROOT_ORDER_NR	\
 	(1 << get_order(KVM_IOMMU_DOMAINS_ROOT_SIZE))
 
+struct kvm_hyp_iommu {
+#ifdef __KVM_NVHE_HYPERVISOR__
+	hyp_spinlock_t			lock;
+#else
+	u32				unused;
+#endif
+};
+
 #endif /* __KVM_IOMMU_H */
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 18/58] KVM: arm64: iommu: Add map/unmap() operations
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (16 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 17/58] KVM: arm64: iommu: Add {attach, detach}_dev Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 19/58] KVM: arm64: iommu: support iommu_iotlb_gather Mostafa Saleh
                   ` (40 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Handle map(), unmap() and iova_to_phys() hypercalls.

In addition to map/unmap, the hypervisor has to ensure that all
mapped pages are tracked, so be before each map()
__pkvm_host_use_dma() would be called to ensure that.

Similarly, on unmap() we need to decrement the refcount using
__pkvm_host_unuse_dma().

However, doing this in standard way as mentioned in the comments is
challenging, so we leave that to the driver.

Also, the hypervisor only guarantees that there are no races
between alloc/free domain operations using the domain refcount to
avoid using extra locks.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 arch/arm64/kvm/hyp/include/nvhe/iommu.h |  7 +++
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c   | 80 ++++++++++++++++++++++++-
 2 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index d6d7447fbac8..17f24a8eb1b9 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -40,6 +40,13 @@ struct kvm_iommu_ops {
 			  u32 endpoint_id, u32 pasid, u32 pasid_bits);
 	int (*detach_dev)(struct kvm_hyp_iommu *iommu, struct kvm_hyp_iommu_domain *domain,
 			  u32 endpoint_id, u32 pasid);
+	int (*map_pages)(struct kvm_hyp_iommu_domain *domain, unsigned long iova,
+			 phys_addr_t paddr, size_t pgsize,
+			 size_t pgcount, int prot, size_t *total_mapped);
+	size_t (*unmap_pages)(struct kvm_hyp_iommu_domain *domain, unsigned long iova,
+			      size_t pgsize, size_t pgcount);
+	phys_addr_t (*iova_to_phys)(struct kvm_hyp_iommu_domain *domain, unsigned long iova);
+
 };
 
 int kvm_iommu_init(void);
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index df2dbe4c0121..83321cc5f466 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -263,22 +263,96 @@ int kvm_iommu_detach_dev(pkvm_handle_t iommu_id, pkvm_handle_t domain_id,
 	return ret;
 }
 
+#define IOMMU_PROT_MASK (IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE |\
+			 IOMMU_NOEXEC | IOMMU_MMIO | IOMMU_PRIV)
+
 size_t kvm_iommu_map_pages(pkvm_handle_t domain_id,
 			   unsigned long iova, phys_addr_t paddr, size_t pgsize,
 			   size_t pgcount, int prot)
 {
-	return 0;
+	size_t size;
+	int ret;
+	size_t total_mapped = 0;
+	struct kvm_hyp_iommu_domain *domain;
+
+	if (prot & ~IOMMU_PROT_MASK)
+		return 0;
+
+	if (__builtin_mul_overflow(pgsize, pgcount, &size) ||
+	    iova + size < iova || paddr + size < paddr)
+		return 0;
+
+	domain = handle_to_domain(domain_id);
+	if (!domain || domain_get(domain))
+		return 0;
+
+	ret = __pkvm_host_use_dma(paddr, size);
+	if (ret)
+		return 0;
+
+	kvm_iommu_ops->map_pages(domain, iova, paddr, pgsize, pgcount, prot, &total_mapped);
+
+	pgcount -= total_mapped / pgsize;
+	/*
+	 * unuse the bits that haven't been mapped yet. The host calls back
+	 * either to continue mapping, or to unmap and unuse what's been done
+	 * so far.
+	 */
+	if (pgcount)
+		__pkvm_host_unuse_dma(paddr + total_mapped, pgcount * pgsize);
+
+	domain_put(domain);
+	return total_mapped;
 }
 
 size_t kvm_iommu_unmap_pages(pkvm_handle_t domain_id, unsigned long iova,
 			     size_t pgsize, size_t pgcount)
 {
-	return 0;
+	size_t size;
+	size_t unmapped;
+	struct kvm_hyp_iommu_domain *domain;
+
+	if (!pgsize || !pgcount)
+		return 0;
+
+	if (__builtin_mul_overflow(pgsize, pgcount, &size) ||
+	    iova + size < iova)
+		return 0;
+
+	domain = handle_to_domain(domain_id);
+	if (!domain || domain_get(domain))
+		return 0;
+
+	/*
+	 * Unlike map, the common code doesn't call the __pkvm_host_unuse_dma,
+	 * because this means that we need either walk the table using iova_to_phys
+	 * similar to VFIO then unmap and call this function, or unmap leaf (page or
+	 * block) at a time, where both might be suboptimal.
+	 * For some IOMMU, we can do 2 walks where one only invalidate the pages
+	 * and the other decrement the refcount.
+	 * As, semantics for this might differ between IOMMUs and it's hard to
+	 * standardized, we leave that to the driver.
+	 */
+	unmapped = kvm_iommu_ops->unmap_pages(domain, iova, pgsize,
+						pgcount);
+
+	domain_put(domain);
+	return unmapped;
 }
 
 phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t domain_id, unsigned long iova)
 {
-	return 0;
+	phys_addr_t phys = 0;
+	struct kvm_hyp_iommu_domain *domain;
+
+	domain = handle_to_domain( domain_id);
+
+	if (!domain || domain_get(domain))
+		return 0;
+
+	phys = kvm_iommu_ops->iova_to_phys(domain, iova);
+	domain_put(domain);
+	return phys;
 }
 
 /* Must be called from the IOMMU driver per IOMMU */
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 19/58] KVM: arm64: iommu: support iommu_iotlb_gather
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (17 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 18/58] KVM: arm64: iommu: Add map/unmap() operations Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 20/58] KVM: arm64: Support power domains Mostafa Saleh
                   ` (39 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

To improve unmap performance, we can batch TLB invalidations at the
end of the unmap similarly to what the kernel.

We use the same data structure as the kernel and most of the same
code.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/include/nvhe/iommu.h | 11 +++++++++--
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c   | 22 +++++++++++++++++++++-
 include/linux/iommu.h                   | 24 +++++++++++++-----------
 3 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index 17f24a8eb1b9..06d12b35fa3e 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -44,15 +44,22 @@ struct kvm_iommu_ops {
 			 phys_addr_t paddr, size_t pgsize,
 			 size_t pgcount, int prot, size_t *total_mapped);
 	size_t (*unmap_pages)(struct kvm_hyp_iommu_domain *domain, unsigned long iova,
-			      size_t pgsize, size_t pgcount);
+			      size_t pgsize, size_t pgcount,
+			      struct iommu_iotlb_gather *gather);
 	phys_addr_t (*iova_to_phys)(struct kvm_hyp_iommu_domain *domain, unsigned long iova);
-
+	void (*iotlb_sync)(struct kvm_hyp_iommu_domain *domain,
+			   struct iommu_iotlb_gather *gather);
 };
 
 int kvm_iommu_init(void);
 
 int kvm_iommu_init_device(struct kvm_hyp_iommu *iommu);
 
+void kvm_iommu_iotlb_gather_add_page(struct kvm_hyp_iommu_domain *domain,
+				     struct iommu_iotlb_gather *gather,
+				     unsigned long iova,
+				     size_t size);
+
 static inline hyp_spinlock_t *kvm_iommu_get_lock(struct kvm_hyp_iommu *iommu)
 {
 	/* See struct kvm_hyp_iommu */
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index 83321cc5f466..a6e0f3634756 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -305,12 +305,30 @@ size_t kvm_iommu_map_pages(pkvm_handle_t domain_id,
 	return total_mapped;
 }
 
+static inline void kvm_iommu_iotlb_sync(struct kvm_hyp_iommu_domain *domain,
+					struct iommu_iotlb_gather *iotlb_gather)
+{
+	if (kvm_iommu_ops->iotlb_sync)
+		kvm_iommu_ops->iotlb_sync(domain, iotlb_gather);
+
+	iommu_iotlb_gather_init(iotlb_gather);
+}
+
+void kvm_iommu_iotlb_gather_add_page(struct kvm_hyp_iommu_domain *domain,
+				     struct iommu_iotlb_gather *gather,
+				     unsigned long iova,
+				     size_t size)
+{
+	_iommu_iotlb_add_page(domain, gather, iova, size, kvm_iommu_iotlb_sync);
+}
+
 size_t kvm_iommu_unmap_pages(pkvm_handle_t domain_id, unsigned long iova,
 			     size_t pgsize, size_t pgcount)
 {
 	size_t size;
 	size_t unmapped;
 	struct kvm_hyp_iommu_domain *domain;
+	struct iommu_iotlb_gather iotlb_gather;
 
 	if (!pgsize || !pgcount)
 		return 0;
@@ -323,6 +341,7 @@ size_t kvm_iommu_unmap_pages(pkvm_handle_t domain_id, unsigned long iova,
 	if (!domain || domain_get(domain))
 		return 0;
 
+	iommu_iotlb_gather_init(&iotlb_gather);
 	/*
 	 * Unlike map, the common code doesn't call the __pkvm_host_unuse_dma,
 	 * because this means that we need either walk the table using iova_to_phys
@@ -334,7 +353,8 @@ size_t kvm_iommu_unmap_pages(pkvm_handle_t domain_id, unsigned long iova,
 	 * standardized, we leave that to the driver.
 	 */
 	unmapped = kvm_iommu_ops->unmap_pages(domain, iova, pgsize,
-						pgcount);
+						pgcount, &iotlb_gather);
+	kvm_iommu_iotlb_sync(domain, &iotlb_gather);
 
 	domain_put(domain);
 	return unmapped;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index bd722f473635..c75877044185 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -911,6 +911,18 @@ static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather *gathe
 		gather->end = end;
 }
 
+/*
+ * If the new page is disjoint from the current range or is mapped at
+ * a different granularity, then sync the TLB so that the gather
+ * structure can be rewritten.
+ */
+#define _iommu_iotlb_add_page(domain, gather, iova, size, sync)		\
+	if (((gather)->pgsize && (gather)->pgsize != (size)) ||		\
+	    iommu_iotlb_gather_is_disjoint((gather), (iova), (size)))	\
+		sync((domain), (gather));				\
+	(gather)->pgsize = (size);					\
+	iommu_iotlb_gather_add_range((gather), (iova), (size))
+
 /**
  * iommu_iotlb_gather_add_page - Gather for page-based TLB invalidation
  * @domain: IOMMU domain to be invalidated
@@ -926,17 +938,7 @@ static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain,
 					       struct iommu_iotlb_gather *gather,
 					       unsigned long iova, size_t size)
 {
-	/*
-	 * If the new page is disjoint from the current range or is mapped at
-	 * a different granularity, then sync the TLB so that the gather
-	 * structure can be rewritten.
-	 */
-	if ((gather->pgsize && gather->pgsize != size) ||
-	    iommu_iotlb_gather_is_disjoint(gather, iova, size))
-		iommu_iotlb_sync(domain, gather);
-
-	gather->pgsize = size;
-	iommu_iotlb_gather_add_range(gather, iova, size);
+	_iommu_iotlb_add_page(domain, gather, iova, size, iommu_iotlb_sync);
 }
 
 static inline bool iommu_iotlb_gather_queued(struct iommu_iotlb_gather *gather)
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 20/58] KVM: arm64: Support power domains
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (18 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 19/58] KVM: arm64: iommu: support iommu_iotlb_gather Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 21/58] KVM: arm64: pkvm: Add __pkvm_host_add_remove_page() Mostafa Saleh
                   ` (38 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Unfortunately, as power management is not widely standardized we have
to work around that.

One implementation we can support is HOST_HVC, where the host
is in control of the power management and it notifies the
hypervisor about the updates.

This adds extra constraints to the IOMMUs, as they must reset to
blocking DMA traffic to be able to use this PD interface.

Unfortunately again, for SMMUv3 which is the only IOMMU currently
supported, there is no architectural way to discover this, so we
rely on enabling this driver when it fits the constraints, also
the driver sets GBPA and assumes that the SMMU retains across
power cycling.

In the next patch SCMI support is added.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 arch/arm64/include/asm/kvm_asm.h       |  1 +
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 30 ++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/Makefile       |  2 +-
 arch/arm64/kvm/hyp/nvhe/hyp-main.c     |  9 +++++
 arch/arm64/kvm/hyp/nvhe/power/hvc.c    | 47 ++++++++++++++++++++++++++
 include/kvm/power_domain.h             | 17 ++++++++++
 6 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/kvm/hyp/nvhe/power/hvc.c
 create mode 100644 include/kvm/power_domain.h

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 9ea155a04332..3dbf30cd10f3 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -114,6 +114,7 @@ enum __kvm_host_smccc_func {
 	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_map_pages,
 	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_unmap_pages,
 	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_iova_to_phys,
+	__KVM_HOST_SMCCC_FUNC___pkvm_host_hvc_pd,
 
 	/*
 	 * Start of the dynamically registered hypercalls. Start a bit
diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
index 8a5554615e40..e4a94696b10e 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
@@ -8,6 +8,7 @@
 #define __ARM64_KVM_NVHE_PKVM_H__
 
 #include <asm/kvm_pkvm.h>
+#include <kvm/power_domain.h>
 
 #include <nvhe/gfp.h>
 #include <nvhe/spinlock.h>
@@ -146,4 +147,33 @@ void pkvm_poison_pvmfw_pages(void);
 int pkvm_timer_init(void);
 void pkvm_udelay(unsigned long usecs);
 
+#define MAX_POWER_DOMAINS		32
+
+struct kvm_power_domain_ops {
+	int (*power_on)(struct kvm_power_domain *pd);
+	int (*power_off)(struct kvm_power_domain *pd);
+};
+
+int pkvm_init_hvc_pd(struct kvm_power_domain *pd,
+		     const struct kvm_power_domain_ops *ops);
+
+int pkvm_host_hvc_pd(u64 device_id, u64 on);
+
+/*
+ * Register a power domain. When the hypervisor catches power requests from the
+ * host for this power domain, it calls the power ops with @pd as argument.
+ */
+static inline int pkvm_init_power_domain(struct kvm_power_domain *pd,
+					 const struct kvm_power_domain_ops *ops)
+{
+	switch (pd->type) {
+	case KVM_POWER_DOMAIN_NONE:
+		return 0;
+	case KVM_POWER_DOMAIN_HOST_HVC:
+		return pkvm_init_hvc_pd(pd, ops);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 #endif /* __ARM64_KVM_NVHE_PKVM_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 9e1b74c661d2..950d34ba6e50 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -8,7 +8,7 @@ CFLAGS_switch.nvhe.o += -Wno-override-init
 hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
 	 hyp-main.o hyp-smp.o psci-relay.o alloc.o early_alloc.o page_alloc.o \
 	 cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o \
-	 serial.o alloc_mgt.o iommu/iommu.o
+	 serial.o alloc_mgt.o iommu/iommu.o power/hvc.o
 hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
 	 ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
 hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 9b224842c487..5df98bf04ef4 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -1674,6 +1674,14 @@ static void handle___pkvm_host_iommu_iova_to_phys(struct kvm_cpu_context *host_c
 	cpu_reg(host_ctxt, 1) = kvm_iommu_iova_to_phys(domain, iova);
 }
 
+static void handle___pkvm_host_hvc_pd(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(u64, device_id, host_ctxt, 1);
+	DECLARE_REG(u64, on, host_ctxt, 2);
+
+	cpu_reg(host_ctxt, 1) = pkvm_host_hvc_pd(device_id, on);
+}
+
 typedef void (*hcall_t)(struct kvm_cpu_context *);
 
 #define HANDLE_FUNC(x)	[__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -1738,6 +1746,7 @@ static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__pkvm_host_iommu_map_pages),
 	HANDLE_FUNC(__pkvm_host_iommu_unmap_pages),
 	HANDLE_FUNC(__pkvm_host_iommu_iova_to_phys),
+	HANDLE_FUNC(__pkvm_host_hvc_pd),
 };
 
 static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
diff --git a/arch/arm64/kvm/hyp/nvhe/power/hvc.c b/arch/arm64/kvm/hyp/nvhe/power/hvc.c
new file mode 100644
index 000000000000..f4d811847e73
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/power/hvc.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ * Author: Mostafa Saleh <smostafa@google.com>
+ */
+
+#include <nvhe/pkvm.h>
+
+struct hvc_power_domain {
+	struct kvm_power_domain			*pd;
+	const struct kvm_power_domain_ops	*ops;
+};
+
+struct hvc_power_domain handlers[MAX_POWER_DOMAINS];
+
+int pkvm_init_hvc_pd(struct kvm_power_domain *pd,
+		     const struct kvm_power_domain_ops *ops)
+{
+	if (pd->device_id >= MAX_POWER_DOMAINS)
+		return -E2BIG;
+
+	handlers[pd->device_id].ops = ops;
+	handlers[pd->device_id].pd = pd;
+
+	return 0;
+}
+
+int pkvm_host_hvc_pd(u64 device_id, u64 on)
+{
+	struct hvc_power_domain *pd;
+
+	if (device_id >= MAX_POWER_DOMAINS)
+		return -E2BIG;
+
+	device_id = array_index_nospec(device_id, MAX_POWER_DOMAINS);
+	pd = &handlers[device_id];
+
+	if (!pd->ops)
+		return -ENOENT;
+
+	if (on)
+		pd->ops->power_on(pd->pd);
+	else
+		pd->ops->power_off(pd->pd);
+
+	return 0;
+}
diff --git a/include/kvm/power_domain.h b/include/kvm/power_domain.h
new file mode 100644
index 000000000000..f6a9c5cdfebb
--- /dev/null
+++ b/include/kvm/power_domain.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_POWER_DOMAIN_H
+#define __KVM_POWER_DOMAIN_H
+
+enum kvm_power_domain_type {
+	KVM_POWER_DOMAIN_NONE,
+	KVM_POWER_DOMAIN_HOST_HVC,
+};
+
+struct kvm_power_domain {
+	enum kvm_power_domain_type	type;
+	union {
+		u64 device_id; /* HOST_HVC device ID*/
+	};
+};
+
+#endif /* __KVM_POWER_DOMAIN_H */
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 21/58] KVM: arm64: pkvm: Add __pkvm_host_add_remove_page()
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (19 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 20/58] KVM: arm64: Support power domains Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-19 11:10   ` Quentin Perret
  2024-12-12 18:03 ` [RFC PATCH v2 22/58] KVM: arm64: pkvm: Support SCMI power domain Mostafa Saleh
                   ` (37 subsequent siblings)
  58 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Add a small helper to remove and add back a page from the host stage-2.
This will be used to temporarily unmap a piece of shared sram (device
memory) from the host while we handle a SCMI request, preventing the
host from modifying the request after it is verified.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  1 +
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index d75e64e59596..c8f49b335093 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -94,6 +94,7 @@ int __pkvm_guest_relinquish_to_host(struct pkvm_hyp_vcpu *vcpu,
 				    u64 ipa, u64 *ppa);
 int __pkvm_host_use_dma(u64 phys_addr, size_t size);
 int __pkvm_host_unuse_dma(u64 phys_addr, size_t size);
+int __pkvm_host_add_remove_page(u64 pfn, bool remove);
 
 bool addr_is_memory(phys_addr_t phys);
 int host_stage2_idmap_locked(phys_addr_t addr, u64 size, enum kvm_pgtable_prot prot);
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 0840af20c366..a428ad9ca871 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -2521,3 +2521,20 @@ int host_stage2_get_leaf(phys_addr_t phys, kvm_pte_t *ptep, s8 *level)
 
 	return ret;
 }
+
+/*
+ * Temporarily unmap a page from the host stage-2, if @remove is true, or put it
+ * back. After restoring the ownership to host, the page will be lazy-mapped.
+ */
+int __pkvm_host_add_remove_page(u64 pfn, bool remove)
+{
+	int ret;
+	u64 host_addr = hyp_pfn_to_phys(pfn);
+	u8 owner = remove ? PKVM_ID_HYP : PKVM_ID_HOST;
+
+	host_lock_component();
+	ret = host_stage2_set_owner_locked(host_addr, PAGE_SIZE, owner);
+	host_unlock_component();
+
+	return ret;
+}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 22/58] KVM: arm64: pkvm: Support SCMI power domain
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (20 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 21/58] KVM: arm64: pkvm: Add __pkvm_host_add_remove_page() Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 23/58] KVM: arm64: iommu: Support power management Mostafa Saleh
                   ` (36 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

The hypervisor needs to catch power domain changes for devices it owns,
such as the SMMU. Possible reasons:

* Ensure that software and hardware states are consistent. The driver
  does not attempt to modify the state while the device is off.
* Save and restore the device state.
* Enforce dependency between consumers and suppliers. For example ensure
  that endpoints are off before turning the SMMU off, in case a powered
  off SMMU lets DMA through. However this is normally enforced by
  firmware.

Add a SCMI power domain, as the standard method for device power
management on Arm. Other methods can be added to kvm_power_domain later.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/include/asm/kvm_hyp.h              |   2 +
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   4 +
 .../arm64/kvm/hyp/include/nvhe/trap_handler.h |   2 +
 arch/arm64/kvm/hyp/nvhe/Makefile              |   2 +-
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |   2 +
 arch/arm64/kvm/hyp/nvhe/power/scmi.c          | 231 ++++++++++++++++++
 include/kvm/power_domain.h                    |   7 +
 7 files changed, 249 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/kvm/hyp/nvhe/power/scmi.c

diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index ee85c6dfb504..0257e8e37434 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -119,7 +119,9 @@ void __sve_restore_state(void *sve_pffr, u32 *fpsr, int restore_ffr);
 
 u64 __guest_enter(struct kvm_vcpu *vcpu);
 
+
 bool kvm_host_psci_handler(struct kvm_cpu_context *host_ctxt, u32 func_id);
+bool kvm_host_scmi_handler(struct kvm_cpu_context *host_ctxt);
 
 #ifdef __KVM_NVHE_HYPERVISOR__
 void __noreturn __hyp_do_panic(struct kvm_cpu_context *host_ctxt, u64 spsr,
diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
index e4a94696b10e..4d40c536d26a 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h
@@ -158,6 +158,8 @@ int pkvm_init_hvc_pd(struct kvm_power_domain *pd,
 		     const struct kvm_power_domain_ops *ops);
 
 int pkvm_host_hvc_pd(u64 device_id, u64 on);
+int pkvm_init_scmi_pd(struct kvm_power_domain *pd,
+		      const struct kvm_power_domain_ops *ops);
 
 /*
  * Register a power domain. When the hypervisor catches power requests from the
@@ -171,6 +173,8 @@ static inline int pkvm_init_power_domain(struct kvm_power_domain *pd,
 		return 0;
 	case KVM_POWER_DOMAIN_HOST_HVC:
 		return pkvm_init_hvc_pd(pd, ops);
+	case KVM_POWER_DOMAIN_ARM_SCMI:
+		return pkvm_init_scmi_pd(pd, ops);
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h b/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h
index 1e6d995968a1..0e6bb92ccdb7 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h
@@ -15,4 +15,6 @@
 #define DECLARE_REG(type, name, ctxt, reg)	\
 				type name = (type)cpu_reg(ctxt, (reg))
 
+void __kvm_hyp_host_forward_smc(struct kvm_cpu_context *host_ctxt);
+
 #endif /* __ARM64_KVM_NVHE_TRAP_HANDLER_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 950d34ba6e50..d846962e7246 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -8,7 +8,7 @@ CFLAGS_switch.nvhe.o += -Wno-override-init
 hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
 	 hyp-main.o hyp-smp.o psci-relay.o alloc.o early_alloc.o page_alloc.o \
 	 cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o \
-	 serial.o alloc_mgt.o iommu/iommu.o power/hvc.o
+	 serial.o alloc_mgt.o iommu/iommu.o power/hvc.o power/scmi.o
 hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
 	 ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
 hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 5df98bf04ef4..1ab8e5507825 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -1806,6 +1806,8 @@ static void handle_host_smc(struct kvm_cpu_context *host_ctxt)
 	handled = kvm_host_psci_handler(host_ctxt, func_id);
 	if (!handled)
 		handled = kvm_host_ffa_handler(host_ctxt, func_id);
+	if (!handled)
+		handled = kvm_host_scmi_handler(host_ctxt);
 	if (!handled && smp_load_acquire(&default_host_smc_handler))
 		handled = default_host_smc_handler(&host_ctxt->regs);
 	if (!handled) {
diff --git a/arch/arm64/kvm/hyp/nvhe/power/scmi.c b/arch/arm64/kvm/hyp/nvhe/power/scmi.c
new file mode 100644
index 000000000000..7de3feb2f1d9
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/power/scmi.c
@@ -0,0 +1,231 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022 Linaro Ltd.
+ */
+
+#include <linux/bitfield.h>
+
+#include <nvhe/pkvm.h>
+#include <nvhe/mm.h>
+#include <nvhe/mem_protect.h>
+#include <nvhe/trap_handler.h>
+
+/* SCMI protocol */
+#define SCMI_PROTOCOL_POWER_DOMAIN	0x11
+
+/*  shmem registers */
+#define SCMI_SHM_CHANNEL_STATUS		0x4
+#define SCMI_SHM_CHANNEL_FLAGS		0x10
+#define SCMI_SHM_LENGTH			0x14
+#define SCMI_SHM_MESSAGE_HEADER		0x18
+#define SCMI_SHM_MESSAGE_PAYLOAD	0x1c
+
+/*  channel status */
+#define SCMI_CHN_FREE			(1U << 0)
+#define SCMI_CHN_ERROR			(1U << 1)
+
+/*  channel flags */
+#define SCMI_CHN_IRQ			(1U << 0)
+
+/*  message header */
+#define SCMI_HDR_TOKEN			GENMASK(27, 18)
+#define SCMI_HDR_PROTOCOL_ID		GENMASK(17, 10)
+#define SCMI_HDR_MESSAGE_TYPE		GENMASK(9, 8)
+#define SCMI_HDR_MESSAGE_ID		GENMASK(7, 0)
+
+/*  power domain */
+#define SCMI_PD_STATE_SET		0x4
+#define SCMI_PD_STATE_SET_FLAGS		0x0
+#define SCMI_PD_STATE_SET_DOMAIN_ID	0x4
+#define SCMI_PD_STATE_SET_POWER_STATE	0x8
+
+#define SCMI_PD_STATE_SET_STATUS	0x0
+
+#define SCMI_PD_STATE_SET_FLAGS_ASYNC	(1U << 0)
+
+#define SCMI_PD_POWER_ON		0
+#define SCMI_PD_POWER_OFF		(1U << 30)
+
+#define SCMI_SUCCESS			0
+
+
+static struct {
+	u32				smc_id;
+	phys_addr_t			shmem_pfn;
+	size_t				shmem_size;
+	void __iomem			*shmem;
+} scmi_channel;
+
+struct scmi_power_domain {
+	struct kvm_power_domain			*pd;
+	const struct kvm_power_domain_ops	*ops;
+};
+
+static struct scmi_power_domain scmi_power_domains[MAX_POWER_DOMAINS];
+static int scmi_power_domain_count;
+
+#define SCMI_POLL_TIMEOUT_US	1000000 /* 1s! */
+
+/* Forward the command to EL3, and wait for completion */
+static int scmi_run_command(struct kvm_cpu_context *host_ctxt)
+{
+	u32 reg;
+	unsigned long i = 0;
+
+	__kvm_hyp_host_forward_smc(host_ctxt);
+
+	do {
+		reg = readl_relaxed(scmi_channel.shmem + SCMI_SHM_CHANNEL_STATUS);
+		if (reg & SCMI_CHN_FREE)
+			break;
+
+		if (WARN_ON(++i > SCMI_POLL_TIMEOUT_US))
+			return -ETIMEDOUT;
+
+		pkvm_udelay(1);
+	} while (!(reg & (SCMI_CHN_FREE | SCMI_CHN_ERROR)));
+
+	if (reg & SCMI_CHN_ERROR)
+		return -EIO;
+
+	reg = readl_relaxed(scmi_channel.shmem + SCMI_SHM_MESSAGE_PAYLOAD +
+			    SCMI_PD_STATE_SET_STATUS);
+	if (reg != SCMI_SUCCESS)
+		return -EIO;
+
+	return 0;
+}
+
+static void __kvm_host_scmi_handler(struct kvm_cpu_context *host_ctxt)
+{
+	int i;
+	u32 reg;
+	struct scmi_power_domain *scmi_pd = NULL;
+
+	/*
+	 * FIXME: the spec does not really allow for an intermediary filtering
+	 * messages on the channel: as soon as the host clears SCMI_CHN_FREE,
+	 * the server may process the message. It doesn't have to wait for a
+	 * doorbell and could just poll on the shared mem. Unlikely in practice,
+	 * but this code is not correct without a spec change requiring the
+	 * server to observe an SMC before processing the message.
+	 */
+	reg = readl_relaxed(scmi_channel.shmem + SCMI_SHM_CHANNEL_STATUS);
+	if (reg & (SCMI_CHN_FREE | SCMI_CHN_ERROR))
+		return;
+
+	reg = readl_relaxed(scmi_channel.shmem + SCMI_SHM_MESSAGE_HEADER);
+	if (FIELD_GET(SCMI_HDR_PROTOCOL_ID, reg) != SCMI_PROTOCOL_POWER_DOMAIN)
+		goto out_forward_smc;
+
+	if (FIELD_GET(SCMI_HDR_MESSAGE_ID, reg) != SCMI_PD_STATE_SET)
+		goto out_forward_smc;
+
+	reg = readl_relaxed(scmi_channel.shmem + SCMI_SHM_MESSAGE_PAYLOAD +
+			    SCMI_PD_STATE_SET_FLAGS);
+	if (WARN_ON(reg & SCMI_PD_STATE_SET_FLAGS_ASYNC))
+		/* We don't support async requests at the moment */
+		return;
+
+	reg = readl_relaxed(scmi_channel.shmem + SCMI_SHM_MESSAGE_PAYLOAD +
+			    SCMI_PD_STATE_SET_DOMAIN_ID);
+
+	for (i = 0; i < MAX_POWER_DOMAINS; i++) {
+		if (!scmi_power_domains[i].pd)
+			break;
+
+		if (reg == scmi_power_domains[i].pd->arm_scmi.domain_id) {
+			scmi_pd = &scmi_power_domains[i];
+			break;
+		}
+	}
+	if (!scmi_pd)
+		goto out_forward_smc;
+
+	reg = readl_relaxed(scmi_channel.shmem + SCMI_SHM_MESSAGE_PAYLOAD +
+			    SCMI_PD_STATE_SET_POWER_STATE);
+	switch (reg) {
+	case SCMI_PD_POWER_ON:
+		if (scmi_run_command(host_ctxt))
+			break;
+
+		scmi_pd->ops->power_on(scmi_pd->pd);
+		break;
+	case SCMI_PD_POWER_OFF:
+		scmi_pd->ops->power_off(scmi_pd->pd);
+
+		if (scmi_run_command(host_ctxt))
+			scmi_pd->ops->power_on(scmi_pd->pd);
+		break;
+	}
+	return;
+
+out_forward_smc:
+	__kvm_hyp_host_forward_smc(host_ctxt);
+}
+
+bool kvm_host_scmi_handler(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(u64, func_id, host_ctxt, 0);
+
+	if (!scmi_channel.shmem || func_id != scmi_channel.smc_id)
+		return false; /* Unhandled */
+
+	/*
+	 * Prevent the host from modifying the request while it is in flight.
+	 * One page is enough, SCMI messages are smaller than that.
+	 *
+	 * FIXME: the host is allowed to poll the shmem while the request is in
+	 * flight, or read shmem when receiving the SCMI interrupt. Although
+	 * it's unlikely with the SMC-based transport, this too requires some
+	 * tightening in the spec.
+	 */
+	if (WARN_ON(__pkvm_host_add_remove_page(scmi_channel.shmem_pfn, true)))
+		return true;
+
+	__kvm_host_scmi_handler(host_ctxt);
+
+	WARN_ON(__pkvm_host_add_remove_page(scmi_channel.shmem_pfn, false));
+	return true; /* Handled */
+}
+
+int pkvm_init_scmi_pd(struct kvm_power_domain *pd,
+		      const struct kvm_power_domain_ops *ops)
+{
+	int ret;
+
+	if (!IS_ALIGNED(pd->arm_scmi.shmem_base, PAGE_SIZE) ||
+	    pd->arm_scmi.shmem_size < PAGE_SIZE) {
+		return -EINVAL;
+	}
+
+	if (!scmi_channel.shmem) {
+		unsigned long shmem;
+
+		/* FIXME: Do we need to mark those pages shared in the host s2? */
+		ret = __pkvm_create_private_mapping(pd->arm_scmi.shmem_base,
+						    pd->arm_scmi.shmem_size,
+						    PAGE_HYP_DEVICE,
+						    &shmem);
+		if (ret)
+			return ret;
+
+		scmi_channel.smc_id = pd->arm_scmi.smc_id;
+		scmi_channel.shmem_pfn = hyp_phys_to_pfn(pd->arm_scmi.shmem_base);
+		scmi_channel.shmem = (void *)shmem;
+
+	} else if (scmi_channel.shmem_pfn !=
+		   hyp_phys_to_pfn(pd->arm_scmi.shmem_base) ||
+		   scmi_channel.smc_id != pd->arm_scmi.smc_id) {
+		/* We support a single channel at the moment */
+		return -ENXIO;
+	}
+
+	if (scmi_power_domain_count == MAX_POWER_DOMAINS)
+		return -ENOSPC;
+
+	scmi_power_domains[scmi_power_domain_count].pd = pd;
+	scmi_power_domains[scmi_power_domain_count].ops = ops;
+	scmi_power_domain_count++;
+	return 0;
+}
diff --git a/include/kvm/power_domain.h b/include/kvm/power_domain.h
index f6a9c5cdfebb..9ade1d60f5f5 100644
--- a/include/kvm/power_domain.h
+++ b/include/kvm/power_domain.h
@@ -5,12 +5,19 @@
 enum kvm_power_domain_type {
 	KVM_POWER_DOMAIN_NONE,
 	KVM_POWER_DOMAIN_HOST_HVC,
+	KVM_POWER_DOMAIN_ARM_SCMI,
 };
 
 struct kvm_power_domain {
 	enum kvm_power_domain_type	type;
 	union {
 		u64 device_id; /* HOST_HVC device ID*/
+		struct {
+			u32		smc_id;
+			u32		domain_id;
+			phys_addr_t	shmem_base;
+			size_t		shmem_size;
+		} arm_scmi; /*ARM_SCMI channel */
 	};
 };
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 23/58] KVM: arm64: iommu: Support power management
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (21 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 22/58] KVM: arm64: pkvm: Support SCMI power domain Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 24/58] KVM: arm64: iommu: Support DABT for IOMMU Mostafa Saleh
                   ` (35 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Add power domain ops to the hypervisor IOMMU driver. We currently make
these assumptions:

* The register state is retained across power off.
* The TLBs are clean on power on.
* Another privileged software (EL3 or SCP FW) handles dependencies
  between SMMU and endpoints.

So we just need to make sure that the CPU does not touch the SMMU
registers while it is powered off.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c | 33 ++++++++++++++++++++++++++-
 include/kvm/iommu.h                   |  3 +++
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index a6e0f3634756..fbab335d3490 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -375,10 +375,41 @@ phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t domain_id, unsigned long iova)
 	return phys;
 }
 
+static int iommu_power_on(struct kvm_power_domain *pd)
+{
+	struct kvm_hyp_iommu *iommu = container_of(pd, struct kvm_hyp_iommu,
+						   power_domain);
+
+	/*
+	 * We currently assume that the device retains its architectural state
+	 * across power off, hence no save/restore.
+	 */
+	kvm_iommu_lock(iommu);
+	iommu->power_is_off = false;
+	kvm_iommu_unlock(iommu);
+	return 0;
+}
+
+static int iommu_power_off(struct kvm_power_domain *pd)
+{
+	struct kvm_hyp_iommu *iommu = container_of(pd, struct kvm_hyp_iommu,
+						   power_domain);
+
+	kvm_iommu_lock(iommu);
+	iommu->power_is_off = true;
+	kvm_iommu_unlock(iommu);
+	return 0;
+}
+
+static const struct kvm_power_domain_ops iommu_power_ops = {
+	.power_on	= iommu_power_on,
+	.power_off	= iommu_power_off,
+};
+
 /* Must be called from the IOMMU driver per IOMMU */
 int kvm_iommu_init_device(struct kvm_hyp_iommu *iommu)
 {
 	kvm_iommu_lock_init(iommu);
 
-	return 0;
+	return pkvm_init_power_domain(&iommu->power_domain, &iommu_power_ops);
 }
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 6ff78d766466..c524ba84a9cf 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -3,6 +3,7 @@
 #define __KVM_IOMMU_H
 
 #include <asm/kvm_host.h>
+#include <kvm/power_domain.h>
 #include <linux/io-pgtable.h>
 #ifdef __KVM_NVHE_HYPERVISOR__
 #include <nvhe/spinlock.h>
@@ -51,6 +52,8 @@ struct kvm_hyp_iommu {
 #else
 	u32				unused;
 #endif
+	struct kvm_power_domain		power_domain;
+	bool				power_is_off;
 };
 
 #endif /* __KVM_IOMMU_H */
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 24/58] KVM: arm64: iommu: Support DABT for IOMMU
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (22 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 23/58] KVM: arm64: iommu: Support power management Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 25/58] KVM: arm64: iommu: Add SMMUv3 driver Mostafa Saleh
                   ` (34 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Soon, SMMUv3 driver would be added and it would need to emulate
access to some of its MMIO space.

Add a handler for DABTs for IOMMU drivers to be able to do so.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/include/nvhe/iommu.h |  2 ++
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c   | 17 +++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/mem_protect.c   | 19 +++++++++++++++++--
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index 06d12b35fa3e..cff75d67d807 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -21,6 +21,7 @@ size_t kvm_iommu_map_pages(pkvm_handle_t domain_id,
 size_t kvm_iommu_unmap_pages(pkvm_handle_t domain_id, unsigned long iova,
 			     size_t pgsize, size_t pgcount);
 phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t domain_id, unsigned long iova);
+bool kvm_iommu_host_dabt_handler(struct kvm_cpu_context *host_ctxt, u64 esr, u64 addr);
 
 /* Flags for memory allocation for IOMMU drivers */
 #define IOMMU_PAGE_NOCACHE				BIT(0)
@@ -49,6 +50,7 @@ struct kvm_iommu_ops {
 	phys_addr_t (*iova_to_phys)(struct kvm_hyp_iommu_domain *domain, unsigned long iova);
 	void (*iotlb_sync)(struct kvm_hyp_iommu_domain *domain,
 			   struct iommu_iotlb_gather *gather);
+	bool (*dabt_handler)(struct kvm_cpu_context *host_ctxt, u64 esr, u64 addr);
 };
 
 int kvm_iommu_init(void);
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index fbab335d3490..e45dadd0c4aa 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -4,6 +4,10 @@
  *
  * Copyright (C) 2022 Linaro Ltd.
  */
+#include <asm/kvm_hyp.h>
+
+#include <hyp/adjust_pc.h>
+
 #include <kvm/iommu.h>
 
 #include <nvhe/iommu.h>
@@ -375,6 +379,19 @@ phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t domain_id, unsigned long iova)
 	return phys;
 }
 
+bool kvm_iommu_host_dabt_handler(struct kvm_cpu_context *host_ctxt, u64 esr, u64 addr)
+{
+	bool ret = false;
+
+	if (kvm_iommu_ops && kvm_iommu_ops->dabt_handler)
+		ret = kvm_iommu_ops->dabt_handler(host_ctxt, esr, addr);
+
+	if (ret)
+		kvm_skip_host_instr();
+
+	return ret;
+}
+
 static int iommu_power_on(struct kvm_power_domain *pd)
 {
 	struct kvm_hyp_iommu *iommu = container_of(pd, struct kvm_hyp_iommu,
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index a428ad9ca871..0fae651107db 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -16,6 +16,7 @@
 #include <hyp/fault.h>
 
 #include <nvhe/gfp.h>
+#include <nvhe/iommu.h>
 #include <nvhe/memory.h>
 #include <nvhe/mem_protect.h>
 #include <nvhe/mm.h>
@@ -799,11 +800,16 @@ static int handle_host_perm_fault(struct kvm_cpu_context *host_ctxt, u64 esr, u6
 	return handled ? 0 : -EPERM;
 }
 
+static bool is_dabt(u64 esr)
+{
+	return ESR_ELx_EC(esr) == ESR_ELx_EC_DABT_LOW;
+}
+
 void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
 {
 	struct kvm_vcpu_fault_info fault;
 	u64 esr, addr;
-	int ret = 0;
+	int ret = -EPERM;
 
 	esr = read_sysreg_el2(SYS_ESR);
 	if (!__get_fault_info(esr, &fault)) {
@@ -817,7 +823,15 @@ void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
 	}
 
 	addr = (fault.hpfar_el2 & HPFAR_MASK) << 8;
-	ret = host_stage2_idmap(addr);
+	addr |= fault.far_el2 & FAR_MASK;
+
+	if (is_dabt(esr) && !addr_is_memory(addr) &&
+	    kvm_iommu_host_dabt_handler(host_ctxt, esr, addr))
+		goto return_to_host;
+
+	/* If not handled, attempt to map the page. */
+	if (ret == -EPERM)
+		ret = host_stage2_idmap(addr);
 
 	if ((esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_PERM)
 		ret = handle_host_perm_fault(host_ctxt, esr, addr);
@@ -827,6 +841,7 @@ void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
 	else
 		BUG_ON(ret && ret != -EAGAIN);
 
+return_to_host:
 	trace_host_mem_abort(esr, addr);
 }
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 25/58] KVM: arm64: iommu: Add SMMUv3 driver
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (23 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 24/58] KVM: arm64: iommu: Support DABT for IOMMU Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 26/58] KVM: arm64: smmu-v3: Initialize registers Mostafa Saleh
                   ` (33 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Add the skeleton for an Arm SMMUv3 driver at EL2.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/Makefile            |  2 ++
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 22 +++++++++++++++++++++
 drivers/iommu/Kconfig                       |  9 +++++++++
 include/kvm/arm_smmu_v3.h                   | 18 +++++++++++++++++
 4 files changed, 51 insertions(+)
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
 create mode 100644 include/kvm/arm_smmu_v3.h

diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index d846962e7246..edfd8a11ac90 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -16,6 +16,8 @@ hyp-obj-$(CONFIG_TRACING) += clock.o events.o trace.o
 hyp-obj-$(CONFIG_MODULES) += modules.o
 hyp-obj-y += $(lib-objs)
 
+hyp-obj-$(CONFIG_ARM_SMMU_V3_PKVM) += iommu/arm-smmu-v3.o
+
 $(obj)/hyp.lds: $(src)/hyp.lds.S FORCE
 	$(call if_changed_dep,cpp_lds_S)
 
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
new file mode 100644
index 000000000000..d2a570c9f3ec
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pKVM hyp driver for the Arm SMMUv3
+ *
+ * Copyright (C) 2022 Linaro Ltd.
+ */
+#include <asm/kvm_hyp.h>
+#include <kvm/arm_smmu_v3.h>
+#include <nvhe/iommu.h>
+
+size_t __ro_after_init kvm_hyp_arm_smmu_v3_count;
+struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
+
+static int smmu_init(void)
+{
+	return -ENOSYS;
+}
+
+/* Shared with the kernel driver in EL1 */
+struct kvm_iommu_ops smmu_ops = {
+	.init				= smmu_init,
+};
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index b3aa1f5d5321..fea5d6a8b90b 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -437,6 +437,15 @@ config TEGRA241_CMDQV
 	  CMDQ-V extension.
 endif
 
+config ARM_SMMU_V3_PKVM
+	bool "ARM SMMUv3 support for protected Virtual Machines"
+	depends on KVM && ARM64
+	help
+	  Enable a SMMUv3 driver in the KVM hypervisor, to protect VMs against
+	  memory accesses from devices owned by the host.
+
+	  Say Y here if you intend to enable KVM in protected mode.
+
 config S390_IOMMU
 	def_bool y if S390 && PCI
 	depends on S390 && PCI
diff --git a/include/kvm/arm_smmu_v3.h b/include/kvm/arm_smmu_v3.h
new file mode 100644
index 000000000000..521028b3ff71
--- /dev/null
+++ b/include/kvm/arm_smmu_v3.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_ARM_SMMU_V3_H
+#define __KVM_ARM_SMMU_V3_H
+
+#include <asm/kvm_asm.h>
+#include <kvm/iommu.h>
+
+struct hyp_arm_smmu_v3_device {
+	struct kvm_hyp_iommu	iommu;
+};
+
+extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
+#define kvm_hyp_arm_smmu_v3_count kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count)
+
+extern struct hyp_arm_smmu_v3_device *kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_smmus);
+#define kvm_hyp_arm_smmu_v3_smmus kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_smmus)
+
+#endif /* __KVM_ARM_SMMU_V3_H */
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 26/58] KVM: arm64: smmu-v3: Initialize registers
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (24 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 25/58] KVM: arm64: iommu: Add SMMUv3 driver Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 27/58] KVM: arm64: smmu-v3: Setup command queue Mostafa Saleh
                   ` (32 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Ensure all writable registers are properly initialized. We do not touch
registers that will not be read by the SMMU due to disabled features.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 130 +++++++++++++++++++-
 include/kvm/arm_smmu_v3.h                   |  11 ++
 2 files changed, 140 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index d2a570c9f3ec..f7e60c188cb0 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -4,16 +4,144 @@
  *
  * Copyright (C) 2022 Linaro Ltd.
  */
+#include <asm/arm-smmu-v3-common.h>
 #include <asm/kvm_hyp.h>
 #include <kvm/arm_smmu_v3.h>
 #include <nvhe/iommu.h>
+#include <nvhe/mem_protect.h>
+#include <nvhe/mm.h>
+#include <nvhe/pkvm.h>
+
+#define ARM_SMMU_POLL_TIMEOUT_US	100000 /* 100ms arbitrary timeout */
 
 size_t __ro_after_init kvm_hyp_arm_smmu_v3_count;
 struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
 
+#define for_each_smmu(smmu) \
+	for ((smmu) = kvm_hyp_arm_smmu_v3_smmus; \
+	     (smmu) != &kvm_hyp_arm_smmu_v3_smmus[kvm_hyp_arm_smmu_v3_count]; \
+	     (smmu)++)
+
+/*
+ * Wait until @cond is true.
+ * Return 0 on success, or -ETIMEDOUT
+ */
+#define smmu_wait(_cond)					\
+({								\
+	int __i = 0;						\
+	int __ret = 0;						\
+								\
+	while (!(_cond)) {					\
+		if (++__i > ARM_SMMU_POLL_TIMEOUT_US) {		\
+			__ret = -ETIMEDOUT;			\
+			break;					\
+		}						\
+		pkvm_udelay(1);					\
+	}							\
+	__ret;							\
+})
+
+static int smmu_write_cr0(struct hyp_arm_smmu_v3_device *smmu, u32 val)
+{
+	writel_relaxed(val, smmu->base + ARM_SMMU_CR0);
+	return smmu_wait(readl_relaxed(smmu->base + ARM_SMMU_CR0ACK) == val);
+}
+
+/* Transfer ownership of structures from host to hyp */
+static int smmu_take_pages(u64 phys, size_t size)
+{
+	WARN_ON(!PAGE_ALIGNED(phys) || !PAGE_ALIGNED(size));
+	return __pkvm_host_donate_hyp(phys >> PAGE_SHIFT, size >> PAGE_SHIFT);
+}
+
+static void smmu_reclaim_pages(u64 phys, size_t size)
+{
+	WARN_ON(!PAGE_ALIGNED(phys) || !PAGE_ALIGNED(size));
+	WARN_ON(__pkvm_hyp_donate_host(phys >> PAGE_SHIFT, size >> PAGE_SHIFT));
+}
+
+static int smmu_init_registers(struct hyp_arm_smmu_v3_device *smmu)
+{
+	u64 val, old;
+	int ret;
+
+	if (!(readl_relaxed(smmu->base + ARM_SMMU_GBPA) & GBPA_ABORT))
+		return -EINVAL;
+
+	/* Initialize all RW registers that will be read by the SMMU */
+	ret = smmu_write_cr0(smmu, 0);
+	if (ret)
+		return ret;
+
+	val = FIELD_PREP(CR1_TABLE_SH, ARM_SMMU_SH_ISH) |
+	      FIELD_PREP(CR1_TABLE_OC, CR1_CACHE_WB) |
+	      FIELD_PREP(CR1_TABLE_IC, CR1_CACHE_WB) |
+	      FIELD_PREP(CR1_QUEUE_SH, ARM_SMMU_SH_ISH) |
+	      FIELD_PREP(CR1_QUEUE_OC, CR1_CACHE_WB) |
+	      FIELD_PREP(CR1_QUEUE_IC, CR1_CACHE_WB);
+	writel_relaxed(val, smmu->base + ARM_SMMU_CR1);
+	writel_relaxed(CR2_PTM, smmu->base + ARM_SMMU_CR2);
+	writel_relaxed(0, smmu->base + ARM_SMMU_IRQ_CTRL);
+
+	val = readl_relaxed(smmu->base + ARM_SMMU_GERROR);
+	old = readl_relaxed(smmu->base + ARM_SMMU_GERRORN);
+	/* Service Failure Mode is fatal */
+	if ((val ^ old) & GERROR_SFM_ERR)
+		return -EIO;
+	/* Clear pending errors */
+	writel_relaxed(val, smmu->base + ARM_SMMU_GERRORN);
+
+	return 0;
+}
+
+static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
+{
+	int ret;
+
+	if (!PAGE_ALIGNED(smmu->mmio_addr | smmu->mmio_size))
+		return -EINVAL;
+
+	ret = ___pkvm_host_donate_hyp(smmu->mmio_addr >> PAGE_SHIFT,
+				      smmu->mmio_size >> PAGE_SHIFT,
+				      /* accept_mmio */ true);
+	if (ret)
+		return ret;
+
+	smmu->base = hyp_phys_to_virt(smmu->mmio_addr);
+
+	ret = smmu_init_registers(smmu);
+	if (ret)
+		return ret;
+
+	return kvm_iommu_init_device(&smmu->iommu);
+}
+
 static int smmu_init(void)
 {
-	return -ENOSYS;
+	int ret;
+	struct hyp_arm_smmu_v3_device *smmu;
+	size_t smmu_arr_size = PAGE_ALIGN(sizeof(*kvm_hyp_arm_smmu_v3_smmus) *
+					  kvm_hyp_arm_smmu_v3_count);
+	phys_addr_t smmu_arr_phys;
+
+	kvm_hyp_arm_smmu_v3_smmus = kern_hyp_va(kvm_hyp_arm_smmu_v3_smmus);
+
+	smmu_arr_phys = hyp_virt_to_phys(kvm_hyp_arm_smmu_v3_smmus);
+
+	ret = smmu_take_pages(smmu_arr_phys, smmu_arr_size);
+	if (ret)
+		return ret;
+
+	for_each_smmu(smmu) {
+		ret = smmu_init_device(smmu);
+		if (ret)
+			goto out_reclaim_smmu;
+	}
+
+	return 0;
+out_reclaim_smmu:
+	smmu_reclaim_pages(smmu_arr_phys, smmu_arr_size);
+	return ret;
 }
 
 /* Shared with the kernel driver in EL1 */
diff --git a/include/kvm/arm_smmu_v3.h b/include/kvm/arm_smmu_v3.h
index 521028b3ff71..fb24bcef1624 100644
--- a/include/kvm/arm_smmu_v3.h
+++ b/include/kvm/arm_smmu_v3.h
@@ -5,8 +5,19 @@
 #include <asm/kvm_asm.h>
 #include <kvm/iommu.h>
 
+/*
+ * Parameters from the trusted host:
+ * @mmio_addr		base address of the SMMU registers
+ * @mmio_size		size of the registers resource
+ *
+ * Other members are filled and used at runtime by the SMMU driver.
+ */
 struct hyp_arm_smmu_v3_device {
 	struct kvm_hyp_iommu	iommu;
+	phys_addr_t		mmio_addr;
+	size_t			mmio_size;
+
+	void __iomem		*base;
 };
 
 extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 27/58] KVM: arm64: smmu-v3: Setup command queue
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (25 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 26/58] KVM: arm64: smmu-v3: Initialize registers Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2025-01-23 13:01   ` Robin Murphy
  2024-12-12 18:03 ` [RFC PATCH v2 28/58] KVM: arm64: smmu-v3: Setup stream table Mostafa Saleh
                   ` (31 subsequent siblings)
  58 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Map the command queue allocated by the host into the hypervisor address
space. When the host mappings are finalized, the queue is unmapped from
the host.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 165 ++++++++++++++++++++
 include/kvm/arm_smmu_v3.h                   |   4 +
 2 files changed, 169 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index f7e60c188cb0..e15356509424 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -41,6 +41,15 @@ struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
 	__ret;							\
 })
 
+#define smmu_wait_event(_smmu, _cond)				\
+({								\
+	if ((_smmu)->features & ARM_SMMU_FEAT_SEV) {		\
+		while (!(_cond))				\
+			wfe();					\
+	}							\
+	smmu_wait(_cond);					\
+})
+
 static int smmu_write_cr0(struct hyp_arm_smmu_v3_device *smmu, u32 val)
 {
 	writel_relaxed(val, smmu->base + ARM_SMMU_CR0);
@@ -60,6 +69,123 @@ static void smmu_reclaim_pages(u64 phys, size_t size)
 	WARN_ON(__pkvm_hyp_donate_host(phys >> PAGE_SHIFT, size >> PAGE_SHIFT));
 }
 
+#define Q_WRAP(smmu, reg)	((reg) & (1 << (smmu)->cmdq_log2size))
+#define Q_IDX(smmu, reg)	((reg) & ((1 << (smmu)->cmdq_log2size) - 1))
+
+static bool smmu_cmdq_full(struct hyp_arm_smmu_v3_device *smmu)
+{
+	u64 cons = readl_relaxed(smmu->base + ARM_SMMU_CMDQ_CONS);
+
+	return Q_IDX(smmu, smmu->cmdq_prod) == Q_IDX(smmu, cons) &&
+	       Q_WRAP(smmu, smmu->cmdq_prod) != Q_WRAP(smmu, cons);
+}
+
+static bool smmu_cmdq_empty(struct hyp_arm_smmu_v3_device *smmu)
+{
+	u64 cons = readl_relaxed(smmu->base + ARM_SMMU_CMDQ_CONS);
+
+	return Q_IDX(smmu, smmu->cmdq_prod) == Q_IDX(smmu, cons) &&
+	       Q_WRAP(smmu, smmu->cmdq_prod) == Q_WRAP(smmu, cons);
+}
+
+static int smmu_add_cmd(struct hyp_arm_smmu_v3_device *smmu,
+			struct arm_smmu_cmdq_ent *ent)
+{
+	int i;
+	int ret;
+	u64 cmd[CMDQ_ENT_DWORDS] = {};
+	int idx = Q_IDX(smmu, smmu->cmdq_prod);
+	u64 *slot = smmu->cmdq_base + idx * CMDQ_ENT_DWORDS;
+
+	if (smmu->iommu.power_is_off)
+		return -EPIPE;
+
+	ret = smmu_wait_event(smmu, !smmu_cmdq_full(smmu));
+	if (ret)
+		return ret;
+
+	cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
+
+	switch (ent->opcode) {
+	case CMDQ_OP_CFGI_ALL:
+		cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
+		break;
+	case CMDQ_OP_CFGI_CD:
+		cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SSID, ent->cfgi.ssid);
+		fallthrough;
+	case CMDQ_OP_CFGI_STE:
+		cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SID, ent->cfgi.sid);
+		cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_LEAF, ent->cfgi.leaf);
+		break;
+	case CMDQ_OP_TLBI_NH_VA:
+		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
+		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
+		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
+		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
+		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
+		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
+		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
+		cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_VA_MASK;
+		break;
+	case CMDQ_OP_TLBI_NSNH_ALL:
+		break;
+	case CMDQ_OP_TLBI_NH_ASID:
+		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
+		fallthrough;
+	case CMDQ_OP_TLBI_S12_VMALL:
+		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
+		break;
+	case CMDQ_OP_TLBI_S2_IPA:
+		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
+		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
+		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
+		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
+		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
+		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
+		cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_IPA_MASK;
+		break;
+	case CMDQ_OP_CMD_SYNC:
+		cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	for (i = 0; i < CMDQ_ENT_DWORDS; i++)
+		slot[i] = cpu_to_le64(cmd[i]);
+
+	smmu->cmdq_prod++;
+	writel(Q_IDX(smmu, smmu->cmdq_prod) | Q_WRAP(smmu, smmu->cmdq_prod),
+	       smmu->base + ARM_SMMU_CMDQ_PROD);
+	return 0;
+}
+
+static int smmu_sync_cmd(struct hyp_arm_smmu_v3_device *smmu)
+{
+	int ret;
+	struct arm_smmu_cmdq_ent cmd = {
+		.opcode = CMDQ_OP_CMD_SYNC,
+	};
+
+	ret = smmu_add_cmd(smmu, &cmd);
+	if (ret)
+		return ret;
+
+	return smmu_wait_event(smmu, smmu_cmdq_empty(smmu));
+}
+
+__maybe_unused
+static int smmu_send_cmd(struct hyp_arm_smmu_v3_device *smmu,
+			 struct arm_smmu_cmdq_ent *cmd)
+{
+	int ret = smmu_add_cmd(smmu, cmd);
+
+	if (ret)
+		return ret;
+
+	return smmu_sync_cmd(smmu);
+}
+
 static int smmu_init_registers(struct hyp_arm_smmu_v3_device *smmu)
 {
 	u64 val, old;
@@ -94,6 +220,41 @@ static int smmu_init_registers(struct hyp_arm_smmu_v3_device *smmu)
 	return 0;
 }
 
+static int smmu_init_cmdq(struct hyp_arm_smmu_v3_device *smmu)
+{
+	u64 cmdq_base;
+	size_t cmdq_nr_entries, cmdq_size;
+	int ret;
+	enum kvm_pgtable_prot prot = PAGE_HYP;
+
+	cmdq_base = readq_relaxed(smmu->base + ARM_SMMU_CMDQ_BASE);
+	if (cmdq_base & ~(Q_BASE_RWA | Q_BASE_ADDR_MASK | Q_BASE_LOG2SIZE))
+		return -EINVAL;
+
+	smmu->cmdq_log2size = cmdq_base & Q_BASE_LOG2SIZE;
+	cmdq_nr_entries = 1 << smmu->cmdq_log2size;
+	cmdq_size = cmdq_nr_entries * CMDQ_ENT_DWORDS * 8;
+
+	cmdq_base &= Q_BASE_ADDR_MASK;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
+		prot |= KVM_PGTABLE_PROT_NORMAL_NC;
+
+	ret = ___pkvm_host_donate_hyp_prot(cmdq_base >> PAGE_SHIFT,
+					   PAGE_ALIGN(cmdq_size) >> PAGE_SHIFT,
+					   false, prot);
+	if (ret)
+		return ret;
+
+	smmu->cmdq_base = hyp_phys_to_virt(cmdq_base);
+
+	memset(smmu->cmdq_base, 0, cmdq_size);
+	writel_relaxed(0, smmu->base + ARM_SMMU_CMDQ_PROD);
+	writel_relaxed(0, smmu->base + ARM_SMMU_CMDQ_CONS);
+
+	return 0;
+}
+
 static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
 {
 	int ret;
@@ -113,6 +274,10 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
 	if (ret)
 		return ret;
 
+	ret = smmu_init_cmdq(smmu);
+	if (ret)
+		return ret;
+
 	return kvm_iommu_init_device(&smmu->iommu);
 }
 
diff --git a/include/kvm/arm_smmu_v3.h b/include/kvm/arm_smmu_v3.h
index fb24bcef1624..393a1a04edba 100644
--- a/include/kvm/arm_smmu_v3.h
+++ b/include/kvm/arm_smmu_v3.h
@@ -16,8 +16,12 @@ struct hyp_arm_smmu_v3_device {
 	struct kvm_hyp_iommu	iommu;
 	phys_addr_t		mmio_addr;
 	size_t			mmio_size;
+	unsigned long		features;
 
 	void __iomem		*base;
+	u32			cmdq_prod;
+	u64			*cmdq_base;
+	size_t			cmdq_log2size;
 };
 
 extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 28/58] KVM: arm64: smmu-v3: Setup stream table
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (26 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 27/58] KVM: arm64: smmu-v3: Setup command queue Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 29/58] KVM: arm64: smmu-v3: Setup event queue Mostafa Saleh
                   ` (30 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Map the stream table allocated by the host into the hypervisor address
space. When the host mappings are finalized, the table is unmapped from
the host. Depending on the host configuration, the stream table may have
one or two levels. Populate the level-2 stream table lazily.

Also, add accessors for STEs.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 157 +++++++++++++++++++-
 include/kvm/arm_smmu_v3.h                   |   3 +
 2 files changed, 159 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index e15356509424..43d2ce7828c1 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -174,7 +174,6 @@ static int smmu_sync_cmd(struct hyp_arm_smmu_v3_device *smmu)
 	return smmu_wait_event(smmu, smmu_cmdq_empty(smmu));
 }
 
-__maybe_unused
 static int smmu_send_cmd(struct hyp_arm_smmu_v3_device *smmu,
 			 struct arm_smmu_cmdq_ent *cmd)
 {
@@ -186,6 +185,94 @@ static int smmu_send_cmd(struct hyp_arm_smmu_v3_device *smmu,
 	return smmu_sync_cmd(smmu);
 }
 
+__maybe_unused
+static int smmu_sync_ste(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
+{
+	struct arm_smmu_cmdq_ent cmd = {
+		.opcode = CMDQ_OP_CFGI_STE,
+		.cfgi.sid = sid,
+		.cfgi.leaf = true,
+	};
+
+	return smmu_send_cmd(smmu, &cmd);
+}
+
+static int smmu_alloc_l2_strtab(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
+{
+	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+	struct arm_smmu_strtab_l1 *l1_desc;
+	dma_addr_t l2ptr_dma;
+	struct arm_smmu_strtab_l2 *l2table;
+	size_t l2_order = get_order(sizeof(struct arm_smmu_strtab_l2));
+	int flags = 0;
+
+	l1_desc = &cfg->l2.l1tab[arm_smmu_strtab_l1_idx(sid)];
+	if (l1_desc->l2ptr)
+		return 0;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
+		flags |= IOMMU_PAGE_NOCACHE;
+
+	l2table = kvm_iommu_donate_pages(l2_order, flags);
+	if (!l2table)
+		return -ENOMEM;
+
+	l2ptr_dma = hyp_virt_to_phys(l2table);
+
+	if (l2ptr_dma & (~STRTAB_L1_DESC_L2PTR_MASK | ~PAGE_MASK)) {
+		kvm_iommu_reclaim_pages(l2table, l2_order);
+		return -EINVAL;
+	}
+
+	/* Ensure the empty stream table is visible before the descriptor write */
+	wmb();
+
+	arm_smmu_write_strtab_l1_desc(l1_desc, l2ptr_dma);
+	return 0;
+}
+
+static struct arm_smmu_ste *
+smmu_get_ste_ptr(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
+{
+	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+
+	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+		struct arm_smmu_strtab_l1 *l1_desc =
+					&cfg->l2.l1tab[arm_smmu_strtab_l1_idx(sid)];
+		struct arm_smmu_strtab_l2 *l2ptr;
+
+		if (arm_smmu_strtab_l1_idx(sid) > cfg->l2.num_l1_ents)
+			return NULL;
+		/* L2 should be allocated before calling this. */
+		if (WARN_ON(!l1_desc->l2ptr))
+			return NULL;
+
+		l2ptr = hyp_phys_to_virt(l1_desc->l2ptr & STRTAB_L1_DESC_L2PTR_MASK);
+		/* Two-level walk */
+		return &l2ptr->stes[arm_smmu_strtab_l2_idx(sid)];
+	}
+
+	if (sid > cfg->linear.num_ents)
+		return NULL;
+	/* Simple linear lookup */
+	return &cfg->linear.table[sid];
+}
+
+__maybe_unused
+static struct arm_smmu_ste *
+smmu_get_alloc_ste_ptr(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
+{
+	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+		int ret = smmu_alloc_l2_strtab(smmu, sid);
+
+		if (ret) {
+			WARN_ON(ret != -ENOMEM);
+			return NULL;
+		}
+	}
+	return smmu_get_ste_ptr(smmu, sid);
+}
+
 static int smmu_init_registers(struct hyp_arm_smmu_v3_device *smmu)
 {
 	u64 val, old;
@@ -255,6 +342,70 @@ static int smmu_init_cmdq(struct hyp_arm_smmu_v3_device *smmu)
 	return 0;
 }
 
+static int smmu_init_strtab(struct hyp_arm_smmu_v3_device *smmu)
+{
+	int ret;
+	u64 strtab_base;
+	size_t strtab_size;
+	u32 strtab_cfg, fmt;
+	int split, log2size;
+	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
+	enum kvm_pgtable_prot prot = PAGE_HYP;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
+		prot |= KVM_PGTABLE_PROT_NORMAL_NC;
+
+	strtab_base = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
+	if (strtab_base & ~(STRTAB_BASE_ADDR_MASK | STRTAB_BASE_RA))
+		return -EINVAL;
+
+	strtab_cfg = readl_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE_CFG);
+	if (strtab_cfg & ~(STRTAB_BASE_CFG_FMT | STRTAB_BASE_CFG_SPLIT |
+			   STRTAB_BASE_CFG_LOG2SIZE))
+		return -EINVAL;
+
+	fmt = FIELD_GET(STRTAB_BASE_CFG_FMT, strtab_cfg);
+	split = FIELD_GET(STRTAB_BASE_CFG_SPLIT, strtab_cfg);
+	log2size = FIELD_GET(STRTAB_BASE_CFG_LOG2SIZE, strtab_cfg);
+	strtab_base &= STRTAB_BASE_ADDR_MASK;
+
+	switch (fmt) {
+	case STRTAB_BASE_CFG_FMT_LINEAR:
+		if (split)
+			return -EINVAL;
+		cfg->linear.num_ents = 1 << log2size;
+		strtab_size = cfg->linear.num_ents * sizeof(struct arm_smmu_ste);
+		cfg->linear.ste_dma = strtab_base;
+		ret = ___pkvm_host_donate_hyp_prot(strtab_base >> PAGE_SHIFT,
+						   PAGE_ALIGN(strtab_size) >> PAGE_SHIFT,
+						   false, prot);
+		if (ret)
+			return -EINVAL;
+		cfg->linear.table = hyp_phys_to_virt(strtab_base);
+		/* Disable all STEs */
+		memset(cfg->linear.table, 0, strtab_size);
+		break;
+	case STRTAB_BASE_CFG_FMT_2LVL:
+		if (split != STRTAB_SPLIT)
+			return -EINVAL;
+		cfg->l2.num_l1_ents = 1 << max(0, log2size - split);
+		strtab_size = cfg->l2.num_l1_ents * sizeof(struct arm_smmu_strtab_l1);
+		cfg->l2.l1_dma = strtab_base;
+		ret = ___pkvm_host_donate_hyp_prot(strtab_base >> PAGE_SHIFT,
+						   PAGE_ALIGN(strtab_size) >> PAGE_SHIFT,
+						   false, prot);
+		if (ret)
+			return -EINVAL;
+		cfg->l2.l1tab = hyp_phys_to_virt(strtab_base);
+		/* Disable all STEs */
+		memset(cfg->l2.l1tab, 0, strtab_size);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
 static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
 {
 	int ret;
@@ -278,6 +429,10 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
 	if (ret)
 		return ret;
 
+	ret = smmu_init_strtab(smmu);
+	if (ret)
+		return ret;
+
 	return kvm_iommu_init_device(&smmu->iommu);
 }
 
diff --git a/include/kvm/arm_smmu_v3.h b/include/kvm/arm_smmu_v3.h
index 393a1a04edba..352c1b2dc72a 100644
--- a/include/kvm/arm_smmu_v3.h
+++ b/include/kvm/arm_smmu_v3.h
@@ -2,6 +2,7 @@
 #ifndef __KVM_ARM_SMMU_V3_H
 #define __KVM_ARM_SMMU_V3_H
 
+#include <asm/arm-smmu-v3-common.h>
 #include <asm/kvm_asm.h>
 #include <kvm/iommu.h>
 
@@ -22,6 +23,8 @@ struct hyp_arm_smmu_v3_device {
 	u32			cmdq_prod;
 	u64			*cmdq_base;
 	size_t			cmdq_log2size;
+	/* strtab_cfg.l2.l2ptrs is not used, instead computed from L1 */
+	struct arm_smmu_strtab_cfg strtab_cfg;
 };
 
 extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 29/58] KVM: arm64: smmu-v3: Setup event queue
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (27 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 28/58] KVM: arm64: smmu-v3: Setup stream table Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 30/58] KVM: arm64: smmu-v3: Reset the device Mostafa Saleh
                   ` (29 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

The host can use the event queue for debuging, and unlike the command
queue, it would be managed by the kernel.
However, it must set in a shared state so it can't be donated to the
hypervisor later.

This relies on the ARM_SMMU_EVTQ_BASE can't be changed after
de-privilege.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 39 +++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index 43d2ce7828c1..5020f74421ad 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -342,6 +342,41 @@ static int smmu_init_cmdq(struct hyp_arm_smmu_v3_device *smmu)
 	return 0;
 }
 
+/*
+ * Event q support is optional and managed by the kernel,
+ * However, it must set in a shared state so it can't be donated
+ * to the hypervisor later.
+ * This relies on the ARM_SMMU_EVTQ_BASE can't be changed after
+ * de-privilege.
+ */
+static int smmu_init_evtq(struct hyp_arm_smmu_v3_device *smmu)
+{
+	u64 evtq_base, evtq_pfn;
+	size_t evtq_nr_entries, evtq_size, evtq_nr_pages;
+	size_t i;
+	int ret;
+
+	evtq_base = readq_relaxed(smmu->base + ARM_SMMU_EVTQ_BASE);
+	if (!evtq_base)
+		return 0;
+
+	if (evtq_base & ~(Q_BASE_RWA | Q_BASE_ADDR_MASK | Q_BASE_LOG2SIZE))
+		return -EINVAL;
+
+	evtq_nr_entries = 1 << (evtq_base & Q_BASE_LOG2SIZE);
+	evtq_size = evtq_nr_entries * EVTQ_ENT_DWORDS * 8;
+	evtq_nr_pages = PAGE_ALIGN(evtq_size) >> PAGE_SHIFT;
+
+	evtq_pfn = PAGE_ALIGN(evtq_base & Q_BASE_ADDR_MASK) >> PAGE_SHIFT;
+
+	for (i = 0 ; i < evtq_nr_pages ; ++i) {
+		ret = __pkvm_host_share_hyp(evtq_pfn + i);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
 static int smmu_init_strtab(struct hyp_arm_smmu_v3_device *smmu)
 {
 	int ret;
@@ -429,6 +464,10 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
 	if (ret)
 		return ret;
 
+	ret = smmu_init_evtq(smmu);
+	if (ret)
+		return ret;
+
 	ret = smmu_init_strtab(smmu);
 	if (ret)
 		return ret;
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 30/58] KVM: arm64: smmu-v3: Reset the device
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (28 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 29/58] KVM: arm64: smmu-v3: Setup event queue Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 31/58] KVM: arm64: smmu-v3: Support io-pgtable Mostafa Saleh
                   ` (28 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Now that all structures are initialized, send global invalidations and
reset the SMMUv3 device.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 38 +++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index 5020f74421ad..58662c2c4c97 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -441,6 +441,40 @@ static int smmu_init_strtab(struct hyp_arm_smmu_v3_device *smmu)
 	return 0;
 }
 
+static int smmu_reset_device(struct hyp_arm_smmu_v3_device *smmu)
+{
+	int ret;
+	struct arm_smmu_cmdq_ent cfgi_cmd = {
+		.opcode = CMDQ_OP_CFGI_ALL,
+	};
+	struct arm_smmu_cmdq_ent tlbi_cmd = {
+		.opcode = CMDQ_OP_TLBI_NSNH_ALL,
+	};
+
+	/* Invalidate all cached configs and TLBs */
+	ret = smmu_write_cr0(smmu, CR0_CMDQEN);
+	if (ret)
+		return ret;
+
+	ret = smmu_add_cmd(smmu, &cfgi_cmd);
+	if (ret)
+		goto err_disable_cmdq;
+
+	ret = smmu_add_cmd(smmu, &tlbi_cmd);
+	if (ret)
+		goto err_disable_cmdq;
+
+	ret = smmu_sync_cmd(smmu);
+	if (ret)
+		goto err_disable_cmdq;
+
+	/* Enable translation */
+	return smmu_write_cr0(smmu, CR0_SMMUEN | CR0_CMDQEN | CR0_ATSCHK);
+
+err_disable_cmdq:
+	return smmu_write_cr0(smmu, 0);
+}
+
 static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
 {
 	int ret;
@@ -472,6 +506,10 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
 	if (ret)
 		return ret;
 
+	ret = smmu_reset_device(smmu);
+	if (ret)
+		return ret;
+
 	return kvm_iommu_init_device(&smmu->iommu);
 }
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 31/58] KVM: arm64: smmu-v3: Support io-pgtable
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (29 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 30/58] KVM: arm64: smmu-v3: Reset the device Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 32/58] KVM: arm64: smmu-v3: Add {alloc/free}_domain Mostafa Saleh
                   ` (27 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Implement the hypervisor version of io-pgtable allocation functions,
mirroring drivers/iommu/io-pgtable-arm.c. Page allocation uses the
IOMMU pool filled by the host.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/Makefile              |   2 +
 .../arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c | 153 ++++++++++++++++++
 include/linux/io-pgtable-arm.h                |  11 ++
 3 files changed, 166 insertions(+)
 create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c

diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index edfd8a11ac90..e4f662b1447f 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -17,6 +17,8 @@ hyp-obj-$(CONFIG_MODULES) += modules.o
 hyp-obj-y += $(lib-objs)
 
 hyp-obj-$(CONFIG_ARM_SMMU_V3_PKVM) += iommu/arm-smmu-v3.o
+hyp-obj-$(CONFIG_ARM_SMMU_V3_PKVM) += iommu/io-pgtable-arm.o \
+	../../../../../drivers/iommu/io-pgtable-arm-common.o
 
 $(obj)/hyp.lds: $(src)/hyp.lds.S FORCE
 	$(call if_changed_dep,cpp_lds_S)
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c b/arch/arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c
new file mode 100644
index 000000000000..aa5bf7c0ed03
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022 Arm Ltd.
+ */
+#include <asm/kvm_hyp.h>
+#include <asm/kvm_mmu.h>
+#include <kvm/arm_smmu_v3.h>
+#include <linux/types.h>
+#include <linux/gfp_types.h>
+#include <linux/io-pgtable-arm.h>
+
+#include <nvhe/alloc.h>
+#include <nvhe/iommu.h>
+#include <nvhe/mem_protect.h>
+
+int arm_lpae_map_exists(void)
+{
+	return -EEXIST;
+}
+
+int arm_lpae_unmap_empty(void)
+{
+	return -EEXIST;
+}
+
+void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
+			     struct io_pgtable_cfg *cfg, void *cookie)
+{
+	void *addr;
+
+	if (!PAGE_ALIGNED(size))
+		return NULL;
+
+	addr = kvm_iommu_donate_pages(get_order(size), 0);
+
+	if (addr && !cfg->coherent_walk)
+		kvm_flush_dcache_to_poc(addr, size);
+
+	return addr;
+}
+
+void __arm_lpae_free_pages(void *addr, size_t size, struct io_pgtable_cfg *cfg,
+			   void *cookie)
+{
+	u8 order;
+
+	/*
+	 * It's guaranteed all allocations are aligned, but core code
+	 * might free PGD with it's actual size.
+	 */
+	order = get_order(PAGE_ALIGN(size));
+
+	if (!cfg->coherent_walk)
+		kvm_flush_dcache_to_poc(addr, size);
+
+	kvm_iommu_reclaim_pages(addr, order);
+}
+
+void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
+			 struct io_pgtable_cfg *cfg)
+{
+	if (!cfg->coherent_walk)
+		kvm_flush_dcache_to_poc(ptep, sizeof(*ptep) * num_entries);
+}
+
+static int kvm_arm_io_pgtable_init(struct io_pgtable_cfg *cfg,
+				   struct arm_lpae_io_pgtable *data)
+{
+	int ret = -EINVAL;
+
+	if (cfg->fmt == ARM_64_LPAE_S2)
+		ret = arm_lpae_init_pgtable_s2(cfg, data);
+	else if (cfg->fmt == ARM_64_LPAE_S1)
+		ret = arm_lpae_init_pgtable_s1(cfg, data);
+
+	if (ret)
+		return ret;
+
+	data->iop.cfg = *cfg;
+	data->iop.fmt	= cfg->fmt;
+
+	return 0;
+}
+
+struct io_pgtable *kvm_arm_io_pgtable_alloc(struct io_pgtable_cfg *cfg,
+					    void *cookie,
+					    int *out_ret)
+{
+	size_t pgd_size, alignment;
+	struct arm_lpae_io_pgtable *data;
+	int ret;
+
+	data = hyp_alloc(sizeof(*data));
+	if (!data) {
+		*out_ret = hyp_alloc_errno();
+		return NULL;
+	}
+
+	ret = kvm_arm_io_pgtable_init(cfg, data);
+	if (ret)
+		goto out_free;
+
+	pgd_size = PAGE_ALIGN(ARM_LPAE_PGD_SIZE(data));
+	data->pgd = __arm_lpae_alloc_pages(pgd_size, 0, &data->iop.cfg, cookie);
+	if (!data->pgd) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+	/*
+	 * If it has eight or more entries, the table must be aligned on
+	 * its size. Otherwise 64 bytes.
+	 */
+	alignment = max(pgd_size, 8 * sizeof(arm_lpae_iopte));
+	if (!IS_ALIGNED(hyp_virt_to_phys(data->pgd), alignment)) {
+		__arm_lpae_free_pages(data->pgd, pgd_size,
+				      &data->iop.cfg, cookie);
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	data->iop.cookie = cookie;
+	if (cfg->fmt == ARM_64_LPAE_S2)
+		data->iop.cfg.arm_lpae_s2_cfg.vttbr = __arm_lpae_virt_to_phys(data->pgd);
+	else if (cfg->fmt == ARM_64_LPAE_S1)
+		data->iop.cfg.arm_lpae_s1_cfg.ttbr = __arm_lpae_virt_to_phys(data->pgd);
+
+	if (!data->iop.cfg.coherent_walk)
+		kvm_flush_dcache_to_poc(data->pgd, pgd_size);
+
+	/* Ensure the empty pgd is visible before any actual TTBR write */
+	wmb();
+
+	*out_ret = 0;
+	return &data->iop;
+out_free:
+	hyp_free(data);
+	*out_ret = ret;
+	return NULL;
+}
+
+int kvm_arm_io_pgtable_free(struct io_pgtable *iopt)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_to_data(iopt);
+	size_t pgd_size = ARM_LPAE_PGD_SIZE(data);
+
+	if (!data->iop.cfg.coherent_walk)
+		kvm_flush_dcache_to_poc(data->pgd, pgd_size);
+
+	io_pgtable_tlb_flush_all(iopt);
+	__arm_lpae_free_pgtable(data, data->start_level, data->pgd);
+	hyp_free(data);
+	return 0;
+}
diff --git a/include/linux/io-pgtable-arm.h b/include/linux/io-pgtable-arm.h
index 337e9254fdbd..88922314157d 100644
--- a/include/linux/io-pgtable-arm.h
+++ b/include/linux/io-pgtable-arm.h
@@ -191,8 +191,19 @@ static inline bool iopte_table(arm_lpae_iopte pte, int lvl)
 	return iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE;
 }
 
+#ifdef __KVM_NVHE_HYPERVISOR__
+#include <nvhe/memory.h>
+#define __arm_lpae_virt_to_phys	hyp_virt_to_phys
+#define __arm_lpae_phys_to_virt	hyp_phys_to_virt
+
+struct io_pgtable *kvm_arm_io_pgtable_alloc(struct io_pgtable_cfg *cfg,
+					    void *cookie,
+					    int *out_ret);
+int kvm_arm_io_pgtable_free(struct io_pgtable *iop);
+#else
 #define __arm_lpae_virt_to_phys	__pa
 #define __arm_lpae_phys_to_virt	__va
+#endif
 
 /* Generic functions */
 void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 32/58] KVM: arm64: smmu-v3: Add {alloc/free}_domain
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (30 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 31/58] KVM: arm64: smmu-v3: Support io-pgtable Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 33/58] KVM: arm64: smmu-v3: Add TLB ops Mostafa Saleh
                   ` (26 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add SMMUv3 alloc/free domain, as this operations are not
tied to the IOMMU, we can't do much with the io-pgtable
allocation or configuration.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 64 +++++++++++++++++++++
 include/kvm/arm_smmu_v3.h                   |  6 ++
 2 files changed, 70 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index 58662c2c4c97..3181933e9a34 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -7,6 +7,8 @@
 #include <asm/arm-smmu-v3-common.h>
 #include <asm/kvm_hyp.h>
 #include <kvm/arm_smmu_v3.h>
+#include <linux/io-pgtable-arm.h>
+#include <nvhe/alloc.h>
 #include <nvhe/iommu.h>
 #include <nvhe/mem_protect.h>
 #include <nvhe/mm.h>
@@ -50,6 +52,22 @@ struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
 	smmu_wait(_cond);					\
 })
 
+/*
+ * SMMUv3 domain:
+ * @domain: Pointer to the IOMMU domain.
+ * @smmu: SMMU instance for this domain.
+ * @type: Type of domain (S1, S2)
+ * @pgt_lock: Lock for page table
+ * @pgtable: io_pgtable instance for this domain
+ */
+struct hyp_arm_smmu_v3_domain {
+	struct kvm_hyp_iommu_domain     *domain;
+	struct hyp_arm_smmu_v3_device	*smmu;
+	u32				type;
+	hyp_spinlock_t			pgt_lock;
+	struct io_pgtable		*pgtable;
+};
+
 static int smmu_write_cr0(struct hyp_arm_smmu_v3_device *smmu, u32 val)
 {
 	writel_relaxed(val, smmu->base + ARM_SMMU_CR0);
@@ -541,7 +559,53 @@ static int smmu_init(void)
 	return ret;
 }
 
+static struct kvm_hyp_iommu *smmu_id_to_iommu(pkvm_handle_t smmu_id)
+{
+	if (smmu_id >= kvm_hyp_arm_smmu_v3_count)
+		return NULL;
+	smmu_id = array_index_nospec(smmu_id, kvm_hyp_arm_smmu_v3_count);
+
+	return &kvm_hyp_arm_smmu_v3_smmus[smmu_id].iommu;
+}
+
+static int smmu_alloc_domain(struct kvm_hyp_iommu_domain *domain, int type)
+{
+	struct hyp_arm_smmu_v3_domain *smmu_domain;
+
+	if (type >= KVM_ARM_SMMU_DOMAIN_MAX)
+		return -EINVAL;
+
+	smmu_domain = hyp_alloc(sizeof(*smmu_domain));
+	if (!smmu_domain)
+		return -ENOMEM;
+
+	/*
+	 * Can't do much without knowing the SMMUv3.
+	 * Page table will be allocated at attach_dev, but can be
+	 * freed from free domain.
+	 */
+	smmu_domain->domain = domain;
+	smmu_domain->type = type;
+	hyp_spin_lock_init(&smmu_domain->pgt_lock);
+	domain->priv = (void *)smmu_domain;
+
+	return 0;
+}
+
+static void smmu_free_domain(struct kvm_hyp_iommu_domain *domain)
+{
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+
+	if (smmu_domain->pgtable)
+		kvm_arm_io_pgtable_free(smmu_domain->pgtable);
+
+	hyp_free(smmu_domain);
+}
+
 /* Shared with the kernel driver in EL1 */
 struct kvm_iommu_ops smmu_ops = {
 	.init				= smmu_init,
+	.get_iommu_by_id		= smmu_id_to_iommu,
+	.alloc_domain			= smmu_alloc_domain,
+	.free_domain			= smmu_free_domain,
 };
diff --git a/include/kvm/arm_smmu_v3.h b/include/kvm/arm_smmu_v3.h
index 352c1b2dc72a..ded98cbaebc1 100644
--- a/include/kvm/arm_smmu_v3.h
+++ b/include/kvm/arm_smmu_v3.h
@@ -33,4 +33,10 @@ extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
 extern struct hyp_arm_smmu_v3_device *kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_smmus);
 #define kvm_hyp_arm_smmu_v3_smmus kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_smmus)
 
+enum kvm_arm_smmu_domain_type {
+	KVM_ARM_SMMU_DOMAIN_S1,
+	KVM_ARM_SMMU_DOMAIN_S2,
+	KVM_ARM_SMMU_DOMAIN_MAX,
+};
+
 #endif /* __KVM_ARM_SMMU_V3_H */
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 33/58] KVM: arm64: smmu-v3: Add TLB ops
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (31 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 32/58] KVM: arm64: smmu-v3: Add {alloc/free}_domain Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 34/58] KVM: arm64: smmu-v3: Add context descriptor functions Mostafa Saleh
                   ` (25 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add TLB invalidation functions would be used next from the page table
code and attach/detach functions.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 167 ++++++++++++++++++++
 1 file changed, 167 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index 3181933e9a34..5f00d5cdf5bc 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -602,10 +602,177 @@ static void smmu_free_domain(struct kvm_hyp_iommu_domain *domain)
 	hyp_free(smmu_domain);
 }
 
+static void smmu_inv_domain(struct hyp_arm_smmu_v3_domain *smmu_domain)
+{
+	struct kvm_hyp_iommu_domain *domain = smmu_domain->domain;
+	struct hyp_arm_smmu_v3_device *smmu = smmu_domain->smmu;
+	struct arm_smmu_cmdq_ent cmd = {};
+
+	if (smmu_domain->pgtable->cfg.fmt == ARM_64_LPAE_S2) {
+		cmd.opcode = CMDQ_OP_TLBI_S12_VMALL;
+		cmd.tlbi.vmid = domain->domain_id;
+	} else {
+		cmd.opcode = CMDQ_OP_TLBI_NH_ASID;
+		cmd.tlbi.asid = domain->domain_id;
+	}
+
+	if (smmu->iommu.power_is_off)
+		return;
+
+	WARN_ON(smmu_send_cmd(smmu, &cmd));
+}
+
+static void smmu_tlb_flush_all(void *cookie)
+{
+	struct kvm_hyp_iommu_domain *domain = cookie;
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+	struct hyp_arm_smmu_v3_device *smmu = smmu_domain->smmu;
+
+	kvm_iommu_lock(&smmu->iommu);
+	smmu_inv_domain(smmu_domain);
+	kvm_iommu_unlock(&smmu->iommu);
+}
+
+static int smmu_tlb_inv_range_smmu(struct hyp_arm_smmu_v3_device *smmu,
+				   struct kvm_hyp_iommu_domain *domain,
+				   struct arm_smmu_cmdq_ent *cmd,
+				   unsigned long iova, size_t size, size_t granule)
+{
+	int ret = 0;
+	unsigned long end = iova + size, num_pages = 0, tg = 0;
+	size_t inv_range = granule;
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+
+	kvm_iommu_lock(&smmu->iommu);
+	if (smmu->iommu.power_is_off)
+		goto out_ret;
+
+	/* Almost copy-paste from the kernel dirver. */
+	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
+		/* Get the leaf page size */
+		tg = __ffs(smmu_domain->pgtable->cfg.pgsize_bitmap);
+
+		num_pages = size >> tg;
+
+		/* Convert page size of 12,14,16 (log2) to 1,2,3 */
+		cmd->tlbi.tg = (tg - 10) / 2;
+
+		/*
+		 * Determine what level the granule is at. For non-leaf, both
+		 * io-pgtable and SVA pass a nominal last-level granule because
+		 * they don't know what level(s) actually apply, so ignore that
+		 * and leave TTL=0. However for various errata reasons we still
+		 * want to use a range command, so avoid the SVA corner case
+		 * where both scale and num could be 0 as well.
+		 */
+		if (cmd->tlbi.leaf)
+			cmd->tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
+		else if ((num_pages & CMDQ_TLBI_RANGE_NUM_MAX) == 1)
+			num_pages++;
+	}
+
+	while (iova < end) {
+		if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
+			/*
+			 * On each iteration of the loop, the range is 5 bits
+			 * worth of the aligned size remaining.
+			 * The range in pages is:
+			 *
+			 * range = (num_pages & (0x1f << __ffs(num_pages)))
+			 */
+			unsigned long scale, num;
+
+			/* Determine the power of 2 multiple number of pages */
+			scale = __ffs(num_pages);
+			cmd->tlbi.scale = scale;
+
+			/* Determine how many chunks of 2^scale size we have */
+			num = (num_pages >> scale) & CMDQ_TLBI_RANGE_NUM_MAX;
+			cmd->tlbi.num = num - 1;
+
+			/* range is num * 2^scale * pgsize */
+			inv_range = num << (scale + tg);
+
+			/* Clear out the lower order bits for the next iteration */
+			num_pages -= num << scale;
+		}
+		cmd->tlbi.addr = iova;
+		WARN_ON(smmu_add_cmd(smmu, cmd));
+		BUG_ON(iova + inv_range < iova);
+		iova += inv_range;
+	}
+
+	ret = smmu_sync_cmd(smmu);
+out_ret:
+	kvm_iommu_unlock(&smmu->iommu);
+	return ret;
+}
+
+static void smmu_tlb_inv_range(struct kvm_hyp_iommu_domain *domain,
+			       unsigned long iova, size_t size, size_t granule,
+			       bool leaf)
+{
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+	unsigned long end = iova + size;
+	struct arm_smmu_cmdq_ent cmd;
+
+	cmd.tlbi.leaf = leaf;
+	if (smmu_domain->pgtable->cfg.fmt == ARM_64_LPAE_S2) {
+		cmd.opcode = CMDQ_OP_TLBI_S2_IPA;
+		cmd.tlbi.vmid = domain->domain_id;
+	} else {
+		cmd.opcode = CMDQ_OP_TLBI_NH_VA;
+		cmd.tlbi.asid = domain->domain_id;
+		cmd.tlbi.vmid = 0;
+	}
+	/*
+	 * There are no mappings at high addresses since we don't use TTB1, so
+	 * no overflow possible.
+	 */
+	BUG_ON(end < iova);
+	WARN_ON(smmu_tlb_inv_range_smmu(smmu_domain->smmu, domain,
+					&cmd, iova, size, granule));
+}
+
+static void smmu_tlb_flush_walk(unsigned long iova, size_t size,
+				size_t granule, void *cookie)
+{
+	smmu_tlb_inv_range(cookie, iova, size, granule, false);
+}
+
+static void smmu_tlb_add_page(struct iommu_iotlb_gather *gather,
+			      unsigned long iova, size_t granule,
+			      void *cookie)
+{
+	if (gather)
+		kvm_iommu_iotlb_gather_add_page(cookie, gather, iova, granule);
+	else
+		smmu_tlb_inv_range(cookie, iova, granule, granule, true);
+}
+
+__maybe_unused
+static const struct iommu_flush_ops smmu_tlb_ops = {
+	.tlb_flush_all	= smmu_tlb_flush_all,
+	.tlb_flush_walk = smmu_tlb_flush_walk,
+	.tlb_add_page	= smmu_tlb_add_page,
+};
+
+static void smmu_iotlb_sync(struct kvm_hyp_iommu_domain *domain,
+			    struct iommu_iotlb_gather *gather)
+{
+	size_t size;
+
+	if (!gather->pgsize)
+		return;
+	size = gather->end - gather->start + 1;
+	smmu_tlb_inv_range(domain, gather->start, size,  gather->pgsize, true);
+}
+
 /* Shared with the kernel driver in EL1 */
 struct kvm_iommu_ops smmu_ops = {
 	.init				= smmu_init,
 	.get_iommu_by_id		= smmu_id_to_iommu,
 	.alloc_domain			= smmu_alloc_domain,
 	.free_domain			= smmu_free_domain,
+	.iotlb_sync			= smmu_iotlb_sync,
 };
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 34/58] KVM: arm64: smmu-v3: Add context descriptor functions
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (32 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 33/58] KVM: arm64: smmu-v3: Add TLB ops Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:03 ` [RFC PATCH v2 35/58] KVM: arm64: smmu-v3: Add attach_dev Mostafa Saleh
                   ` (24 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add functions to allocate and access context descriptors that would
be used in stage-1 attach.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 53 +++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index 5f00d5cdf5bc..d58424e45e1d 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -215,6 +215,19 @@ static int smmu_sync_ste(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
 	return smmu_send_cmd(smmu, &cmd);
 }
 
+__maybe_unused
+static int smmu_sync_cd(struct hyp_arm_smmu_v3_device *smmu, u32 sid, u32 ssid)
+{
+	struct arm_smmu_cmdq_ent cmd = {
+		.opcode = CMDQ_OP_CFGI_CD,
+		.cfgi.sid	= sid,
+		.cfgi.ssid	= ssid,
+		.cfgi.leaf = true,
+	};
+
+	return smmu_send_cmd(smmu, &cmd);
+}
+
 static int smmu_alloc_l2_strtab(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
 {
 	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
@@ -291,6 +304,46 @@ smmu_get_alloc_ste_ptr(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
 	return smmu_get_ste_ptr(smmu, sid);
 }
 
+__maybe_unused
+static u64 *smmu_get_cd_ptr(u64 *cdtab, u32 ssid)
+{
+	/* Only linear supported for now. */
+	return cdtab + ssid * CTXDESC_CD_DWORDS;
+}
+
+__maybe_unused
+static u64 *smmu_alloc_cd(struct hyp_arm_smmu_v3_device *smmu, u32 pasid_bits)
+{
+	u64 *cd_table;
+	int flags = 0;
+	u32 requested_order = get_order((1 << pasid_bits) *
+					(CTXDESC_CD_DWORDS << 3));
+
+	/*
+	 * We support max of 64K linear tables only, this should be enough
+	 * for 128 pasids
+	 */
+	if (WARN_ON(requested_order > 4))
+		return NULL;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
+		flags |= IOMMU_PAGE_NOCACHE;
+
+	cd_table = kvm_iommu_donate_pages(requested_order, flags);
+	if (!cd_table)
+		return NULL;
+	return (u64 *)hyp_virt_to_phys(cd_table);
+}
+
+__maybe_unused
+static void smmu_free_cd(u64 *cd_table, u32 pasid_bits)
+{
+	u32 order = get_order((1 << pasid_bits) *
+			      (CTXDESC_CD_DWORDS << 3));
+
+	kvm_iommu_reclaim_pages(cd_table, order);
+}
+
 static int smmu_init_registers(struct hyp_arm_smmu_v3_device *smmu)
 {
 	u64 val, old;
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 35/58] KVM: arm64: smmu-v3: Add attach_dev
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (33 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 34/58] KVM: arm64: smmu-v3: Add context descriptor functions Mostafa Saleh
@ 2024-12-12 18:03 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 36/58] KVM: arm64: smmu-v3: Add detach_dev Mostafa Saleh
                   ` (23 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:03 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add attach_dev HVC code which handles both stage-1 and stage-2.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 244 +++++++++++++++++++-
 include/kvm/arm_smmu_v3.h                   |   4 +
 2 files changed, 242 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index d58424e45e1d..a96eb6625c48 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -68,6 +68,11 @@ struct hyp_arm_smmu_v3_domain {
 	struct io_pgtable		*pgtable;
 };
 
+static struct hyp_arm_smmu_v3_device *to_smmu(struct kvm_hyp_iommu *iommu)
+{
+	return container_of(iommu, struct hyp_arm_smmu_v3_device, iommu);
+}
+
 static int smmu_write_cr0(struct hyp_arm_smmu_v3_device *smmu, u32 val)
 {
 	writel_relaxed(val, smmu->base + ARM_SMMU_CR0);
@@ -203,7 +208,6 @@ static int smmu_send_cmd(struct hyp_arm_smmu_v3_device *smmu,
 	return smmu_sync_cmd(smmu);
 }
 
-__maybe_unused
 static int smmu_sync_ste(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
 {
 	struct arm_smmu_cmdq_ent cmd = {
@@ -215,7 +219,6 @@ static int smmu_sync_ste(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
 	return smmu_send_cmd(smmu, &cmd);
 }
 
-__maybe_unused
 static int smmu_sync_cd(struct hyp_arm_smmu_v3_device *smmu, u32 sid, u32 ssid)
 {
 	struct arm_smmu_cmdq_ent cmd = {
@@ -289,7 +292,6 @@ smmu_get_ste_ptr(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
 	return &cfg->linear.table[sid];
 }
 
-__maybe_unused
 static struct arm_smmu_ste *
 smmu_get_alloc_ste_ptr(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
 {
@@ -304,14 +306,12 @@ smmu_get_alloc_ste_ptr(struct hyp_arm_smmu_v3_device *smmu, u32 sid)
 	return smmu_get_ste_ptr(smmu, sid);
 }
 
-__maybe_unused
 static u64 *smmu_get_cd_ptr(u64 *cdtab, u32 ssid)
 {
 	/* Only linear supported for now. */
 	return cdtab + ssid * CTXDESC_CD_DWORDS;
 }
 
-__maybe_unused
 static u64 *smmu_alloc_cd(struct hyp_arm_smmu_v3_device *smmu, u32 pasid_bits)
 {
 	u64 *cd_table;
@@ -803,7 +803,6 @@ static void smmu_tlb_add_page(struct iommu_iotlb_gather *gather,
 		smmu_tlb_inv_range(cookie, iova, granule, granule, true);
 }
 
-__maybe_unused
 static const struct iommu_flush_ops smmu_tlb_ops = {
 	.tlb_flush_all	= smmu_tlb_flush_all,
 	.tlb_flush_walk = smmu_tlb_flush_walk,
@@ -821,6 +820,238 @@ static void smmu_iotlb_sync(struct kvm_hyp_iommu_domain *domain,
 	smmu_tlb_inv_range(domain, gather->start, size,  gather->pgsize, true);
 }
 
+static int smmu_domain_config_s2(struct kvm_hyp_iommu_domain *domain,
+				 struct arm_smmu_ste *ste)
+{
+	struct io_pgtable_cfg *cfg;
+	u64 ts, sl, ic, oc, sh, tg, ps;
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+
+	cfg = &smmu_domain->pgtable->cfg;
+	ps = cfg->arm_lpae_s2_cfg.vtcr.ps;
+	tg = cfg->arm_lpae_s2_cfg.vtcr.tg;
+	sh = cfg->arm_lpae_s2_cfg.vtcr.sh;
+	oc = cfg->arm_lpae_s2_cfg.vtcr.orgn;
+	ic = cfg->arm_lpae_s2_cfg.vtcr.irgn;
+	sl = cfg->arm_lpae_s2_cfg.vtcr.sl;
+	ts = cfg->arm_lpae_s2_cfg.vtcr.tsz;
+
+	ste->data[0] = STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
+	ste->data[1] = FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING);
+	ste->data[2] = FIELD_PREP(STRTAB_STE_2_VTCR,
+			FIELD_PREP(STRTAB_STE_2_VTCR_S2PS, ps) |
+			FIELD_PREP(STRTAB_STE_2_VTCR_S2TG, tg) |
+			FIELD_PREP(STRTAB_STE_2_VTCR_S2SH0, sh) |
+			FIELD_PREP(STRTAB_STE_2_VTCR_S2OR0, oc) |
+			FIELD_PREP(STRTAB_STE_2_VTCR_S2IR0, ic) |
+			FIELD_PREP(STRTAB_STE_2_VTCR_S2SL0, sl) |
+			FIELD_PREP(STRTAB_STE_2_VTCR_S2T0SZ, ts)) |
+		 FIELD_PREP(STRTAB_STE_2_S2VMID, domain->domain_id) |
+		 STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2R;
+	ste->data[3] = cfg->arm_lpae_s2_cfg.vttbr & STRTAB_STE_3_S2TTB_MASK;
+
+	return 0;
+}
+
+static u64 *smmu_domain_config_s1_ste(struct hyp_arm_smmu_v3_device *smmu,
+				      u32 pasid_bits, struct arm_smmu_ste *ste)
+{
+	u64 *cd_table;
+
+	cd_table = smmu_alloc_cd(smmu, pasid_bits);
+	if (!cd_table)
+		return NULL;
+
+	ste->data[1] = FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
+		FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
+		FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
+		FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH);
+	ste->data[0] = ((u64)cd_table & STRTAB_STE_0_S1CTXPTR_MASK) |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
+		FIELD_PREP(STRTAB_STE_0_S1CDMAX, pasid_bits) |
+		FIELD_PREP(STRTAB_STE_0_S1FMT, STRTAB_STE_0_S1FMT_LINEAR) |
+		STRTAB_STE_0_V;
+
+	return cd_table;
+}
+
+/*
+ * This function handles configuration for pasid and non-pasid domains
+ * with the following assumptions:
+ * - pasid 0 always attached first, this should be the typicall flow
+ *   for the kernel where attach_dev is always called before set_dev_pasid.
+ *   In that case only pasid 0 is allowed to allocate memory for the CD,
+ *   and other pasids would expect to find the tabel.
+ * - pasid 0 is detached last, also guaranteed from the kernel.
+ */
+static int smmu_domain_config_s1(struct hyp_arm_smmu_v3_device *smmu,
+				 struct kvm_hyp_iommu_domain *domain,
+				 u32 sid, u32 pasid, u32 pasid_bits,
+				 struct arm_smmu_ste *ste)
+{
+	struct arm_smmu_ste *dst;
+	u64 val;
+	u64 *cd_entry, *cd_table;
+	struct io_pgtable_cfg *cfg;
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+
+	cfg = &smmu_domain->pgtable->cfg;
+	dst = smmu_get_ste_ptr(smmu, sid);
+	val = dst->data[0];
+
+	if (FIELD_GET(STRTAB_STE_0_CFG, val) == STRTAB_STE_0_CFG_S2_TRANS)
+		return -EBUSY;
+
+	if (pasid == 0) {
+		cd_table = smmu_domain_config_s1_ste(smmu, pasid_bits, ste);
+		if (!cd_table)
+			return -ENOMEM;
+	} else {
+		u32 nr_entries;
+
+		cd_table = (u64 *)(FIELD_GET(STRTAB_STE_0_S1CTXPTR_MASK, val) << 6);
+		if (!cd_table)
+			return -EINVAL;
+		nr_entries = 1 << FIELD_GET(STRTAB_STE_0_S1CDMAX, val);
+		if (pasid >= nr_entries)
+			return -E2BIG;
+	}
+
+	/* Write CD. */
+	cd_entry = smmu_get_cd_ptr(hyp_phys_to_virt((u64)cd_table), pasid);
+
+	/* CD already used by another device. */
+	if (cd_entry[0])
+		return -EBUSY;
+
+	cd_entry[1] = cpu_to_le64(cfg->arm_lpae_s1_cfg.ttbr & CTXDESC_CD_1_TTB0_MASK);
+	cd_entry[2] = 0;
+	cd_entry[3] = cpu_to_le64(cfg->arm_lpae_s1_cfg.mair);
+
+	/* STE is live. */
+	if (pasid)
+		smmu_sync_cd(smmu, sid, pasid);
+	val =  FIELD_PREP(CTXDESC_CD_0_TCR_T0SZ, cfg->arm_lpae_s1_cfg.tcr.tsz) |
+	       FIELD_PREP(CTXDESC_CD_0_TCR_TG0, cfg->arm_lpae_s1_cfg.tcr.tg) |
+	       FIELD_PREP(CTXDESC_CD_0_TCR_IRGN0, cfg->arm_lpae_s1_cfg.tcr.irgn) |
+	       FIELD_PREP(CTXDESC_CD_0_TCR_ORGN0, cfg->arm_lpae_s1_cfg.tcr.orgn) |
+	       FIELD_PREP(CTXDESC_CD_0_TCR_SH0, cfg->arm_lpae_s1_cfg.tcr.sh) |
+	       FIELD_PREP(CTXDESC_CD_0_TCR_IPS, cfg->arm_lpae_s1_cfg.tcr.ips) |
+	       CTXDESC_CD_0_TCR_EPD1 | CTXDESC_CD_0_AA64 |
+	       CTXDESC_CD_0_R | CTXDESC_CD_0_A |
+	       CTXDESC_CD_0_ASET |
+	       FIELD_PREP(CTXDESC_CD_0_ASID, domain->domain_id) |
+	       CTXDESC_CD_0_V;
+	WRITE_ONCE(cd_entry[0], cpu_to_le64(val));
+	/* STE is live. */
+	if (pasid)
+		smmu_sync_cd(smmu, sid, pasid);
+	return 0;
+}
+
+static int smmu_domain_finalise(struct hyp_arm_smmu_v3_device *smmu,
+				struct kvm_hyp_iommu_domain *domain)
+{
+	int ret;
+	struct io_pgtable_cfg cfg;
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+
+	if (smmu_domain->type == KVM_ARM_SMMU_DOMAIN_S1) {
+		size_t ias = (smmu->features & ARM_SMMU_FEAT_VAX) ? 52 : 48;
+
+		cfg = (struct io_pgtable_cfg) {
+			.fmt = ARM_64_LPAE_S1,
+			.pgsize_bitmap = smmu->pgsize_bitmap,
+			.ias = min_t(unsigned long, ias, VA_BITS),
+			.oas = smmu->ias,
+			.coherent_walk = smmu->features & ARM_SMMU_FEAT_COHERENCY,
+			.tlb = &smmu_tlb_ops,
+		};
+	} else {
+		cfg = (struct io_pgtable_cfg) {
+			.fmt = ARM_64_LPAE_S2,
+			.pgsize_bitmap = smmu->pgsize_bitmap,
+			.ias = smmu->ias,
+			.oas = smmu->oas,
+			.coherent_walk = smmu->features & ARM_SMMU_FEAT_COHERENCY,
+			.tlb = &smmu_tlb_ops,
+		};
+	}
+
+	hyp_spin_lock(&smmu_domain->pgt_lock);
+	smmu_domain->pgtable = kvm_arm_io_pgtable_alloc(&cfg, domain, &ret);
+	hyp_spin_unlock(&smmu_domain->pgt_lock);
+	return ret;
+}
+
+static int smmu_attach_dev(struct kvm_hyp_iommu *iommu, struct kvm_hyp_iommu_domain *domain,
+			   u32 sid, u32 pasid, u32 pasid_bits)
+{
+	int i;
+	int ret;
+	struct arm_smmu_ste *dst;
+	struct arm_smmu_ste ste = {};
+	struct hyp_arm_smmu_v3_device *smmu = to_smmu(iommu);
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+
+	kvm_iommu_lock(iommu);
+	dst = smmu_get_alloc_ste_ptr(smmu, sid);
+	if (!dst) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	if (smmu_domain->smmu && (smmu != smmu_domain->smmu)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (!smmu_domain->pgtable) {
+		ret = smmu_domain_finalise(smmu, domain);
+		if (ret)
+			goto out_unlock;
+	}
+
+	if (smmu_domain->type == KVM_ARM_SMMU_DOMAIN_S2) {
+		/* Device already attached or pasid for s2. */
+		if (dst->data[0] || pasid) {
+			ret = -EBUSY;
+			goto out_unlock;
+		}
+		ret = smmu_domain_config_s2(domain, &ste);
+	} else {
+		/*
+		 * Allocate and config CD, and update CD if possible.
+		 */
+		pasid_bits = min(pasid_bits, smmu->ssid_bits);
+		ret = smmu_domain_config_s1(smmu, domain, sid, pasid,
+					    pasid_bits, &ste);
+	}
+	smmu_domain->smmu = smmu;
+	/* We don't update STEs for pasid domains. */
+	if (ret || pasid)
+		goto out_unlock;
+
+	/*
+	 * The SMMU may cache a disabled STE.
+	 * Initialize all fields, sync, then enable it.
+	 */
+	for (i = 1; i < STRTAB_STE_DWORDS; i++)
+		dst->data[i] = ste.data[i];
+
+	ret = smmu_sync_ste(smmu, sid);
+	if (ret)
+		goto out_unlock;
+
+	WRITE_ONCE(dst->data[0], ste.data[0]);
+	ret = smmu_sync_ste(smmu, sid);
+	WARN_ON(ret);
+out_unlock:
+	kvm_iommu_unlock(iommu);
+	return ret;
+}
+
 /* Shared with the kernel driver in EL1 */
 struct kvm_iommu_ops smmu_ops = {
 	.init				= smmu_init,
@@ -828,4 +1059,5 @@ struct kvm_iommu_ops smmu_ops = {
 	.alloc_domain			= smmu_alloc_domain,
 	.free_domain			= smmu_free_domain,
 	.iotlb_sync			= smmu_iotlb_sync,
+	.attach_dev			= smmu_attach_dev,
 };
diff --git a/include/kvm/arm_smmu_v3.h b/include/kvm/arm_smmu_v3.h
index ded98cbaebc1..e8616ec5a048 100644
--- a/include/kvm/arm_smmu_v3.h
+++ b/include/kvm/arm_smmu_v3.h
@@ -25,6 +25,10 @@ struct hyp_arm_smmu_v3_device {
 	size_t			cmdq_log2size;
 	/* strtab_cfg.l2.l2ptrs is not used, instead computed from L1 */
 	struct arm_smmu_strtab_cfg strtab_cfg;
+	size_t			oas;
+	size_t			ias;
+	size_t			pgsize_bitmap;
+	size_t			ssid_bits;
 };
 
 extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 36/58] KVM: arm64: smmu-v3: Add detach_dev
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (34 preceding siblings ...)
  2024-12-12 18:03 ` [RFC PATCH v2 35/58] KVM: arm64: smmu-v3: Add attach_dev Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 37/58] iommu/io-pgtable: Generalize walker interface Mostafa Saleh
                   ` (22 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add detach_dev for stage-1 and stage-2 domains.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 76 ++++++++++++++++++++-
 1 file changed, 75 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index a96eb6625c48..ec3f8d9749d3 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -335,7 +335,6 @@ static u64 *smmu_alloc_cd(struct hyp_arm_smmu_v3_device *smmu, u32 pasid_bits)
 	return (u64 *)hyp_virt_to_phys(cd_table);
 }
 
-__maybe_unused
 static void smmu_free_cd(u64 *cd_table, u32 pasid_bits)
 {
 	u32 order = get_order((1 << pasid_bits) *
@@ -1052,6 +1051,80 @@ static int smmu_attach_dev(struct kvm_hyp_iommu *iommu, struct kvm_hyp_iommu_dom
 	return ret;
 }
 
+static int smmu_detach_dev(struct kvm_hyp_iommu *iommu, struct kvm_hyp_iommu_domain *domain,
+			   u32 sid, u32 pasid)
+{
+	struct arm_smmu_ste *dst;
+	int i, ret;
+	struct hyp_arm_smmu_v3_device *smmu = to_smmu(iommu);
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+	u32 pasid_bits = 0;
+	u64 *cd_table, *cd;
+
+	kvm_iommu_lock(iommu);
+	dst = smmu_get_ste_ptr(smmu, sid);
+	if (!dst) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
+	/*
+	 * For stage-1:
+	 * - The kernel has to detach pasid = 0 the last.
+	 * - This will free the CD.
+	 */
+	if (smmu_domain->type == KVM_ARM_SMMU_DOMAIN_S1) {
+		pasid_bits = FIELD_GET(STRTAB_STE_0_S1CDMAX, dst->data[0]);
+		if (pasid >= (1 << pasid_bits)) {
+			ret = -E2BIG;
+			goto out_unlock;
+		}
+		cd_table = (u64 *)(dst->data[0] & STRTAB_STE_0_S1CTXPTR_MASK);
+		if (WARN_ON(!cd_table)) {
+			ret = -ENODEV;
+			goto out_unlock;
+		}
+
+		cd_table = hyp_phys_to_virt((phys_addr_t)cd_table);
+		if (pasid == 0) {
+			int j;
+
+			/* Ensure other pasids are detached. */
+			for (j = 1 ; j < (1 << pasid_bits) ; ++j) {
+				cd = smmu_get_cd_ptr(cd_table, j);
+				if (cd[0] & CTXDESC_CD_0_V) {
+					ret = -EINVAL;
+					goto out_unlock;
+				}
+			}
+		} else {
+			cd = smmu_get_cd_ptr(cd_table, pasid);
+			cd[0] = 0;
+			smmu_sync_cd(smmu, sid, pasid);
+			cd[1] = 0;
+			cd[2] = 0;
+			cd[3] = 0;
+			ret = smmu_sync_cd(smmu, sid, pasid);
+			goto out_unlock;
+		}
+	}
+	/* For stage-2 and pasid = 0 */
+	dst->data[0] = 0;
+	ret = smmu_sync_ste(smmu, sid);
+	if (ret)
+		goto out_unlock;
+	for (i = 1; i < STRTAB_STE_DWORDS; i++)
+		dst->data[i] = 0;
+
+	ret = smmu_sync_ste(smmu, sid);
+
+	smmu_free_cd(cd_table, pasid_bits);
+
+out_unlock:
+	kvm_iommu_unlock(iommu);
+	return ret;
+}
+
 /* Shared with the kernel driver in EL1 */
 struct kvm_iommu_ops smmu_ops = {
 	.init				= smmu_init,
@@ -1060,4 +1133,5 @@ struct kvm_iommu_ops smmu_ops = {
 	.free_domain			= smmu_free_domain,
 	.iotlb_sync			= smmu_iotlb_sync,
 	.attach_dev			= smmu_attach_dev,
+	.detach_dev			= smmu_detach_dev,
 };
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 37/58] iommu/io-pgtable: Generalize walker interface
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (35 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 36/58] KVM: arm64: smmu-v3: Add detach_dev Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 38/58] iommu/io-pgtable-arm: Add post table walker callback Mostafa Saleh
                   ` (21 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add a common struct for walker, which has a callback visitor
with HW agnostic info (phys addr + size).

Add size to walker so it can walk a range of IOVAs.

Also, add a cookie for the arm walker.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/gpu/drm/msm/msm_iommu.c       |  5 +++-
 drivers/iommu/io-pgtable-arm-common.c | 35 ++++++++++++++++++---------
 include/linux/io-pgtable-arm.h        |  2 +-
 include/linux/io-pgtable.h            | 18 ++++++++++++--
 4 files changed, 45 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 3e692818ba1f..8516861dd626 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -200,6 +200,9 @@ msm_iommu_pagetable_walk(struct msm_mmu *mmu, unsigned long iova, uint64_t ptes[
 {
 	struct msm_iommu_pagetable *pagetable;
 	struct arm_lpae_io_pgtable_walk_data wd = {};
+	struct io_pgtable_walk_common walker = {
+		.data = &wd,
+	};
 
 	if (mmu->type != MSM_MMU_IOMMU_PAGETABLE)
 		return -EINVAL;
@@ -209,7 +212,7 @@ msm_iommu_pagetable_walk(struct msm_mmu *mmu, unsigned long iova, uint64_t ptes[
 	if (!pagetable->pgtbl_ops->pgtable_walk)
 		return -EINVAL;
 
-	pagetable->pgtbl_ops->pgtable_walk(pagetable->pgtbl_ops, iova, &wd);
+	pagetable->pgtbl_ops->pgtable_walk(pagetable->pgtbl_ops, iova, 1, &walker);
 
 	for (int i = 0; i < ARRAY_SIZE(wd.ptes); i++)
 		ptes[i] = wd.ptes[i];
diff --git a/drivers/iommu/io-pgtable-arm-common.c b/drivers/iommu/io-pgtable-arm-common.c
index 21ee8ff7c881..4fc0b03494e3 100644
--- a/drivers/iommu/io-pgtable-arm-common.c
+++ b/drivers/iommu/io-pgtable-arm-common.c
@@ -481,7 +481,8 @@ struct iova_to_phys_data {
 static int visit_iova_to_phys(struct io_pgtable_walk_data *walk_data, int lvl,
 			      arm_lpae_iopte *ptep, size_t size)
 {
-	struct iova_to_phys_data *data = walk_data->data;
+	struct io_pgtable_walk_common *walker = walk_data->data;
+	struct iova_to_phys_data *data = walker->data;
 	data->pte = *ptep;
 	data->lvl = lvl;
 	return 0;
@@ -492,8 +493,11 @@ static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
 	struct iova_to_phys_data d;
-	struct io_pgtable_walk_data walk_data = {
+	struct io_pgtable_walk_common walker = {
 		.data = &d,
+	};
+	struct io_pgtable_walk_data walk_data = {
+		.data = &walker,
 		.visit = visit_iova_to_phys,
 		.addr = iova,
 		.end = iova + 1,
@@ -511,23 +515,25 @@ static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
 static int visit_pgtable_walk(struct io_pgtable_walk_data *walk_data, int lvl,
 			      arm_lpae_iopte *ptep, size_t size)
 {
-	struct arm_lpae_io_pgtable_walk_data *data = walk_data->data;
-	data->ptes[data->level++] = *ptep;
+	struct io_pgtable_walk_common *walker = walk_data->data;
+	struct arm_lpae_io_pgtable_walk_data *data = walker->data;
+
+	data->ptes[lvl] = *ptep;
+	data->level = lvl + 1;
 	return 0;
 }
 
-static int arm_lpae_pgtable_walk(struct io_pgtable_ops *ops, unsigned long iova, void *wd)
+static int arm_lpae_pgtable_walk(struct io_pgtable_ops *ops, unsigned long iova,
+				 size_t size, struct io_pgtable_walk_common *walker)
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
 	struct io_pgtable_walk_data walk_data = {
-		.data = wd,
+		.data = walker,
 		.visit = visit_pgtable_walk,
 		.addr = iova,
-		.end = iova + 1,
+		.end = iova + size,
 	};
 
-	((struct arm_lpae_io_pgtable_walk_data *)wd)->level = 0;
-
 	return __arm_lpae_iopte_walk(data, &walk_data, data->pgd, data->start_level);
 }
 
@@ -537,6 +543,7 @@ static int io_pgtable_visit(struct arm_lpae_io_pgtable *data,
 {
 	struct io_pgtable *iop = &data->iop;
 	arm_lpae_iopte pte = READ_ONCE(*ptep);
+	struct io_pgtable_walk_common *walker = walk_data->data;
 
 	size_t size = ARM_LPAE_BLOCK_SIZE(lvl, data);
 	int ret = walk_data->visit(walk_data, lvl, ptep, size);
@@ -544,6 +551,8 @@ static int io_pgtable_visit(struct arm_lpae_io_pgtable *data,
 		return ret;
 
 	if (iopte_leaf(pte, lvl, iop->fmt)) {
+		if (walker->visit_leaf)
+			walker->visit_leaf(iopte_to_paddr(pte, data), size, walker, ptep);
 		walk_data->addr += size;
 		return 0;
 	}
@@ -585,7 +594,8 @@ static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
 static int visit_dirty(struct io_pgtable_walk_data *walk_data, int lvl,
 		       arm_lpae_iopte *ptep, size_t size)
 {
-	struct iommu_dirty_bitmap *dirty = walk_data->data;
+	struct io_pgtable_walk_common *walker = walk_data->data;
+	struct iommu_dirty_bitmap *dirty = walker->data;
 
 	if (!iopte_leaf(*ptep, lvl, walk_data->iop->fmt))
 		return 0;
@@ -606,9 +616,12 @@ static int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
 {
 	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	struct io_pgtable_walk_common walker = {
+		.data = dirty,
+	};
 	struct io_pgtable_walk_data walk_data = {
 		.iop = &data->iop,
-		.data = dirty,
+		.data = &walker,
 		.visit = visit_dirty,
 		.flags = flags,
 		.addr = iova,
diff --git a/include/linux/io-pgtable-arm.h b/include/linux/io-pgtable-arm.h
index 88922314157d..9e5878c37d78 100644
--- a/include/linux/io-pgtable-arm.h
+++ b/include/linux/io-pgtable-arm.h
@@ -18,7 +18,7 @@ struct arm_lpae_io_pgtable {
 
 struct io_pgtable_walk_data {
 	struct io_pgtable		*iop;
-	void				*data;
+	struct io_pgtable_walk_common	*data;
 	int (*visit)(struct io_pgtable_walk_data *walk_data, int lvl,
 		     arm_lpae_iopte *ptep, size_t size);
 	unsigned long			flags;
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index f789234c703b..da50e17b0177 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -185,12 +185,25 @@ struct io_pgtable_cfg {
  *
  * @ptes:     The recorded PTE values from the walk
  * @level:    The level of the last PTE
+ * @cookie:   Cookie set by caller to identify it
  *
  * @level also specifies the last valid index in @ptes
  */
 struct arm_lpae_io_pgtable_walk_data {
 	u64 ptes[4];
 	int level;
+	void *cookie;
+};
+
+/**
+ * struct io_pgtable_walk_common - common information from a pgtable walk
+ * @visit_leaf: callback for each leaf providing it's physical address and size
+ */
+struct io_pgtable_walk_common {
+	void (*visit_leaf)(phys_addr_t paddr, size_t size,
+			   struct io_pgtable_walk_common *data,
+			   void *wd);
+	void *data; /* pointer to walk data as arm_lpae_io_pgtable_walk_data*/
 };
 
 /**
@@ -199,7 +212,7 @@ struct arm_lpae_io_pgtable_walk_data {
  * @map_pages:    Map a physically contiguous range of pages of the same size.
  * @unmap_pages:  Unmap a range of virtually contiguous pages of the same size.
  * @iova_to_phys: Translate iova to physical address.
- * @pgtable_walk: (optional) Perform a page table walk for a given iova.
+ * @pgtable_walk: (optional) Perform a page table walk for a given iova and size.
  *
  * These functions map directly onto the iommu_ops member functions with
  * the same names.
@@ -213,7 +226,8 @@ struct io_pgtable_ops {
 			      struct iommu_iotlb_gather *gather);
 	phys_addr_t (*iova_to_phys)(struct io_pgtable_ops *ops,
 				    unsigned long iova);
-	int (*pgtable_walk)(struct io_pgtable_ops *ops, unsigned long iova, void *wd);
+	int (*pgtable_walk)(struct io_pgtable_ops *ops, unsigned long iova,
+			    size_t size, struct io_pgtable_walk_common *walker);
 	int (*read_and_clear_dirty)(struct io_pgtable_ops *ops,
 				    unsigned long iova, size_t size,
 				    unsigned long flags,
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 38/58] iommu/io-pgtable-arm: Add post table walker callback
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (36 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 37/58] iommu/io-pgtable: Generalize walker interface Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 39/58] drivers/iommu: io-pgtable: Add IO_PGTABLE_QUIRK_UNMAP_INVAL Mostafa Saleh
                   ` (20 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add a callback for postable, this would be used by pKVM to
cleanup tables next.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/iommu/io-pgtable-arm-common.c | 15 ++++++++++++++-
 include/linux/io-pgtable-arm.h        |  2 ++
 include/linux/io-pgtable.h            |  2 ++
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm-common.c b/drivers/iommu/io-pgtable-arm-common.c
index 4fc0b03494e3..076240eaec19 100644
--- a/drivers/iommu/io-pgtable-arm-common.c
+++ b/drivers/iommu/io-pgtable-arm-common.c
@@ -523,6 +523,13 @@ static int visit_pgtable_walk(struct io_pgtable_walk_data *walk_data, int lvl,
 	return 0;
 }
 
+static void visit_pgtable_post_table(struct arm_lpae_io_pgtable_walk_data *data,
+				     arm_lpae_iopte *ptep, int lvl)
+{
+	if (data->visit_post_table)
+		data->visit_post_table(data, ptep, lvl);
+}
+
 static int arm_lpae_pgtable_walk(struct io_pgtable_ops *ops, unsigned long iova,
 				 size_t size, struct io_pgtable_walk_common *walker)
 {
@@ -530,6 +537,7 @@ static int arm_lpae_pgtable_walk(struct io_pgtable_ops *ops, unsigned long iova,
 	struct io_pgtable_walk_data walk_data = {
 		.data = walker,
 		.visit = visit_pgtable_walk,
+		.visit_post_table = visit_pgtable_post_table,
 		.addr = iova,
 		.end = iova + size,
 	};
@@ -562,7 +570,12 @@ static int io_pgtable_visit(struct arm_lpae_io_pgtable *data,
 	}
 
 	ptep = iopte_deref(pte, data);
-	return __arm_lpae_iopte_walk(data, walk_data, ptep, lvl + 1);
+	ret = __arm_lpae_iopte_walk(data, walk_data, ptep, lvl + 1);
+
+	if (walk_data->visit_post_table)
+		walk_data->visit_post_table(data, ptep, lvl);
+
+	return ret;
 }
 
 static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
diff --git a/include/linux/io-pgtable-arm.h b/include/linux/io-pgtable-arm.h
index 9e5878c37d78..c00eb0cb7e43 100644
--- a/include/linux/io-pgtable-arm.h
+++ b/include/linux/io-pgtable-arm.h
@@ -21,6 +21,8 @@ struct io_pgtable_walk_data {
 	struct io_pgtable_walk_common	*data;
 	int (*visit)(struct io_pgtable_walk_data *walk_data, int lvl,
 		     arm_lpae_iopte *ptep, size_t size);
+	void (*visit_post_table)(struct arm_lpae_io_pgtable_walk_data *data,
+				 arm_lpae_iopte *ptep, int lvl);
 	unsigned long			flags;
 	u64				addr;
 	const u64			end;
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index da50e17b0177..86226571cdb8 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -193,6 +193,8 @@ struct arm_lpae_io_pgtable_walk_data {
 	u64 ptes[4];
 	int level;
 	void *cookie;
+	void (*visit_post_table)(struct arm_lpae_io_pgtable_walk_data *data,
+				 arm_lpae_iopte *ptep, int lvl);
 };
 
 /**
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 39/58] drivers/iommu: io-pgtable: Add IO_PGTABLE_QUIRK_UNMAP_INVAL
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (37 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 38/58] iommu/io-pgtable-arm: Add post table walker callback Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 40/58] KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys Mostafa Saleh
                   ` (19 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Only invalidate PTE without clearing it.

For io-pgtable-armm that also leaves the table allocated after an
unmap as they can't be freed.

This quirk also will allow the page table walker to traverse through
tables invalidated by an unmap, allowing the caller to doing any
booking keeping and freeing the table after.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/iommu/io-pgtable-arm-common.c | 50 +++++++++++++++++++--------
 include/linux/io-pgtable-arm.h        |  7 +++-
 include/linux/io-pgtable.h            |  5 ++-
 3 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-common.c b/drivers/iommu/io-pgtable-arm-common.c
index 076240eaec19..89be1aa72a6b 100644
--- a/drivers/iommu/io-pgtable-arm-common.c
+++ b/drivers/iommu/io-pgtable-arm-common.c
@@ -42,7 +42,10 @@ static phys_addr_t iopte_to_paddr(arm_lpae_iopte pte,
 static void __arm_lpae_clear_pte(arm_lpae_iopte *ptep, struct io_pgtable_cfg *cfg, int num_entries)
 {
 	for (int i = 0; i < num_entries; i++)
-		ptep[i] = 0;
+		if (cfg->quirks & IO_PGTABLE_QUIRK_UNMAP_INVAL)
+			ptep[i] &= ~ARM_LPAE_PTE_VALID;
+		else
+			ptep[i] = 0;
 
 	if (!cfg->coherent_walk && num_entries)
 		__arm_lpae_sync_pte(ptep, num_entries, cfg);
@@ -170,7 +173,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 
 	/* Grab a pointer to the next level */
 	pte = READ_ONCE(*ptep);
-	if (!pte) {
+	if (!iopte_valid(pte)) {
 		cptep = __arm_lpae_alloc_pages(tblsz, gfp, cfg, data->iop.cookie);
 		if (!cptep)
 			return -ENOMEM;
@@ -182,9 +185,9 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		__arm_lpae_sync_pte(ptep, 1, cfg);
 	}
 
-	if (pte && !iopte_leaf(pte, lvl, data->iop.fmt)) {
+	if (iopte_valid(pte) && !iopte_leaf(pte, lvl, data->iop.fmt)) {
 		cptep = iopte_deref(pte, data);
-	} else if (pte) {
+	} else if (iopte_valid(pte)) {
 		/* We require an unmap first */
 		return arm_lpae_unmap_empty();
 	}
@@ -316,7 +319,7 @@ void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
 	while (ptep != end) {
 		arm_lpae_iopte pte = *ptep++;
 
-		if (!pte || iopte_leaf(pte, lvl, data->iop.fmt))
+		if (!iopte_valid(pte) || iopte_leaf(pte, lvl, data->iop.fmt))
 			continue;
 
 		__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
@@ -401,7 +404,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 	unmap_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
 	ptep += unmap_idx_start;
 	pte = READ_ONCE(*ptep);
-	if (WARN_ON(!pte))
+	if (WARN_ON(!iopte_valid(pte)))
 		return 0;
 
 	/* If the size matches this level, we're in the right place */
@@ -412,7 +415,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 		/* Find and handle non-leaf entries */
 		for (i = 0; i < num_entries; i++) {
 			pte = READ_ONCE(ptep[i]);
-			if (WARN_ON(!pte))
+			if (WARN_ON(!iopte_valid(pte)))
 				break;
 
 			if (!iopte_leaf(pte, lvl, iop->fmt)) {
@@ -421,7 +424,9 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 				/* Also flush any partial walks */
 				io_pgtable_tlb_flush_walk(iop, iova + i * size, size,
 							  ARM_LPAE_GRANULE(data));
-				__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
+				if (!(iop->cfg.quirks & IO_PGTABLE_QUIRK_UNMAP_INVAL))
+					__arm_lpae_free_pgtable(data, lvl + 1,
+								iopte_deref(pte, data));
 			}
 		}
 
@@ -523,9 +528,12 @@ static int visit_pgtable_walk(struct io_pgtable_walk_data *walk_data, int lvl,
 	return 0;
 }
 
-static void visit_pgtable_post_table(struct arm_lpae_io_pgtable_walk_data *data,
+static void visit_pgtable_post_table(struct io_pgtable_walk_data *walk_data,
 				     arm_lpae_iopte *ptep, int lvl)
 {
+	struct io_pgtable_walk_common *walker = walk_data->data;
+	struct arm_lpae_io_pgtable_walk_data *data = walker->data;
+
 	if (data->visit_post_table)
 		data->visit_post_table(data, ptep, lvl);
 }
@@ -550,30 +558,41 @@ static int io_pgtable_visit(struct arm_lpae_io_pgtable *data,
 			    arm_lpae_iopte *ptep, int lvl)
 {
 	struct io_pgtable *iop = &data->iop;
+	struct io_pgtable_cfg *cfg = &iop->cfg;
 	arm_lpae_iopte pte = READ_ONCE(*ptep);
 	struct io_pgtable_walk_common *walker = walk_data->data;
+	arm_lpae_iopte *old_ptep = ptep;
+	bool is_leaf, is_table;
 
 	size_t size = ARM_LPAE_BLOCK_SIZE(lvl, data);
 	int ret = walk_data->visit(walk_data, lvl, ptep, size);
 	if (ret)
 		return ret;
 
-	if (iopte_leaf(pte, lvl, iop->fmt)) {
+	if (cfg->quirks & IO_PGTABLE_QUIRK_UNMAP_INVAL) {
+		/* Visitng invalid tables as it still have enteries. */
+		is_table = pte && iopte_table(pte | ARM_LPAE_PTE_VALID, lvl);
+		is_leaf = pte && iopte_leaf(pte | ARM_LPAE_PTE_VALID, lvl, iop->fmt);
+	} else {
+		is_table = iopte_table(pte, lvl);
+		is_leaf = iopte_leaf(pte, lvl, iop->fmt);
+	}
+
+	if (is_leaf) {
 		if (walker->visit_leaf)
 			walker->visit_leaf(iopte_to_paddr(pte, data), size, walker, ptep);
 		walk_data->addr += size;
 		return 0;
 	}
 
-	if (!iopte_table(pte, lvl)) {
+	if (!is_table)
 		return -EINVAL;
-	}
 
 	ptep = iopte_deref(pte, data);
 	ret = __arm_lpae_iopte_walk(data, walk_data, ptep, lvl + 1);
 
 	if (walk_data->visit_post_table)
-		walk_data->visit_post_table(data, ptep, lvl);
+		walk_data->visit_post_table(walk_data, old_ptep, lvl);
 
 	return ret;
 }
@@ -744,7 +763,8 @@ int arm_lpae_init_pgtable_s1(struct io_pgtable_cfg *cfg,
 	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
 			    IO_PGTABLE_QUIRK_ARM_TTBR1 |
 			    IO_PGTABLE_QUIRK_ARM_OUTER_WBWA |
-			    IO_PGTABLE_QUIRK_ARM_HD))
+			    IO_PGTABLE_QUIRK_ARM_HD |
+			    IO_PGTABLE_QUIRK_UNMAP_INVAL))
 		return -EINVAL;
 
 	ret = arm_lpae_init_pgtable(cfg, data);
@@ -830,7 +850,7 @@ int arm_lpae_init_pgtable_s2(struct io_pgtable_cfg *cfg,
 	typeof(&cfg->arm_lpae_s2_cfg.vtcr) vtcr = &cfg->arm_lpae_s2_cfg.vtcr;
 
 	/* The NS quirk doesn't apply at stage 2 */
-	if (cfg->quirks)
+	if (cfg->quirks & ~IO_PGTABLE_QUIRK_UNMAP_INVAL)
 		return -EINVAL;
 
 	ret = arm_lpae_init_pgtable(cfg, data);
diff --git a/include/linux/io-pgtable-arm.h b/include/linux/io-pgtable-arm.h
index c00eb0cb7e43..407f05fb300a 100644
--- a/include/linux/io-pgtable-arm.h
+++ b/include/linux/io-pgtable-arm.h
@@ -21,7 +21,7 @@ struct io_pgtable_walk_data {
 	struct io_pgtable_walk_common	*data;
 	int (*visit)(struct io_pgtable_walk_data *walk_data, int lvl,
 		     arm_lpae_iopte *ptep, size_t size);
-	void (*visit_post_table)(struct arm_lpae_io_pgtable_walk_data *data,
+	void (*visit_post_table)(struct io_pgtable_walk_data *walk_data,
 				 arm_lpae_iopte *ptep, int lvl);
 	unsigned long			flags;
 	u64				addr;
@@ -193,6 +193,11 @@ static inline bool iopte_table(arm_lpae_iopte pte, int lvl)
 	return iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE;
 }
 
+static inline bool iopte_valid(arm_lpae_iopte pte)
+{
+	return pte & ARM_LPAE_PTE_VALID;
+}
+
 #ifdef __KVM_NVHE_HYPERVISOR__
 #include <nvhe/memory.h>
 #define __arm_lpae_virt_to_phys	hyp_virt_to_phys
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 86226571cdb8..ce0aed9c87d2 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -89,6 +89,8 @@ struct io_pgtable_cfg {
 	 *	attributes set in the TCR for a non-coherent page-table walker.
 	 *
 	 * IO_PGTABLE_QUIRK_ARM_HD: Enables dirty tracking in stage 1 pagetable.
+	 *
+	 * IO_PGTABLE_QUIRK_UNMAP_INVAL: Only invalidate PTE on unmap, don't clear it.
 	 */
 	#define IO_PGTABLE_QUIRK_ARM_NS			BIT(0)
 	#define IO_PGTABLE_QUIRK_NO_PERMS		BIT(1)
@@ -97,6 +99,7 @@ struct io_pgtable_cfg {
 	#define IO_PGTABLE_QUIRK_ARM_TTBR1		BIT(5)
 	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA		BIT(6)
 	#define IO_PGTABLE_QUIRK_ARM_HD			BIT(7)
+	#define IO_PGTABLE_QUIRK_UNMAP_INVAL		BIT(8)
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
@@ -194,7 +197,7 @@ struct arm_lpae_io_pgtable_walk_data {
 	int level;
 	void *cookie;
 	void (*visit_post_table)(struct arm_lpae_io_pgtable_walk_data *data,
-				 arm_lpae_iopte *ptep, int lvl);
+				 u64 *ptep, int lvl);
 };
 
 /**
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 40/58] KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (38 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 39/58] drivers/iommu: io-pgtable: Add IO_PGTABLE_QUIRK_UNMAP_INVAL Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 19:44   ` Jason Gunthorpe
  2024-12-12 18:04 ` [RFC PATCH v2 41/58] KVM: arm64: smmu-v3: Add DABT handler Mostafa Saleh
                   ` (18 subsequent siblings)
  58 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add map_pages and iova_to_phys HVC code, which
mainly calls the io-pgtable.

For unmap_pages, we rely on IO_PGTABLE_QUIRK_UNMAP_INVAL, where the
driver first calls unmap_pages which invalidate all the pages as
a typical unmap, issuing all the necessary TLB invalidations.
Then, we will start a page table with 2 callbacks:
- visit_leaf: for each unmapped leaf, it would decrement the refcount
  of the page using __pkvm_host_unuse_dma(), reversing the what IOMMU
  core does in map.
- visit_post_table: this would free any invalidated tables as they
  wouldn't be freed because of the quirk.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 147 ++++++++++++++++++++
 1 file changed, 147 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index ec3f8d9749d3..1821a3420a4d 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -808,15 +808,74 @@ static const struct iommu_flush_ops smmu_tlb_ops = {
 	.tlb_add_page	= smmu_tlb_add_page,
 };
 
+static void smmu_unmap_visit_leaf(phys_addr_t addr, size_t size,
+				  struct io_pgtable_walk_common *data,
+				  void *wd)
+{
+	u64 *ptep = wd;
+
+	WARN_ON(__pkvm_host_unuse_dma(addr, size));
+	*ptep = 0;
+}
+
+/*
+ * On unmap with the IO_PGTABLE_QUIRK_UNMAP_INVAL, unmap doesn't clear
+ * or free any tables, so after the unmap walk the table and on the post
+ * walk we free invalid tables.
+ * One caveat, is that a table can be unmapped while it points to other
+ * tables which would be valid, and we would need to free those also.
+ * The simples solution is to look at the walk PTE info and if any of
+ * the parents is invalid it means that this table also needs to freed.
+ */
+static void smmu_unmap_visit_post_table(struct arm_lpae_io_pgtable_walk_data *walk_data,
+					arm_lpae_iopte *ptep, int lvl)
+{
+	struct arm_lpae_io_pgtable *data = walk_data->cookie;
+	size_t table_size;
+	int i;
+	bool invalid = false;
+
+	if (lvl == data->start_level)
+		table_size = ARM_LPAE_PGD_SIZE(data);
+	else
+		table_size = ARM_LPAE_GRANULE(data);
+
+	for (i = 0 ; i <= lvl ; ++i)
+		invalid |= !iopte_valid(walk_data->ptes[lvl]);
+
+	if (!invalid)
+		return;
+
+	__arm_lpae_free_pages(ptep, table_size, &data->iop.cfg, data->iop.cookie);
+	*ptep = 0;
+}
+
 static void smmu_iotlb_sync(struct kvm_hyp_iommu_domain *domain,
 			    struct iommu_iotlb_gather *gather)
 {
 	size_t size;
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+	struct io_pgtable *pgtable = smmu_domain->pgtable;
+	struct arm_lpae_io_pgtable *data = io_pgtable_to_data(pgtable);
+	struct arm_lpae_io_pgtable_walk_data wd = {
+		.cookie = data,
+		.visit_post_table = smmu_unmap_visit_post_table,
+	};
+	struct io_pgtable_walk_common walk_data = {
+		.visit_leaf = smmu_unmap_visit_leaf,
+		.data = &wd,
+	};
 
 	if (!gather->pgsize)
 		return;
 	size = gather->end - gather->start + 1;
 	smmu_tlb_inv_range(domain, gather->start, size,  gather->pgsize, true);
+
+	/*
+	 * Now decrement the refcount of unmapped pages thanks to
+	 * IO_PGTABLE_QUIRK_UNMAP_INVAL
+	 */
+	pgtable->ops.pgtable_walk(&pgtable->ops, gather->start, size, &walk_data);
 }
 
 static int smmu_domain_config_s2(struct kvm_hyp_iommu_domain *domain,
@@ -966,6 +1025,7 @@ static int smmu_domain_finalise(struct hyp_arm_smmu_v3_device *smmu,
 			.oas = smmu->ias,
 			.coherent_walk = smmu->features & ARM_SMMU_FEAT_COHERENCY,
 			.tlb = &smmu_tlb_ops,
+			.quirks = IO_PGTABLE_QUIRK_UNMAP_INVAL,
 		};
 	} else {
 		cfg = (struct io_pgtable_cfg) {
@@ -975,6 +1035,7 @@ static int smmu_domain_finalise(struct hyp_arm_smmu_v3_device *smmu,
 			.oas = smmu->oas,
 			.coherent_walk = smmu->features & ARM_SMMU_FEAT_COHERENCY,
 			.tlb = &smmu_tlb_ops,
+			.quirks = IO_PGTABLE_QUIRK_UNMAP_INVAL,
 		};
 	}
 
@@ -1125,6 +1186,89 @@ static int smmu_detach_dev(struct kvm_hyp_iommu *iommu, struct kvm_hyp_iommu_dom
 	return ret;
 }
 
+static int smmu_map_pages(struct kvm_hyp_iommu_domain *domain, unsigned long iova,
+			  phys_addr_t paddr, size_t pgsize,
+			  size_t pgcount, int prot, size_t *total_mapped)
+{
+	size_t mapped;
+	size_t granule;
+	int ret;
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+	struct io_pgtable *pgtable = smmu_domain->pgtable;
+
+	if (!pgtable)
+		return -EINVAL;
+
+	granule = 1UL << __ffs(smmu_domain->pgtable->cfg.pgsize_bitmap);
+	if (!IS_ALIGNED(iova | paddr | pgsize, granule))
+		return -EINVAL;
+
+	hyp_spin_lock(&smmu_domain->pgt_lock);
+	while (pgcount && !ret) {
+		mapped = 0;
+		ret = pgtable->ops.map_pages(&pgtable->ops, iova, paddr,
+					     pgsize, pgcount, prot, 0, &mapped);
+		if (ret)
+			break;
+		WARN_ON(!IS_ALIGNED(mapped, pgsize));
+		WARN_ON(mapped > pgcount * pgsize);
+
+		pgcount -= mapped / pgsize;
+		*total_mapped += mapped;
+		iova += mapped;
+		paddr += mapped;
+	}
+	hyp_spin_unlock(&smmu_domain->pgt_lock);
+
+	return 0;
+}
+
+static size_t smmu_unmap_pages(struct kvm_hyp_iommu_domain *domain, unsigned long iova,
+			       size_t pgsize, size_t pgcount, struct iommu_iotlb_gather *gather)
+{
+	size_t granule, unmapped, total_unmapped = 0;
+	size_t size = pgsize * pgcount;
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+	struct io_pgtable *pgtable = smmu_domain->pgtable;
+
+	if (!pgtable)
+		return -EINVAL;
+
+	granule = 1UL << __ffs(smmu_domain->pgtable->cfg.pgsize_bitmap);
+	if (!IS_ALIGNED(iova | pgsize, granule))
+		return 0;
+
+	hyp_spin_lock(&smmu_domain->pgt_lock);
+	while (total_unmapped < size) {
+		unmapped = pgtable->ops.unmap_pages(&pgtable->ops, iova, pgsize,
+						    pgcount, gather);
+		if (!unmapped)
+			break;
+		iova += unmapped;
+		total_unmapped += unmapped;
+		pgcount -= unmapped / pgsize;
+	}
+	hyp_spin_unlock(&smmu_domain->pgt_lock);
+	return total_unmapped;
+}
+
+static phys_addr_t smmu_iova_to_phys(struct kvm_hyp_iommu_domain *domain,
+				     unsigned long iova)
+{
+	phys_addr_t paddr;
+	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+	struct io_pgtable *pgtable = smmu_domain->pgtable;
+
+	if (!pgtable)
+		return -EINVAL;
+
+	hyp_spin_lock(&smmu_domain->pgt_lock);
+	paddr = pgtable->ops.iova_to_phys(&pgtable->ops, iova);
+	hyp_spin_unlock(&smmu_domain->pgt_lock);
+
+	return paddr;
+}
+
 /* Shared with the kernel driver in EL1 */
 struct kvm_iommu_ops smmu_ops = {
 	.init				= smmu_init,
@@ -1134,4 +1278,7 @@ struct kvm_iommu_ops smmu_ops = {
 	.iotlb_sync			= smmu_iotlb_sync,
 	.attach_dev			= smmu_attach_dev,
 	.detach_dev			= smmu_detach_dev,
+	.map_pages			= smmu_map_pages,
+	.unmap_pages			= smmu_unmap_pages,
+	.iova_to_phys			= smmu_iova_to_phys,
 };
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 41/58] KVM: arm64: smmu-v3: Add DABT handler
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (39 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 40/58] KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 42/58] iommu/arm-smmu-v3-kvm: Add host driver for pKVM Mostafa Saleh
                   ` (17 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add a data abort handler for the SMMUv3, we allow access for
ETVQ and GERROR for debug purpose.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 58 +++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index 1821a3420a4d..2a99873d980f 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -13,6 +13,7 @@
 #include <nvhe/mem_protect.h>
 #include <nvhe/mm.h>
 #include <nvhe/pkvm.h>
+#include <nvhe/trap_handler.h>
 
 #define ARM_SMMU_POLL_TIMEOUT_US	100000 /* 100ms arbitrary timeout */
 
@@ -1269,6 +1270,62 @@ static phys_addr_t smmu_iova_to_phys(struct kvm_hyp_iommu_domain *domain,
 	return paddr;
 }
 
+static bool smmu_dabt_device(struct hyp_arm_smmu_v3_device *smmu,
+			     struct kvm_cpu_context *host_ctxt,
+			     u64 esr, u32 off)
+{
+	bool is_write = esr & ESR_ELx_WNR;
+	unsigned int len = BIT((esr & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT);
+	int rd = (esr & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
+	const u32 no_access  = 0;
+	const u32 read_write = (u32)(-1);
+	const u32 read_only = is_write ? no_access : read_write;
+	u32 mask = no_access;
+
+	/*
+	 * Only handle MMIO access with u32 size and alignment.
+	 * We don't need to change 64-bit registers for now.
+	 */
+	if ((len != sizeof(u32)) || (off & (sizeof(u32) - 1)))
+		return false;
+
+	switch (off) {
+	case ARM_SMMU_EVTQ_PROD + SZ_64K:
+		mask = read_write;
+		break;
+	case ARM_SMMU_EVTQ_CONS + SZ_64K:
+		mask = read_write;
+		break;
+	case ARM_SMMU_GERROR:
+		mask = read_only;
+		break;
+	case ARM_SMMU_GERRORN:
+		mask = read_write;
+		break;
+	};
+
+	if (!mask)
+		return false;
+	if (is_write)
+		writel_relaxed(cpu_reg(host_ctxt, rd) & mask, smmu->base + off);
+	else
+		cpu_reg(host_ctxt, rd) = readl_relaxed(smmu->base + off);
+
+	return true;
+}
+
+static bool smmu_dabt_handler(struct kvm_cpu_context *host_ctxt, u64 esr, u64 addr)
+{
+	struct hyp_arm_smmu_v3_device *smmu;
+
+	for_each_smmu(smmu) {
+		if (addr < smmu->mmio_addr || addr >= smmu->mmio_addr + smmu->mmio_size)
+			continue;
+		return smmu_dabt_device(smmu, host_ctxt, esr, addr - smmu->mmio_addr);
+	}
+	return false;
+}
+
 /* Shared with the kernel driver in EL1 */
 struct kvm_iommu_ops smmu_ops = {
 	.init				= smmu_init,
@@ -1281,4 +1338,5 @@ struct kvm_iommu_ops smmu_ops = {
 	.map_pages			= smmu_map_pages,
 	.unmap_pages			= smmu_unmap_pages,
 	.iova_to_phys			= smmu_iova_to_phys,
+	.dabt_handler			= smmu_dabt_handler,
 };
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 42/58] iommu/arm-smmu-v3-kvm: Add host driver for pKVM
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (40 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 41/58] KVM: arm64: smmu-v3: Add DABT handler Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 43/58] iommu/arm-smmu-v3-kvm: Pass a list of SMMU devices to the hypervisor Mostafa Saleh
                   ` (16 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Under protected KVM (pKVM), the host does not have access to guest or
hypervisor memory. This means that devices owned by the host must be
isolated by the SMMU, and the hypervisor is in charge of the SMMU.

Introduce the host component that replaces the normal SMMUv3 driver when
pKVM is enabled, and sends configuration and requests to the actual
driver running in the hypervisor (EL2).

Rather than rely on regular driver probe, pKVM directly calls
kvm_arm_smmu_v3_init(), which synchronously finds all SMMUs and hands
them to the hypervisor. If the regular driver is enabled, it will not
find any free SMMU to drive once it gets probed.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/Makefile        |  6 ++
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 64 +++++++++++++++++++
 2 files changed, 70 insertions(+)
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c

diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
index 515a84f14783..7a182adbebc1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -6,3 +6,9 @@ arm_smmu_v3-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
 arm_smmu_v3-$(CONFIG_TEGRA241_CMDQV) += tegra241-cmdqv.o
 
 obj-$(CONFIG_ARM_SMMU_V3_KUNIT_TEST) += arm-smmu-v3-test.o
+
+obj-$(CONFIG_ARM_SMMU_V3_PKVM) += arm_smmu_v3_kvm.o
+ccflags-$(CONFIG_ARM_SMMU_V3_PKVM) += -Iarch/arm64/kvm/
+arm_smmu_v3_kvm-objs-y += arm-smmu-v3-kvm.o
+arm_smmu_v3_kvm-objs-y += arm-smmu-v3-common.o
+arm_smmu_v3_kvm-objs := $(arm_smmu_v3_kvm-objs-y)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
new file mode 100644
index 000000000000..8cea33d15e08
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pKVM host driver for the Arm SMMUv3
+ *
+ * Copyright (C) 2022 Linaro Ltd.
+ */
+#include <asm/kvm_mmu.h>
+
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+
+#include <kvm/arm_smmu_v3.h>
+
+#include "arm-smmu-v3.h"
+
+extern struct kvm_iommu_ops kvm_nvhe_sym(smmu_ops);
+
+static int kvm_arm_smmu_probe(struct platform_device *pdev)
+{
+	return -ENOSYS;
+}
+
+static void kvm_arm_smmu_remove(struct platform_device *pdev)
+{
+}
+
+static const struct of_device_id arm_smmu_of_match[] = {
+	{ .compatible = "arm,smmu-v3", },
+	{ },
+};
+
+static struct platform_driver kvm_arm_smmu_driver = {
+	.driver = {
+		.name = "kvm-arm-smmu-v3",
+		.of_match_table = arm_smmu_of_match,
+	},
+	.remove = kvm_arm_smmu_remove,
+};
+
+static int kvm_arm_smmu_v3_init_drv(void)
+{
+	return platform_driver_probe(&kvm_arm_smmu_driver, kvm_arm_smmu_probe);
+}
+
+static void kvm_arm_smmu_v3_remove_drv(void)
+{
+	platform_driver_unregister(&kvm_arm_smmu_driver);
+}
+
+struct kvm_iommu_driver kvm_smmu_v3_ops = {
+	.init_driver = kvm_arm_smmu_v3_init_drv,
+	.remove_driver = kvm_arm_smmu_v3_remove_drv,
+};
+
+static int kvm_arm_smmu_v3_register(void)
+{
+	if (!is_protected_kvm_enabled())
+		return 0;
+
+	return kvm_iommu_register_driver(&kvm_smmu_v3_ops,
+					kern_hyp_va(lm_alias(&kvm_nvhe_sym(smmu_ops))));
+};
+
+core_initcall(kvm_arm_smmu_v3_register);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 43/58] iommu/arm-smmu-v3-kvm: Pass a list of SMMU devices to the hypervisor
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (41 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 42/58] iommu/arm-smmu-v3-kvm: Add host driver for pKVM Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 44/58] iommu/arm-smmu-v3-kvm: Validate device features Mostafa Saleh
                   ` (15 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Build a list of devices and donate the page to the hypervisor. At this
point the host is trusted and this would be a good opportunity to
provide more information about the system. For example, which devices
are owned by the host (perhaps via the VMID and SW bits in the stream
table, although we populate the stream table lazily at the moment.)

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 128 +++++++++++++++++-
 1 file changed, 126 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index 8cea33d15e08..e2d9bd97ddc5 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -15,9 +15,73 @@
 
 extern struct kvm_iommu_ops kvm_nvhe_sym(smmu_ops);
 
+struct host_arm_smmu_device {
+	struct arm_smmu_device		smmu;
+	pkvm_handle_t			id;
+};
+
+#define smmu_to_host(_smmu) \
+	container_of(_smmu, struct host_arm_smmu_device, smmu);
+
+static size_t				kvm_arm_smmu_cur;
+static size_t				kvm_arm_smmu_count;
+static struct hyp_arm_smmu_v3_device	*kvm_arm_smmu_array;
+
 static int kvm_arm_smmu_probe(struct platform_device *pdev)
 {
-	return -ENOSYS;
+	int ret;
+	size_t size;
+	phys_addr_t ioaddr;
+	struct resource *res;
+	struct arm_smmu_device *smmu;
+	struct device *dev = &pdev->dev;
+	struct host_arm_smmu_device *host_smmu;
+	struct hyp_arm_smmu_v3_device *hyp_smmu;
+
+	if (kvm_arm_smmu_cur >= kvm_arm_smmu_count)
+		return -ENOSPC;
+
+	hyp_smmu = &kvm_arm_smmu_array[kvm_arm_smmu_cur];
+
+	host_smmu = devm_kzalloc(dev, sizeof(*host_smmu), GFP_KERNEL);
+	if (!host_smmu)
+		return -ENOMEM;
+
+	smmu = &host_smmu->smmu;
+	smmu->dev = dev;
+
+	ret = arm_smmu_fw_probe(pdev, smmu);
+	if (ret)
+		return ret;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	size = resource_size(res);
+	if (size < SZ_128K) {
+		dev_err(dev, "unsupported MMIO region size (%pr)\n", res);
+		return -EINVAL;
+	}
+	ioaddr = res->start;
+	host_smmu->id = kvm_arm_smmu_cur;
+
+	smmu->base = devm_ioremap_resource(dev, res);
+	if (IS_ERR(smmu->base))
+		return PTR_ERR(smmu->base);
+
+	ret = arm_smmu_device_hw_probe(smmu);
+	if (ret)
+		return ret;
+
+	platform_set_drvdata(pdev, smmu);
+
+	/* Hypervisor parameters */
+	hyp_smmu->pgsize_bitmap = smmu->pgsize_bitmap;
+	hyp_smmu->oas = smmu->oas;
+	hyp_smmu->ias = smmu->ias;
+	hyp_smmu->mmio_addr = ioaddr;
+	hyp_smmu->mmio_size = size;
+	kvm_arm_smmu_cur++;
+
+	return arm_smmu_register_iommu(smmu, &kvm_arm_smmu_ops, ioaddr);
 }
 
 static void kvm_arm_smmu_remove(struct platform_device *pdev)
@@ -37,9 +101,69 @@ static struct platform_driver kvm_arm_smmu_driver = {
 	.remove = kvm_arm_smmu_remove,
 };
 
+static int kvm_arm_smmu_array_alloc(void)
+{
+	int smmu_order;
+	struct device_node *np;
+
+	kvm_arm_smmu_count = 0;
+	for_each_compatible_node(np, NULL, "arm,smmu-v3")
+		kvm_arm_smmu_count++;
+
+	if (!kvm_arm_smmu_count)
+		return 0;
+
+	/* Allocate the parameter list shared with the hypervisor */
+	smmu_order = get_order(kvm_arm_smmu_count * sizeof(*kvm_arm_smmu_array));
+	kvm_arm_smmu_array = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
+						      smmu_order);
+	if (!kvm_arm_smmu_array)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void kvm_arm_smmu_array_free(void)
+{
+	int order;
+
+	order = get_order(kvm_arm_smmu_count * sizeof(*kvm_arm_smmu_array));
+	free_pages((unsigned long)kvm_arm_smmu_array, order);
+}
+
 static int kvm_arm_smmu_v3_init_drv(void)
 {
-	return platform_driver_probe(&kvm_arm_smmu_driver, kvm_arm_smmu_probe);
+	int ret;
+
+	/*
+	 * Check whether any device owned by the host is behind an SMMU.
+	 */
+	ret = kvm_arm_smmu_array_alloc();
+	if (ret || !kvm_arm_smmu_count)
+		return ret;
+
+	ret = platform_driver_probe(&kvm_arm_smmu_driver, kvm_arm_smmu_probe);
+	if (ret)
+		goto err_free;
+
+	if (kvm_arm_smmu_cur != kvm_arm_smmu_count) {
+		/* A device exists but failed to probe */
+		ret = -EUNATCH;
+		goto err_free;
+	}
+
+	/*
+	 * These variables are stored in the nVHE image, and won't be accessible
+	 * after KVM initialization. Ownership of kvm_arm_smmu_array will be
+	 * transferred to the hypervisor as well.
+	 */
+	kvm_hyp_arm_smmu_v3_smmus = kvm_arm_smmu_array;
+	kvm_hyp_arm_smmu_v3_count = kvm_arm_smmu_count;
+	return 0;
+
+err_free:
+	kvm_arm_smmu_array_free();
+	return ret;
 }
 
 static void kvm_arm_smmu_v3_remove_drv(void)
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 44/58] iommu/arm-smmu-v3-kvm: Validate device features
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (42 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 43/58] iommu/arm-smmu-v3-kvm: Pass a list of SMMU devices to the hypervisor Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 45/58] iommu/arm-smmu-v3-kvm: Allocate structures and reset device Mostafa Saleh
                   ` (14 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

The KVM hypervisor driver supports a small subset of features. Ensure
the implementation is compatible, and disable some unused features.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 43 +++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index e2d9bd97ddc5..4b0c9ff6e7f1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -27,6 +27,45 @@ static size_t				kvm_arm_smmu_cur;
 static size_t				kvm_arm_smmu_count;
 static struct hyp_arm_smmu_v3_device	*kvm_arm_smmu_array;
 
+static bool kvm_arm_smmu_validate_features(struct arm_smmu_device *smmu)
+{
+	unsigned int required_features =
+		ARM_SMMU_FEAT_TT_LE;
+	unsigned int forbidden_features =
+		ARM_SMMU_FEAT_STALL_FORCE;
+	unsigned int keep_features =
+		ARM_SMMU_FEAT_2_LVL_STRTAB	|
+		ARM_SMMU_FEAT_2_LVL_CDTAB	|
+		ARM_SMMU_FEAT_TT_LE		|
+		ARM_SMMU_FEAT_SEV		|
+		ARM_SMMU_FEAT_COHERENCY		|
+		ARM_SMMU_FEAT_TRANS_S1		|
+		ARM_SMMU_FEAT_TRANS_S2		|
+		ARM_SMMU_FEAT_VAX		|
+		ARM_SMMU_FEAT_RANGE_INV;
+
+	if (smmu->options & ARM_SMMU_OPT_PAGE0_REGS_ONLY) {
+		dev_err(smmu->dev, "unsupported layout\n");
+		return false;
+	}
+
+	if ((smmu->features & required_features) != required_features) {
+		dev_err(smmu->dev, "missing features 0x%x\n",
+			required_features & ~smmu->features);
+		return false;
+	}
+
+	if (smmu->features & forbidden_features) {
+		dev_err(smmu->dev, "features 0x%x forbidden\n",
+			smmu->features & forbidden_features);
+		return false;
+	}
+
+	smmu->features &= keep_features;
+
+	return true;
+}
+
 static int kvm_arm_smmu_probe(struct platform_device *pdev)
 {
 	int ret;
@@ -71,6 +110,9 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 	if (ret)
 		return ret;
 
+	if (!kvm_arm_smmu_validate_features(smmu))
+		return -ENODEV;
+
 	platform_set_drvdata(pdev, smmu);
 
 	/* Hypervisor parameters */
@@ -79,6 +121,7 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 	hyp_smmu->ias = smmu->ias;
 	hyp_smmu->mmio_addr = ioaddr;
 	hyp_smmu->mmio_size = size;
+	hyp_smmu->features = smmu->features;
 	kvm_arm_smmu_cur++;
 
 	return arm_smmu_register_iommu(smmu, &kvm_arm_smmu_ops, ioaddr);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 45/58] iommu/arm-smmu-v3-kvm: Allocate structures and reset device
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (43 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 44/58] iommu/arm-smmu-v3-kvm: Validate device features Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 46/58] KVM: arm64: Add function to topup generic allocator Mostafa Saleh
                   ` (13 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Allocate the structures that will be shared between hypervisor and SMMU:
command queue and stream table. Install them in the MMIO registers,
along with some configuration bits. After hyp initialization, the host
won't have access to those pages anymore.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 53 +++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index 4b0c9ff6e7f1..e4a5bdc830bc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -18,6 +18,7 @@ extern struct kvm_iommu_ops kvm_nvhe_sym(smmu_ops);
 struct host_arm_smmu_device {
 	struct arm_smmu_device		smmu;
 	pkvm_handle_t			id;
+	u32				boot_gbpa;
 };
 
 #define smmu_to_host(_smmu) \
@@ -66,6 +67,35 @@ static bool kvm_arm_smmu_validate_features(struct arm_smmu_device *smmu)
 	return true;
 }
 
+static int kvm_arm_smmu_device_reset(struct host_arm_smmu_device *host_smmu)
+{
+	int ret;
+	u32 reg;
+	struct arm_smmu_device *smmu = &host_smmu->smmu;
+
+	reg = readl_relaxed(smmu->base + ARM_SMMU_CR0);
+	if (reg & CR0_SMMUEN)
+		dev_warn(smmu->dev, "SMMU currently enabled! Resetting...\n");
+
+	/* Disable bypass */
+	host_smmu->boot_gbpa = readl_relaxed(smmu->base + ARM_SMMU_GBPA);
+	ret = arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0);
+	if (ret)
+		return ret;
+
+	ret = arm_smmu_device_disable(smmu);
+	if (ret)
+		return ret;
+
+	/* Stream table */
+	arm_smmu_write_strtab(smmu);
+
+	/* Command queue */
+	writeq_relaxed(smmu->cmdq.q.q_base, smmu->base + ARM_SMMU_CMDQ_BASE);
+
+	return 0;
+}
+
 static int kvm_arm_smmu_probe(struct platform_device *pdev)
 {
 	int ret;
@@ -113,6 +143,20 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 	if (!kvm_arm_smmu_validate_features(smmu))
 		return -ENODEV;
 
+	ret = arm_smmu_init_one_queue(smmu, &smmu->cmdq.q, smmu->base,
+				      ARM_SMMU_CMDQ_PROD, ARM_SMMU_CMDQ_CONS,
+				      CMDQ_ENT_DWORDS, "cmdq");
+	if (ret)
+		return ret;
+
+	ret = arm_smmu_init_strtab(smmu);
+	if (ret)
+		return ret;
+
+	ret = kvm_arm_smmu_device_reset(host_smmu);
+	if (ret)
+		return ret;
+
 	platform_set_drvdata(pdev, smmu);
 
 	/* Hypervisor parameters */
@@ -129,6 +173,15 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 
 static void kvm_arm_smmu_remove(struct platform_device *pdev)
 {
+	struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
+	struct host_arm_smmu_device *host_smmu = smmu_to_host(smmu);
+
+	/*
+	 * There was an error during hypervisor setup. The hyp driver may
+	 * have already enabled the device, so disable it.
+	 */
+	arm_smmu_device_disable(smmu);
+	arm_smmu_update_gbpa(smmu, host_smmu->boot_gbpa, GBPA_ABORT);
 }
 
 static const struct of_device_id arm_smmu_of_match[] = {
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 46/58] KVM: arm64: Add function to topup generic allocator
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (44 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 45/58] iommu/arm-smmu-v3-kvm: Allocate structures and reset device Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 47/58] KVM: arm64: Add macro for SMCCC call with all returns Mostafa Saleh
                   ` (12 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Soon, IOMMU driver might need to topup the IOMMU pool from map_pages
IOMMU operation, which has a gfp flag is it might be called from
atomic context, add a function to topup an allocator with an ID
that also accepts gfp flags.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  4 ++++
 arch/arm64/kvm/mmu.c              | 20 ++++++++++++++++++++
 arch/arm64/kvm/pkvm.c             | 20 ++++++++++++++++++++
 3 files changed, 44 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index a3b5d8dd8995..59a23828bd0e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -155,6 +155,8 @@ static inline void __free_hyp_memcache(struct kvm_hyp_memcache *mc,
 
 void free_hyp_memcache(struct kvm_hyp_memcache *mc);
 int topup_hyp_memcache(struct kvm_hyp_memcache *mc, unsigned long min_pages, unsigned long order);
+int topup_hyp_memcache_gfp(struct kvm_hyp_memcache *mc, unsigned long min_pages,
+			   unsigned long order, gfp_t gfp);
 
 static inline void init_hyp_memcache(struct kvm_hyp_memcache *mc)
 {
@@ -1628,6 +1630,8 @@ void kvm_set_vm_id_reg(struct kvm *kvm, u32 reg, u64 val);
 #define HYP_ALLOC_MGT_IOMMU_ID		1
 
 unsigned long __pkvm_reclaim_hyp_alloc_mgt(unsigned long nr_pages);
+int __pkvm_topup_hyp_alloc_mgt_gfp(unsigned long id, unsigned long nr_pages,
+				   unsigned long sz_alloc, gfp_t gfp);
 
 struct kvm_iommu_driver {
 	int (*init_driver)(void);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ef7e8c156afb..229338877c59 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1225,6 +1225,11 @@ static void *hyp_mc_alloc_fn(void *flags, unsigned long order)
 	return addr;
 }
 
+static void *hyp_mc_alloc_gfp_fn(void *flags, unsigned long order)
+{
+	return (void *)__get_free_pages(*(gfp_t *)flags, order);
+}
+
 void free_hyp_memcache(struct kvm_hyp_memcache *mc)
 {
 	unsigned long flags = mc->flags;
@@ -1249,6 +1254,21 @@ int topup_hyp_memcache(struct kvm_hyp_memcache *mc, unsigned long min_pages,
 				    kvm_host_pa, (void *)flags, order);
 }
 
+int topup_hyp_memcache_gfp(struct kvm_hyp_memcache *mc, unsigned long min_pages,
+			   unsigned long order, gfp_t gfp)
+{
+	void *flags = &gfp;
+
+	if (!is_protected_kvm_enabled())
+		return 0;
+
+	if (order > PAGE_SHIFT)
+		return -E2BIG;
+
+	return __topup_hyp_memcache(mc, min_pages, hyp_mc_alloc_gfp_fn,
+				    kvm_host_pa, flags, order);
+}
+
 /**
  * kvm_phys_addr_ioremap - map a device range to guest IPA
  *
diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index e6df35aae840..0c45acbbff6e 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -1114,3 +1114,23 @@ unsigned long __pkvm_reclaim_hyp_alloc_mgt(unsigned long nr_pages)
 
 	return reclaimed;
 }
+
+int __pkvm_topup_hyp_alloc_mgt_gfp(unsigned long id, unsigned long nr_pages,
+				   unsigned long sz_alloc, gfp_t gfp)
+{
+	struct kvm_hyp_memcache mc;
+	int ret;
+
+	init_hyp_memcache(&mc);
+
+	ret = topup_hyp_memcache_gfp(&mc, nr_pages, get_order(sz_alloc), gfp);
+	if (ret)
+		return ret;
+
+	ret = kvm_call_hyp_nvhe(__pkvm_hyp_alloc_mgt_refill, id,
+				mc.head, mc.nr_pages);
+	if (ret)
+		free_hyp_memcache(&mc);
+
+	return ret;
+}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 47/58] KVM: arm64: Add macro for SMCCC call with all returns
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (45 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 46/58] KVM: arm64: Add function to topup generic allocator Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 48/58] iommu/arm-smmu-v3-kvm: Add function to topup IOMMU allocator Mostafa Saleh
                   ` (11 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add a macro that returns SMCCC returns from a hypercall.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/include/asm/kvm_host.h | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 59a23828bd0e..3cdc99ebdd0d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1248,7 +1248,18 @@ void kvm_arm_resume_guest(struct kvm *kvm);
 #define vcpu_has_run_once(vcpu)	!!rcu_access_pointer((vcpu)->pid)
 
 #ifndef __KVM_NVHE_HYPERVISOR__
-#define kvm_call_hyp_nvhe(f, ...)						\
+#define kvm_call_hyp_nvhe_smccc(f, ...)					\
+	({								\
+		struct arm_smccc_res res;				\
+									\
+		arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(f),		\
+				  ##__VA_ARGS__, &res);			\
+		WARN_ON(res.a0 != SMCCC_RET_SUCCESS);			\
+									\
+		res;							\
+	})
+
+#define kvm_call_hyp_nvhe(f, ...)					\
 	({								\
 		struct arm_smccc_res res;				\
 									\
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 48/58] iommu/arm-smmu-v3-kvm: Add function to topup IOMMU allocator
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (46 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 47/58] KVM: arm64: Add macro for SMCCC call with all returns Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 49/58] iommu/arm-smmu-v3-kvm: Add IOMMU ops Mostafa Saleh
                   ` (10 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

The hypervisor returns requests for memory allocation in HVCs
encoded in the return registers.

Add a function that checks those returns and topup the IOMMU
alloctor in response to requests, and a macro that calls
this function around calling an HVC.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 40 +++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index e4a5bdc830bc..dab2d59b5a88 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -4,6 +4,7 @@
  *
  * Copyright (C) 2022 Linaro Ltd.
  */
+#include <asm/kvm_pkvm.h>
 #include <asm/kvm_mmu.h>
 
 #include <linux/of_platform.h>
@@ -28,6 +29,45 @@ static size_t				kvm_arm_smmu_cur;
 static size_t				kvm_arm_smmu_count;
 static struct hyp_arm_smmu_v3_device	*kvm_arm_smmu_array;
 
+static int kvm_arm_smmu_topup_memcache(struct arm_smccc_res *res, gfp_t gfp)
+{
+	struct kvm_hyp_req req;
+
+	hyp_reqs_smccc_decode(res, &req);
+
+	if ((res->a1 == -ENOMEM) && (req.type != KVM_HYP_REQ_TYPE_MEM)) {
+		/*
+		 * There is no way for drivers to populate hyp_alloc requests,
+		 * so -ENOMEM + no request indicates that.
+		 */
+		return __pkvm_topup_hyp_alloc(1);
+	} else if (req.type != KVM_HYP_REQ_TYPE_MEM) {
+		return -EBADE;
+	}
+
+	if (req.mem.dest == REQ_MEM_DEST_HYP_IOMMU) {
+		return __pkvm_topup_hyp_alloc_mgt_gfp(HYP_ALLOC_MGT_IOMMU_ID,
+						      req.mem.nr_pages,
+						      req.mem.sz_alloc,
+						      gfp);
+	} else if (req.mem.dest == REQ_MEM_DEST_HYP_ALLOC) {
+		/* Fill hyp alloc*/
+		return __pkvm_topup_hyp_alloc(req.mem.nr_pages);
+	}
+
+	pr_err("Bogus mem request");
+	return -EBADE;
+}
+
+#define kvm_call_hyp_nvhe_mc(...)					\
+({									\
+	struct arm_smccc_res __res;					\
+	do {								\
+		__res = kvm_call_hyp_nvhe_smccc(__VA_ARGS__);		\
+	} while (__res.a1 && !kvm_arm_smmu_topup_memcache(&__res, GFP_KERNEL));\
+	__res.a1;							\
+})
+
 static bool kvm_arm_smmu_validate_features(struct arm_smmu_device *smmu)
 {
 	unsigned int required_features =
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 49/58] iommu/arm-smmu-v3-kvm: Add IOMMU ops
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (47 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 48/58] iommu/arm-smmu-v3-kvm: Add function to topup IOMMU allocator Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 50/58] iommu/arm-smmu-v3-kvm: Add map, unmap and iova_to_phys operations Mostafa Saleh
                   ` (9 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add iommu_ops: attach_dev, release_device, probe_device, domain_alloc/
free, capable, and some other common ops with the kernel SMMUv3
driver: device_group, of_xlate, get_resv_regions.

Other ops as map/unmap and iova_to_phys added next.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 284 ++++++++++++++++++
 1 file changed, 284 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index dab2d59b5a88..071743f5acf9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -7,6 +7,7 @@
 #include <asm/kvm_pkvm.h>
 #include <asm/kvm_mmu.h>
 
+#include <linux/arm-smccc.h>
 #include <linux/of_platform.h>
 #include <linux/platform_device.h>
 
@@ -25,9 +26,26 @@ struct host_arm_smmu_device {
 #define smmu_to_host(_smmu) \
 	container_of(_smmu, struct host_arm_smmu_device, smmu);
 
+struct kvm_arm_smmu_master {
+	struct arm_smmu_device		*smmu;
+	struct device			*dev;
+	struct kvm_arm_smmu_domain      *domain;
+};
+
+struct kvm_arm_smmu_domain {
+	struct iommu_domain		domain;
+	struct arm_smmu_device		*smmu;
+	struct mutex			init_mutex;
+	pkvm_handle_t			id;
+};
+
+#define to_kvm_smmu_domain(_domain) \
+	container_of(_domain, struct kvm_arm_smmu_domain, domain)
+
 static size_t				kvm_arm_smmu_cur;
 static size_t				kvm_arm_smmu_count;
 static struct hyp_arm_smmu_v3_device	*kvm_arm_smmu_array;
+static DEFINE_IDA(kvm_arm_smmu_domain_ida);
 
 static int kvm_arm_smmu_topup_memcache(struct arm_smccc_res *res, gfp_t gfp)
 {
@@ -68,6 +86,267 @@ static int kvm_arm_smmu_topup_memcache(struct arm_smccc_res *res, gfp_t gfp)
 	__res.a1;							\
 })
 
+static struct platform_driver kvm_arm_smmu_driver;
+
+static struct arm_smmu_device *
+kvm_arm_smmu_get_by_fwnode(struct fwnode_handle *fwnode)
+{
+	struct device *dev;
+
+	dev = driver_find_device_by_fwnode(&kvm_arm_smmu_driver.driver, fwnode);
+	put_device(dev);
+	return dev ? dev_get_drvdata(dev) : NULL;
+}
+
+static struct iommu_ops kvm_arm_smmu_ops;
+
+static struct iommu_device *kvm_arm_smmu_probe_device(struct device *dev)
+{
+	struct arm_smmu_device *smmu;
+	struct kvm_arm_smmu_master *master;
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+
+	if (WARN_ON_ONCE(dev_iommu_priv_get(dev)))
+		return ERR_PTR(-EBUSY);
+
+	smmu = kvm_arm_smmu_get_by_fwnode(fwspec->iommu_fwnode);
+	if (!smmu)
+		return ERR_PTR(-ENODEV);
+
+	master = kzalloc(sizeof(*master), GFP_KERNEL);
+	if (!master)
+		return ERR_PTR(-ENOMEM);
+
+	master->dev = dev;
+	master->smmu = smmu;
+	dev_iommu_priv_set(dev, master);
+
+	return &smmu->iommu;
+}
+
+static struct iommu_domain *kvm_arm_smmu_domain_alloc(unsigned type)
+{
+	struct kvm_arm_smmu_domain *kvm_smmu_domain;
+
+	/*
+	 * We don't support
+	 * - IOMMU_DOMAIN_DMA_FQ because lazy unmap would clash with memory
+	 *   donation to guests.
+	 * - IOMMU_DOMAIN_IDENTITY: Requires a stage-2 only transparent domain.
+	 */
+	if (type != IOMMU_DOMAIN_DMA &&
+	    type != IOMMU_DOMAIN_UNMANAGED)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	kvm_smmu_domain = kzalloc(sizeof(*kvm_smmu_domain), GFP_KERNEL);
+	if (!kvm_smmu_domain)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&kvm_smmu_domain->init_mutex);
+
+	return &kvm_smmu_domain->domain;
+}
+
+static int kvm_arm_smmu_domain_finalize(struct kvm_arm_smmu_domain *kvm_smmu_domain,
+					struct kvm_arm_smmu_master *master)
+{
+	int ret = 0;
+	struct arm_smmu_device *smmu = master->smmu;
+	unsigned int max_domains;
+	enum kvm_arm_smmu_domain_type type;
+	struct io_pgtable_cfg cfg;
+	unsigned long ias;
+
+	if (kvm_smmu_domain->smmu && (kvm_smmu_domain->smmu != smmu))
+		return -EINVAL;
+
+	if (kvm_smmu_domain->smmu)
+		return 0;
+	/* Default to stage-1. */
+	if (smmu->features & ARM_SMMU_FEAT_TRANS_S1) {
+		ias = (smmu->features & ARM_SMMU_FEAT_VAX) ? 52 : 48;
+		cfg = (struct io_pgtable_cfg) {
+			.fmt = ARM_64_LPAE_S1,
+			.pgsize_bitmap = smmu->pgsize_bitmap,
+			.ias = min_t(unsigned long, ias, VA_BITS),
+			.oas = smmu->ias,
+			.coherent_walk = smmu->features & ARM_SMMU_FEAT_COHERENCY,
+		};
+		ret = io_pgtable_configure(&cfg);
+		if (ret)
+			return ret;
+
+		type = KVM_ARM_SMMU_DOMAIN_S1;
+		kvm_smmu_domain->domain.pgsize_bitmap = cfg.pgsize_bitmap;
+		kvm_smmu_domain->domain.geometry.aperture_end = (1UL << cfg.ias) - 1;
+		max_domains = 1 << smmu->asid_bits;
+	} else {
+		cfg = (struct io_pgtable_cfg) {
+			.fmt = ARM_64_LPAE_S2,
+			.pgsize_bitmap = smmu->pgsize_bitmap,
+			.ias = smmu->ias,
+			.oas = smmu->oas,
+			.coherent_walk = smmu->features & ARM_SMMU_FEAT_COHERENCY,
+		};
+		ret = io_pgtable_configure(&cfg);
+		if (ret)
+			return ret;
+
+		type = KVM_ARM_SMMU_DOMAIN_S2;
+		kvm_smmu_domain->domain.pgsize_bitmap = cfg.pgsize_bitmap;
+		kvm_smmu_domain->domain.geometry.aperture_end = (1UL << cfg.ias) - 1;
+		max_domains = 1 << smmu->vmid_bits;
+	}
+	kvm_smmu_domain->domain.geometry.force_aperture = true;
+
+	/*
+	 * The hypervisor uses the domain_id for asid/vmid so it has to be
+	 * unique, and it has to be in range of this smmu, which can be
+	 * either 8 or 16 bits.
+	 */
+	ret = ida_alloc_range(&kvm_arm_smmu_domain_ida, 0,
+			      min(KVM_IOMMU_MAX_DOMAINS, max_domains), GFP_KERNEL);
+	if (ret < 0)
+		return ret;
+
+	kvm_smmu_domain->id = ret;
+
+	ret = kvm_call_hyp_nvhe_mc(__pkvm_host_iommu_alloc_domain,
+				   kvm_smmu_domain->id, type);
+	if (ret) {
+		ida_free(&kvm_arm_smmu_domain_ida, kvm_smmu_domain->id);
+		return ret;
+	}
+
+	kvm_smmu_domain->smmu = smmu;
+	return 0;
+}
+
+static void kvm_arm_smmu_domain_free(struct iommu_domain *domain)
+{
+	int ret;
+	struct kvm_arm_smmu_domain *kvm_smmu_domain = to_kvm_smmu_domain(domain);
+	struct arm_smmu_device *smmu = kvm_smmu_domain->smmu;
+
+	if (smmu) {
+		ret = kvm_call_hyp_nvhe(__pkvm_host_iommu_free_domain, kvm_smmu_domain->id);
+		ida_free(&kvm_arm_smmu_domain_ida, kvm_smmu_domain->id);
+	}
+	kfree(kvm_smmu_domain);
+}
+
+static int kvm_arm_smmu_detach_dev(struct host_arm_smmu_device *host_smmu,
+				   struct kvm_arm_smmu_master *master)
+{
+	int i, ret;
+	struct arm_smmu_device *smmu = &host_smmu->smmu;
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
+	struct kvm_arm_smmu_domain *domain = master->domain;
+
+	if (!domain)
+		return 0;
+
+	for (i = 0; i < fwspec->num_ids; i++) {
+		int sid = fwspec->ids[i];
+
+		ret = kvm_call_hyp_nvhe(__pkvm_host_iommu_detach_dev,
+					host_smmu->id, domain->id, sid, 0);
+		if (ret) {
+			dev_err(smmu->dev, "cannot detach device %s (0x%x): %d\n",
+				dev_name(master->dev), sid, ret);
+			break;
+		}
+	}
+
+	master->domain = NULL;
+
+	return ret;
+}
+
+static void kvm_arm_smmu_release_device(struct device *dev)
+{
+	struct kvm_arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct host_arm_smmu_device *host_smmu = smmu_to_host(master->smmu);
+
+	kvm_arm_smmu_detach_dev(host_smmu, master);
+	kfree(master);
+	iommu_fwspec_free(dev);
+}
+
+static int kvm_arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
+{
+	int i, ret;
+	struct arm_smmu_device *smmu;
+	struct host_arm_smmu_device *host_smmu;
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+	struct kvm_arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct kvm_arm_smmu_domain *kvm_smmu_domain = to_kvm_smmu_domain(domain);
+
+	if (!master)
+		return -ENODEV;
+
+	smmu = master->smmu;
+	host_smmu = smmu_to_host(smmu);
+
+	ret = kvm_arm_smmu_detach_dev(host_smmu, master);
+	if (ret)
+		return ret;
+
+	mutex_lock(&kvm_smmu_domain->init_mutex);
+	ret = kvm_arm_smmu_domain_finalize(kvm_smmu_domain, master);
+	mutex_unlock(&kvm_smmu_domain->init_mutex);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < fwspec->num_ids; i++) {
+		int sid = fwspec->ids[i];
+
+		ret = kvm_call_hyp_nvhe_mc(__pkvm_host_iommu_attach_dev,
+					   host_smmu->id, kvm_smmu_domain->id,
+					   sid, 0, 0);
+		if (ret) {
+			dev_err(smmu->dev, "cannot attach device %s (0x%x): %d\n",
+				dev_name(dev), sid, ret);
+			goto out_ret;
+		}
+	}
+	master->domain = kvm_smmu_domain;
+
+out_ret:
+	if (ret)
+		kvm_arm_smmu_detach_dev(host_smmu, master);
+	return ret;
+}
+
+static bool kvm_arm_smmu_capable(struct device *dev, enum iommu_cap cap)
+{
+	struct kvm_arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+	switch (cap) {
+	case IOMMU_CAP_CACHE_COHERENCY:
+		return master->smmu->features & ARM_SMMU_FEAT_COHERENCY;
+	case IOMMU_CAP_NOEXEC:
+	default:
+		return false;
+	}
+}
+
+static struct iommu_ops kvm_arm_smmu_ops = {
+	.capable		= kvm_arm_smmu_capable,
+	.device_group		= arm_smmu_device_group,
+	.of_xlate		= arm_smmu_of_xlate,
+	.get_resv_regions	= arm_smmu_get_resv_regions,
+	.probe_device		= kvm_arm_smmu_probe_device,
+	.release_device		= kvm_arm_smmu_release_device,
+	.domain_alloc		= kvm_arm_smmu_domain_alloc,
+	.pgsize_bitmap		= -1UL,
+	.owner			= THIS_MODULE,
+	.default_domain_ops = &(const struct iommu_domain_ops) {
+		.attach_dev	= kvm_arm_smmu_attach_dev,
+		.free		= kvm_arm_smmu_domain_free,
+	}
+};
+
 static bool kvm_arm_smmu_validate_features(struct arm_smmu_device *smmu)
 {
 	unsigned int required_features =
@@ -183,6 +462,11 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 	if (!kvm_arm_smmu_validate_features(smmu))
 		return -ENODEV;
 
+	if (kvm_arm_smmu_ops.pgsize_bitmap == -1UL)
+		kvm_arm_smmu_ops.pgsize_bitmap = smmu->pgsize_bitmap;
+	else
+		kvm_arm_smmu_ops.pgsize_bitmap |= smmu->pgsize_bitmap;
+
 	ret = arm_smmu_init_one_queue(smmu, &smmu->cmdq.q, smmu->base,
 				      ARM_SMMU_CMDQ_PROD, ARM_SMMU_CMDQ_CONS,
 				      CMDQ_ENT_DWORDS, "cmdq");
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 50/58] iommu/arm-smmu-v3-kvm: Add map, unmap and iova_to_phys operations
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (48 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 49/58] iommu/arm-smmu-v3-kvm: Add IOMMU ops Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 51/58] iommu/arm-smmu-v3-kvm: Support PASID operations Mostafa Saleh
                   ` (8 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add map, unmap and iova_to_phys, which are forwarded to the
hypervisor.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 72 +++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index 071743f5acf9..82f0191b222c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -331,6 +331,75 @@ static bool kvm_arm_smmu_capable(struct device *dev, enum iommu_cap cap)
 	}
 }
 
+static int kvm_arm_smmu_map_pages(struct iommu_domain *domain,
+				  unsigned long iova, phys_addr_t paddr,
+				  size_t pgsize, size_t pgcount, int prot,
+				  gfp_t gfp, size_t *total_mapped)
+{
+	size_t mapped;
+	size_t size = pgsize * pgcount;
+	struct kvm_arm_smmu_domain *kvm_smmu_domain = to_kvm_smmu_domain(domain);
+	struct arm_smccc_res res;
+
+	do {
+		res = kvm_call_hyp_nvhe_smccc(__pkvm_host_iommu_map_pages,
+					      kvm_smmu_domain->id,
+					      iova, paddr, pgsize, pgcount, prot);
+		mapped = res.a1;
+		iova += mapped;
+		paddr += mapped;
+		WARN_ON(mapped % pgsize);
+		WARN_ON(mapped > pgcount * pgsize);
+		pgcount -= mapped / pgsize;
+		*total_mapped += mapped;
+	} while (*total_mapped < size && !kvm_arm_smmu_topup_memcache(&res, gfp));
+	if (*total_mapped < size)
+		return -EINVAL;
+
+	return 0;
+}
+
+static size_t kvm_arm_smmu_unmap_pages(struct iommu_domain *domain,
+				       unsigned long iova, size_t pgsize,
+				       size_t pgcount,
+				       struct iommu_iotlb_gather *iotlb_gather)
+{
+	size_t unmapped;
+	size_t total_unmapped = 0;
+	size_t size = pgsize * pgcount;
+	struct kvm_arm_smmu_domain *kvm_smmu_domain = to_kvm_smmu_domain(domain);
+	struct arm_smccc_res res;
+
+	do {
+		res = kvm_call_hyp_nvhe_smccc(__pkvm_host_iommu_unmap_pages,
+					      kvm_smmu_domain->id,
+					      iova, pgsize, pgcount);
+		unmapped = res.a1;
+		total_unmapped += unmapped;
+		iova += unmapped;
+		WARN_ON(unmapped % pgsize);
+		pgcount -= unmapped / pgsize;
+
+		/*
+		 * The page table driver can unmap less than we asked for. If it
+		 * didn't unmap anything at all, then it either reached the end
+		 * of the range, or it needs a page in the memcache to break a
+		 * block mapping.
+		 */
+	} while (total_unmapped < size &&
+		 (unmapped || !kvm_arm_smmu_topup_memcache(&res, GFP_ATOMIC)));
+
+	return total_unmapped;
+}
+
+static phys_addr_t kvm_arm_smmu_iova_to_phys(struct iommu_domain *domain,
+					     dma_addr_t iova)
+{
+	struct kvm_arm_smmu_domain *kvm_smmu_domain = to_kvm_smmu_domain(domain);
+
+	return kvm_call_hyp_nvhe(__pkvm_host_iommu_iova_to_phys, kvm_smmu_domain->id, iova);
+}
+
 static struct iommu_ops kvm_arm_smmu_ops = {
 	.capable		= kvm_arm_smmu_capable,
 	.device_group		= arm_smmu_device_group,
@@ -344,6 +413,9 @@ static struct iommu_ops kvm_arm_smmu_ops = {
 	.default_domain_ops = &(const struct iommu_domain_ops) {
 		.attach_dev	= kvm_arm_smmu_attach_dev,
 		.free		= kvm_arm_smmu_domain_free,
+		.map_pages	= kvm_arm_smmu_map_pages,
+		.unmap_pages	= kvm_arm_smmu_unmap_pages,
+		.iova_to_phys	= kvm_arm_smmu_iova_to_phys,
 	}
 };
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 51/58] iommu/arm-smmu-v3-kvm: Support PASID operations
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (49 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 50/58] iommu/arm-smmu-v3-kvm: Add map, unmap and iova_to_phys operations Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 52/58] iommu/arm-smmu-v3-kvm: Add IRQs for the driver Mostafa Saleh
                   ` (7 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add support for set_dev_pasid and remove_dev_pasid, the hypervisor
already supports pasid, so we just need to forward it in the
hypercalls in addition to proper tracking of domains per master.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 58 +++++++++++++++----
 1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index 82f0191b222c..cbcd8a75d562 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -29,7 +29,8 @@ struct host_arm_smmu_device {
 struct kvm_arm_smmu_master {
 	struct arm_smmu_device		*smmu;
 	struct device			*dev;
-	struct kvm_arm_smmu_domain      *domain;
+	struct xarray			domains;
+	u32				ssid_bits;
 };
 
 struct kvm_arm_smmu_domain {
@@ -119,6 +120,10 @@ static struct iommu_device *kvm_arm_smmu_probe_device(struct device *dev)
 
 	master->dev = dev;
 	master->smmu = smmu;
+
+	device_property_read_u32(dev, "pasid-num-bits", &master->ssid_bits);
+	master->ssid_bits = min(smmu->ssid_bits, master->ssid_bits);
+	xa_init(&master->domains);
 	dev_iommu_priv_set(dev, master);
 
 	return &smmu->iommu;
@@ -235,13 +240,14 @@ static void kvm_arm_smmu_domain_free(struct iommu_domain *domain)
 	kfree(kvm_smmu_domain);
 }
 
-static int kvm_arm_smmu_detach_dev(struct host_arm_smmu_device *host_smmu,
-				   struct kvm_arm_smmu_master *master)
+static int kvm_arm_smmu_detach_dev_pasid(struct host_arm_smmu_device *host_smmu,
+					 struct kvm_arm_smmu_master *master,
+					 ioasid_t pasid)
 {
 	int i, ret;
 	struct arm_smmu_device *smmu = &host_smmu->smmu;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
-	struct kvm_arm_smmu_domain *domain = master->domain;
+	struct kvm_arm_smmu_domain *domain = xa_load(&master->domains, pasid);
 
 	if (!domain)
 		return 0;
@@ -250,7 +256,7 @@ static int kvm_arm_smmu_detach_dev(struct host_arm_smmu_device *host_smmu,
 		int sid = fwspec->ids[i];
 
 		ret = kvm_call_hyp_nvhe(__pkvm_host_iommu_detach_dev,
-					host_smmu->id, domain->id, sid, 0);
+					host_smmu->id, domain->id, sid, pasid);
 		if (ret) {
 			dev_err(smmu->dev, "cannot detach device %s (0x%x): %d\n",
 				dev_name(master->dev), sid, ret);
@@ -258,22 +264,39 @@ static int kvm_arm_smmu_detach_dev(struct host_arm_smmu_device *host_smmu,
 		}
 	}
 
-	master->domain = NULL;
+	xa_erase(&master->domains, pasid);
 
 	return ret;
 }
 
+static int kvm_arm_smmu_detach_dev(struct host_arm_smmu_device *host_smmu,
+				   struct kvm_arm_smmu_master *master)
+{
+	return kvm_arm_smmu_detach_dev_pasid(host_smmu, master, 0);
+}
+
+static void kvm_arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid,
+					  struct iommu_domain *domain)
+{
+	struct kvm_arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct host_arm_smmu_device *host_smmu = smmu_to_host(master->smmu);
+
+	kvm_arm_smmu_detach_dev_pasid(host_smmu, master, pasid);
+}
+
 static void kvm_arm_smmu_release_device(struct device *dev)
 {
 	struct kvm_arm_smmu_master *master = dev_iommu_priv_get(dev);
 	struct host_arm_smmu_device *host_smmu = smmu_to_host(master->smmu);
 
 	kvm_arm_smmu_detach_dev(host_smmu, master);
+	xa_destroy(&master->domains);
 	kfree(master);
 	iommu_fwspec_free(dev);
 }
 
-static int kvm_arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
+static int kvm_arm_smmu_set_dev_pasid(struct iommu_domain *domain,
+				      struct device *dev, ioasid_t pasid)
 {
 	int i, ret;
 	struct arm_smmu_device *smmu;
@@ -288,7 +311,7 @@ static int kvm_arm_smmu_attach_dev(struct iommu_domain *domain, struct device *d
 	smmu = master->smmu;
 	host_smmu = smmu_to_host(smmu);
 
-	ret = kvm_arm_smmu_detach_dev(host_smmu, master);
+	ret = kvm_arm_smmu_detach_dev_pasid(host_smmu, master, pasid);
 	if (ret)
 		return ret;
 
@@ -303,14 +326,14 @@ static int kvm_arm_smmu_attach_dev(struct iommu_domain *domain, struct device *d
 
 		ret = kvm_call_hyp_nvhe_mc(__pkvm_host_iommu_attach_dev,
 					   host_smmu->id, kvm_smmu_domain->id,
-					   sid, 0, 0);
+					   sid, pasid, master->ssid_bits);
 		if (ret) {
 			dev_err(smmu->dev, "cannot attach device %s (0x%x): %d\n",
 				dev_name(dev), sid, ret);
 			goto out_ret;
 		}
 	}
-	master->domain = kvm_smmu_domain;
+	ret = xa_insert(&master->domains, pasid, kvm_smmu_domain, GFP_KERNEL);
 
 out_ret:
 	if (ret)
@@ -318,6 +341,19 @@ static int kvm_arm_smmu_attach_dev(struct iommu_domain *domain, struct device *d
 	return ret;
 }
 
+static int kvm_arm_smmu_attach_dev(struct iommu_domain *domain,
+				   struct device *dev)
+{
+	struct kvm_arm_smmu_master *master = dev_iommu_priv_get(dev);
+	unsigned long pasid = 0;
+
+	/* All pasids must be removed first. */
+	if (xa_find_after(&master->domains, &pasid, ULONG_MAX, XA_PRESENT))
+		return -EBUSY;
+
+	return kvm_arm_smmu_set_dev_pasid(domain, dev, 0);
+}
+
 static bool kvm_arm_smmu_capable(struct device *dev, enum iommu_cap cap)
 {
 	struct kvm_arm_smmu_master *master = dev_iommu_priv_get(dev);
@@ -409,6 +445,7 @@ static struct iommu_ops kvm_arm_smmu_ops = {
 	.release_device		= kvm_arm_smmu_release_device,
 	.domain_alloc		= kvm_arm_smmu_domain_alloc,
 	.pgsize_bitmap		= -1UL,
+	.remove_dev_pasid	= kvm_arm_smmu_remove_dev_pasid,
 	.owner			= THIS_MODULE,
 	.default_domain_ops = &(const struct iommu_domain_ops) {
 		.attach_dev	= kvm_arm_smmu_attach_dev,
@@ -416,6 +453,7 @@ static struct iommu_ops kvm_arm_smmu_ops = {
 		.map_pages	= kvm_arm_smmu_map_pages,
 		.unmap_pages	= kvm_arm_smmu_unmap_pages,
 		.iova_to_phys	= kvm_arm_smmu_iova_to_phys,
+		.set_dev_pasid	= kvm_arm_smmu_set_dev_pasid,
 	}
 };
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 52/58] iommu/arm-smmu-v3-kvm: Add IRQs for the driver
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (50 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 51/58] iommu/arm-smmu-v3-kvm: Support PASID operations Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 53/58] iommu/arm-smmu-v3-kvm: Probe power domains Mostafa Saleh
                   ` (6 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Handle IRQs from the KVM kernel driver, it should be safe to do it
from the kernel as this a debug feature.

Only GERROR and EVTQ irqs are handled.
Unlike the kernel driver, we don't do much here (no rest in SMMU or
interaction of cmdq) but just printing.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c   |   3 +-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 138 ++++++++++++++++++
 2 files changed, 139 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index 2a99873d980f..60f0760f49eb 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -365,7 +365,6 @@ static int smmu_init_registers(struct hyp_arm_smmu_v3_device *smmu)
 	      FIELD_PREP(CR1_QUEUE_IC, CR1_CACHE_WB);
 	writel_relaxed(val, smmu->base + ARM_SMMU_CR1);
 	writel_relaxed(CR2_PTM, smmu->base + ARM_SMMU_CR2);
-	writel_relaxed(0, smmu->base + ARM_SMMU_IRQ_CTRL);
 
 	val = readl_relaxed(smmu->base + ARM_SMMU_GERROR);
 	old = readl_relaxed(smmu->base + ARM_SMMU_GERRORN);
@@ -540,7 +539,7 @@ static int smmu_reset_device(struct hyp_arm_smmu_v3_device *smmu)
 		goto err_disable_cmdq;
 
 	/* Enable translation */
-	return smmu_write_cr0(smmu, CR0_SMMUEN | CR0_CMDQEN | CR0_ATSCHK);
+	return smmu_write_cr0(smmu, CR0_SMMUEN | CR0_CMDQEN | CR0_ATSCHK | CR0_EVTQEN);
 
 err_disable_cmdq:
 	return smmu_write_cr0(smmu, 0);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index cbcd8a75d562..674ce2b02a4b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -496,11 +496,107 @@ static bool kvm_arm_smmu_validate_features(struct arm_smmu_device *smmu)
 	return true;
 }
 
+static irqreturn_t kvm_arm_smmu_evt_handler(int irq, void *dev)
+{
+	int i;
+	struct arm_smmu_device *smmu = dev;
+	struct arm_smmu_queue *q = &smmu->evtq.q;
+	struct arm_smmu_ll_queue *llq = &q->llq;
+	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+	u64 evt[EVTQ_ENT_DWORDS];
+
+	do {
+		while (!queue_remove_raw(q, evt)) {
+			u8 id = FIELD_GET(EVTQ_0_ID, evt[0]);
+
+			if (!__ratelimit(&rs))
+				continue;
+
+			dev_info(smmu->dev, "event 0x%02x received:\n", id);
+			for (i = 0; i < ARRAY_SIZE(evt); ++i)
+				dev_info(smmu->dev, "\t0x%016llx\n",
+					 (unsigned long long)evt[i]);
+
+			cond_resched();
+		}
+
+		/*
+		 * Not much we can do on overflow, so scream and pretend we're
+		 * trying harder.
+		 */
+		if (queue_sync_prod_in(q) == -EOVERFLOW)
+			dev_err(smmu->dev, "EVTQ overflow detected -- events lost\n");
+	} while (!queue_empty(llq));
+
+	/* Sync our overflow flag, as we believe we're up to speed */
+	queue_sync_cons_ovf(q);
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t kvm_arm_smmu_gerror_handler(int irq, void *dev)
+{
+	u32 gerror, gerrorn, active;
+	struct arm_smmu_device *smmu = dev;
+
+	gerror = readl_relaxed(smmu->base + ARM_SMMU_GERROR);
+	gerrorn = readl_relaxed(smmu->base + ARM_SMMU_GERRORN);
+
+	active = gerror ^ gerrorn;
+	if (!(active & GERROR_ERR_MASK))
+		return IRQ_NONE; /* No errors pending */
+
+	dev_warn(smmu->dev,
+		 "unexpected global error reported (0x%08x), this could be serious\n",
+		 active);
+
+	/* There is no API to reconfigure the device at the moment.*/
+	if (active & GERROR_SFM_ERR)
+		dev_err(smmu->dev, "device has entered Service Failure Mode!\n");
+
+	if (active & GERROR_MSI_GERROR_ABT_ERR)
+		dev_warn(smmu->dev, "GERROR MSI write aborted\n");
+
+	if (active & GERROR_MSI_PRIQ_ABT_ERR)
+		dev_warn(smmu->dev, "PRIQ MSI write aborted\n");
+
+	if (active & GERROR_MSI_EVTQ_ABT_ERR)
+		dev_warn(smmu->dev, "EVTQ MSI write aborted\n");
+
+	if (active & GERROR_MSI_CMDQ_ABT_ERR)
+		dev_warn(smmu->dev, "CMDQ MSI write aborted\n");
+
+	if (active & GERROR_PRIQ_ABT_ERR)
+		dev_err(smmu->dev, "PRIQ write aborted -- events may have been lost\n");
+
+	if (active & GERROR_EVTQ_ABT_ERR)
+		dev_err(smmu->dev, "EVTQ write aborted -- events may have been lost\n");
+
+	if (active & GERROR_CMDQ_ERR) {
+		dev_err(smmu->dev, "CMDQ ERR -- Hypervisor cmdq corrupted?\n");
+		BUG();
+	}
+
+	writel(gerror, smmu->base + ARM_SMMU_GERRORN);
+
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t kvm_arm_smmu_pri_handler(int irq, void *dev)
+{
+	struct arm_smmu_device *smmu = dev;
+
+	dev_err(smmu->dev, "PRI not supported in KVM driver!\n");
+
+	return IRQ_HANDLED;
+}
+
 static int kvm_arm_smmu_device_reset(struct host_arm_smmu_device *host_smmu)
 {
 	int ret;
 	u32 reg;
 	struct arm_smmu_device *smmu = &host_smmu->smmu;
+	u32 irqen_flags = IRQ_CTRL_EVTQ_IRQEN | IRQ_CTRL_GERROR_IRQEN;
 
 	reg = readl_relaxed(smmu->base + ARM_SMMU_CR0);
 	if (reg & CR0_SMMUEN)
@@ -522,6 +618,39 @@ static int kvm_arm_smmu_device_reset(struct host_arm_smmu_device *host_smmu)
 	/* Command queue */
 	writeq_relaxed(smmu->cmdq.q.q_base, smmu->base + ARM_SMMU_CMDQ_BASE);
 
+	/* Event queue */
+	writeq_relaxed(smmu->evtq.q.q_base, smmu->base + ARM_SMMU_EVTQ_BASE);
+	writel_relaxed(smmu->evtq.q.llq.prod, smmu->base + SZ_64K + ARM_SMMU_EVTQ_PROD);
+	writel_relaxed(smmu->evtq.q.llq.cons, smmu->base + SZ_64K + ARM_SMMU_EVTQ_CONS);
+
+	/* Disable IRQs first */
+	ret = arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
+				      ARM_SMMU_IRQ_CTRLACK);
+	if (ret) {
+		dev_err(smmu->dev, "failed to disable irqs\n");
+		return ret;
+	}
+
+	/*
+	 * We don't support combined irqs for now, no specific reason, they are uncommon
+	 * so we just try to avoid bloating the code.
+	 */
+	if (smmu->combined_irq)
+		dev_err(smmu->dev, "Combined irqs not supported by this driver\n");
+	else
+		arm_smmu_setup_unique_irqs(smmu, kvm_arm_smmu_evt_handler,
+					   kvm_arm_smmu_gerror_handler,
+					   kvm_arm_smmu_pri_handler);
+
+	if (smmu->features & ARM_SMMU_FEAT_PRI)
+		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
+
+	/* Enable interrupt generation on the SMMU */
+	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
+				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
+	if (ret)
+		dev_warn(smmu->dev, "failed to enable irqs\n");
+
 	return 0;
 }
 
@@ -565,6 +694,8 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 	if (IS_ERR(smmu->base))
 		return PTR_ERR(smmu->base);
 
+	arm_smmu_probe_irq(pdev, smmu);
+
 	ret = arm_smmu_device_hw_probe(smmu);
 	if (ret)
 		return ret;
@@ -583,6 +714,13 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 	if (ret)
 		return ret;
 
+	/* evtq */
+	ret = arm_smmu_init_one_queue(smmu, &smmu->evtq.q, smmu->base + SZ_64K,
+				      ARM_SMMU_EVTQ_PROD, ARM_SMMU_EVTQ_CONS,
+				      EVTQ_ENT_DWORDS, "evtq");
+	if (ret)
+		return ret;
+
 	ret = arm_smmu_init_strtab(smmu);
 	if (ret)
 		return ret;
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 53/58] iommu/arm-smmu-v3-kvm: Probe power domains
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (51 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 52/58] iommu/arm-smmu-v3-kvm: Add IRQs for the driver Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 54/58] iommu/arm-smmu-v3-kvm: Enable runtime PM Mostafa Saleh
                   ` (5 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

From: Jean-Philippe Brucker <jean-philippe@linaro.org>

Try to use SCMI if possible, otherwise rely on HVC to the hypervisor
to notify about power changes, this is ONLY safe if the SMMU resets
to blocking DMA.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 78 +++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index 674ce2b02a4b..deeed994a131 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -8,6 +8,7 @@
 #include <asm/kvm_mmu.h>
 
 #include <linux/arm-smccc.h>
+#include <linux/of_address.h>
 #include <linux/of_platform.h>
 #include <linux/platform_device.h>
 
@@ -21,6 +22,7 @@ struct host_arm_smmu_device {
 	struct arm_smmu_device		smmu;
 	pkvm_handle_t			id;
 	u32				boot_gbpa;
+	struct kvm_power_domain		power_domain;
 };
 
 #define smmu_to_host(_smmu) \
@@ -654,6 +656,77 @@ static int kvm_arm_smmu_device_reset(struct host_arm_smmu_device *host_smmu)
 	return 0;
 }
 
+static int kvm_arm_probe_scmi_pd(struct device_node *scmi_node,
+				 struct kvm_power_domain *pd)
+{
+	int ret;
+	struct resource res;
+	struct of_phandle_args args;
+
+	pd->type = KVM_POWER_DOMAIN_ARM_SCMI;
+
+	ret = of_parse_phandle_with_args(scmi_node, "shmem", NULL, 0, &args);
+	if (ret)
+		return ret;
+
+	ret = of_address_to_resource(args.np, 0, &res);
+	if (ret)
+		goto out_put_nodes;
+
+	ret = of_property_read_u32(scmi_node, "arm,smc-id",
+				   &pd->arm_scmi.smc_id);
+	if (ret)
+		goto out_put_nodes;
+
+	/*
+	 * The shared buffer is unmapped from the host while a request is in
+	 * flight, so it has to be on its own page.
+	 */
+	if (!IS_ALIGNED(res.start, SZ_64K) || resource_size(&res) < SZ_64K) {
+		ret = -EINVAL;
+		goto out_put_nodes;
+	}
+
+	pd->arm_scmi.shmem_base = res.start;
+	pd->arm_scmi.shmem_size = resource_size(&res);
+
+out_put_nodes:
+	of_node_put(args.np);
+	return ret;
+}
+
+/* TODO: Move this. None of it is specific to SMMU */
+static int kvm_arm_probe_power_domain(struct device *dev,
+				      struct kvm_power_domain *pd)
+{
+	int ret;
+	struct device_node *parent;
+	struct of_phandle_args args;
+
+	if (!of_get_property(dev->of_node, "power-domains", NULL))
+		return 0;
+
+	ret = of_parse_phandle_with_args(dev->of_node, "power-domains",
+					 "#power-domain-cells", 0, &args);
+	if (ret)
+		return ret;
+
+	parent = of_get_parent(args.np);
+	if (parent && of_device_is_compatible(parent, "arm,scmi-smc") &&
+	    args.args_count > 0) {
+		pd->arm_scmi.domain_id = args.args[0];
+		ret = kvm_arm_probe_scmi_pd(parent, pd);
+	} else {
+		dev_warn(dev, "Unknown PM method for %pOF, using HVC\n",
+			 args.np);
+		pd->type = KVM_POWER_DOMAIN_HOST_HVC;
+		pd->device_id = kvm_arm_smmu_cur;
+	}
+	of_node_put(parent);
+	of_node_put(args.np);
+	return ret;
+}
+
 static int kvm_arm_smmu_probe(struct platform_device *pdev)
 {
 	int ret;
@@ -681,6 +754,10 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 	if (ret)
 		return ret;
 
+	ret = kvm_arm_probe_power_domain(dev, &host_smmu->power_domain);
+	if (ret)
+		return ret;
+
 	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
 	size = resource_size(res);
 	if (size < SZ_128K) {
@@ -738,6 +815,7 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 	hyp_smmu->mmio_addr = ioaddr;
 	hyp_smmu->mmio_size = size;
 	hyp_smmu->features = smmu->features;
+	hyp_smmu->iommu.power_domain = host_smmu->power_domain;
 	kvm_arm_smmu_cur++;
 
 	return arm_smmu_register_iommu(smmu, &kvm_arm_smmu_ops, ioaddr);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 54/58] iommu/arm-smmu-v3-kvm: Enable runtime PM
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (52 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 53/58] iommu/arm-smmu-v3-kvm: Probe power domains Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations Mostafa Saleh
                   ` (4 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Enable runtime PM for the KVM SMMUv3 driver. The PM link to DMA masters
dictates when the SMMU should be powered on.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 66 +++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index deeed994a131..e987c273ff3c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -11,6 +11,7 @@
 #include <linux/of_address.h>
 #include <linux/of_platform.h>
 #include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
 
 #include <kvm/arm_smmu_v3.h>
 
@@ -127,6 +128,12 @@ static struct iommu_device *kvm_arm_smmu_probe_device(struct device *dev)
 	master->ssid_bits = min(smmu->ssid_bits, master->ssid_bits);
 	xa_init(&master->domains);
 	dev_iommu_priv_set(dev, master);
+	if (!device_link_add(dev, smmu->dev,
+			     DL_FLAG_PM_RUNTIME |
+			     DL_FLAG_AUTOREMOVE_SUPPLIER)) {
+		kfree(master);
+		return ERR_PTR(-ENOLINK);
+	}
 
 	return &smmu->iommu;
 }
@@ -818,6 +825,14 @@ static int kvm_arm_smmu_probe(struct platform_device *pdev)
 	hyp_smmu->iommu.power_domain = host_smmu->power_domain;
 	kvm_arm_smmu_cur++;
 
+	pm_runtime_set_active(dev);
+	pm_runtime_enable(dev);
+	/*
+	 * Take a reference to keep the SMMU powered on while the hypervisor
+	 * initializes it.
+	 */
+	pm_runtime_resume_and_get(dev);
+
 	return arm_smmu_register_iommu(smmu, &kvm_arm_smmu_ops, ioaddr);
 }
 
@@ -826,6 +841,8 @@ static void kvm_arm_smmu_remove(struct platform_device *pdev)
 	struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
 	struct host_arm_smmu_device *host_smmu = smmu_to_host(smmu);
 
+	pm_runtime_disable(&pdev->dev);
+	pm_runtime_set_suspended(&pdev->dev);
 	/*
 	 * There was an error during hypervisor setup. The hyp driver may
 	 * have already enabled the device, so disable it.
@@ -834,6 +851,30 @@ static void kvm_arm_smmu_remove(struct platform_device *pdev)
 	arm_smmu_update_gbpa(smmu, host_smmu->boot_gbpa, GBPA_ABORT);
 }
 
+static int kvm_arm_smmu_suspend(struct device *dev)
+{
+	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+	struct host_arm_smmu_device *host_smmu = smmu_to_host(smmu);
+
+	if (host_smmu->power_domain.type == KVM_POWER_DOMAIN_HOST_HVC)
+		return kvm_call_hyp_nvhe(__pkvm_host_hvc_pd, host_smmu->id, 0);
+	return 0;
+}
+
+static int kvm_arm_smmu_resume(struct device *dev)
+{
+	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+	struct host_arm_smmu_device *host_smmu = smmu_to_host(smmu);
+
+	if (host_smmu->power_domain.type == KVM_POWER_DOMAIN_HOST_HVC)
+		return kvm_call_hyp_nvhe(__pkvm_host_hvc_pd, host_smmu->id, 1);
+	return 0;
+}
+
+static const struct dev_pm_ops kvm_arm_smmu_pm_ops = {
+	SET_RUNTIME_PM_OPS(kvm_arm_smmu_suspend, kvm_arm_smmu_resume, NULL)
+};
+
 static const struct of_device_id arm_smmu_of_match[] = {
 	{ .compatible = "arm,smmu-v3", },
 	{ },
@@ -843,6 +884,7 @@ static struct platform_driver kvm_arm_smmu_driver = {
 	.driver = {
 		.name = "kvm-arm-smmu-v3",
 		.of_match_table = arm_smmu_of_match,
+		.pm = &kvm_arm_smmu_pm_ops,
 	},
 	.remove = kvm_arm_smmu_remove,
 };
@@ -877,6 +919,12 @@ static void kvm_arm_smmu_array_free(void)
 	free_pages((unsigned long)kvm_arm_smmu_array, order);
 }
 
+static int smmu_put_device(struct device *dev, void *data)
+{
+	pm_runtime_put(dev);
+	return 0;
+}
+
 static int kvm_arm_smmu_v3_init_drv(void)
 {
 	int ret;
@@ -905,6 +953,7 @@ static int kvm_arm_smmu_v3_init_drv(void)
 	 */
 	kvm_hyp_arm_smmu_v3_smmus = kvm_arm_smmu_array;
 	kvm_hyp_arm_smmu_v3_count = kvm_arm_smmu_count;
+
 	return 0;
 
 err_free:
@@ -931,4 +980,21 @@ static int kvm_arm_smmu_v3_register(void)
 					kern_hyp_va(lm_alias(&kvm_nvhe_sym(smmu_ops))));
 };
 
+/*
+ * KVM init hypervisor at device_sync init call,
+ * so we drop the PM references of the SMMU taken at probe
+ * at the late initcall where it's guaranteed the hypervisor
+ * has initialized the SMMUs.
+ */
+static int kvm_arm_smmu_v3_post_init(void)
+{
+	if (!kvm_arm_smmu_count)
+		return 0;
+
+	WARN_ON(driver_for_each_device(&kvm_arm_smmu_driver.driver, NULL,
+				       NULL, smmu_put_device));
+	return 0;
+}
+
 core_initcall(kvm_arm_smmu_v3_register);
+late_initcall(kvm_arm_smmu_v3_post_init);
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (53 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 54/58] iommu/arm-smmu-v3-kvm: Enable runtime PM Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-19 12:48   ` Robin Murphy
  2024-12-12 18:04 ` [RFC PATCH v2 56/58] KVM: arm64: iommu: Add hypercall for map_sg Mostafa Saleh
                   ` (3 subsequent siblings)
  58 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

With pKVM SMMUv3 driver which para-virtualizes the IOMMU in the
hypervisor, has an extra overhead with map_sg, as it loops over
iommu_map, and for each map requires context switching, disabling
interrupts...

Instead, add an new domain operations:
- alloc_cookie_sg: Allocate a new sg deferred cookie
- add_deferred_map_sg: Add a mapping to the cookie
- consume_deferred_map_sg: Consume and release the cookie

Alternativly, we can pass the sg list as is. However, this would
duplicate some of the logic and it would make more sense to
conolidate all the sg list parsing for IOMMU drivers in one place.

virtio-iommu is another IOMMU that can benfit from this, but it
would need to have a new operation that standerdize passing
an sglist based on these ops.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/iommu/iommu.c | 53 +++++++++++++++++++++++++++++++++++++++++--
 include/linux/iommu.h | 19 ++++++++++++++++
 2 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 83c8e617a2c5..3a3c48631dd6 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2608,6 +2608,37 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
+static int __iommu_add_sg(struct iommu_map_cookie_sg *cookie_sg,
+			  unsigned long iova, phys_addr_t paddr, size_t size)
+{
+	struct iommu_domain *domain = cookie_sg->domain;
+	const struct iommu_domain_ops *ops = domain->ops;
+	unsigned int min_pagesz;
+	size_t pgsize, count;
+
+	if (unlikely(!(domain->type & __IOMMU_DOMAIN_PAGING)))
+		return -EINVAL;
+
+	if (WARN_ON(domain->pgsize_bitmap == 0UL))
+		return -ENODEV;
+
+	/* find out the minimum page size supported */
+	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+
+	/*
+	 * both the virtual address and the physical one, as well as
+	 * the size of the mapping, must be aligned (at least) to the
+	 * size of the smallest page supported by the hardware
+	 */
+	if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
+		pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 0x%x\n",
+		       iova, &paddr, size, min_pagesz);
+		return -EINVAL;
+	}
+	pgsize = iommu_pgsize(domain, iova, paddr, size, &count);
+	return ops->add_deferred_map_sg(cookie_sg, paddr, pgsize, count);
+}
+
 ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 		     struct scatterlist *sg, unsigned int nents, int prot,
 		     gfp_t gfp)
@@ -2617,6 +2648,9 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 	phys_addr_t start;
 	unsigned int i = 0;
 	int ret;
+	bool deferred_sg = ops->alloc_cookie_sg && ops->add_deferred_map_sg &&
+			   ops->consume_deferred_map_sg;
+	struct iommu_map_cookie_sg *cookie_sg;
 
 	might_sleep_if(gfpflags_allow_blocking(gfp));
 
@@ -2625,12 +2659,24 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 				__GFP_HIGHMEM)))
 		return -EINVAL;
 
+	if (deferred_sg) {
+		cookie_sg = ops->alloc_cookie_sg(iova, prot, nents, gfp);
+		if (!cookie_sg) {
+			pr_err("iommu: failed alloc cookie\n");
+			return -ENOMEM;
+		}
+		cookie_sg->domain = domain;
+	}
+
 	while (i <= nents) {
 		phys_addr_t s_phys = sg_phys(sg);
 
 		if (len && s_phys != start + len) {
-			ret = __iommu_map(domain, iova + mapped, start,
-					len, prot, gfp);
+			if (deferred_sg)
+				ret = __iommu_add_sg(cookie_sg, iova + mapped, start, len);
+			else
+				ret = __iommu_map(domain, iova + mapped, start,
+						  len, prot, gfp);
 
 			if (ret)
 				goto out_err;
@@ -2654,6 +2700,9 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			sg = sg_next(sg);
 	}
 
+	if (deferred_sg)
+		ops->consume_deferred_map_sg(cookie_sg);
+
 	if (ops->iotlb_sync_map) {
 		ret = ops->iotlb_sync_map(domain, iova, mapped);
 		if (ret)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c75877044185..5e60ac349228 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -601,6 +601,14 @@ struct iommu_ops {
 	u8 user_pasid_table:1;
 };
 
+/**
+ * struct iommu_map_cookie_sg - Cookie for a deferred map sg
+ * @domain: Domain for the sg lit
+ */
+struct iommu_map_cookie_sg {
+	struct iommu_domain *domain;
+};
+
 /**
  * struct iommu_domain_ops - domain specific operations
  * @attach_dev: attach an iommu domain to a device
@@ -638,6 +646,11 @@ struct iommu_ops {
  * @enable_nesting: Enable nesting
  * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
  * @free: Release the domain after use.
+ * @alloc_cookie_sg: Allocate a cookie that would be used to create
+ *		     a sg list, filled from the next functions
+ * @add_deferred_map_sg: Add a mapping to a cookie of a sg list.
+ * @consume_deferred_map_sg: Consume the sg list as now all mappings are added,
+ *			     it should also release the cookie as it's not used.
  */
 struct iommu_domain_ops {
 	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
@@ -668,6 +681,12 @@ struct iommu_domain_ops {
 				  unsigned long quirks);
 
 	void (*free)(struct iommu_domain *domain);
+
+	struct iommu_map_cookie_sg *(*alloc_cookie_sg)(unsigned long iova, int prot,
+						       unsigned int nents, gfp_t gfp);
+	int (*add_deferred_map_sg)(struct iommu_map_cookie_sg *cookie,
+				   phys_addr_t paddr, size_t pgsize, size_t pgcount);
+	int (*consume_deferred_map_sg)(struct iommu_map_cookie_sg *cookie);
 };
 
 /**
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 56/58] KVM: arm64: iommu: Add hypercall for map_sg
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (54 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 57/58] iommu/arm-smmu-v3-kvm: Implement sg operations Mostafa Saleh
                   ` (2 subsequent siblings)
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Add a new type struct kvm_iommu_sg, that describes a simple sglist,
and a hypercall that can consume it while calling the map_pages ops.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/include/asm/kvm_asm.h        |  1 +
 arch/arm64/include/asm/kvm_host.h       | 19 ++++++++
 arch/arm64/kvm/hyp/include/nvhe/iommu.h |  2 +
 arch/arm64/kvm/hyp/nvhe/hyp-main.c      | 14 ++++++
 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c   | 58 +++++++++++++++++++++++++
 arch/arm64/kvm/iommu.c                  | 32 ++++++++++++++
 6 files changed, 126 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 3dbf30cd10f3..f2b86d1a62ed 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -115,6 +115,7 @@ enum __kvm_host_smccc_func {
 	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_unmap_pages,
 	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_iova_to_phys,
 	__KVM_HOST_SMCCC_FUNC___pkvm_host_hvc_pd,
+	__KVM_HOST_SMCCC_FUNC___pkvm_host_iommu_map_sg,
 
 	/*
 	 * Start of the dynamically registered hypercalls. Start a bit
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 3cdc99ebdd0d..704648619d28 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1655,4 +1655,23 @@ int kvm_iommu_register_driver(struct kvm_iommu_driver *kern_ops,
 int kvm_iommu_init_driver(void);
 void kvm_iommu_remove_driver(void);
 
+struct kvm_iommu_sg {
+	phys_addr_t phys;
+	size_t pgsize;
+	unsigned int pgcount;
+};
+
+static inline struct kvm_iommu_sg *kvm_iommu_sg_alloc(unsigned int nents, gfp_t gfp)
+{
+	return alloc_pages_exact(PAGE_ALIGN(nents * sizeof(struct kvm_iommu_sg)), gfp);
+}
+
+static inline void kvm_iommu_sg_free(struct kvm_iommu_sg *sg, unsigned int nents)
+{
+	free_pages_exact(sg, PAGE_ALIGN(nents * sizeof(struct kvm_iommu_sg)));
+}
+
+int kvm_iommu_share_hyp_sg(struct kvm_iommu_sg *sg, unsigned int nents);
+int kvm_iommu_unshare_hyp_sg(struct kvm_iommu_sg *sg, unsigned int nents);
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/hyp/include/nvhe/iommu.h b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
index cff75d67d807..1004465b680a 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/iommu.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/iommu.h
@@ -22,6 +22,8 @@ size_t kvm_iommu_unmap_pages(pkvm_handle_t domain_id, unsigned long iova,
 			     size_t pgsize, size_t pgcount);
 phys_addr_t kvm_iommu_iova_to_phys(pkvm_handle_t domain_id, unsigned long iova);
 bool kvm_iommu_host_dabt_handler(struct kvm_cpu_context *host_ctxt, u64 esr, u64 addr);
+size_t kvm_iommu_map_sg(pkvm_handle_t domain, unsigned long iova, struct kvm_iommu_sg *sg,
+			unsigned int nent, unsigned int prot);
 
 /* Flags for memory allocation for IOMMU drivers */
 #define IOMMU_PAGE_NOCACHE				BIT(0)
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 1ab8e5507825..5659aae0c758 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -1682,6 +1682,19 @@ static void handle___pkvm_host_hvc_pd(struct kvm_cpu_context *host_ctxt)
 	cpu_reg(host_ctxt, 1) = pkvm_host_hvc_pd(device_id, on);
 }
 
+static void handle___pkvm_host_iommu_map_sg(struct kvm_cpu_context *host_ctxt)
+{
+	unsigned long ret;
+	DECLARE_REG(pkvm_handle_t, domain, host_ctxt, 1);
+	DECLARE_REG(unsigned long, iova, host_ctxt, 2);
+	DECLARE_REG(struct kvm_iommu_sg *, sg, host_ctxt, 3);
+	DECLARE_REG(unsigned int, nent, host_ctxt, 4);
+	DECLARE_REG(unsigned int, prot, host_ctxt, 5);
+
+	ret = kvm_iommu_map_sg(domain, iova, kern_hyp_va(sg), nent, prot);
+	hyp_reqs_smccc_encode(ret, host_ctxt, this_cpu_ptr(&host_hyp_reqs));
+}
+
 typedef void (*hcall_t)(struct kvm_cpu_context *);
 
 #define HANDLE_FUNC(x)	[__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -1747,6 +1760,7 @@ static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__pkvm_host_iommu_unmap_pages),
 	HANDLE_FUNC(__pkvm_host_iommu_iova_to_phys),
 	HANDLE_FUNC(__pkvm_host_hvc_pd),
+	HANDLE_FUNC(__pkvm_host_iommu_map_sg),
 };
 
 static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
index e45dadd0c4aa..b0c9b9086fd1 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
@@ -392,6 +392,64 @@ bool kvm_iommu_host_dabt_handler(struct kvm_cpu_context *host_ctxt, u64 esr, u64
 	return ret;
 }
 
+size_t kvm_iommu_map_sg(pkvm_handle_t domain_id, unsigned long iova, struct kvm_iommu_sg *sg,
+			unsigned int nent, unsigned int prot)
+{
+	int ret;
+	size_t total_mapped = 0, mapped;
+	struct kvm_hyp_iommu_domain *domain;
+	phys_addr_t phys;
+	size_t size, pgsize, pgcount;
+	unsigned int orig_nent = nent;
+	struct kvm_iommu_sg *orig_sg = sg;
+
+	if (!kvm_iommu_ops || !kvm_iommu_ops->map_pages)
+		return 0;
+
+	if (prot & ~IOMMU_PROT_MASK)
+		return 0;
+
+	domain = handle_to_domain(domain_id);
+	if (!domain || domain_get(domain))
+		return 0;
+
+	ret = hyp_pin_shared_mem(sg, sg + nent);
+	if (ret)
+		goto out_put_domain;
+
+	while (nent--) {
+		phys = sg->phys;
+		pgsize = sg->pgsize;
+		pgcount = sg->pgcount;
+
+		if (__builtin_mul_overflow(pgsize, pgcount, &size) ||
+		    iova + size < iova)
+			goto out_unpin_sg;
+
+		ret = __pkvm_host_use_dma(phys, size);
+		if (ret)
+			goto out_unpin_sg;
+
+		mapped = 0;
+		kvm_iommu_ops->map_pages(domain, iova, phys, pgsize, pgcount, prot, &mapped);
+		total_mapped += mapped;
+		phys += mapped;
+		iova += mapped;
+		/* Might need memory */
+		if (mapped != size) {
+			__pkvm_host_unuse_dma(phys, size - mapped);
+			break;
+		}
+		sg++;
+	}
+
+out_unpin_sg:
+	hyp_unpin_shared_mem(orig_sg, orig_sg + orig_nent);
+out_put_domain:
+	domain_put(domain);
+	return total_mapped;
+}
+
 static int iommu_power_on(struct kvm_power_domain *pd)
 {
 	struct kvm_hyp_iommu *iommu = container_of(pd, struct kvm_hyp_iommu,
diff --git a/arch/arm64/kvm/iommu.c b/arch/arm64/kvm/iommu.c
index af3417e6259d..99718af0cba6 100644
--- a/arch/arm64/kvm/iommu.c
+++ b/arch/arm64/kvm/iommu.c
@@ -55,3 +55,35 @@ void kvm_iommu_remove_driver(void)
 	if (smp_load_acquire(&iommu_driver))
 		iommu_driver->remove_driver();
 }
+
+int kvm_iommu_share_hyp_sg(struct kvm_iommu_sg *sg, unsigned int nents)
+{
+	size_t nr_pages = PAGE_ALIGN(sizeof(*sg) * nents) >> PAGE_SHIFT;
+	phys_addr_t sg_pfn = virt_to_phys(sg) >> PAGE_SHIFT;
+	int i;
+	int ret;
+
+	for (i = 0 ; i < nr_pages ; ++i) {
+		ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp, sg_pfn + i);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int kvm_iommu_unshare_hyp_sg(struct kvm_iommu_sg *sg, unsigned int nents)
+{
+	size_t nr_pages = PAGE_ALIGN(sizeof(*sg) * nents) >> PAGE_SHIFT;
+	phys_addr_t sg_pfn = virt_to_phys(sg) >> PAGE_SHIFT;
+	int i;
+	int ret;
+
+	for (i = 0 ; i < nr_pages ; ++i) {
+		ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp, sg_pfn + i);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 57/58] iommu/arm-smmu-v3-kvm: Implement sg operations
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (55 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 56/58] KVM: arm64: iommu: Add hypercall for map_sg Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 18:04 ` [RFC PATCH v2 58/58] iommu/arm-smmu-v3-kvm: Support command queue batching Mostafa Saleh
  2024-12-12 19:41 ` [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Jason Gunthorpe
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Implement the new map_sg ops which mainly populate the kvm_iommu_sg
and pass it in the hypervisor.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c   | 93 +++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
index e987c273ff3c..ac45455b384d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
@@ -445,6 +445,96 @@ static phys_addr_t kvm_arm_smmu_iova_to_phys(struct iommu_domain *domain,
 	return kvm_call_hyp_nvhe(__pkvm_host_iommu_iova_to_phys, kvm_smmu_domain->id, iova);
 }
 
+struct kvm_arm_smmu_map_sg {
+	struct iommu_map_cookie_sg cookie;
+	struct kvm_iommu_sg *sg;
+	unsigned int ptr;
+	unsigned long iova;
+	int prot;
+	gfp_t gfp;
+	unsigned int nents;
+};
+
+static struct iommu_map_cookie_sg *kvm_arm_smmu_alloc_cookie_sg(unsigned long iova,
+								int prot,
+								unsigned int nents,
+								gfp_t gfp)
+{
+	int ret;
+	struct kvm_arm_smmu_map_sg *map_sg = kzalloc(sizeof(*map_sg), gfp);
+
+	if (!map_sg)
+		return NULL;
+
+	map_sg->sg = kvm_iommu_sg_alloc(nents, gfp);
+	if (!map_sg->sg)
+		return NULL;
+	map_sg->iova = iova;
+	map_sg->prot = prot;
+	map_sg->gfp = gfp;
+	map_sg->nents = nents;
+	ret = kvm_iommu_share_hyp_sg(map_sg->sg, nents);
+	if (ret) {
+		kvm_iommu_sg_free(map_sg->sg, nents);
+		kfree(map_sg);
+		return NULL;
+	}
+
+	return &map_sg->cookie;
+}
+
+static int kvm_arm_smmu_add_deferred_map_sg(struct iommu_map_cookie_sg *cookie,
+					    phys_addr_t paddr, size_t pgsize, size_t pgcount)
+{
+	struct kvm_arm_smmu_map_sg *map_sg = container_of(cookie, struct kvm_arm_smmu_map_sg,
+							  cookie);
+	struct kvm_iommu_sg *sg = map_sg->sg;
+
+	sg[map_sg->ptr].phys = paddr;
+	sg[map_sg->ptr].pgsize = pgsize;
+	sg[map_sg->ptr].pgcount = pgcount;
+	map_sg->ptr++;
+	return 0;
+}
+
+static int kvm_arm_smmu_consume_deferred_map_sg(struct iommu_map_cookie_sg *cookie)
+{
+	struct kvm_arm_smmu_map_sg *map_sg = container_of(cookie, struct kvm_arm_smmu_map_sg,
+							  cookie);
+	struct kvm_iommu_sg *sg = map_sg->sg;
+	size_t mapped, total_mapped = 0;
+	struct arm_smccc_res res;
+	struct kvm_arm_smmu_domain *kvm_smmu_domain = to_kvm_smmu_domain(map_sg->cookie.domain);
+
+	do {
+		res = kvm_call_hyp_nvhe_smccc(__pkvm_host_iommu_map_sg,
+					      kvm_smmu_domain->id,
+					      map_sg->iova, sg, map_sg->ptr, map_sg->prot);
+		mapped = res.a1;
+		map_sg->iova += mapped;
+		total_mapped += mapped;
+		/* Skip mapped */
+		while (mapped) {
+			if (mapped < (sg->pgsize * sg->pgcount)) {
+				sg->phys += mapped;
+				sg->pgcount -= mapped / sg->pgsize;
+				mapped = 0;
+			} else {
+				mapped -= sg->pgsize * sg->pgcount;
+				sg++;
+				map_sg->ptr--;
+			}
+		}
+
+		kvm_arm_smmu_topup_memcache(&res, map_sg->gfp);
+	} while (map_sg->ptr);
+
+	kvm_iommu_unshare_hyp_sg(sg, map_sg->nents);
+	kvm_iommu_sg_free(sg, map_sg->nents);
+	kfree(map_sg);
+	return 0;
+}
+
 static struct iommu_ops kvm_arm_smmu_ops = {
 	.capable		= kvm_arm_smmu_capable,
 	.device_group		= arm_smmu_device_group,
@@ -463,6 +553,9 @@ static struct iommu_ops kvm_arm_smmu_ops = {
 		.unmap_pages	= kvm_arm_smmu_unmap_pages,
 		.iova_to_phys	= kvm_arm_smmu_iova_to_phys,
 		.set_dev_pasid	= kvm_arm_smmu_set_dev_pasid,
+		.alloc_cookie_sg = kvm_arm_smmu_alloc_cookie_sg,
+		.add_deferred_map_sg = kvm_arm_smmu_add_deferred_map_sg,
+		.consume_deferred_map_sg = kvm_arm_smmu_consume_deferred_map_sg,
 	}
 };
 
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 58/58] iommu/arm-smmu-v3-kvm: Support command queue batching
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (56 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 57/58] iommu/arm-smmu-v3-kvm: Implement sg operations Mostafa Saleh
@ 2024-12-12 18:04 ` Mostafa Saleh
  2024-12-12 19:41 ` [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Jason Gunthorpe
  58 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-12 18:04 UTC (permalink / raw)
  To: iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, robin.murphy,
	jean-philippe, jgg, nicolinc, vdonnefort, qperret, tabba,
	danielmentz, tzukui, Mostafa Saleh

Similar to the kernel driver, we can batch commands at EL2 to avoid
writing to MMIO space, this is quite noticable if the SMMU doesn't
support range invalidation so it has to invalidate page per page.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 arch/arm64/include/asm/arm-smmu-v3-common.h | 16 ++++
 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 95 ++++++++++++++++-----
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 15 ----
 3 files changed, 88 insertions(+), 38 deletions(-)

diff --git a/arch/arm64/include/asm/arm-smmu-v3-common.h b/arch/arm64/include/asm/arm-smmu-v3-common.h
index f2fbd286f674..2578c8e9202e 100644
--- a/arch/arm64/include/asm/arm-smmu-v3-common.h
+++ b/arch/arm64/include/asm/arm-smmu-v3-common.h
@@ -573,4 +573,20 @@ struct arm_smmu_cmdq_ent {
 	};
 };
 
+#define Q_OVERFLOW_FLAG			(1U << 31)
+#define Q_OVF(p)			((p) & Q_OVERFLOW_FLAG)
+
+/*
+ * This is used to size the command queue and therefore must be at least
+ * BITS_PER_LONG so that the valid_map works correctly (it relies on the
+ * total number of queue entries being a multiple of BITS_PER_LONG).
+ */
+#define CMDQ_BATCH_ENTRIES		BITS_PER_LONG
+
+struct arm_smmu_cmdq_batch {
+	u64				cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
+	struct arm_smmu_cmdq		*cmdq;
+	int				num;
+};
+
 #endif /* _ARM_SMMU_V3_COMMON_H */
diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
index 60f0760f49eb..62760136c6fb 100644
--- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
+++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
@@ -96,12 +96,20 @@ static void smmu_reclaim_pages(u64 phys, size_t size)
 #define Q_WRAP(smmu, reg)	((reg) & (1 << (smmu)->cmdq_log2size))
 #define Q_IDX(smmu, reg)	((reg) & ((1 << (smmu)->cmdq_log2size) - 1))
 
-static bool smmu_cmdq_full(struct hyp_arm_smmu_v3_device *smmu)
+static bool smmu_cmdq_has_space(struct hyp_arm_smmu_v3_device *smmu, u32 n)
 {
-	u64 cons = readl_relaxed(smmu->base + ARM_SMMU_CMDQ_CONS);
+	u64 smmu_cons = readl_relaxed(smmu->base + ARM_SMMU_CMDQ_CONS);
+	u32 space, prod, cons;
 
-	return Q_IDX(smmu, smmu->cmdq_prod) == Q_IDX(smmu, cons) &&
-	       Q_WRAP(smmu, smmu->cmdq_prod) != Q_WRAP(smmu, cons);
+	prod = Q_IDX(smmu, smmu->cmdq_prod);
+	cons = Q_IDX(smmu, smmu_cons);
+
+	if (Q_WRAP(smmu, smmu->cmdq_prod) == Q_WRAP(smmu, smmu_cons))
+		space = (1 << smmu->cmdq_log2size) - (prod - cons);
+	else
+		space = cons - prod;
+
+	return space >= n;
 }
 
 static bool smmu_cmdq_empty(struct hyp_arm_smmu_v3_device *smmu)
@@ -112,22 +120,8 @@ static bool smmu_cmdq_empty(struct hyp_arm_smmu_v3_device *smmu)
 	       Q_WRAP(smmu, smmu->cmdq_prod) == Q_WRAP(smmu, cons);
 }
 
-static int smmu_add_cmd(struct hyp_arm_smmu_v3_device *smmu,
-			struct arm_smmu_cmdq_ent *ent)
+static int smmu_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
 {
-	int i;
-	int ret;
-	u64 cmd[CMDQ_ENT_DWORDS] = {};
-	int idx = Q_IDX(smmu, smmu->cmdq_prod);
-	u64 *slot = smmu->cmdq_base + idx * CMDQ_ENT_DWORDS;
-
-	if (smmu->iommu.power_is_off)
-		return -EPIPE;
-
-	ret = smmu_wait_event(smmu, !smmu_cmdq_full(smmu));
-	if (ret)
-		return ret;
-
 	cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
 
 	switch (ent->opcode) {
@@ -175,15 +169,49 @@ static int smmu_add_cmd(struct hyp_arm_smmu_v3_device *smmu,
 		return -EINVAL;
 	}
 
-	for (i = 0; i < CMDQ_ENT_DWORDS; i++)
-		slot[i] = cpu_to_le64(cmd[i]);
+	return 0;
+}
+
+static int smmu_issue_cmds(struct hyp_arm_smmu_v3_device *smmu,
+			   u64 *cmds, int n)
+{
+	int idx = Q_IDX(smmu, smmu->cmdq_prod);
+	u64 *slot = smmu->cmdq_base + idx * CMDQ_ENT_DWORDS;
+	int i;
+	int ret;
+	u32 prod;
+
+	if (smmu->iommu.power_is_off)
+		return -EPIPE;
+
+	ret = smmu_wait_event(smmu, smmu_cmdq_has_space(smmu, n));
+	if (ret)
+		return ret;
+
+	for (i = 0; i < CMDQ_ENT_DWORDS * n; i++)
+		slot[i] = cpu_to_le64(cmds[i]);
+
+	prod = (Q_WRAP(smmu, smmu->cmdq_prod) | Q_IDX(smmu, smmu->cmdq_prod)) + n;
+	smmu->cmdq_prod = Q_OVF(smmu->cmdq_prod) | Q_WRAP(smmu, prod) | Q_IDX(smmu, prod);
 
-	smmu->cmdq_prod++;
 	writel(Q_IDX(smmu, smmu->cmdq_prod) | Q_WRAP(smmu, smmu->cmdq_prod),
 	       smmu->base + ARM_SMMU_CMDQ_PROD);
 	return 0;
 }
 
+static int smmu_add_cmd(struct hyp_arm_smmu_v3_device *smmu,
+			struct arm_smmu_cmdq_ent *ent)
+{
+	u64 cmd[CMDQ_ENT_DWORDS] = {};
+	int ret;
+
+	ret = smmu_build_cmd(cmd, ent);
+	if (ret)
+		return ret;
+
+	return smmu_issue_cmds(smmu, cmd, 1);
+}
+
 static int smmu_sync_cmd(struct hyp_arm_smmu_v3_device *smmu)
 {
 	int ret;
@@ -685,6 +713,23 @@ static void smmu_tlb_flush_all(void *cookie)
 	kvm_iommu_unlock(&smmu->iommu);
 }
 
+static void smmu_cmdq_batch_add(struct hyp_arm_smmu_v3_device *smmu,
+				struct arm_smmu_cmdq_batch *cmds,
+				struct arm_smmu_cmdq_ent *cmd)
+{
+	int index;
+
+	if (cmds->num == CMDQ_BATCH_ENTRIES) {
+		smmu_issue_cmds(smmu, cmds->cmds, cmds->num);
+		cmds->num = 0;
+	}
+
+	index = cmds->num * CMDQ_ENT_DWORDS;
+	smmu_build_cmd(&cmds->cmds[index], cmd);
+
+	cmds->num++;
+}
+
 static int smmu_tlb_inv_range_smmu(struct hyp_arm_smmu_v3_device *smmu,
 				   struct kvm_hyp_iommu_domain *domain,
 				   struct arm_smmu_cmdq_ent *cmd,
@@ -694,6 +739,7 @@ static int smmu_tlb_inv_range_smmu(struct hyp_arm_smmu_v3_device *smmu,
 	unsigned long end = iova + size, num_pages = 0, tg = 0;
 	size_t inv_range = granule;
 	struct hyp_arm_smmu_v3_domain *smmu_domain = domain->priv;
+	struct arm_smmu_cmdq_batch cmds;
 
 	kvm_iommu_lock(&smmu->iommu);
 	if (smmu->iommu.power_is_off)
@@ -723,6 +769,8 @@ static int smmu_tlb_inv_range_smmu(struct hyp_arm_smmu_v3_device *smmu,
 			num_pages++;
 	}
 
+	cmds.num = 0;
+
 	while (iova < end) {
 		if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
 			/*
@@ -749,11 +797,12 @@ static int smmu_tlb_inv_range_smmu(struct hyp_arm_smmu_v3_device *smmu,
 			num_pages -= num << scale;
 		}
 		cmd->tlbi.addr = iova;
-		WARN_ON(smmu_add_cmd(smmu, cmd));
+		smmu_cmdq_batch_add(smmu, &cmds, cmd);
 		BUG_ON(iova + inv_range < iova);
 		iova += inv_range;
 	}
 
+	WARN_ON(smmu_issue_cmds(smmu, cmds.cmds, cmds.num));
 	ret = smmu_sync_cmd(smmu);
 out_ret:
 	kvm_iommu_unlock(&smmu->iommu);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index d91dfe55835d..18f878bb7f98 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -20,8 +20,6 @@ struct arm_smmu_device;
 
 #define Q_IDX(llq, p)			((p) & ((1 << (llq)->max_n_shift) - 1))
 #define Q_WRP(llq, p)			((p) & (1 << (llq)->max_n_shift))
-#define Q_OVERFLOW_FLAG			(1U << 31)
-#define Q_OVF(p)			((p) & Q_OVERFLOW_FLAG)
 #define Q_ENT(q, p)			((q)->base +			\
 					 Q_IDX(&((q)->llq), p) *	\
 					 (q)->ent_dwords)
@@ -35,13 +33,6 @@ struct arm_smmu_device;
 
 #define CMDQ_PROD_OWNED_FLAG		Q_OVERFLOW_FLAG
 
-/*
- * This is used to size the command queue and therefore must be at least
- * BITS_PER_LONG so that the valid_map works correctly (it relies on the
- * total number of queue entries being a multiple of BITS_PER_LONG).
- */
-#define CMDQ_BATCH_ENTRIES		BITS_PER_LONG
-
 /* High-level queue structures */
 #define ARM_SMMU_POLL_TIMEOUT_US	1000000 /* 1s! */
 #define ARM_SMMU_POLL_SPIN_COUNT	10
@@ -100,12 +91,6 @@ static inline bool arm_smmu_cmdq_supports_cmd(struct arm_smmu_cmdq *cmdq,
 	return cmdq->supports_cmd ? cmdq->supports_cmd(ent) : true;
 }
 
-struct arm_smmu_cmdq_batch {
-	u64				cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
-	struct arm_smmu_cmdq		*cmdq;
-	int				num;
-};
-
 struct arm_smmu_evtq {
 	struct arm_smmu_queue		q;
 	struct iopf_queue		*iopf;
-- 
2.47.0.338.g60cca15819-goog



^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
                   ` (57 preceding siblings ...)
  2024-12-12 18:04 ` [RFC PATCH v2 58/58] iommu/arm-smmu-v3-kvm: Support command queue batching Mostafa Saleh
@ 2024-12-12 19:41 ` Jason Gunthorpe
  2024-12-13 19:39   ` Mostafa Saleh
  58 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2024-12-12 19:41 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, nicolinc,
	vdonnefort, qperret, tabba, danielmentz, tzukui

On Thu, Dec 12, 2024 at 06:03:24PM +0000, Mostafa Saleh wrote:

> This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
> hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
> host SMMU driver into nVHE isn't really an option. It is too large and
> complex and requires infrastructure from all over the kernel. We add a
> reduced nVHE driver that deals with populating the SMMU tables and the
> command queue, and the host driver still deals with probing and some
> initialization.

The cover letter doesn't explain why someone needs page tables in the
guest at all?

If you are able to implement nested support then you can boot the
guest with no-iommu and an effective identity translation through a
hypervisor controlled S2. ie no guest map/unmap. Great DMA
performance.

I thought the point of doing the paravirt here was to allow dynamic
pinning of the guest memory? This is the primary downside with nested.
The entire guest memory has to be pinned down at guest boot.

> 1. Paravirtual I/O page tables
> This is the solution implemented in this series. The host creates
> IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
> the hypervisor populates the page tables. Page tables are abstracted into
> IOMMU domains, which allow multiple devices to share the same address
> space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
> and free_domain(), manage the domains, the semantics of those hypercalls
> are almost identical to the IOMMU ops which make the kernel driver part
> simpler.

That is re-inventing virtio-iommu. I don't really understand why this
series is hacking up arm-smmuv3 so much, that is not, and should not,
be a paravirt driver. Why not create a clean new pkvm specific driver
for the paravirt?? Or find a way to re-use parts of virtio-iommu?

Shouldn't other arch versions of pkvm be able to re-use the same guest
iommu driver?

> b- Locking: The io-pgtable-arm is lockless under some guarantees of how
>    the IOMMU code behaves. However with pKVM, the kernel is not trusted
>    and a malicious kernel can issue concurrent requests causing memory
>    corruption or UAF, so that it has to be locked in the hypervisor.

? I don't get it, the hypervisor page table has to be private to the
hypervisor. It is not that io-pgtable-arm is lockless, it is that it
relies on a particular kind of caller supplied locking. pkvm's calls
into its private io-pgtable-arm would need pkvm specific locking that
makes sense for it. Where does a malicious guest kernel get into this?

> 2. Nested SMMUv3 translation (with emulation)
> Another approach is to rely on nested translation support which is
> optional in SMMUv3, that requires an architecturally accurate emulation
> of SMMUv3 which can be complicated including cmdq emulation.

The confidential compute folks are going in this direction.

> The trade off between the 2 approaches can be roughly summarised as:
> Paravirtualization:
> - Compatible with more HW (and IOMMUs).
> - Better DMA performance due to shorter table walks/less TLB pressure
> - Needs extra complexity to squeeze the last bit of optimization (around
>   unmap, and map_sg).

It has better straight line DMA performance if the DMAs are all
static. Generally much, much worse performance if the DMAs are
dynamically mapped as you have to trap so much stuff.

The other negative is there is no way to get SVA support with
para-virtualization.

The positive is you don't have to pin the VM's memory.

> Nested Emulation
> - Faster map_pages (not sure about unmap because it requires cmdq
>   emulation for TLB invalidation if DVM not used).

If you can do nested then you can run in identity mode and then you
don't have any performance down side. It is a complete win.

Even if you do non-idenity nested is still likely faster for changing
translation than paravirt approaches. A single cmdq range invalidate
should be about the same broad overhead as a single paravirt call to
unmap except they can be batched under load.

Things like vCMDQ eliminate this overhead entirely, to my mind that is
the future direction of this HW as you obviously need to HW optimize
invalidation...

> - Needs extra complexity for architecturally emulating SMMUv3.

Lots of people have now done this, it is not really so bad. In
exchange you get a full architected feature set, better performance,
and are ready for HW optimizations.

> - Add IDENTITY_DOMAIN support, I already have some patches for that, but
>   didn’t want to complicate this series, I can send them separately.

This seems kind of pointless to me. If you can tolerate identity (ie
pin all memory) then do nested, and maybe don't even bother with a
guest iommu.

If you want most of the guest memory to be swappable/movable/whatever
then paravirt is the only choice, and you really don't want the guest
to have any identiy support at all.

Really, I think you'd want to have both options, there is no "best"
here. It depends what people want to use the VM for.

My advice for merging would be to start with the pkvm side setting up
a fully pinned S2 and do not have a guest driver. Nesting without
emulating smmuv3. Basically you get protected identity DMA support. I
think that would be a much less sprawling patch series. From there it
would be well positioned to add both smmuv3 emulation and a paravirt
iommu flow.

Jason

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 40/58] KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys
  2024-12-12 18:04 ` [RFC PATCH v2 40/58] KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys Mostafa Saleh
@ 2024-12-12 19:44   ` Jason Gunthorpe
  2024-12-13 19:48     ` Mostafa Saleh
  0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2024-12-12 19:44 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, nicolinc,
	vdonnefort, qperret, tabba, danielmentz, tzukui

On Thu, Dec 12, 2024 at 06:04:04PM +0000, Mostafa Saleh wrote:
> Add map_pages and iova_to_phys HVC code, which
> mainly calls the io-pgtable.
> 
> For unmap_pages, we rely on IO_PGTABLE_QUIRK_UNMAP_INVAL, where the
> driver first calls unmap_pages which invalidate all the pages as
> a typical unmap, issuing all the necessary TLB invalidations.
> Then, we will start a page table with 2 callbacks:
> - visit_leaf: for each unmapped leaf, it would decrement the refcount
>   of the page using __pkvm_host_unuse_dma(), reversing the what IOMMU
>   core does in map.
> - visit_post_table: this would free any invalidated tables as they
>   wouldn't be freed because of the quirk.

I don't know if the timelines will work out, but the pagetable stuff
I'm working on will let you write a much more appropriate
implementation for pkvm's usage than trying to hack it into the
iopgtable code like this.

Even the iommu focused routines I have got now would solve this
problem because they allways spit out a linked list of all the memory
to free after map/unmap and never internally free it..

Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2024-12-12 19:41 ` [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Jason Gunthorpe
@ 2024-12-13 19:39   ` Mostafa Saleh
  2025-01-02 20:16     ` Jason Gunthorpe
  0 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-13 19:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, nicolinc,
	vdonnefort, qperret, tabba, danielmentz, tzukui

Hi Jason,

Thanks a lot for taking the time to review this, I tried to reply to all
points. However I think a main source of confusion was that this is only
for the host kernel not guests, with this series guests still have no
access to DMA under pKVM. I hope that clarifies some of the points.

On Thu, Dec 12, 2024 at 03:41:19PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 12, 2024 at 06:03:24PM +0000, Mostafa Saleh wrote:
> 
> > This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
> > hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
> > host SMMU driver into nVHE isn't really an option. It is too large and
> > complex and requires infrastructure from all over the kernel. We add a
> > reduced nVHE driver that deals with populating the SMMU tables and the
> > command queue, and the host driver still deals with probing and some
> > initialization.
> 
> The cover letter doesn't explain why someone needs page tables in the
> guest at all?

This is not for guests but for the host, the hypervisor needs to
establish DMA isolation between the host and the hypervisor/guests.
Before these patches; as mentioned, a host can program a DMA device
to read/write any memory (that has nothing to do with whether the
guest has DMA access or not).

So it’s mandatory for pKVM to establish DMA isolation, otherwise
it can be easily defeated.

However, guest DMA support is optional and only needed for device
passthrough, I have some patches to support that in pKVM also(only with
vfio-platform), but it’s unlikely to be posted upstream before merging a
host DMA isolation solution first as it’s mandatory.

> 
> If you are able to implement nested support then you can boot the
> guest with no-iommu and an effective identity translation through a
> hypervisor controlled S2. ie no guest map/unmap. Great DMA
> performance.

We can do that for the host also, which is discussed in the v1 cover
letter. However, we try to keep feature parity with the normal (VHE)
KVM arm64 support, so constraining KVM support to not have IOVA spaces
for devices seems too much and impractical on modern systems (phones for
example).

> 
> I thought the point of doing the paravirt here was to allow dynamic
> pinning of the guest memory? This is the primary downside with nested.
> The entire guest memory has to be pinned down at guest boot.

As this is for the host, memory pinning is not really an issue (However,
with nesting and shared CPU stage-2 there are other challenges as
mentioned).

> 
> > 1. Paravirtual I/O page tables
> > This is the solution implemented in this series. The host creates
> > IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
> > the hypervisor populates the page tables. Page tables are abstracted into
> > IOMMU domains, which allow multiple devices to share the same address
> > space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
> > and free_domain(), manage the domains, the semantics of those hypercalls
> > are almost identical to the IOMMU ops which make the kernel driver part
> > simpler.
> 
> That is re-inventing virtio-iommu. I don't really understand why this
> series is hacking up arm-smmuv3 so much, that is not, and should not,
> be a paravirt driver. Why not create a clean new pkvm specific driver
> for the paravirt?? Or find a way to re-use parts of virtio-iommu?
> 
> Shouldn't other arch versions of pkvm be able to re-use the same guest
> iommu driver?

As mentioned, this is for the host kernel not the guest. However the
hypervisor/kernel interface is not IOMMU specific. And it can be extended
to other IOMMUs/archs.

There is no hacking for the arm-smmu-v3 driver, but mostly splitting
the driver so it can be re-used + introduction for a separate hypervisor
driver, it’s similar to how SVA re-use part of the driver also but just
on a bigger scale.

> 
> > b- Locking: The io-pgtable-arm is lockless under some guarantees of how
> >    the IOMMU code behaves. However with pKVM, the kernel is not trusted
> >    and a malicious kernel can issue concurrent requests causing memory
> >    corruption or UAF, so that it has to be locked in the hypervisor.
> 
> ? I don't get it, the hypervisor page table has to be private to the
> hypervisor. It is not that io-pgtable-arm is lockless, it is that it
> relies on a particular kind of caller supplied locking. pkvm's calls
> into its private io-pgtable-arm would need pkvm specific locking that
> makes sense for it. Where does a malicious guest kernel get into this?

At the moment when the kernel driver uses the io-pgtable-arm, it doesn’t
protect it with any locks under some assumptions, for example, unmapping
a table with block size and a leaf inside it with page size concurrently
can cause a UAF, but the DMA API never does that.

With pKVM, the host kernel is not trusted, and if compromised it can
instrument such attacks to corrupt hypervisor memory, so the hypervisor
would lock io-pgtable-arm operations in EL2 to avoid that.

> 
> > 2. Nested SMMUv3 translation (with emulation)
> > Another approach is to rely on nested translation support which is
> > optional in SMMUv3, that requires an architecturally accurate emulation
> > of SMMUv3 which can be complicated including cmdq emulation.
> 
> The confidential compute folks are going in this direction.

I see, but one key advantage for pKVM that it requires minimum hardware,
with the paravirtual approach we can support single stage SMMUv3 or even
non-architected IOMMUs, that + the DMA performance, might give it slight
edge, but as I mentioned I plan to do more throughout comparison with
nesting and maybe discuss it in a conference this year.

> 
> > The trade off between the 2 approaches can be roughly summarised as:
> > Paravirtualization:
> > - Compatible with more HW (and IOMMUs).
> > - Better DMA performance due to shorter table walks/less TLB pressure
> > - Needs extra complexity to squeeze the last bit of optimization (around
> >   unmap, and map_sg).
> 
> It has better straight line DMA performance if the DMAs are all
> static. Generally much, much worse performance if the DMAs are
> dynamically mapped as you have to trap so much stuff.

I agree it’s not that clear, I will finish the nested implementation
and run some standard IO benchmarks.

> 
> The other negative is there is no way to get SVA support with
> para-virtualization.
> 
Yeah, SVA is tricky, I guess for that we would have to use nesting,
but tbh, I don’t think it’s a deal breaker for now.

> The positive is you don't have to pin the VM's memory.
> 
> > Nested Emulation
> > - Faster map_pages (not sure about unmap because it requires cmdq
> >   emulation for TLB invalidation if DVM not used).
> 
> If you can do nested then you can run in identity mode and then you
> don't have any performance down side. It is a complete win.

Unfortunately, as mentioned above it’s not that practical, many devices
in mobile space expect IO translation capability.

> 
> Even if you do non-idenity nested is still likely faster for changing
> translation than paravirt approaches. A single cmdq range invalidate
> should be about the same broad overhead as a single paravirt call to
> unmap except they can be batched under load.
> 
> Things like vCMDQ eliminate this overhead entirely, to my mind that is
> the future direction of this HW as you obviously need to HW optimize
> invalidation...
> 
> > - Needs extra complexity for architecturally emulating SMMUv3.
> 
> Lots of people have now done this, it is not really so bad. In
> exchange you get a full architected feature set, better performance,
> and are ready for HW optimizations.

It’s not impossible, it’s just more complicated doing it in the
hypervisor which has limited features compared to the kernel + I haven’t
seen any open source implementation for that except for Qemu which is in
userspace.

> 
> > - Add IDENTITY_DOMAIN support, I already have some patches for that, but
> >   didn’t want to complicate this series, I can send them separately.
> 
> This seems kind of pointless to me. If you can tolerate identity (ie
> pin all memory) then do nested, and maybe don't even bother with a
> guest iommu.

As mentioned, the choice for para-virt was not only to avoid pinning,
as this is the host, for IDENTITY_DOMAIN we either share the page table,
then we have to deal with lazy mapping (SMMU features, BBM...) or mirror
the table in a shadow SMMU only identity page table.

> 
> If you want most of the guest memory to be swappable/movable/whatever
> then paravirt is the only choice, and you really don't want the guest
> to have any identiy support at all.
> 
> Really, I think you'd want to have both options, there is no "best"
> here. It depends what people want to use the VM for.
> 
> My advice for merging would be to start with the pkvm side setting up
> a fully pinned S2 and do not have a guest driver. Nesting without
> emulating smmuv3. Basically you get protected identity DMA support. I
> think that would be a much less sprawling patch series. From there it
> would be well positioned to add both smmuv3 emulation and a paravirt
> iommu flow.
> 

I am open to any suggestions, but I believe any solution considered for
merge, should have enough features to be usable on actual systems (translating
IOMMU can be used for example) so either para-virt as this series or full
nesting as the PoC above (or maybe both?), which IMO comes down to the
trade-off mentioned above.

Thanks,
Mostafa

> Jason

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 40/58] KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys
  2024-12-12 19:44   ` Jason Gunthorpe
@ 2024-12-13 19:48     ` Mostafa Saleh
  0 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-13 19:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, nicolinc,
	vdonnefort, qperret, tabba, danielmentz, tzukui

Hi Jason,

On Thu, Dec 12, 2024 at 03:44:35PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 12, 2024 at 06:04:04PM +0000, Mostafa Saleh wrote:
> > Add map_pages and iova_to_phys HVC code, which
> > mainly calls the io-pgtable.
> > 
> > For unmap_pages, we rely on IO_PGTABLE_QUIRK_UNMAP_INVAL, where the
> > driver first calls unmap_pages which invalidate all the pages as
> > a typical unmap, issuing all the necessary TLB invalidations.
> > Then, we will start a page table with 2 callbacks:
> > - visit_leaf: for each unmapped leaf, it would decrement the refcount
> >   of the page using __pkvm_host_unuse_dma(), reversing the what IOMMU
> >   core does in map.
> > - visit_post_table: this would free any invalidated tables as they
> >   wouldn't be freed because of the quirk.
> 
> I don't know if the timelines will work out, but the pagetable stuff
> I'm working on will let you write a much more appropriate
> implementation for pkvm's usage than trying to hack it into the
> iopgtable code like this.

I didn’t check your new page table patches yet, but I would say it’s
more likely your patches would land first, because as mentioned in the
cover letter, there are still many dependencies for pKVM before IOMMU
support lands, so I don’t mind converging if possible.

Thanks,
Mostafa

> 
> Even the iommu focused routines I have got now would solve this
> problem because they allways spit out a linked list of all the memory
> to free after map/unmap and never internally free it..

> 
> Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 21/58] KVM: arm64: pkvm: Add __pkvm_host_add_remove_page()
  2024-12-12 18:03 ` [RFC PATCH v2 21/58] KVM: arm64: pkvm: Add __pkvm_host_add_remove_page() Mostafa Saleh
@ 2024-12-19 11:10   ` Quentin Perret
  2024-12-19 11:19     ` Mostafa Saleh
  0 siblings, 1 reply; 97+ messages in thread
From: Quentin Perret @ 2024-12-19 11:10 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, jgg, nicolinc,
	vdonnefort, tabba, danielmentz, tzukui

On Thursday 12 Dec 2024 at 18:03:45 (+0000), Mostafa Saleh wrote:
> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> +/*
> + * Temporarily unmap a page from the host stage-2, if @remove is true, or put it
> + * back. After restoring the ownership to host, the page will be lazy-mapped.
> + */
> +int __pkvm_host_add_remove_page(u64 pfn, bool remove)
> +{
> +	int ret;
> +	u64 host_addr = hyp_pfn_to_phys(pfn);
> +	u8 owner = remove ? PKVM_ID_HYP : PKVM_ID_HOST;
> +
> +	host_lock_component();
> +	ret = host_stage2_set_owner_locked(host_addr, PAGE_SIZE, owner);

Any reason why this can't be expressed using __pkvm_host_donate_hyp()
and __pkvm_hyp_donate_host()?

This doesn't check any state, so it feels like a dangerous primitive to
have. Is the issue the overhead of mapping/unmapping into EL2 stage-1?


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 11/58] KVM: arm64: pkvm: Add pkvm_udelay()
  2024-12-12 18:03 ` [RFC PATCH v2 11/58] KVM: arm64: pkvm: Add pkvm_udelay() Mostafa Saleh
@ 2024-12-19 11:14   ` Quentin Perret
  2024-12-19 11:21     ` Mostafa Saleh
  0 siblings, 1 reply; 97+ messages in thread
From: Quentin Perret @ 2024-12-19 11:14 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, jgg, nicolinc,
	vdonnefort, tabba, danielmentz, tzukui

On Thursday 12 Dec 2024 at 18:03:35 (+0000), Mostafa Saleh wrote:
> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> 
> Add a simple delay loop for drivers.
> 
> This could use more work. It should be possible to insert a wfe and save
> power, but I haven't studied whether it is safe to do so with the host
> in control of the event stream. The SMMU driver will use wfe anyway for
> frequent waits (provided the implementation can send command queue
> events).

Mooh, I'm thoroughly hating that we need this -- pKVM is non preemptible
so we better not wait for too long.

I can surely figure it out from the following patches, but could you
please expand on the use-case?

On a side note I'm not too worried about the power impact of not having
a wfe in there, again we better not be spinning for long enough that
power starts to be noticeable.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 21/58] KVM: arm64: pkvm: Add __pkvm_host_add_remove_page()
  2024-12-19 11:10   ` Quentin Perret
@ 2024-12-19 11:19     ` Mostafa Saleh
  0 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-19 11:19 UTC (permalink / raw)
  To: Quentin Perret
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, jgg, nicolinc,
	vdonnefort, tabba, danielmentz, tzukui

On Thu, Dec 19, 2024 at 11:10:23AM +0000, Quentin Perret wrote:
> On Thursday 12 Dec 2024 at 18:03:45 (+0000), Mostafa Saleh wrote:
> > From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > +/*
> > + * Temporarily unmap a page from the host stage-2, if @remove is true, or put it
> > + * back. After restoring the ownership to host, the page will be lazy-mapped.
> > + */
> > +int __pkvm_host_add_remove_page(u64 pfn, bool remove)
> > +{
> > +	int ret;
> > +	u64 host_addr = hyp_pfn_to_phys(pfn);
> > +	u8 owner = remove ? PKVM_ID_HYP : PKVM_ID_HOST;
> > +
> > +	host_lock_component();
> > +	ret = host_stage2_set_owner_locked(host_addr, PAGE_SIZE, owner);
> 
> Any reason why this can't be expressed using __pkvm_host_donate_hyp()
> and __pkvm_hyp_donate_host()?

That makes more sense, I will fix it.
> 
> This doesn't check any state, so it feels like a dangerous primitive to
> have. Is the issue the overhead of mapping/unmapping into EL2 stage-1?


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 11/58] KVM: arm64: pkvm: Add pkvm_udelay()
  2024-12-19 11:14   ` Quentin Perret
@ 2024-12-19 11:21     ` Mostafa Saleh
  2024-12-19 11:28       ` Quentin Perret
  0 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-19 11:21 UTC (permalink / raw)
  To: Quentin Perret
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, jgg, nicolinc,
	vdonnefort, tabba, danielmentz, tzukui

On Thu, Dec 19, 2024 at 11:14:23AM +0000, Quentin Perret wrote:
> On Thursday 12 Dec 2024 at 18:03:35 (+0000), Mostafa Saleh wrote:
> > From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > 
> > Add a simple delay loop for drivers.
> > 
> > This could use more work. It should be possible to insert a wfe and save
> > power, but I haven't studied whether it is safe to do so with the host
> > in control of the event stream. The SMMU driver will use wfe anyway for
> > frequent waits (provided the implementation can send command queue
> > events).
> 
> Mooh, I'm thoroughly hating that we need this -- pKVM is non preemptible
> so we better not wait for too long.
> 
> I can surely figure it out from the following patches, but could you
> please expand on the use-case?

The driver needs to poll some SMMU MMIO, so it needs to measure time
in terms of udelay to timeout, at the moment its arbitrary set to 100ms.

Thanks,
Mostafa
> 
> On a side note I'm not too worried about the power impact of not having
> a wfe in there, again we better not be spinning for long enough that
> power starts to be noticeable.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 12/58] KVM: arm64: Add __pkvm_{use, unuse}_dma()
  2024-12-12 18:03 ` [RFC PATCH v2 12/58] KVM: arm64: Add __pkvm_{use, unuse}_dma() Mostafa Saleh
@ 2024-12-19 11:23   ` Quentin Perret
  0 siblings, 0 replies; 97+ messages in thread
From: Quentin Perret @ 2024-12-19 11:23 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, jgg, nicolinc,
	vdonnefort, tabba, danielmentz, tzukui

On Thursday 12 Dec 2024 at 18:03:36 (+0000), Mostafa Saleh wrote:
> +int __pkvm_host_unuse_dma(phys_addr_t phys_addr, size_t size)
> +{
> +	int i;
> +	size_t nr_pages = size >> PAGE_SHIFT;
> +
> +	if (WARN_ON(!PAGE_ALIGNED(phys_addr | size)))
> +		return -EINVAL;
> +
> +	host_lock_component();
> +	if (!range_is_memory(phys_addr, phys_addr + size))
> +		goto out_ret;

Nit: I think you can do this out of the critical section -- the memblock
list is guaranteed stable by construction.

> +	/*
> +	 * We end up here after the caller successfully unmapped the page from
> +	 * the IOMMU table. Which means that a ref is held, the page is shared
> +	 * in the host s2, there can be no failure.
> +	 */
> +	for (i = 0; i < nr_pages; i++)
> +		__pkvm_host_unuse_dma_page(phys_addr + i * PAGE_SIZE);
> +
> +out_ret:
> +	host_unlock_component();
> +	return 0;
> +}
> +
>  int __pkvm_host_share_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu,
>  			    enum kvm_pgtable_prot prot)
>  {
> -- 
> 2.47.0.338.g60cca15819-goog
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 11/58] KVM: arm64: pkvm: Add pkvm_udelay()
  2024-12-19 11:21     ` Mostafa Saleh
@ 2024-12-19 11:28       ` Quentin Perret
  0 siblings, 0 replies; 97+ messages in thread
From: Quentin Perret @ 2024-12-19 11:28 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, jgg, nicolinc,
	vdonnefort, tabba, danielmentz, tzukui

On Thursday 19 Dec 2024 at 11:21:40 (+0000), Mostafa Saleh wrote:
> On Thu, Dec 19, 2024 at 11:14:23AM +0000, Quentin Perret wrote:
> > On Thursday 12 Dec 2024 at 18:03:35 (+0000), Mostafa Saleh wrote:
> > > From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > 
> > > Add a simple delay loop for drivers.
> > > 
> > > This could use more work. It should be possible to insert a wfe and save
> > > power, but I haven't studied whether it is safe to do so with the host
> > > in control of the event stream. The SMMU driver will use wfe anyway for
> > > frequent waits (provided the implementation can send command queue
> > > events).
> > 
> > Mooh, I'm thoroughly hating that we need this -- pKVM is non preemptible
> > so we better not wait for too long.
> > 
> > I can surely figure it out from the following patches, but could you
> > please expand on the use-case?
> 
> The driver needs to poll some SMMU MMIO, so it needs to measure time
> in terms of udelay to timeout, at the moment its arbitrary set to 100ms.

OK, I'll look at the patches to see how hard it'd be to return to the
host for scheduling while this is happening. It's probably hard because
we're presumably in a funky state, but it's worth trying to figure it
out. Any EL2 section longer than 100us or so is problematic in my view,
so a 100*ms* timeout is scary!


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations
  2024-12-12 18:04 ` [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations Mostafa Saleh
@ 2024-12-19 12:48   ` Robin Murphy
  2024-12-19 14:24     ` Mostafa Saleh
  0 siblings, 1 reply; 97+ messages in thread
From: Robin Murphy @ 2024-12-19 12:48 UTC (permalink / raw)
  To: Mostafa Saleh, iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, jean-philippe, jgg,
	nicolinc, vdonnefort, qperret, tabba, danielmentz, tzukui

On 2024-12-12 6:04 pm, Mostafa Saleh wrote:
> With pKVM SMMUv3 driver which para-virtualizes the IOMMU in the
> hypervisor, has an extra overhead with map_sg, as it loops over
> iommu_map, and for each map requires context switching, disabling
> interrupts...
> 
> Instead, add an new domain operations:
> - alloc_cookie_sg: Allocate a new sg deferred cookie
> - add_deferred_map_sg: Add a mapping to the cookie
> - consume_deferred_map_sg: Consume and release the cookie
> 
> Alternativly, we can pass the sg list as is. However, this would
> duplicate some of the logic and it would make more sense to
> conolidate all the sg list parsing for IOMMU drivers in one place.

But why bother with fiddly overly-specific machinery at all when you can 
already make ->map_pages asynchronous and consolidate the expensive part 
into ->iotlb_sync_map in general, like s390 does?

Thanks,
Robin.

> virtio-iommu is another IOMMU that can benfit from this, but it
> would need to have a new operation that standerdize passing
> an sglist based on these ops.
> 
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
>   drivers/iommu/iommu.c | 53 +++++++++++++++++++++++++++++++++++++++++--
>   include/linux/iommu.h | 19 ++++++++++++++++
>   2 files changed, 70 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 83c8e617a2c5..3a3c48631dd6 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2608,6 +2608,37 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
>   }
>   EXPORT_SYMBOL_GPL(iommu_unmap_fast);
>   
> +static int __iommu_add_sg(struct iommu_map_cookie_sg *cookie_sg,
> +			  unsigned long iova, phys_addr_t paddr, size_t size)
> +{
> +	struct iommu_domain *domain = cookie_sg->domain;
> +	const struct iommu_domain_ops *ops = domain->ops;
> +	unsigned int min_pagesz;
> +	size_t pgsize, count;
> +
> +	if (unlikely(!(domain->type & __IOMMU_DOMAIN_PAGING)))
> +		return -EINVAL;
> +
> +	if (WARN_ON(domain->pgsize_bitmap == 0UL))
> +		return -ENODEV;
> +
> +	/* find out the minimum page size supported */
> +	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
> +
> +	/*
> +	 * both the virtual address and the physical one, as well as
> +	 * the size of the mapping, must be aligned (at least) to the
> +	 * size of the smallest page supported by the hardware
> +	 */
> +	if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
> +		pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 0x%x\n",
> +		       iova, &paddr, size, min_pagesz);
> +		return -EINVAL;
> +	}
> +	pgsize = iommu_pgsize(domain, iova, paddr, size, &count);
> +	return ops->add_deferred_map_sg(cookie_sg, paddr, pgsize, count);
> +}
> +
>   ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
>   		     struct scatterlist *sg, unsigned int nents, int prot,
>   		     gfp_t gfp)
> @@ -2617,6 +2648,9 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
>   	phys_addr_t start;
>   	unsigned int i = 0;
>   	int ret;
> +	bool deferred_sg = ops->alloc_cookie_sg && ops->add_deferred_map_sg &&
> +			   ops->consume_deferred_map_sg;
> +	struct iommu_map_cookie_sg *cookie_sg;
>   
>   	might_sleep_if(gfpflags_allow_blocking(gfp));
>   
> @@ -2625,12 +2659,24 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
>   				__GFP_HIGHMEM)))
>   		return -EINVAL;
>   
> +	if (deferred_sg) {
> +		cookie_sg = ops->alloc_cookie_sg(iova, prot, nents, gfp);
> +		if (!cookie_sg) {
> +			pr_err("iommu: failed alloc cookie\n");
> +			return -ENOMEM;
> +		}
> +		cookie_sg->domain = domain;
> +	}
> +
>   	while (i <= nents) {
>   		phys_addr_t s_phys = sg_phys(sg);
>   
>   		if (len && s_phys != start + len) {
> -			ret = __iommu_map(domain, iova + mapped, start,
> -					len, prot, gfp);
> +			if (deferred_sg)
> +				ret = __iommu_add_sg(cookie_sg, iova + mapped, start, len);
> +			else
> +				ret = __iommu_map(domain, iova + mapped, start,
> +						  len, prot, gfp);
>   
>   			if (ret)
>   				goto out_err;
> @@ -2654,6 +2700,9 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
>   			sg = sg_next(sg);
>   	}
>   
> +	if (deferred_sg)
> +		ops->consume_deferred_map_sg(cookie_sg);
> +
>   	if (ops->iotlb_sync_map) {
>   		ret = ops->iotlb_sync_map(domain, iova, mapped);
>   		if (ret)
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index c75877044185..5e60ac349228 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -601,6 +601,14 @@ struct iommu_ops {
>   	u8 user_pasid_table:1;
>   };
>   
> +/**
> + * struct iommu_map_cookie_sg - Cookie for a deferred map sg
> + * @domain: Domain for the sg lit
> + */
> +struct iommu_map_cookie_sg {
> +	struct iommu_domain *domain;
> +};
> +
>   /**
>    * struct iommu_domain_ops - domain specific operations
>    * @attach_dev: attach an iommu domain to a device
> @@ -638,6 +646,11 @@ struct iommu_ops {
>    * @enable_nesting: Enable nesting
>    * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>    * @free: Release the domain after use.
> + * @alloc_cookie_sg: Allocate a cookie that would be used to create
> + *		     a sg list, filled from the next functions
> + * @add_deferred_map_sg: Add a mapping to a cookie of a sg list.
> + * @consume_deferred_map_sg: Consume the sg list as now all mappings are added,
> + *			     it should also release the cookie as it's not used.
>    */
>   struct iommu_domain_ops {
>   	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
> @@ -668,6 +681,12 @@ struct iommu_domain_ops {
>   				  unsigned long quirks);
>   
>   	void (*free)(struct iommu_domain *domain);
> +
> +	struct iommu_map_cookie_sg *(*alloc_cookie_sg)(unsigned long iova, int prot,
> +						       unsigned int nents, gfp_t gfp);
> +	int (*add_deferred_map_sg)(struct iommu_map_cookie_sg *cookie,
> +				   phys_addr_t paddr, size_t pgsize, size_t pgcount);
> +	int (*consume_deferred_map_sg)(struct iommu_map_cookie_sg *cookie);
>   };
>   
>   /**



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations
  2024-12-19 12:48   ` Robin Murphy
@ 2024-12-19 14:24     ` Mostafa Saleh
  2025-01-02 20:18       ` Jason Gunthorpe
  0 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2024-12-19 14:24 UTC (permalink / raw)
  To: Robin Murphy
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, jean-philippe, jgg, nicolinc, vdonnefort,
	qperret, tabba, danielmentz, tzukui

Hi Robin,

On Thu, Dec 19, 2024 at 12:48:27PM +0000, Robin Murphy wrote:
> On 2024-12-12 6:04 pm, Mostafa Saleh wrote:
> > With pKVM SMMUv3 driver which para-virtualizes the IOMMU in the
> > hypervisor, has an extra overhead with map_sg, as it loops over
> > iommu_map, and for each map requires context switching, disabling
> > interrupts...
> > 
> > Instead, add an new domain operations:
> > - alloc_cookie_sg: Allocate a new sg deferred cookie
> > - add_deferred_map_sg: Add a mapping to the cookie
> > - consume_deferred_map_sg: Consume and release the cookie
> > 
> > Alternativly, we can pass the sg list as is. However, this would
> > duplicate some of the logic and it would make more sense to
> > conolidate all the sg list parsing for IOMMU drivers in one place.
> 
> But why bother with fiddly overly-specific machinery at all when you can
> already make ->map_pages asynchronous and consolidate the expensive part
> into ->iotlb_sync_map in general, like s390 does?

This was my initial idea too. But I believe there is no enough context in
iotlb_sync_map, so we either have to create a per-domain deferred_map list
which is synced on any iotlb_sync_map, but that would require to lock the
map operation, hence impacting concurrency.

Or we have to use some complex logic to extract context from iotlb_sync_map,
(something like range iova tree in map and then on sync we can retrieve that)
That’s why I proposed this approach, where the IOMMU subsystem by design is
aware of the semantics and “helps” by providing the right data structures/calls.

I had a quick look now at s390, and it seems a bit different as they only
notify the hypervisor about the iova range being changed, and don’t need
to provide iova->paddr mapping which pKVM does.

Thanks,
Mostafa
> 
> Thanks,
> Robin.
> 
> > virtio-iommu is another IOMMU that can benfit from this, but it
> > would need to have a new operation that standerdize passing
> > an sglist based on these ops.
> > 
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> >   drivers/iommu/iommu.c | 53 +++++++++++++++++++++++++++++++++++++++++--
> >   include/linux/iommu.h | 19 ++++++++++++++++
> >   2 files changed, 70 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 83c8e617a2c5..3a3c48631dd6 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -2608,6 +2608,37 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
> >   }
> >   EXPORT_SYMBOL_GPL(iommu_unmap_fast);
> > +static int __iommu_add_sg(struct iommu_map_cookie_sg *cookie_sg,
> > +			  unsigned long iova, phys_addr_t paddr, size_t size)
> > +{
> > +	struct iommu_domain *domain = cookie_sg->domain;
> > +	const struct iommu_domain_ops *ops = domain->ops;
> > +	unsigned int min_pagesz;
> > +	size_t pgsize, count;
> > +
> > +	if (unlikely(!(domain->type & __IOMMU_DOMAIN_PAGING)))
> > +		return -EINVAL;
> > +
> > +	if (WARN_ON(domain->pgsize_bitmap == 0UL))
> > +		return -ENODEV;
> > +
> > +	/* find out the minimum page size supported */
> > +	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
> > +
> > +	/*
> > +	 * both the virtual address and the physical one, as well as
> > +	 * the size of the mapping, must be aligned (at least) to the
> > +	 * size of the smallest page supported by the hardware
> > +	 */
> > +	if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
> > +		pr_err("unaligned: iova 0x%lx pa %pa size 0x%zx min_pagesz 0x%x\n",
> > +		       iova, &paddr, size, min_pagesz);
> > +		return -EINVAL;
> > +	}
> > +	pgsize = iommu_pgsize(domain, iova, paddr, size, &count);
> > +	return ops->add_deferred_map_sg(cookie_sg, paddr, pgsize, count);
> > +}
> > +
> >   ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
> >   		     struct scatterlist *sg, unsigned int nents, int prot,
> >   		     gfp_t gfp)
> > @@ -2617,6 +2648,9 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
> >   	phys_addr_t start;
> >   	unsigned int i = 0;
> >   	int ret;
> > +	bool deferred_sg = ops->alloc_cookie_sg && ops->add_deferred_map_sg &&
> > +			   ops->consume_deferred_map_sg;
> > +	struct iommu_map_cookie_sg *cookie_sg;
> >   	might_sleep_if(gfpflags_allow_blocking(gfp));
> > @@ -2625,12 +2659,24 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
> >   				__GFP_HIGHMEM)))
> >   		return -EINVAL;
> > +	if (deferred_sg) {
> > +		cookie_sg = ops->alloc_cookie_sg(iova, prot, nents, gfp);
> > +		if (!cookie_sg) {
> > +			pr_err("iommu: failed alloc cookie\n");
> > +			return -ENOMEM;
> > +		}
> > +		cookie_sg->domain = domain;
> > +	}
> > +
> >   	while (i <= nents) {
> >   		phys_addr_t s_phys = sg_phys(sg);
> >   		if (len && s_phys != start + len) {
> > -			ret = __iommu_map(domain, iova + mapped, start,
> > -					len, prot, gfp);
> > +			if (deferred_sg)
> > +				ret = __iommu_add_sg(cookie_sg, iova + mapped, start, len);
> > +			else
> > +				ret = __iommu_map(domain, iova + mapped, start,
> > +						  len, prot, gfp);
> >   			if (ret)
> >   				goto out_err;
> > @@ -2654,6 +2700,9 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
> >   			sg = sg_next(sg);
> >   	}
> > +	if (deferred_sg)
> > +		ops->consume_deferred_map_sg(cookie_sg);
> > +
> >   	if (ops->iotlb_sync_map) {
> >   		ret = ops->iotlb_sync_map(domain, iova, mapped);
> >   		if (ret)
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index c75877044185..5e60ac349228 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -601,6 +601,14 @@ struct iommu_ops {
> >   	u8 user_pasid_table:1;
> >   };
> > +/**
> > + * struct iommu_map_cookie_sg - Cookie for a deferred map sg
> > + * @domain: Domain for the sg lit
> > + */
> > +struct iommu_map_cookie_sg {
> > +	struct iommu_domain *domain;
> > +};
> > +
> >   /**
> >    * struct iommu_domain_ops - domain specific operations
> >    * @attach_dev: attach an iommu domain to a device
> > @@ -638,6 +646,11 @@ struct iommu_ops {
> >    * @enable_nesting: Enable nesting
> >    * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
> >    * @free: Release the domain after use.
> > + * @alloc_cookie_sg: Allocate a cookie that would be used to create
> > + *		     a sg list, filled from the next functions
> > + * @add_deferred_map_sg: Add a mapping to a cookie of a sg list.
> > + * @consume_deferred_map_sg: Consume the sg list as now all mappings are added,
> > + *			     it should also release the cookie as it's not used.
> >    */
> >   struct iommu_domain_ops {
> >   	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
> > @@ -668,6 +681,12 @@ struct iommu_domain_ops {
> >   				  unsigned long quirks);
> >   	void (*free)(struct iommu_domain *domain);
> > +
> > +	struct iommu_map_cookie_sg *(*alloc_cookie_sg)(unsigned long iova, int prot,
> > +						       unsigned int nents, gfp_t gfp);
> > +	int (*add_deferred_map_sg)(struct iommu_map_cookie_sg *cookie,
> > +				   phys_addr_t paddr, size_t pgsize, size_t pgcount);
> > +	int (*consume_deferred_map_sg)(struct iommu_map_cookie_sg *cookie);
> >   };
> >   /**
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2024-12-13 19:39   ` Mostafa Saleh
@ 2025-01-02 20:16     ` Jason Gunthorpe
  2025-01-08 12:09       ` Mostafa Saleh
  0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-01-02 20:16 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, nicolinc,
	vdonnefort, qperret, tabba, danielmentz, tzukui

On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> Thanks a lot for taking the time to review this, I tried to reply to all
> points. However I think a main source of confusion was that this is only
> for the host kernel not guests, with this series guests still have no
> access to DMA under pKVM. I hope that clarifies some of the points.

I think I just used different words, I ment the direct guest of pvkm,
including what you are calling the host kernel.

> > The cover letter doesn't explain why someone needs page tables in the
> > guest at all?
> 
> This is not for guests but for the host, the hypervisor needs to
> establish DMA isolation between the host and the hypervisor/guests.

Why isn't this done directly in pkvm by setting up IOMMU tables that
identity map the host/guest's CPU mapping? Why does the host kernel or
guest kernel need to have page tables?

> However, guest DMA support is optional and only needed for device
> passthrough, 

Why? The CC cases are having the pkvm layer control the translation,
so when the host spawns a guest the pkvm will setup a contained IOMMU
translation for that guest as well.

Don't you also want to protect the guests from the host in this model?

> We can do that for the host also, which is discussed in the v1 cover
> letter. However, we try to keep feature parity with the normal (VHE)
> KVM arm64 support, so constraining KVM support to not have IOVA spaces
> for devices seems too much and impractical on modern systems (phones for
> example).

But why? Do you have current use cases on phone where you need to have
device-specific iommu_domains? What are they? Answering this goes a
long way to understanding the real performance of a para virt approach.

> There is no hacking for the arm-smmu-v3 driver, but mostly splitting
> the driver so it can be re-used + introduction for a separate
> hypervisor

I understood splitting some of it so you could share code with the
pkvm side, but I don't see that it should be connected to the
host/guest driver. Surely that should be a generic pkvm-iommu driver
that is arch neutral, like virtio-iommu.

> With pKVM, the host kernel is not trusted, and if compromised it can
> instrument such attacks to corrupt hypervisor memory, so the hypervisor
> would lock io-pgtable-arm operations in EL2 to avoid that.

io-pgtable-arm has a particular set of locking assumptions, the caller
has to follow it. When pkvm converts the hypercalls for the
para-virtualization into io-pgtable-arm calls it has to also ensure it
follows io-pgtable-arm's locking model if it is going to use that as
its code base. This has nothing to do with the guest or trust, it is
just implementing concurrency correctly in pkvm..

> Yeah, SVA is tricky, I guess for that we would have to use nesting,
> but tbh, I don’t think it’s a deal breaker for now.

Again, it depends what your actual use case for translation is inside
the host/guest environments. It would be good to clearly spell this out..
There are few drivers that directly manpulate the iommu_domains of a
device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
of those are you targetting?

> > Lots of people have now done this, it is not really so bad. In
> > exchange you get a full architected feature set, better performance,
> > and are ready for HW optimizations.
> 
> It’s not impossible, it’s just more complicated doing it in the
> hypervisor which has limited features compared to the kernel + I haven’t
> seen any open source implementation for that except for Qemu which is in
> userspace.

People are doing it in their CC stuff, which is about the same as
pkvm. I'm not sure if it will be open source, I hope so since it needs
security auditing..

> > > - Add IDENTITY_DOMAIN support, I already have some patches for that, but
> > >   didn’t want to complicate this series, I can send them separately.
> > 
> > This seems kind of pointless to me. If you can tolerate identity (ie
> > pin all memory) then do nested, and maybe don't even bother with a
> > guest iommu.
> 
> As mentioned, the choice for para-virt was not only to avoid pinning,
> as this is the host, for IDENTITY_DOMAIN we either share the page table,
> then we have to deal with lazy mapping (SMMU features, BBM...) or mirror
> the table in a shadow SMMU only identity page table.

AFAIK you always have to mirror unless you significantly change how
the KVM S1 page table stuff is working. The CC people have made those
changes and won't mirror, so it is doable..

> > My advice for merging would be to start with the pkvm side setting up
> > a fully pinned S2 and do not have a guest driver. Nesting without
> > emulating smmuv3. Basically you get protected identity DMA support. I
> > think that would be a much less sprawling patch series. From there it
> > would be well positioned to add both smmuv3 emulation and a paravirt
> > iommu flow.
> 
> I am open to any suggestions, but I believe any solution considered for
> merge, should have enough features to be usable on actual systems (translating
> IOMMU can be used for example) so either para-virt as this series or full
> nesting as the PoC above (or maybe both?), which IMO comes down to the
> trade-off mentioned above.

IMHO no, you can have a completely usable solution without host/guest
controlled translation. This is equivilant to a bare metal system with
no IOMMU HW. This exists and is still broadly useful. The majority of
cloud VMs out there are in this configuration.

That is the simplest/smallest thing to start with. Adding host/guest
controlled translation is a build-on-top excercise that seems to have
a lot of options and people may end up wanting to do all of them.

I don't think you need to show that host/guest controlled translation
is possible to make progress, of course it is possible. Just getting
to the point where pkvm can own the SMMU HW and provide DMA isolation
between all of it's direct host/guest is a good step.

Jason

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations
  2024-12-19 14:24     ` Mostafa Saleh
@ 2025-01-02 20:18       ` Jason Gunthorpe
  2025-01-03 15:35         ` Mostafa Saleh
  0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-01-02 20:18 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Robin Murphy, iommu, kvmarm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, jean-philippe,
	nicolinc, vdonnefort, qperret, tabba, danielmentz, tzukui

On Thu, Dec 19, 2024 at 02:24:05PM +0000, Mostafa Saleh wrote:
> I had a quick look now at s390, and it seems a bit different as they only
> notify the hypervisor about the iova range being changed, and don’t need
> to provide iova->paddr mapping which pKVM does.

Can you explain this statement some more. It seems strange to me, why
would the pkvm side, which has it's own page table, need to be be told
about the iova->paddr during range unmapping requests - and why would
the host/guest side have to unnecessarily store this information?

Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations
  2025-01-02 20:18       ` Jason Gunthorpe
@ 2025-01-03 15:35         ` Mostafa Saleh
  2025-01-03 15:47           ` Jason Gunthorpe
  0 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-03 15:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, iommu, kvmarm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, jean-philippe,
	nicolinc, vdonnefort, qperret, tabba, danielmentz, tzukui

On Thu, Jan 02, 2025 at 04:18:31PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 19, 2024 at 02:24:05PM +0000, Mostafa Saleh wrote:
> > I had a quick look now at s390, and it seems a bit different as they only
> > notify the hypervisor about the iova range being changed, and don’t need
> > to provide iova->paddr mapping which pKVM does.
> 
> Can you explain this statement some more. It seems strange to me, why
> would the pkvm side, which has it's own page table, need to be be told
> about the iova->paddr during range unmapping requests - and why would
> the host/guest side have to unnecessarily store this information?

No, it doesn’t need to be told about the paddr on unmapping.

The problem is as follows;
- With the current IOMMU API, an iommu_map_sg() function ends up looping on
  iommu_map() which is very slow as map is a hypercall and there is a lot of
  context switching.
- So, we add a map_sg hypercall, instead of consuming the kernel scatterlist
  which duplicates logic and can be complicated to do. A new iommu operation is
  added add_deferred_map_sg(), where at iommu_map_sg() instead of calling
  iommu_map() it calls add_deferred_map_sg() to add mapping, and 2 other new ops
  also added to create and consume a sg request which comes from a single iommu_map_sg().

The drawback of this approach is that it adds 3 new iommu_ops with a bit of niche
semantics, which for now only used by this driver.

An alternative approach as Robin suggested, is to treat all iommu_map as sg map,
and when the driver gets the iotlb_sync_map() call it can just issue the hypercall,
however this call only provides the IOVA range, which requires extra work or locking
as mentioned in the thread, and as Robin mentioned s390 doing something similar,
I was highlighting that in their driver, this call only notifies the hypervisor
about an IOVA and not an actual pv map as pKVM, so it much simpler in their case.

Thanks,
Mostafa
> 
> Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations
  2025-01-03 15:35         ` Mostafa Saleh
@ 2025-01-03 15:47           ` Jason Gunthorpe
  2025-01-08 12:13             ` Mostafa Saleh
  0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-01-03 15:47 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Robin Murphy, iommu, kvmarm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, jean-philippe,
	nicolinc, vdonnefort, qperret, tabba, danielmentz, tzukui

On Fri, Jan 03, 2025 at 03:35:20PM +0000, Mostafa Saleh wrote:

> An alternative approach as Robin suggested, is to treat all iommu_map as sg map,
> and when the driver gets the iotlb_sync_map() call it can just issue the hypercall,
> however this call only provides the IOVA range, which requires extra work or locking
> as mentioned in the thread, and as Robin mentioned s390 doing something similar,
> I was highlighting that in their driver, this call only notifies the hypervisor
> about an IOVA and not an actual pv map as pKVM, so it much simpler in their case.

Oh, that is much clearer, maybe incorporate some of that into the
commit message.

We are going in a general direction of trying to make the fast dma
mapping path not require sg, so adding sg specific optimizations to
the low level driver is not good.

Batching the hypercalls in the gather and then flushing to execute the
batch seems more reasonable.

You could probably work on advancing this infrastructure separately
via virtio-iommu..

Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-02 20:16     ` Jason Gunthorpe
@ 2025-01-08 12:09       ` Mostafa Saleh
  2025-01-16  6:39         ` Tian, Kevin
                           ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-08 12:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, nicolinc,
	vdonnefort, qperret, tabba, danielmentz, tzukui

On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > Thanks a lot for taking the time to review this, I tried to reply to all
> > points. However I think a main source of confusion was that this is only
> > for the host kernel not guests, with this series guests still have no
> > access to DMA under pKVM. I hope that clarifies some of the points.
> 
> I think I just used different words, I ment the direct guest of pvkm,
> including what you are calling the host kernel.
> 

KVM treats host/guests very differently, so I think the distinction
between both in this context is important as this driver is for the
host only, guests are another story.

> > > The cover letter doesn't explain why someone needs page tables in the
> > > guest at all?
> > 
> > This is not for guests but for the host, the hypervisor needs to
> > establish DMA isolation between the host and the hypervisor/guests.
> 
> Why isn't this done directly in pkvm by setting up IOMMU tables that
> identity map the host/guest's CPU mapping? Why does the host kernel or
> guest kernel need to have page tables?
> 

If we setup identity tables that either means there is no translation
capability for the guest(or host here) or nesting should be used,
which is discussed later in this cover letter.

> > However, guest DMA support is optional and only needed for device
> > passthrough, 
> 
> Why? The CC cases are having the pkvm layer control the translation,
> so when the host spawns a guest the pkvm will setup a contained IOMMU
> translation for that guest as well.
> 
> Don't you also want to protect the guests from the host in this model?
> 

We do protect the guests from the host, in the proposed approach by
preventing mapping memory from guests(or hypervisor) in the IOMMU or
donating memory currently mapped in the IOMMU.
However, at the moment pKVM doesn’t support device passthrough, so
guests don't need IOMMU page tables as they can’t use any device or issue
DMA directly.
I have some patches to support device passthrough in guests + guest
IOMMU page tables, which is not part of this series, as mentioned host
DMA isolation is critical for pKVM model, while guest device passthrough
is an optional feature (but we plan to upstream that later)

> > We can do that for the host also, which is discussed in the v1 cover
> > letter. However, we try to keep feature parity with the normal (VHE)
> > KVM arm64 support, so constraining KVM support to not have IOVA spaces
> > for devices seems too much and impractical on modern systems (phones for
> > example).
> 
> But why? Do you have current use cases on phone where you need to have
> device-specific iommu_domains? What are they? Answering this goes a
> long way to understanding the real performance of a para virt approach.
> 

I don’t think having one domain for all devices fits most cases, SoCs can have
heterogeneous SMMUs, different addresses sizes, coherency...
Also, the basic idea of isolation between devices where some can be
controlled from userspace, or influenced by external entities, which
should be isolated. (we'd want the USB/network devices to have access to
other devices memory for example)
Another example would be accelerators, where they only operate on contiguous
memory and having such large buffers on phones in phyiscal space is almost
impossible.

I don’t think having a single domain is practical (nor it helps in this case).

> > There is no hacking for the arm-smmu-v3 driver, but mostly splitting
> > the driver so it can be re-used + introduction for a separate
> > hypervisor
> 
> I understood splitting some of it so you could share code with the
> pkvm side, but I don't see that it should be connected to the
> host/guest driver. Surely that should be a generic pkvm-iommu driver
> that is arch neutral, like virtio-iommu.
> 

The host driver follows the KVM (nvhe/hvhe) model, where at boot the
kernel (EL1) does a lot of the initialization and then it becomes untrusted
and the hypervisor manages everything after.

Similarly, The driver first probes in EL1 and does many of the complicated
stuff that is not supported at the hypervisor (EL2) as parsing firmware tables.
And ends up populating a simplified description of the SMMU topology.

Then the KVM <-> SMMU interfaces is not arch specific, you can check that
in hyp_main.c or nvhe/iommu.c where there is no reference to SMMU and all
hypercalls are abstracted so other IOMMUs can be supported under pKVM
(That’s the case in Android).

Maybe the driver at EL1 also can be further split to have a standard
part for hypercall interface and an init part which is SMMUv3 specific,
but I’d rather not complicate things until we have other users upstream..

For guest VMs (not part of this series), the interface and the kernel driver
are completely arch agnostic, similarly to virtio-iommu.

> > With pKVM, the host kernel is not trusted, and if compromised it can
> > instrument such attacks to corrupt hypervisor memory, so the hypervisor
> > would lock io-pgtable-arm operations in EL2 to avoid that.
> 
> io-pgtable-arm has a particular set of locking assumptions, the caller
> has to follow it. When pkvm converts the hypercalls for the
> para-virtualization into io-pgtable-arm calls it has to also ensure it
> follows io-pgtable-arm's locking model if it is going to use that as
> its code base. This has nothing to do with the guest or trust, it is
> just implementing concurrency correctly in pkvm..
> 

AFAICT, io-pgtable-arm has a set of assumptions about how it's called,
that’s why it’s lockless as the DMA API should follow those assumptions.
For example you can’t unmap a table and an entry inside the table concurrently,
that can lead to UAF/memory corruption, and this never happens at the moment as
the kernel has no bugs :)

However, pKVM always assumes that the kernel can be malicious, so a bad kernel
can issue such a call breaking those assumptions leading to UAF/memory corruption
inside the hypervisor. Which is not acceptable, so the solution is to use a lock
to prevent such issues from concurrent requests.

> > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > but tbh, I don’t think it’s a deal breaker for now.
> 
> Again, it depends what your actual use case for translation is inside
> the host/guest environments. It would be good to clearly spell this out..
> There are few drivers that directly manpulate the iommu_domains of a
> device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> of those are you targetting?
> 

Not sure I understand this point about manipulating domains.
AFAIK, SVA is not that common, including mobile spaces but I can be wrong,
that’s why it’s not a priority here.

> > > Lots of people have now done this, it is not really so bad. In
> > > exchange you get a full architected feature set, better performance,
> > > and are ready for HW optimizations.
> > 
> > It’s not impossible, it’s just more complicated doing it in the
> > hypervisor which has limited features compared to the kernel + I haven’t
> > seen any open source implementation for that except for Qemu which is in
> > userspace.
> 
> People are doing it in their CC stuff, which is about the same as
> pkvm. I'm not sure if it will be open source, I hope so since it needs
> security auditing..
> 

Yes, as mentioned later I also have a WIP implementation for KVM (which is
open source[1] :)) that I plan to send to the list (maybe in 3-4 months when
ready) as an alternative approach.

> > > > - Add IDENTITY_DOMAIN support, I already have some patches for that, but
> > > >   didn’t want to complicate this series, I can send them separately.
> > > 
> > > This seems kind of pointless to me. If you can tolerate identity (ie
> > > pin all memory) then do nested, and maybe don't even bother with a
> > > guest iommu.
> > 
> > As mentioned, the choice for para-virt was not only to avoid pinning,
> > as this is the host, for IDENTITY_DOMAIN we either share the page table,
> > then we have to deal with lazy mapping (SMMU features, BBM...) or mirror
> > the table in a shadow SMMU only identity page table.
> 
> AFAIK you always have to mirror unless you significantly change how
> the KVM S1 page table stuff is working. The CC people have made those
> changes and won't mirror, so it is doable..
> 

Yes, I agree, AFAIK, the current KVM pgtable code is not ready for shared
page tables with the IOMMU.

> > > My advice for merging would be to start with the pkvm side setting up
> > > a fully pinned S2 and do not have a guest driver. Nesting without
> > > emulating smmuv3. Basically you get protected identity DMA support. I
> > > think that would be a much less sprawling patch series. From there it
> > > would be well positioned to add both smmuv3 emulation and a paravirt
> > > iommu flow.
> > 
> > I am open to any suggestions, but I believe any solution considered for
> > merge, should have enough features to be usable on actual systems (translating
> > IOMMU can be used for example) so either para-virt as this series or full
> > nesting as the PoC above (or maybe both?), which IMO comes down to the
> > trade-off mentioned above.
> 
> IMHO no, you can have a completely usable solution without host/guest
> controlled translation. This is equivilant to a bare metal system with
> no IOMMU HW. This exists and is still broadly useful. The majority of
> cloud VMs out there are in this configuration.
> 
> That is the simplest/smallest thing to start with. Adding host/guest
> controlled translation is a build-on-top excercise that seems to have
> a lot of options and people may end up wanting to do all of them.
> 
> I don't think you need to show that host/guest controlled translation
> is possible to make progress, of course it is possible. Just getting
> to the point where pkvm can own the SMMU HW and provide DMA isolation
> between all of it's direct host/guest is a good step.

My plan was basically:
1) Finish and send nested SMMUv3 as RFC, with more insights about
   performance and complexity trade-offs of both approaches.

2) Discuss next steps for the upstream solution in an upcoming conference
   (like LPC or earlier if possible) and work on upstreaming it.

3) Work on guest device passthrough and IOMMU support.

I am open to gradually upstream this as you mentioned where as a first
step pKVM would establish DMA isolation without translation for host,
that should be enough to have functional pKVM and run protected workloads.

But although that might be usable on some systems, I don’t think that’s
practical in the long term as it limits the amount of HW that can run pKVM.

[1] https://android-kvm.googlesource.com/linux/+/refs/heads/smostafa/android15-6.6-smmu-nesting-wip

Thanks,
Mostafa

> 
> Jason

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations
  2025-01-03 15:47           ` Jason Gunthorpe
@ 2025-01-08 12:13             ` Mostafa Saleh
  0 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-08 12:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, iommu, kvmarm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, jean-philippe,
	nicolinc, vdonnefort, qperret, tabba, danielmentz, tzukui

On Fri, Jan 03, 2025 at 11:47:57AM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 03, 2025 at 03:35:20PM +0000, Mostafa Saleh wrote:
> 
> > An alternative approach as Robin suggested, is to treat all iommu_map as sg map,
> > and when the driver gets the iotlb_sync_map() call it can just issue the hypercall,
> > however this call only provides the IOVA range, which requires extra work or locking
> > as mentioned in the thread, and as Robin mentioned s390 doing something similar,
> > I was highlighting that in their driver, this call only notifies the hypervisor
> > about an IOVA and not an actual pv map as pKVM, so it much simpler in their case.
> 
> Oh, that is much clearer, maybe incorporate some of that into the
> commit message.

Sure.

> 
> We are going in a general direction of trying to make the fast dma
> mapping path not require sg, so adding sg specific optimizations to
> the low level driver is not good.
> 
> Batching the hypercalls in the gather and then flushing to execute the
> batch seems more reasonable.
> 
> You could probably work on advancing this infrastructure separately
> via virtio-iommu..

That was my impression also, I mentioned it in the commit message,
but for virtio-iommu we need a change in the standard to define this
operation, I can look into that.

Thanks,
Mostafa
> 
> Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-08 12:09       ` Mostafa Saleh
@ 2025-01-16  6:39         ` Tian, Kevin
  2025-01-16 19:14           ` Jason Gunthorpe
  2025-01-16  8:51         ` Tian, Kevin
  2025-01-16 19:19         ` Jason Gunthorpe
  2 siblings, 1 reply; 97+ messages in thread
From: Tian, Kevin @ 2025-01-16  6:39 UTC (permalink / raw)
  To: Mostafa Saleh, Jason Gunthorpe
  Cc: iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 8, 2025 8:10 PM
> 
> On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > but tbh, I don’t think it’s a deal breaker for now.
> >
> > Again, it depends what your actual use case for translation is inside
> > the host/guest environments. It would be good to clearly spell this out..
> > There are few drivers that directly manpulate the iommu_domains of a
> > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > of those are you targetting?
> >
> 
> Not sure I understand this point about manipulating domains.
> AFAIK, SVA is not that common, including mobile spaces but I can be wrong,
> that’s why it’s not a priority here.

Nested translation is required beyond SVA. A scenario which requires
a vIOMMU and multiple device domains within the guest would like to
embrace nesting. Especially for ARM vSMMU nesting is a must.

But I'm not sure that I got Jason's point about " there is no way to get
SVA support with para-virtualization." virtio-iommu is a para-virtualized
model and SVA support is in its plan. The main requirement is to pass
the base pointer of the guest CPU page table to backend and PRI faults/
responses back forth.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-08 12:09       ` Mostafa Saleh
  2025-01-16  6:39         ` Tian, Kevin
@ 2025-01-16  8:51         ` Tian, Kevin
  2025-01-22 11:28           ` Mostafa Saleh
  2025-01-16 19:19         ` Jason Gunthorpe
  2 siblings, 1 reply; 97+ messages in thread
From: Tian, Kevin @ 2025-01-16  8:51 UTC (permalink / raw)
  To: Mostafa Saleh, Jason Gunthorpe
  Cc: iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 8, 2025 8:10 PM
> 
> On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > I am open to any suggestions, but I believe any solution considered for
> > > merge, should have enough features to be usable on actual systems
> (translating
> > > IOMMU can be used for example) so either para-virt as this series or full
> > > nesting as the PoC above (or maybe both?), which IMO comes down to
> the
> > > trade-off mentioned above.
> >
> > IMHO no, you can have a completely usable solution without host/guest
> > controlled translation. This is equivilant to a bare metal system with
> > no IOMMU HW. This exists and is still broadly useful. The majority of
> > cloud VMs out there are in this configuration.
> >
> > That is the simplest/smallest thing to start with. Adding host/guest
> > controlled translation is a build-on-top excercise that seems to have
> > a lot of options and people may end up wanting to do all of them.
> >
> > I don't think you need to show that host/guest controlled translation
> > is possible to make progress, of course it is possible. Just getting
> > to the point where pkvm can own the SMMU HW and provide DMA
> isolation
> > between all of it's direct host/guest is a good step.
> 
> My plan was basically:
> 1) Finish and send nested SMMUv3 as RFC, with more insights about
>    performance and complexity trade-offs of both approaches.
> 
> 2) Discuss next steps for the upstream solution in an upcoming conference
>    (like LPC or earlier if possible) and work on upstreaming it.
> 
> 3) Work on guest device passthrough and IOMMU support.
> 
> I am open to gradually upstream this as you mentioned where as a first
> step pKVM would establish DMA isolation without translation for host,
> that should be enough to have functional pKVM and run protected
> workloads.

Does that approach assume starting from a full-fledged SMMU driver 
inside pKVM or do we still expect the host to enumerate/initialize
the hw (but skip any translation) so the pKVM part can focus only
on managing translation?

I'm curious about the burden of maintaining another IOMMU
subsystem under the KVM directory. It's not built into the host kernel
image, but hosted in the same kernel repo. This series tried to
reduce the duplication via io-pgtable-arm but still considerable 
duplication exists (~2000LOC in pKVM). The would be very confusing
moving forward and hard to maintain e.g. ensure bugs fixed in
both sides.

The CPU side is a different story. iiuc KVM-ARM is a split driver model 
from day one for nVHE. It's kept even for VHE with difference only
on using hypercall vs using direct function call. pKVM is added 
incrementally on top of nVHE hence it's natural to maintain the
 pKVM logic in the kernel repo. No duplication.

But there is no such thing in the IOMMU side. Probably we'd want to
try reusing the entire IOMMU sub-system in pKVM if it's agreed
to use full-fledged drivers in pKVM. Or if continuing the split-driver
model should we try splitting the existing drivers into two parts then
connecting two together via function call on native and via hypercall
in pKVM (similar to how KVM-ARM does)?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-16  6:39         ` Tian, Kevin
@ 2025-01-16 19:14           ` Jason Gunthorpe
  2025-01-17  6:57             ` Tian, Kevin
  0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-01-16 19:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Mostafa Saleh, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

On Thu, Jan 16, 2025 at 06:39:31AM +0000, Tian, Kevin wrote:
> > From: Mostafa Saleh <smostafa@google.com>
> > Sent: Wednesday, January 8, 2025 8:10 PM
> > 
> > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > > but tbh, I don’t think it’s a deal breaker for now.
> > >
> > > Again, it depends what your actual use case for translation is inside
> > > the host/guest environments. It would be good to clearly spell this out..
> > > There are few drivers that directly manpulate the iommu_domains of a
> > > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > > of those are you targetting?
> > >
> > 
> > Not sure I understand this point about manipulating domains.
> > AFAIK, SVA is not that common, including mobile spaces but I can be wrong,
> > that’s why it’s not a priority here.
> 
> Nested translation is required beyond SVA. A scenario which requires
> a vIOMMU and multiple device domains within the guest would like to
> embrace nesting. Especially for ARM vSMMU nesting is a must.

Right, if you need an iommu domain in the guest there are only three
mainstream ways to get this in Linux:
 1) Use the DMA API and have the iommu group be translating. This is
    optional in that the DMA API usually supports identity as an option.
 2) A driver directly calls iommu_paging_domain_alloc() and manually
    attaches it to some device, and does not use the DMA API. My list
    above of ath1x/etc are examples doing this
 3) Use VFIO

My remark to Mostafa is to be specific, which of the above do you want
to do in your mobile guest (and what driver exactly if #2) and why.

This will help inform what the performance profile looks like and
guide if nesting/para virt is appropriate.

> But I'm not sure that I got Jason's point about " there is no way to get
> SVA support with para-virtualization." virtio-iommu is a para-virtualized
> model and SVA support is in its plan. The main requirement is to pass
> the base pointer of the guest CPU page table to backend and PRI faults/
> responses back forth.

That's nesting, you have a full page table under the control of the
guest, and the guest needs to have a level of HW-specific
knowledge. It is just an alternative to using the native nesting
vIOMMU.

What I mean by "para-virtualization" is the guest does map/unmap calls
to the hypervisor and has no page tbale.

Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-08 12:09       ` Mostafa Saleh
  2025-01-16  6:39         ` Tian, Kevin
  2025-01-16  8:51         ` Tian, Kevin
@ 2025-01-16 19:19         ` Jason Gunthorpe
  2025-01-22 11:46           ` Mostafa Saleh
  2 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-01-16 19:19 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, nicolinc,
	vdonnefort, qperret, tabba, danielmentz, tzukui

On Wed, Jan 08, 2025 at 12:09:53PM +0000, Mostafa Saleh wrote:

> I am open to gradually upstream this as you mentioned where as a first
> step pKVM would establish DMA isolation without translation for host,
> that should be enough to have functional pKVM and run protected workloads.

Personally I hate these giant patch series, you should strip it down
to small meaningful steps and try to stay below 20 per series.

I think getting pkvm to own the SMMU HW is a great first step that
everything else can build on

> But although that might be usable on some systems, I don’t think that’s
> practical in the long term as it limits the amount of HW that can run pKVM.

I suspect you will end up doing everything. Old HW needs paravirt, new
HW will want nesting and its performance. Users other than mobile will
come. If we were to use pKVM on server workloads we need nesting for
performance.

Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-16 19:14           ` Jason Gunthorpe
@ 2025-01-17  6:57             ` Tian, Kevin
  2025-01-22 11:04               ` Mostafa Saleh
  0 siblings, 1 reply; 97+ messages in thread
From: Tian, Kevin @ 2025-01-17  6:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mostafa Saleh, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Friday, January 17, 2025 3:15 AM
> 
> On Thu, Jan 16, 2025 at 06:39:31AM +0000, Tian, Kevin wrote:
> > > From: Mostafa Saleh <smostafa@google.com>
> > > Sent: Wednesday, January 8, 2025 8:10 PM
> > >
> > > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > > > but tbh, I don’t think it’s a deal breaker for now.
> > > >
> > > > Again, it depends what your actual use case for translation is inside
> > > > the host/guest environments. It would be good to clearly spell this out..
> > > > There are few drivers that directly manpulate the iommu_domains of a
> > > > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > > > of those are you targetting?
> > > >
> > >
> > > Not sure I understand this point about manipulating domains.
> > > AFAIK, SVA is not that common, including mobile spaces but I can be
> wrong,
> > > that’s why it’s not a priority here.
> >
> > Nested translation is required beyond SVA. A scenario which requires
> > a vIOMMU and multiple device domains within the guest would like to
> > embrace nesting. Especially for ARM vSMMU nesting is a must.
> 
> Right, if you need an iommu domain in the guest there are only three
> mainstream ways to get this in Linux:
>  1) Use the DMA API and have the iommu group be translating. This is
>     optional in that the DMA API usually supports identity as an option.
>  2) A driver directly calls iommu_paging_domain_alloc() and manually
>     attaches it to some device, and does not use the DMA API. My list
>     above of ath1x/etc are examples doing this
>  3) Use VFIO
> 
> My remark to Mostafa is to be specific, which of the above do you want
> to do in your mobile guest (and what driver exactly if #2) and why.
> 
> This will help inform what the performance profile looks like and
> guide if nesting/para virt is appropriate.

Yeah that part would be critical to help decide which route to pursue
first. Even when all options might be required in the end when pKVM
is scaled to more scenarios, as you mentioned in another mail, a staging
approach would be much preferrable to evolve.

The pros/cons between nesting/para virt is clear - more static the 
mapping is, more gain from the para approach due to less paging 
walking and smaller tlb footprint, while vice versa nesting performs
much better by avoiding frequent para calls on page table mgmt. 😊

> 
> > But I'm not sure that I got Jason's point about " there is no way to get
> > SVA support with para-virtualization." virtio-iommu is a para-virtualized
> > model and SVA support is in its plan. The main requirement is to pass
> > the base pointer of the guest CPU page table to backend and PRI faults/
> > responses back forth.
> 
> That's nesting, you have a full page table under the control of the
> guest, and the guest needs to have a level of HW-specific
> knowledge. It is just an alternative to using the native nesting
> vIOMMU.
> 
> What I mean by "para-virtualization" is the guest does map/unmap calls
> to the hypervisor and has no page tbale.
> 

Yes, that should never happen for SVA.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-17  6:57             ` Tian, Kevin
@ 2025-01-22 11:04               ` Mostafa Saleh
  2025-01-22 16:20                 ` Jason Gunthorpe
  2025-01-23  8:13                 ` Tian, Kevin
  0 siblings, 2 replies; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-22 11:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

On Fri, Jan 17, 2025 at 06:57:12AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Friday, January 17, 2025 3:15 AM
> > 
> > On Thu, Jan 16, 2025 at 06:39:31AM +0000, Tian, Kevin wrote:
> > > > From: Mostafa Saleh <smostafa@google.com>
> > > > Sent: Wednesday, January 8, 2025 8:10 PM
> > > >
> > > > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > > > > but tbh, I don’t think it’s a deal breaker for now.
> > > > >
> > > > > Again, it depends what your actual use case for translation is inside
> > > > > the host/guest environments. It would be good to clearly spell this out..
> > > > > There are few drivers that directly manpulate the iommu_domains of a
> > > > > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > > > > of those are you targetting?
> > > > >
> > > >
> > > > Not sure I understand this point about manipulating domains.
> > > > AFAIK, SVA is not that common, including mobile spaces but I can be
> > wrong,
> > > > that’s why it’s not a priority here.
> > >
> > > Nested translation is required beyond SVA. A scenario which requires
> > > a vIOMMU and multiple device domains within the guest would like to
> > > embrace nesting. Especially for ARM vSMMU nesting is a must.

We can still do para-virtualization for guests the same way we do for the
host and use a single stage IOMMU.

> > 
> > Right, if you need an iommu domain in the guest there are only three
> > mainstream ways to get this in Linux:
> >  1) Use the DMA API and have the iommu group be translating. This is
> >     optional in that the DMA API usually supports identity as an option.
> >  2) A driver directly calls iommu_paging_domain_alloc() and manually
> >     attaches it to some device, and does not use the DMA API. My list
> >     above of ath1x/etc are examples doing this
> >  3) Use VFIO
> > 
> > My remark to Mostafa is to be specific, which of the above do you want
> > to do in your mobile guest (and what driver exactly if #2) and why.
> > 
> > This will help inform what the performance profile looks like and
> > guide if nesting/para virt is appropriate.
> 

AFAIK, the most common use cases would be:
- Devices using DMA API because it requires a lot of memory to be
  contiguous in IOVA, which is hard to do with identity
- Devices with security requirements/constraints to be isolated from the
  rest of the system, also using DMA API
- VFIO is something we are looking at the moment and have prototyped with
  pKVM, and it should be supported soon in Android (only for platform
  devices for now)

> Yeah that part would be critical to help decide which route to pursue
> first. Even when all options might be required in the end when pKVM
> is scaled to more scenarios, as you mentioned in another mail, a staging
> approach would be much preferrable to evolve.

I agree that would probably be the case. I will work on more staging
approach for v3, mostly without the pv part as Jason suggested.

> 
> The pros/cons between nesting/para virt is clear - more static the 
> mapping is, more gain from the para approach due to less paging 
> walking and smaller tlb footprint, while vice versa nesting performs
> much better by avoiding frequent para calls on page table mgmt. 😊

I am also working to get the numbers for both cases so we know
the order of magnitude of each case, as I guess it won't be as clear
for large systems with many DMA initiators what approach is best.


Thanks,
Mostafa

> 
> > 
> > > But I'm not sure that I got Jason's point about " there is no way to get
> > > SVA support with para-virtualization." virtio-iommu is a para-virtualized
> > > model and SVA support is in its plan. The main requirement is to pass
> > > the base pointer of the guest CPU page table to backend and PRI faults/
> > > responses back forth.
> > 
> > That's nesting, you have a full page table under the control of the
> > guest, and the guest needs to have a level of HW-specific
> > knowledge. It is just an alternative to using the native nesting
> > vIOMMU.
> > 
> > What I mean by "para-virtualization" is the guest does map/unmap calls
> > to the hypervisor and has no page tbale.
> > 
> 
> Yes, that should never happen for SVA.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-16  8:51         ` Tian, Kevin
@ 2025-01-22 11:28           ` Mostafa Saleh
  2025-01-23  8:25             ` Tian, Kevin
  0 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-22 11:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

Hi Kevin,

On Thu, Jan 16, 2025 at 08:51:11AM +0000, Tian, Kevin wrote:
> > From: Mostafa Saleh <smostafa@google.com>
> > Sent: Wednesday, January 8, 2025 8:10 PM
> > 
> > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > I am open to any suggestions, but I believe any solution considered for
> > > > merge, should have enough features to be usable on actual systems
> > (translating
> > > > IOMMU can be used for example) so either para-virt as this series or full
> > > > nesting as the PoC above (or maybe both?), which IMO comes down to
> > the
> > > > trade-off mentioned above.
> > >
> > > IMHO no, you can have a completely usable solution without host/guest
> > > controlled translation. This is equivilant to a bare metal system with
> > > no IOMMU HW. This exists and is still broadly useful. The majority of
> > > cloud VMs out there are in this configuration.
> > >
> > > That is the simplest/smallest thing to start with. Adding host/guest
> > > controlled translation is a build-on-top excercise that seems to have
> > > a lot of options and people may end up wanting to do all of them.
> > >
> > > I don't think you need to show that host/guest controlled translation
> > > is possible to make progress, of course it is possible. Just getting
> > > to the point where pkvm can own the SMMU HW and provide DMA
> > isolation
> > > between all of it's direct host/guest is a good step.
> > 
> > My plan was basically:
> > 1) Finish and send nested SMMUv3 as RFC, with more insights about
> >    performance and complexity trade-offs of both approaches.
> > 
> > 2) Discuss next steps for the upstream solution in an upcoming conference
> >    (like LPC or earlier if possible) and work on upstreaming it.
> > 
> > 3) Work on guest device passthrough and IOMMU support.
> > 
> > I am open to gradually upstream this as you mentioned where as a first
> > step pKVM would establish DMA isolation without translation for host,
> > that should be enough to have functional pKVM and run protected
> > workloads.
> 
> Does that approach assume starting from a full-fledged SMMU driver 
> inside pKVM or do we still expect the host to enumerate/initialize
> the hw (but skip any translation) so the pKVM part can focus only
> on managing translation?

I have been thinking about this, and I think most of the initialization
won’t be changed, and we would do any possible initialization in the
kernel avoiding complexity in the hypervisor (parsing
device-tree/acpi...) also that makes code re-use easier if both drivers
do that in the kernel space.

> 
> I'm curious about the burden of maintaining another IOMMU
> subsystem under the KVM directory. It's not built into the host kernel
> image, but hosted in the same kernel repo. This series tried to
> reduce the duplication via io-pgtable-arm but still considerable 
> duplication exists (~2000LOC in pKVM). The would be very confusing
> moving forward and hard to maintain e.g. ensure bugs fixed in
> both sides.

KVM IOMMU subsystem is very different from the one kernel, it’s about
paravirtualtion and abstraction, I tried my best to make sure all
possible code can be re-used by splitting arm-smmu-v3-common.c and
io-pgtable-arm-common.c and even re-using iommu_iotlb_gather from the
iommu code.
So my guess, there won't be much of that effort as there is no
duplication in logic.

I am still thinking about how v3 will look like, but as mentioned I am
inclined to Jason’s suggestion to reduce the series and remove the
paravirtualtion stuff and only establish DMA isolation as a starting point.
That will remove a lot of code from the KVM IOMMU for now, but we'd
need to address that later.
And we can build on top of this code either a para-virtual approach or
nested-emulation one.

> 
> The CPU side is a different story. iiuc KVM-ARM is a split driver model 
> from day one for nVHE. It's kept even for VHE with difference only
> on using hypercall vs using direct function call. pKVM is added 
> incrementally on top of nVHE hence it's natural to maintain the
>  pKVM logic in the kernel repo. No duplication.
> 
> But there is no such thing in the IOMMU side. Probably we'd want to
> try reusing the entire IOMMU sub-system in pKVM if it's agreed
> to use full-fledged drivers in pKVM. Or if continuing the split-driver
> model should we try splitting the existing drivers into two parts then
> connecting two together via function call on native and via hypercall
> in pKVM (similar to how KVM-ARM does)?

For the IOMMU KVM code, it’s quite different from the kernel one and serves
different purposes, so there is no logic duplication there.
The idea to use hypercalls/function calls in some places for VHE/nVHE,
doesn’t really translate here, as already the driver is abstracted by
iommu_ops, unlike KVM which has one code base for everything, as I
mentioned in another reply, we can standardize the hypercall part of the
kernel driver into a an IOMMU agnostic file (as virtio-iommu) and the
KVM SMMUv3 kernel driver would ony be responsible for initialization,
that should be the closest to the split model in nVHE.

Also, pKVM have some different code path in the kernel, for example pKVM
has a different mem abort handler, different initialization (pkvm.c)

Thanks,
Mostafa

> 
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-16 19:19         ` Jason Gunthorpe
@ 2025-01-22 11:46           ` Mostafa Saleh
  0 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-22 11:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, robin.murphy, jean-philippe, nicolinc,
	vdonnefort, qperret, tabba, danielmentz, tzukui

On Thu, Jan 16, 2025 at 03:19:52PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 08, 2025 at 12:09:53PM +0000, Mostafa Saleh wrote:
> 
> > I am open to gradually upstream this as you mentioned where as a first
> > step pKVM would establish DMA isolation without translation for host,
> > that should be enough to have functional pKVM and run protected workloads.
> 
> Personally I hate these giant patch series, you should strip it down
> to small meaningful steps and try to stay below 20 per series.
> 
> I think getting pkvm to own the SMMU HW is a great first step that
> everything else can build on

I plan to do that for v3, I think that also removes the out-of-tree
dependencies, so the code applies directly on upstream.
Thanks for the feedback!

> 
> > But although that might be usable on some systems, I don’t think that’s
> > practical in the long term as it limits the amount of HW that can run pKVM.
> 
> I suspect you will end up doing everything. Old HW needs paravirt, new
> HW will want nesting and its performance. Users other than mobile will
> come. If we were to use pKVM on server workloads we need nesting for
> performance.

Yes, I guess that would be the case, as I mentioned in another reply
it would be interesting to get the order of magnitude both, which I am
looking into, I hope it'd help with which direction we should
prioritize upstream.

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-22 11:04               ` Mostafa Saleh
@ 2025-01-22 16:20                 ` Jason Gunthorpe
  2025-01-22 17:17                   ` Mostafa Saleh
  2025-01-23  8:13                 ` Tian, Kevin
  1 sibling, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-01-22 16:20 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Tian, Kevin, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

On Wed, Jan 22, 2025 at 11:04:24AM +0000, Mostafa Saleh wrote:
> AFAIK, the most common use cases would be:
> - Devices using DMA API because it requires a lot of memory to be
>   contiguous in IOVA, which is hard to do with identity

This is not a feature of the DMA API any driver should rely on .. Are
you aware of one that does?

> - Devices with security requirements/constraints to be isolated from the
>   rest of the system, also using DMA API

This is real, but again, in a mobile context does this even exist? It isn't
like there are external PCIe ports that need securing on a phone?

> - VFIO is something we are looking at the moment and have prototyped with
>   pKVM, and it should be supported soon in Android (only for platform
>   devices for now)

Yes, this makes sense

Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-22 16:20                 ` Jason Gunthorpe
@ 2025-01-22 17:17                   ` Mostafa Saleh
  2025-01-22 19:16                     ` Jason Gunthorpe
  0 siblings, 1 reply; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-22 17:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

On Wed, Jan 22, 2025 at 12:20:55PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 22, 2025 at 11:04:24AM +0000, Mostafa Saleh wrote:
> > AFAIK, the most common use cases would be:
> > - Devices using DMA API because it requires a lot of memory to be
> >   contiguous in IOVA, which is hard to do with identity
> 
> This is not a feature of the DMA API any driver should rely on .. Are
> you aware of one that does?
> 

I’d guess one example is media drivers, they usually need large contiguous
buffers, and would use for ex dma_alloc_coherent(), if the IOMMU is disabled or
bypassed, that means that the kernel has to find such contiguous size in the
physical address which can be impossible on devices with small memory as
mobile devices. Similarly.

I will look more into this while working on the patches to identity map
everything for v3, and I’d see what kind of issues I hit.

> > - Devices with security requirements/constraints to be isolated from the
> >   rest of the system, also using DMA API
> 
> This is real, but again, in a mobile context does this even exist? It isn't
> like there are external PCIe ports that need securing on a phone?

It’s not just about completely external devices, it’s a defence in depth
measure, where for example, network devices can be poked externally an
there have cases in the past where exploits were found[1], so some vendors
might have a policy to isolate such devices. Which I believe is a valid.

[1] https://lwn.net/ml/oss-security/20221013101046.GB20615@suse.de/

Thanks,
Mostafa

> 
> > - VFIO is something we are looking at the moment and have prototyped with
> >   pKVM, and it should be supported soon in Android (only for platform
> >   devices for now)
> 
> Yes, this makes sense
> 
> Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-22 17:17                   ` Mostafa Saleh
@ 2025-01-22 19:16                     ` Jason Gunthorpe
  0 siblings, 0 replies; 97+ messages in thread
From: Jason Gunthorpe @ 2025-01-22 19:16 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Tian, Kevin, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

On Wed, Jan 22, 2025 at 05:17:50PM +0000, Mostafa Saleh wrote:
> On Wed, Jan 22, 2025 at 12:20:55PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 22, 2025 at 11:04:24AM +0000, Mostafa Saleh wrote:
> > > AFAIK, the most common use cases would be:
> > > - Devices using DMA API because it requires a lot of memory to be
> > >   contiguous in IOVA, which is hard to do with identity
> > 
> > This is not a feature of the DMA API any driver should rely on .. Are
> > you aware of one that does?
> > 
> 
> I’d guess one example is media drivers, they usually need large contiguous
> buffers, and would use for ex dma_alloc_coherent(), if the IOMMU is disabled or
> bypassed, that means that the kernel has to find such contiguous size in the
> physical address which can be impossible on devices with small memory as
> mobile devices. Similarly.

I see, that make sense

> It’s not just about completely external devices, it’s a defence in depth
> measure, where for example, network devices can be poked externally an
> there have cases in the past where exploits were found[1], so some vendors
> might have a policy to isolate such devices. Which I believe is a valid.

The performance cost of doing isolation like that with networking is
probably prohibitive with paravirt..

Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-22 11:04               ` Mostafa Saleh
  2025-01-22 16:20                 ` Jason Gunthorpe
@ 2025-01-23  8:13                 ` Tian, Kevin
  2025-01-29 12:16                   ` Mostafa Saleh
  1 sibling, 1 reply; 97+ messages in thread
From: Tian, Kevin @ 2025-01-23  8:13 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 22, 2025 7:04 PM
> 
> On Fri, Jan 17, 2025 at 06:57:12AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > Sent: Friday, January 17, 2025 3:15 AM
> > >
> > > On Thu, Jan 16, 2025 at 06:39:31AM +0000, Tian, Kevin wrote:
> > > > > From: Mostafa Saleh <smostafa@google.com>
> > > > > Sent: Wednesday, January 8, 2025 8:10 PM
> > > > >
> > > > > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > > > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > > > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > > > > > but tbh, I don’t think it’s a deal breaker for now.
> > > > > >
> > > > > > Again, it depends what your actual use case for translation is inside
> > > > > > the host/guest environments. It would be good to clearly spell this
> out..
> > > > > > There are few drivers that directly manpulate the iommu_domains
> of a
> > > > > > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > > > > > of those are you targetting?
> > > > > >
> > > > >
> > > > > Not sure I understand this point about manipulating domains.
> > > > > AFAIK, SVA is not that common, including mobile spaces but I can be
> > > wrong,
> > > > > that’s why it’s not a priority here.
> > > >
> > > > Nested translation is required beyond SVA. A scenario which requires
> > > > a vIOMMU and multiple device domains within the guest would like to
> > > > embrace nesting. Especially for ARM vSMMU nesting is a must.
> 
> We can still do para-virtualization for guests the same way we do for the
> host and use a single stage IOMMU.

same way but both require a nested setup.

In concept there are two layers of address translations: GVA->GPA via
guest page table, and GPA->HPA via pKVM page table.

The difference between host/guest is just on the GPA mapping. For host
it's 1:1 with additional hardening for which portion can be mapped and
which cannot. For guest it's non-identical with the mapping established
from the host.

A nested translation naturally fits that conceptual layers.

Using a single-stage IOMMU means you need to combine two layers
into one layer i.e. GVA->HPA by removing GPA. Then you have to
paravirt guest page table so every guest PTE change is intercepted
to replace GPA with HPA.

Doing so completely kills the benefit of SVA, which is why Jason said
a no-go.

> 
> > >
> > > Right, if you need an iommu domain in the guest there are only three
> > > mainstream ways to get this in Linux:
> > >  1) Use the DMA API and have the iommu group be translating. This is
> > >     optional in that the DMA API usually supports identity as an option.
> > >  2) A driver directly calls iommu_paging_domain_alloc() and manually
> > >     attaches it to some device, and does not use the DMA API. My list
> > >     above of ath1x/etc are examples doing this
> > >  3) Use VFIO
> > >
> > > My remark to Mostafa is to be specific, which of the above do you want
> > > to do in your mobile guest (and what driver exactly if #2) and why.
> > >
> > > This will help inform what the performance profile looks like and
> > > guide if nesting/para virt is appropriate.
> >
> 
> AFAIK, the most common use cases would be:
> - Devices using DMA API because it requires a lot of memory to be
>   contiguous in IOVA, which is hard to do with identity
> - Devices with security requirements/constraints to be isolated from the
>   rest of the system, also using DMA API
> - VFIO is something we are looking at the moment and have prototyped with
>   pKVM, and it should be supported soon in Android (only for platform
>   devices for now)

what really matters is the frequency of map/unmap.

> 
> > Yeah that part would be critical to help decide which route to pursue
> > first. Even when all options might be required in the end when pKVM
> > is scaled to more scenarios, as you mentioned in another mail, a staging
> > approach would be much preferrable to evolve.
> 
> I agree that would probably be the case. I will work on more staging
> approach for v3, mostly without the pv part as Jason suggested.
> 
> >
> > The pros/cons between nesting/para virt is clear - more static the
> > mapping is, more gain from the para approach due to less paging
> > walking and smaller tlb footprint, while vice versa nesting performs
> > much better by avoiding frequent para calls on page table mgmt. 😊
> 
> I am also working to get the numbers for both cases so we know
> the order of magnitude of each case, as I guess it won't be as clear
> for large systems with many DMA initiators what approach is best.
> 
> 

That'd be great!

^ permalink raw reply	[flat|nested] 97+ messages in thread

* RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-22 11:28           ` Mostafa Saleh
@ 2025-01-23  8:25             ` Tian, Kevin
  2025-01-29 12:21               ` Mostafa Saleh
  0 siblings, 1 reply; 97+ messages in thread
From: Tian, Kevin @ 2025-01-23  8:25 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 22, 2025 7:29 PM
> 
> Hi Kevin,
> 
> On Thu, Jan 16, 2025 at 08:51:11AM +0000, Tian, Kevin wrote:
> > > From: Mostafa Saleh <smostafa@google.com>
> > > Sent: Wednesday, January 8, 2025 8:10 PM
> > >
> > > My plan was basically:
> > > 1) Finish and send nested SMMUv3 as RFC, with more insights about
> > >    performance and complexity trade-offs of both approaches.
> > >
> > > 2) Discuss next steps for the upstream solution in an upcoming
> conference
> > >    (like LPC or earlier if possible) and work on upstreaming it.
> > >
> > > 3) Work on guest device passthrough and IOMMU support.
> > >
> > > I am open to gradually upstream this as you mentioned where as a first
> > > step pKVM would establish DMA isolation without translation for host,
> > > that should be enough to have functional pKVM and run protected
> > > workloads.
> >
> > Does that approach assume starting from a full-fledged SMMU driver
> > inside pKVM or do we still expect the host to enumerate/initialize
> > the hw (but skip any translation) so the pKVM part can focus only
> > on managing translation?
> 
> I have been thinking about this, and I think most of the initialization
> won’t be changed, and we would do any possible initialization in the
> kernel avoiding complexity in the hypervisor (parsing
> device-tree/acpi...) also that makes code re-use easier if both drivers
> do that in the kernel space.

yeah that'd make sense for now. 

> 
> >
> > I'm curious about the burden of maintaining another IOMMU
> > subsystem under the KVM directory. It's not built into the host kernel
> > image, but hosted in the same kernel repo. This series tried to
> > reduce the duplication via io-pgtable-arm but still considerable
> > duplication exists (~2000LOC in pKVM). The would be very confusing
> > moving forward and hard to maintain e.g. ensure bugs fixed in
> > both sides.
> 
> KVM IOMMU subsystem is very different from the one kernel, it’s about
> paravirtualtion and abstraction, I tried my best to make sure all
> possible code can be re-used by splitting arm-smmu-v3-common.c and
> io-pgtable-arm-common.c and even re-using iommu_iotlb_gather from the
> iommu code.
> So my guess, there won't be much of that effort as there is no
> duplication in logic.

I'm not sure how different it is. In concept it still manages iommu
mappings, just with additional restrictions. Bear me that I haven't
looked into the detail of the 2000LOC driver in pKVM smmu driver. 
but the size does scare me, especially considering the case when
other vendors are supported later.

Let's keep it in mind and re-check after you have v3. It's simpler hence
suppose the actual difference between a pKVM iommu driver and
a normal kernel IOMMU driver can be judged more easily than now.

The learning here would be beneficial to the design in other pKVM
components, e.g. when porting pKVM to x86. Currently KVM x86 is 
monothetic. Maintaining pKVM under KVM/x86 would be a much
bigger challenge than doing it under KVM/arm. There will also be
question about what can be shared and how to better maintain
the pKVM specific logic in KVM/x86.

Overall my gut-feeling is that the pKVM specific code must be small
enough otherwise maintaining a run-time irrelevant project in the
kernel repo would be questionable. 😊


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 27/58] KVM: arm64: smmu-v3: Setup command queue
  2024-12-12 18:03 ` [RFC PATCH v2 27/58] KVM: arm64: smmu-v3: Setup command queue Mostafa Saleh
@ 2025-01-23 13:01   ` Robin Murphy
  2025-01-29 11:15     ` Mostafa Saleh
  0 siblings, 1 reply; 97+ messages in thread
From: Robin Murphy @ 2025-01-23 13:01 UTC (permalink / raw)
  To: Mostafa Saleh, iommu, kvmarm, linux-kernel, linux-arm-kernel
  Cc: catalin.marinas, will, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, robdclark, joro, jean-philippe, jgg,
	nicolinc, vdonnefort, qperret, tabba, danielmentz, tzukui

On 2024-12-12 6:03 pm, Mostafa Saleh wrote:
> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> 
> Map the command queue allocated by the host into the hypervisor address
> space. When the host mappings are finalized, the queue is unmapped from
> the host.

Don't forget the fun of reimplementing the errata workarounds to avoid 
generating certain problematic command sequences - beware it's mostly 
implicit in the current kernel driver :)

Thanks,
Robin.

> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Mostafa Saleh <smostafa@google.com>
> ---
>   arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 165 ++++++++++++++++++++
>   include/kvm/arm_smmu_v3.h                   |   4 +
>   2 files changed, 169 insertions(+)
> 
> diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
> index f7e60c188cb0..e15356509424 100644
> --- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
> +++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
> @@ -41,6 +41,15 @@ struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
>   	__ret;							\
>   })
>   
> +#define smmu_wait_event(_smmu, _cond)				\
> +({								\
> +	if ((_smmu)->features & ARM_SMMU_FEAT_SEV) {		\
> +		while (!(_cond))				\
> +			wfe();					\
> +	}							\
> +	smmu_wait(_cond);					\
> +})
> +
>   static int smmu_write_cr0(struct hyp_arm_smmu_v3_device *smmu, u32 val)
>   {
>   	writel_relaxed(val, smmu->base + ARM_SMMU_CR0);
> @@ -60,6 +69,123 @@ static void smmu_reclaim_pages(u64 phys, size_t size)
>   	WARN_ON(__pkvm_hyp_donate_host(phys >> PAGE_SHIFT, size >> PAGE_SHIFT));
>   }
>   
> +#define Q_WRAP(smmu, reg)	((reg) & (1 << (smmu)->cmdq_log2size))
> +#define Q_IDX(smmu, reg)	((reg) & ((1 << (smmu)->cmdq_log2size) - 1))
> +
> +static bool smmu_cmdq_full(struct hyp_arm_smmu_v3_device *smmu)
> +{
> +	u64 cons = readl_relaxed(smmu->base + ARM_SMMU_CMDQ_CONS);
> +
> +	return Q_IDX(smmu, smmu->cmdq_prod) == Q_IDX(smmu, cons) &&
> +	       Q_WRAP(smmu, smmu->cmdq_prod) != Q_WRAP(smmu, cons);
> +}
> +
> +static bool smmu_cmdq_empty(struct hyp_arm_smmu_v3_device *smmu)
> +{
> +	u64 cons = readl_relaxed(smmu->base + ARM_SMMU_CMDQ_CONS);
> +
> +	return Q_IDX(smmu, smmu->cmdq_prod) == Q_IDX(smmu, cons) &&
> +	       Q_WRAP(smmu, smmu->cmdq_prod) == Q_WRAP(smmu, cons);
> +}
> +
> +static int smmu_add_cmd(struct hyp_arm_smmu_v3_device *smmu,
> +			struct arm_smmu_cmdq_ent *ent)
> +{
> +	int i;
> +	int ret;
> +	u64 cmd[CMDQ_ENT_DWORDS] = {};
> +	int idx = Q_IDX(smmu, smmu->cmdq_prod);
> +	u64 *slot = smmu->cmdq_base + idx * CMDQ_ENT_DWORDS;
> +
> +	if (smmu->iommu.power_is_off)
> +		return -EPIPE;
> +
> +	ret = smmu_wait_event(smmu, !smmu_cmdq_full(smmu));
> +	if (ret)
> +		return ret;
> +
> +	cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
> +
> +	switch (ent->opcode) {
> +	case CMDQ_OP_CFGI_ALL:
> +		cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
> +		break;
> +	case CMDQ_OP_CFGI_CD:
> +		cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SSID, ent->cfgi.ssid);
> +		fallthrough;
> +	case CMDQ_OP_CFGI_STE:
> +		cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SID, ent->cfgi.sid);
> +		cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_LEAF, ent->cfgi.leaf);
> +		break;
> +	case CMDQ_OP_TLBI_NH_VA:
> +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
> +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
> +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
> +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
> +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
> +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
> +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
> +		cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_VA_MASK;
> +		break;
> +	case CMDQ_OP_TLBI_NSNH_ALL:
> +		break;
> +	case CMDQ_OP_TLBI_NH_ASID:
> +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
> +		fallthrough;
> +	case CMDQ_OP_TLBI_S12_VMALL:
> +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
> +		break;
> +	case CMDQ_OP_TLBI_S2_IPA:
> +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
> +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
> +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
> +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
> +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
> +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
> +		cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_IPA_MASK;
> +		break;
> +	case CMDQ_OP_CMD_SYNC:
> +		cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	for (i = 0; i < CMDQ_ENT_DWORDS; i++)
> +		slot[i] = cpu_to_le64(cmd[i]);
> +
> +	smmu->cmdq_prod++;
> +	writel(Q_IDX(smmu, smmu->cmdq_prod) | Q_WRAP(smmu, smmu->cmdq_prod),
> +	       smmu->base + ARM_SMMU_CMDQ_PROD);
> +	return 0;
> +}
> +
> +static int smmu_sync_cmd(struct hyp_arm_smmu_v3_device *smmu)
> +{
> +	int ret;
> +	struct arm_smmu_cmdq_ent cmd = {
> +		.opcode = CMDQ_OP_CMD_SYNC,
> +	};
> +
> +	ret = smmu_add_cmd(smmu, &cmd);
> +	if (ret)
> +		return ret;
> +
> +	return smmu_wait_event(smmu, smmu_cmdq_empty(smmu));
> +}
> +
> +__maybe_unused
> +static int smmu_send_cmd(struct hyp_arm_smmu_v3_device *smmu,
> +			 struct arm_smmu_cmdq_ent *cmd)
> +{
> +	int ret = smmu_add_cmd(smmu, cmd);
> +
> +	if (ret)
> +		return ret;
> +
> +	return smmu_sync_cmd(smmu);
> +}
> +
>   static int smmu_init_registers(struct hyp_arm_smmu_v3_device *smmu)
>   {
>   	u64 val, old;
> @@ -94,6 +220,41 @@ static int smmu_init_registers(struct hyp_arm_smmu_v3_device *smmu)
>   	return 0;
>   }
>   
> +static int smmu_init_cmdq(struct hyp_arm_smmu_v3_device *smmu)
> +{
> +	u64 cmdq_base;
> +	size_t cmdq_nr_entries, cmdq_size;
> +	int ret;
> +	enum kvm_pgtable_prot prot = PAGE_HYP;
> +
> +	cmdq_base = readq_relaxed(smmu->base + ARM_SMMU_CMDQ_BASE);
> +	if (cmdq_base & ~(Q_BASE_RWA | Q_BASE_ADDR_MASK | Q_BASE_LOG2SIZE))
> +		return -EINVAL;
> +
> +	smmu->cmdq_log2size = cmdq_base & Q_BASE_LOG2SIZE;
> +	cmdq_nr_entries = 1 << smmu->cmdq_log2size;
> +	cmdq_size = cmdq_nr_entries * CMDQ_ENT_DWORDS * 8;
> +
> +	cmdq_base &= Q_BASE_ADDR_MASK;
> +
> +	if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
> +		prot |= KVM_PGTABLE_PROT_NORMAL_NC;
> +
> +	ret = ___pkvm_host_donate_hyp_prot(cmdq_base >> PAGE_SHIFT,
> +					   PAGE_ALIGN(cmdq_size) >> PAGE_SHIFT,
> +					   false, prot);
> +	if (ret)
> +		return ret;
> +
> +	smmu->cmdq_base = hyp_phys_to_virt(cmdq_base);
> +
> +	memset(smmu->cmdq_base, 0, cmdq_size);
> +	writel_relaxed(0, smmu->base + ARM_SMMU_CMDQ_PROD);
> +	writel_relaxed(0, smmu->base + ARM_SMMU_CMDQ_CONS);
> +
> +	return 0;
> +}
> +
>   static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
>   {
>   	int ret;
> @@ -113,6 +274,10 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
>   	if (ret)
>   		return ret;
>   
> +	ret = smmu_init_cmdq(smmu);
> +	if (ret)
> +		return ret;
> +
>   	return kvm_iommu_init_device(&smmu->iommu);
>   }
>   
> diff --git a/include/kvm/arm_smmu_v3.h b/include/kvm/arm_smmu_v3.h
> index fb24bcef1624..393a1a04edba 100644
> --- a/include/kvm/arm_smmu_v3.h
> +++ b/include/kvm/arm_smmu_v3.h
> @@ -16,8 +16,12 @@ struct hyp_arm_smmu_v3_device {
>   	struct kvm_hyp_iommu	iommu;
>   	phys_addr_t		mmio_addr;
>   	size_t			mmio_size;
> +	unsigned long		features;
>   
>   	void __iomem		*base;
> +	u32			cmdq_prod;
> +	u64			*cmdq_base;
> +	size_t			cmdq_log2size;
>   };
>   
>   extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 27/58] KVM: arm64: smmu-v3: Setup command queue
  2025-01-23 13:01   ` Robin Murphy
@ 2025-01-29 11:15     ` Mostafa Saleh
  0 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-29 11:15 UTC (permalink / raw)
  To: Robin Murphy
  Cc: iommu, kvmarm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui,
	robdclark, joro, jean-philippe, jgg, nicolinc, vdonnefort,
	qperret, tabba, danielmentz, tzukui

On Thu, Jan 23, 2025 at 01:01:55PM +0000, Robin Murphy wrote:
> On 2024-12-12 6:03 pm, Mostafa Saleh wrote:
> > From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > 
> > Map the command queue allocated by the host into the hypervisor address
> > space. When the host mappings are finalized, the queue is unmapped from
> > the host.
> 
> Don't forget the fun of reimplementing the errata workarounds to avoid
> generating certain problematic command sequences - beware it's mostly
> implicit in the current kernel driver :)

Thanks, I see I missed “ARM_SMMU_OPT_CMDQ_FORCE_SYNC”, I will try to re-use
as much of the command queue code as possible in v3, although it’s unlikely
the hypervisor will have the some insertion algorithm as the host, but at
least for the command population.

Thanks,
Mostafa

> 
> Thanks,
> Robin.
> 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Mostafa Saleh <smostafa@google.com>
> > ---
> >   arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 165 ++++++++++++++++++++
> >   include/kvm/arm_smmu_v3.h                   |   4 +
> >   2 files changed, 169 insertions(+)
> > 
> > diff --git a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
> > index f7e60c188cb0..e15356509424 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
> > @@ -41,6 +41,15 @@ struct hyp_arm_smmu_v3_device *kvm_hyp_arm_smmu_v3_smmus;
> >   	__ret;							\
> >   })
> > +#define smmu_wait_event(_smmu, _cond)				\
> > +({								\
> > +	if ((_smmu)->features & ARM_SMMU_FEAT_SEV) {		\
> > +		while (!(_cond))				\
> > +			wfe();					\
> > +	}							\
> > +	smmu_wait(_cond);					\
> > +})
> > +
> >   static int smmu_write_cr0(struct hyp_arm_smmu_v3_device *smmu, u32 val)
> >   {
> >   	writel_relaxed(val, smmu->base + ARM_SMMU_CR0);
> > @@ -60,6 +69,123 @@ static void smmu_reclaim_pages(u64 phys, size_t size)
> >   	WARN_ON(__pkvm_hyp_donate_host(phys >> PAGE_SHIFT, size >> PAGE_SHIFT));
> >   }
> > +#define Q_WRAP(smmu, reg)	((reg) & (1 << (smmu)->cmdq_log2size))
> > +#define Q_IDX(smmu, reg)	((reg) & ((1 << (smmu)->cmdq_log2size) - 1))
> > +
> > +static bool smmu_cmdq_full(struct hyp_arm_smmu_v3_device *smmu)
> > +{
> > +	u64 cons = readl_relaxed(smmu->base + ARM_SMMU_CMDQ_CONS);
> > +
> > +	return Q_IDX(smmu, smmu->cmdq_prod) == Q_IDX(smmu, cons) &&
> > +	       Q_WRAP(smmu, smmu->cmdq_prod) != Q_WRAP(smmu, cons);
> > +}
> > +
> > +static bool smmu_cmdq_empty(struct hyp_arm_smmu_v3_device *smmu)
> > +{
> > +	u64 cons = readl_relaxed(smmu->base + ARM_SMMU_CMDQ_CONS);
> > +
> > +	return Q_IDX(smmu, smmu->cmdq_prod) == Q_IDX(smmu, cons) &&
> > +	       Q_WRAP(smmu, smmu->cmdq_prod) == Q_WRAP(smmu, cons);
> > +}
> > +
> > +static int smmu_add_cmd(struct hyp_arm_smmu_v3_device *smmu,
> > +			struct arm_smmu_cmdq_ent *ent)
> > +{
> > +	int i;
> > +	int ret;
> > +	u64 cmd[CMDQ_ENT_DWORDS] = {};
> > +	int idx = Q_IDX(smmu, smmu->cmdq_prod);
> > +	u64 *slot = smmu->cmdq_base + idx * CMDQ_ENT_DWORDS;
> > +
> > +	if (smmu->iommu.power_is_off)
> > +		return -EPIPE;
> > +
> > +	ret = smmu_wait_event(smmu, !smmu_cmdq_full(smmu));
> > +	if (ret)
> > +		return ret;
> > +
> > +	cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
> > +
> > +	switch (ent->opcode) {
> > +	case CMDQ_OP_CFGI_ALL:
> > +		cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_RANGE, 31);
> > +		break;
> > +	case CMDQ_OP_CFGI_CD:
> > +		cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SSID, ent->cfgi.ssid);
> > +		fallthrough;
> > +	case CMDQ_OP_CFGI_STE:
> > +		cmd[0] |= FIELD_PREP(CMDQ_CFGI_0_SID, ent->cfgi.sid);
> > +		cmd[1] |= FIELD_PREP(CMDQ_CFGI_1_LEAF, ent->cfgi.leaf);
> > +		break;
> > +	case CMDQ_OP_TLBI_NH_VA:
> > +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
> > +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
> > +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
> > +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
> > +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
> > +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
> > +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
> > +		cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_VA_MASK;
> > +		break;
> > +	case CMDQ_OP_TLBI_NSNH_ALL:
> > +		break;
> > +	case CMDQ_OP_TLBI_NH_ASID:
> > +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
> > +		fallthrough;
> > +	case CMDQ_OP_TLBI_S12_VMALL:
> > +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
> > +		break;
> > +	case CMDQ_OP_TLBI_S2_IPA:
> > +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_NUM, ent->tlbi.num);
> > +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_SCALE, ent->tlbi.scale);
> > +		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
> > +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_LEAF, ent->tlbi.leaf);
> > +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TTL, ent->tlbi.ttl);
> > +		cmd[1] |= FIELD_PREP(CMDQ_TLBI_1_TG, ent->tlbi.tg);
> > +		cmd[1] |= ent->tlbi.addr & CMDQ_TLBI_1_IPA_MASK;
> > +		break;
> > +	case CMDQ_OP_CMD_SYNC:
> > +		cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV);
> > +		break;
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	for (i = 0; i < CMDQ_ENT_DWORDS; i++)
> > +		slot[i] = cpu_to_le64(cmd[i]);
> > +
> > +	smmu->cmdq_prod++;
> > +	writel(Q_IDX(smmu, smmu->cmdq_prod) | Q_WRAP(smmu, smmu->cmdq_prod),
> > +	       smmu->base + ARM_SMMU_CMDQ_PROD);
> > +	return 0;
> > +}
> > +
> > +static int smmu_sync_cmd(struct hyp_arm_smmu_v3_device *smmu)
> > +{
> > +	int ret;
> > +	struct arm_smmu_cmdq_ent cmd = {
> > +		.opcode = CMDQ_OP_CMD_SYNC,
> > +	};
> > +
> > +	ret = smmu_add_cmd(smmu, &cmd);
> > +	if (ret)
> > +		return ret;
> > +
> > +	return smmu_wait_event(smmu, smmu_cmdq_empty(smmu));
> > +}
> > +
> > +__maybe_unused
> > +static int smmu_send_cmd(struct hyp_arm_smmu_v3_device *smmu,
> > +			 struct arm_smmu_cmdq_ent *cmd)
> > +{
> > +	int ret = smmu_add_cmd(smmu, cmd);
> > +
> > +	if (ret)
> > +		return ret;
> > +
> > +	return smmu_sync_cmd(smmu);
> > +}
> > +
> >   static int smmu_init_registers(struct hyp_arm_smmu_v3_device *smmu)
> >   {
> >   	u64 val, old;
> > @@ -94,6 +220,41 @@ static int smmu_init_registers(struct hyp_arm_smmu_v3_device *smmu)
> >   	return 0;
> >   }
> > +static int smmu_init_cmdq(struct hyp_arm_smmu_v3_device *smmu)
> > +{
> > +	u64 cmdq_base;
> > +	size_t cmdq_nr_entries, cmdq_size;
> > +	int ret;
> > +	enum kvm_pgtable_prot prot = PAGE_HYP;
> > +
> > +	cmdq_base = readq_relaxed(smmu->base + ARM_SMMU_CMDQ_BASE);
> > +	if (cmdq_base & ~(Q_BASE_RWA | Q_BASE_ADDR_MASK | Q_BASE_LOG2SIZE))
> > +		return -EINVAL;
> > +
> > +	smmu->cmdq_log2size = cmdq_base & Q_BASE_LOG2SIZE;
> > +	cmdq_nr_entries = 1 << smmu->cmdq_log2size;
> > +	cmdq_size = cmdq_nr_entries * CMDQ_ENT_DWORDS * 8;
> > +
> > +	cmdq_base &= Q_BASE_ADDR_MASK;
> > +
> > +	if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
> > +		prot |= KVM_PGTABLE_PROT_NORMAL_NC;
> > +
> > +	ret = ___pkvm_host_donate_hyp_prot(cmdq_base >> PAGE_SHIFT,
> > +					   PAGE_ALIGN(cmdq_size) >> PAGE_SHIFT,
> > +					   false, prot);
> > +	if (ret)
> > +		return ret;
> > +
> > +	smmu->cmdq_base = hyp_phys_to_virt(cmdq_base);
> > +
> > +	memset(smmu->cmdq_base, 0, cmdq_size);
> > +	writel_relaxed(0, smmu->base + ARM_SMMU_CMDQ_PROD);
> > +	writel_relaxed(0, smmu->base + ARM_SMMU_CMDQ_CONS);
> > +
> > +	return 0;
> > +}
> > +
> >   static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
> >   {
> >   	int ret;
> > @@ -113,6 +274,10 @@ static int smmu_init_device(struct hyp_arm_smmu_v3_device *smmu)
> >   	if (ret)
> >   		return ret;
> > +	ret = smmu_init_cmdq(smmu);
> > +	if (ret)
> > +		return ret;
> > +
> >   	return kvm_iommu_init_device(&smmu->iommu);
> >   }
> > diff --git a/include/kvm/arm_smmu_v3.h b/include/kvm/arm_smmu_v3.h
> > index fb24bcef1624..393a1a04edba 100644
> > --- a/include/kvm/arm_smmu_v3.h
> > +++ b/include/kvm/arm_smmu_v3.h
> > @@ -16,8 +16,12 @@ struct hyp_arm_smmu_v3_device {
> >   	struct kvm_hyp_iommu	iommu;
> >   	phys_addr_t		mmio_addr;
> >   	size_t			mmio_size;
> > +	unsigned long		features;
> >   	void __iomem		*base;
> > +	u32			cmdq_prod;
> > +	u64			*cmdq_base;
> > +	size_t			cmdq_log2size;
> >   };
> >   extern size_t kvm_nvhe_sym(kvm_hyp_arm_smmu_v3_count);
> 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-23  8:13                 ` Tian, Kevin
@ 2025-01-29 12:16                   ` Mostafa Saleh
  0 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-29 12:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

On Thu, Jan 23, 2025 at 08:13:34AM +0000, Tian, Kevin wrote:
> > From: Mostafa Saleh <smostafa@google.com>
> > Sent: Wednesday, January 22, 2025 7:04 PM
> > 
> > On Fri, Jan 17, 2025 at 06:57:12AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > > Sent: Friday, January 17, 2025 3:15 AM
> > > >
> > > > On Thu, Jan 16, 2025 at 06:39:31AM +0000, Tian, Kevin wrote:
> > > > > > From: Mostafa Saleh <smostafa@google.com>
> > > > > > Sent: Wednesday, January 8, 2025 8:10 PM
> > > > > >
> > > > > > On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> > > > > > > On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > > > > > > > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > > > > > > > but tbh, I don’t think it’s a deal breaker for now.
> > > > > > >
> > > > > > > Again, it depends what your actual use case for translation is inside
> > > > > > > the host/guest environments. It would be good to clearly spell this
> > out..
> > > > > > > There are few drivers that directly manpulate the iommu_domains
> > of a
> > > > > > > device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> > > > > > > of those are you targetting?
> > > > > > >
> > > > > >
> > > > > > Not sure I understand this point about manipulating domains.
> > > > > > AFAIK, SVA is not that common, including mobile spaces but I can be
> > > > wrong,
> > > > > > that’s why it’s not a priority here.
> > > > >
> > > > > Nested translation is required beyond SVA. A scenario which requires
> > > > > a vIOMMU and multiple device domains within the guest would like to
> > > > > embrace nesting. Especially for ARM vSMMU nesting is a must.
> > 
> > We can still do para-virtualization for guests the same way we do for the
> > host and use a single stage IOMMU.
> 
> same way but both require a nested setup.
> 
> In concept there are two layers of address translations: GVA->GPA via
> guest page table, and GPA->HPA via pKVM page table.
> 
> The difference between host/guest is just on the GPA mapping. For host
> it's 1:1 with additional hardening for which portion can be mapped and
> which cannot. For guest it's non-identical with the mapping established
> from the host.
> 
> A nested translation naturally fits that conceptual layers.
> 
> Using a single-stage IOMMU means you need to combine two layers
> into one layer i.e. GVA->HPA by removing GPA. Then you have to
> paravirt guest page table so every guest PTE change is intercepted
> to replace GPA with HPA.
> 
> Doing so completely kills the benefit of SVA, which is why Jason said
> a no-go.

I agree, this can’t work with SVA, in order to make that work we would
need some new para-virt operation to install the S1 table, and the
hypervisor has to configure the device in nested translation.

But, for guests that doesn’t need SVA, they can just use single-stage
para-virt (like virtio-iommu)

> 
> > 
> > > >
> > > > Right, if you need an iommu domain in the guest there are only three
> > > > mainstream ways to get this in Linux:
> > > >  1) Use the DMA API and have the iommu group be translating. This is
> > > >     optional in that the DMA API usually supports identity as an option.
> > > >  2) A driver directly calls iommu_paging_domain_alloc() and manually
> > > >     attaches it to some device, and does not use the DMA API. My list
> > > >     above of ath1x/etc are examples doing this
> > > >  3) Use VFIO
> > > >
> > > > My remark to Mostafa is to be specific, which of the above do you want
> > > > to do in your mobile guest (and what driver exactly if #2) and why.
> > > >
> > > > This will help inform what the performance profile looks like and
> > > > guide if nesting/para virt is appropriate.
> > >
> > 
> > AFAIK, the most common use cases would be:
> > - Devices using DMA API because it requires a lot of memory to be
> >   contiguous in IOVA, which is hard to do with identity
> > - Devices with security requirements/constraints to be isolated from the
> >   rest of the system, also using DMA API
> > - VFIO is something we are looking at the moment and have prototyped with
> >   pKVM, and it should be supported soon in Android (only for platform
> >   devices for now)
> 
> what really matters is the frequency of map/unmap.
Yes, though it differs between devices/systems :/ that’s why I reckon we
would need both on the long term. However, starting with some benchmarks
for these cases can help to understand the magnitude of both solutions
and prioritise which one is more suitable to start with for upstream.

Thanks,
Mostafa
> 
> > 
> > > Yeah that part would be critical to help decide which route to pursue
> > > first. Even when all options might be required in the end when pKVM
> > > is scaled to more scenarios, as you mentioned in another mail, a staging
> > > approach would be much preferrable to evolve.
> > 
> > I agree that would probably be the case. I will work on more staging
> > approach for v3, mostly without the pv part as Jason suggested.
> > 
> > >
> > > The pros/cons between nesting/para virt is clear - more static the
> > > mapping is, more gain from the para approach due to less paging
> > > walking and smaller tlb footprint, while vice versa nesting performs
> > > much better by avoiding frequent para calls on page table mgmt. 😊
> > 
> > I am also working to get the numbers for both cases so we know
> > the order of magnitude of each case, as I guess it won't be as clear
> > for large systems with many DMA initiators what approach is best.
> > 
> > 
> 
> That'd be great!


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-23  8:25             ` Tian, Kevin
@ 2025-01-29 12:21               ` Mostafa Saleh
  2025-01-29 13:50                 ` Jason Gunthorpe
  2025-02-18  9:52                 ` Tian, Kevin
  0 siblings, 2 replies; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-29 12:21 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

On Thu, Jan 23, 2025 at 08:25:13AM +0000, Tian, Kevin wrote:
> > From: Mostafa Saleh <smostafa@google.com>
> > Sent: Wednesday, January 22, 2025 7:29 PM
> > 
> > Hi Kevin,
> > 
> > On Thu, Jan 16, 2025 at 08:51:11AM +0000, Tian, Kevin wrote:
> > > > From: Mostafa Saleh <smostafa@google.com>
> > > > Sent: Wednesday, January 8, 2025 8:10 PM
> > > >
> > > > My plan was basically:
> > > > 1) Finish and send nested SMMUv3 as RFC, with more insights about
> > > >    performance and complexity trade-offs of both approaches.
> > > >
> > > > 2) Discuss next steps for the upstream solution in an upcoming
> > conference
> > > >    (like LPC or earlier if possible) and work on upstreaming it.
> > > >
> > > > 3) Work on guest device passthrough and IOMMU support.
> > > >
> > > > I am open to gradually upstream this as you mentioned where as a first
> > > > step pKVM would establish DMA isolation without translation for host,
> > > > that should be enough to have functional pKVM and run protected
> > > > workloads.
> > >
> > > Does that approach assume starting from a full-fledged SMMU driver
> > > inside pKVM or do we still expect the host to enumerate/initialize
> > > the hw (but skip any translation) so the pKVM part can focus only
> > > on managing translation?
> > 
> > I have been thinking about this, and I think most of the initialization
> > won’t be changed, and we would do any possible initialization in the
> > kernel avoiding complexity in the hypervisor (parsing
> > device-tree/acpi...) also that makes code re-use easier if both drivers
> > do that in the kernel space.
> 
> yeah that'd make sense for now. 
> 
> > 
> > >
> > > I'm curious about the burden of maintaining another IOMMU
> > > subsystem under the KVM directory. It's not built into the host kernel
> > > image, but hosted in the same kernel repo. This series tried to
> > > reduce the duplication via io-pgtable-arm but still considerable
> > > duplication exists (~2000LOC in pKVM). The would be very confusing
> > > moving forward and hard to maintain e.g. ensure bugs fixed in
> > > both sides.
> > 
> > KVM IOMMU subsystem is very different from the one kernel, it’s about
> > paravirtualtion and abstraction, I tried my best to make sure all
> > possible code can be re-used by splitting arm-smmu-v3-common.c and
> > io-pgtable-arm-common.c and even re-using iommu_iotlb_gather from the
> > iommu code.
> > So my guess, there won't be much of that effort as there is no
> > duplication in logic.
> 
> I'm not sure how different it is. In concept it still manages iommu
> mappings, just with additional restrictions. Bear me that I haven't
> looked into the detail of the 2000LOC driver in pKVM smmu driver. 
> but the size does scare me, especially considering the case when
> other vendors are supported later.
> 
> Let's keep it in mind and re-check after you have v3. It's simpler hence
> suppose the actual difference between a pKVM iommu driver and
> a normal kernel IOMMU driver can be judged more easily than now.

I see, I believe we can reduce the size by re-using more data-structure
types + more refactoring on the kernel side.

Also, we can make many parts of the code standard outside the driver as
calling hypercalls, dealing with memory allocation...., so. other IOMMUs
will only add minimal code.

> 
> The learning here would be beneficial to the design in other pKVM
> components, e.g. when porting pKVM to x86. Currently KVM x86 is 
> monothetic. Maintaining pKVM under KVM/x86 would be a much
> bigger challenge than doing it under KVM/arm. There will also be
> question about what can be shared and how to better maintain
> the pKVM specific logic in KVM/x86.
> 
> Overall my gut-feeling is that the pKVM specific code must be small
> enough otherwise maintaining a run-time irrelevant project in the
> kernel repo would be questionable. 😊
> 

I am not sure I understand, but I don’t see how pKVM is irrelevant,
it’s a mode in KVM (just like, nvhe/hvhe where they run in 2 exception
levels) and can’t be separated from the kernel as that defeats the
point of KVM, that means that all hypercalls have to be stable ABI,
same for the shared data, shared structs, types...

Thanks,
Mostafa


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-29 12:21               ` Mostafa Saleh
@ 2025-01-29 13:50                 ` Jason Gunthorpe
  2025-01-29 14:08                   ` Mostafa Saleh
  2025-02-18  9:52                 ` Tian, Kevin
  1 sibling, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-01-29 13:50 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Tian, Kevin, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

On Wed, Jan 29, 2025 at 12:21:01PM +0000, Mostafa Saleh wrote:
> levels) and can’t be separated from the kernel as that defeats the
> point of KVM, that means that all hypercalls have to be stable ABI,
> same for the shared data, shared structs, types...

Sorry, just trying to understand this sentance, today pkvm has no
stable ABI right? That is the whole point of building it into the
kernel?

Things like the CC world are creating stable ABIs for their pkvm like
environments because they are not built into the kernel? And thus they
take the pain of that?

Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-29 13:50                 ` Jason Gunthorpe
@ 2025-01-29 14:08                   ` Mostafa Saleh
  0 siblings, 0 replies; 97+ messages in thread
From: Mostafa Saleh @ 2025-01-29 14:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

On Wed, Jan 29, 2025 at 09:50:53AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 12:21:01PM +0000, Mostafa Saleh wrote:
> > levels) and can’t be separated from the kernel as that defeats the
> > point of KVM, that means that all hypercalls have to be stable ABI,
> > same for the shared data, shared structs, types...
> 
> Sorry, just trying to understand this sentance, today pkvm has no
> stable ABI right? That is the whole point of building it into the
> kernel?

Yes.

> 
> Things like the CC world are creating stable ABIs for their pkvm like
> environments because they are not built into the kernel? And thus they
> take the pain of that?

Yes, my point is, we can't just separate pKVM as Kevin was mentioning as
they has no ABI and it is tightly coupled with the kernel.


Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 97+ messages in thread

* RE: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
  2025-01-29 12:21               ` Mostafa Saleh
  2025-01-29 13:50                 ` Jason Gunthorpe
@ 2025-02-18  9:52                 ` Tian, Kevin
  1 sibling, 0 replies; 97+ messages in thread
From: Tian, Kevin @ 2025-02-18  9:52 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com

> From: Mostafa Saleh <smostafa@google.com>
> Sent: Wednesday, January 29, 2025 8:21 PM
> 
> On Thu, Jan 23, 2025 at 08:25:13AM +0000, Tian, Kevin wrote:
> >
> > The learning here would be beneficial to the design in other pKVM
> > components, e.g. when porting pKVM to x86. Currently KVM x86 is
> > monothetic. Maintaining pKVM under KVM/x86 would be a much
> > bigger challenge than doing it under KVM/arm. There will also be
> > question about what can be shared and how to better maintain
> > the pKVM specific logic in KVM/x86.
> >
> > Overall my gut-feeling is that the pKVM specific code must be small
> > enough otherwise maintaining a run-time irrelevant project in the
> > kernel repo would be questionable. 😊
> >
> 
> I am not sure I understand, but I don’t see how pKVM is irrelevant,
> it’s a mode in KVM (just like, nvhe/hvhe where they run in 2 exception
> levels) and can’t be separated from the kernel as that defeats the
> point of KVM, that means that all hypercalls have to be stable ABI,
> same for the shared data, shared structs, types...
> 

Yes pKVM doesn't favor stable ABI. My point was more on the part
that nvhe is a hardware limitation so kvm-arm already coped with it
from day one then adding the concept of pKVM atop was relatively
easy, but changing other subsystems to support this split model
just for pKVM adds more maintenance burden. Then the maintainers
may challenge the value of supporting pKVM if the size of maintaining
the split model becomes too large... Anyway we will see how it
turns out with more discussions on your next version.

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2025-02-18 10:08 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 01/58] iommu/io-pgtable-arm: Split the page table driver Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 02/58] iommu/io-pgtable-arm: Split initialization Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 03/58] iommu/io-pgtable: Add configure() operation Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 04/58] iommu/arm-smmu-v3: Move some definitions to arm64 include/ Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 05/58] iommu/arm-smmu-v3: Extract driver-specific bits from probe function Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 06/58] iommu/arm-smmu-v3: Move some functions to arm-smmu-v3-common.c Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 07/58] iommu/arm-smmu-v3: Move queue and table allocation " Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 08/58] iommu/arm-smmu-v3: Move firmware probe to arm-smmu-v3-common Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 09/58] iommu/arm-smmu-v3: Move IOMMU registration to arm-smmu-v3-common.c Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 10/58] iommu/arm-smmu-v3: Move common irq code to common file Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 11/58] KVM: arm64: pkvm: Add pkvm_udelay() Mostafa Saleh
2024-12-19 11:14   ` Quentin Perret
2024-12-19 11:21     ` Mostafa Saleh
2024-12-19 11:28       ` Quentin Perret
2024-12-12 18:03 ` [RFC PATCH v2 12/58] KVM: arm64: Add __pkvm_{use, unuse}_dma() Mostafa Saleh
2024-12-19 11:23   ` Quentin Perret
2024-12-12 18:03 ` [RFC PATCH v2 13/58] KVM: arm64: Introduce IOMMU driver infrastructure Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 14/58] KVM: arm64: pkvm: Add IOMMU hypercalls Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 15/58] KVM: arm64: iommu: Add a memory pool for the IOMMU Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 16/58] KVM: arm64: iommu: Add domains Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 17/58] KVM: arm64: iommu: Add {attach, detach}_dev Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 18/58] KVM: arm64: iommu: Add map/unmap() operations Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 19/58] KVM: arm64: iommu: support iommu_iotlb_gather Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 20/58] KVM: arm64: Support power domains Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 21/58] KVM: arm64: pkvm: Add __pkvm_host_add_remove_page() Mostafa Saleh
2024-12-19 11:10   ` Quentin Perret
2024-12-19 11:19     ` Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 22/58] KVM: arm64: pkvm: Support SCMI power domain Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 23/58] KVM: arm64: iommu: Support power management Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 24/58] KVM: arm64: iommu: Support DABT for IOMMU Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 25/58] KVM: arm64: iommu: Add SMMUv3 driver Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 26/58] KVM: arm64: smmu-v3: Initialize registers Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 27/58] KVM: arm64: smmu-v3: Setup command queue Mostafa Saleh
2025-01-23 13:01   ` Robin Murphy
2025-01-29 11:15     ` Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 28/58] KVM: arm64: smmu-v3: Setup stream table Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 29/58] KVM: arm64: smmu-v3: Setup event queue Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 30/58] KVM: arm64: smmu-v3: Reset the device Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 31/58] KVM: arm64: smmu-v3: Support io-pgtable Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 32/58] KVM: arm64: smmu-v3: Add {alloc/free}_domain Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 33/58] KVM: arm64: smmu-v3: Add TLB ops Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 34/58] KVM: arm64: smmu-v3: Add context descriptor functions Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 35/58] KVM: arm64: smmu-v3: Add attach_dev Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 36/58] KVM: arm64: smmu-v3: Add detach_dev Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 37/58] iommu/io-pgtable: Generalize walker interface Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 38/58] iommu/io-pgtable-arm: Add post table walker callback Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 39/58] drivers/iommu: io-pgtable: Add IO_PGTABLE_QUIRK_UNMAP_INVAL Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 40/58] KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys Mostafa Saleh
2024-12-12 19:44   ` Jason Gunthorpe
2024-12-13 19:48     ` Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 41/58] KVM: arm64: smmu-v3: Add DABT handler Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 42/58] iommu/arm-smmu-v3-kvm: Add host driver for pKVM Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 43/58] iommu/arm-smmu-v3-kvm: Pass a list of SMMU devices to the hypervisor Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 44/58] iommu/arm-smmu-v3-kvm: Validate device features Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 45/58] iommu/arm-smmu-v3-kvm: Allocate structures and reset device Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 46/58] KVM: arm64: Add function to topup generic allocator Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 47/58] KVM: arm64: Add macro for SMCCC call with all returns Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 48/58] iommu/arm-smmu-v3-kvm: Add function to topup IOMMU allocator Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 49/58] iommu/arm-smmu-v3-kvm: Add IOMMU ops Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 50/58] iommu/arm-smmu-v3-kvm: Add map, unmap and iova_to_phys operations Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 51/58] iommu/arm-smmu-v3-kvm: Support PASID operations Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 52/58] iommu/arm-smmu-v3-kvm: Add IRQs for the driver Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 53/58] iommu/arm-smmu-v3-kvm: Probe power domains Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 54/58] iommu/arm-smmu-v3-kvm: Enable runtime PM Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations Mostafa Saleh
2024-12-19 12:48   ` Robin Murphy
2024-12-19 14:24     ` Mostafa Saleh
2025-01-02 20:18       ` Jason Gunthorpe
2025-01-03 15:35         ` Mostafa Saleh
2025-01-03 15:47           ` Jason Gunthorpe
2025-01-08 12:13             ` Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 56/58] KVM: arm64: iommu: Add hypercall for map_sg Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 57/58] iommu/arm-smmu-v3-kvm: Implement sg operations Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 58/58] iommu/arm-smmu-v3-kvm: Support command queue batching Mostafa Saleh
2024-12-12 19:41 ` [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Jason Gunthorpe
2024-12-13 19:39   ` Mostafa Saleh
2025-01-02 20:16     ` Jason Gunthorpe
2025-01-08 12:09       ` Mostafa Saleh
2025-01-16  6:39         ` Tian, Kevin
2025-01-16 19:14           ` Jason Gunthorpe
2025-01-17  6:57             ` Tian, Kevin
2025-01-22 11:04               ` Mostafa Saleh
2025-01-22 16:20                 ` Jason Gunthorpe
2025-01-22 17:17                   ` Mostafa Saleh
2025-01-22 19:16                     ` Jason Gunthorpe
2025-01-23  8:13                 ` Tian, Kevin
2025-01-29 12:16                   ` Mostafa Saleh
2025-01-16  8:51         ` Tian, Kevin
2025-01-22 11:28           ` Mostafa Saleh
2025-01-23  8:25             ` Tian, Kevin
2025-01-29 12:21               ` Mostafa Saleh
2025-01-29 13:50                 ` Jason Gunthorpe
2025-01-29 14:08                   ` Mostafa Saleh
2025-02-18  9:52                 ` Tian, Kevin
2025-01-16 19:19         ` Jason Gunthorpe
2025-01-22 11:46           ` Mostafa Saleh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).