public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains
@ 2026-04-28 13:13 fangyu.yu
  2026-04-28 13:13 ` [RFC PATCH 01/11] iommupt: Add RISC-V Second-stage (iohgatp) page table support fangyu.yu
                   ` (11 more replies)
  0 siblings, 12 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel,
	Fangyu Yu

From: Fangyu Yu <fangyu.yu@linux.alibaba.com>

The RISC-V IOMMU architecture defines an AMO_HWAD capability (Hardware
Access/Dirty update) that allows the IOMMU to atomically set the A/D bits
in second-stage PTEs on DMA access.  When DC.tc.GADE is asserted, the IOMMU
autonomously sets D on the first write to a page mapped by an iohgatp
domain.  This series wires that capability up to the iommufd dirty-tracking
interface (IOMMU_HWPT_SET_DIRTY_TRACKING / IOMMU_HWPT_GET_DIRTY_BITMAP) and
reports IOMMU_CAP_DIRTY_TRACKING.

Design notes
------------

* The feature is scoped to second-stage (iohgatp) domains only; these are
  the domains created for KVM / VFIO device pass-through when userspace
  allocates an HWPT with IOMMU_HWPT_ALLOC_NEST_PARENT or
  IOMMU_HWPT_ALLOC_DIRTY_TRACKING.  First-stage (iosatp) domains are not
  touched by this series.

* The page-table side plugs into the existing generic_pt dirty hook
  framework (amdv1 / vtdss style).  RISC-V adds the three required PTE
  ops – is_write_dirty / make_write_clean / make_write_dirty.

Testing
-------

* Test on QEMU RISC-V, a virtio-net and an e1000e device was passed through
  to an L2 guest via vfio-pci + iommufd.

* generic_pt KUnit: the existing test_dirty case now runs and passes for
  the RISC-V 64-bit format.

Follow-up work
--------------
* Build a dedicated end-to-end test case that drives the full flow
  (HWPT_ALLOC with DIRTY_TRACKING -> attach -> IOAS_MAP -> generate real
  DMA -> SET_DIRTY_TRACKING -> GET_DIRTY_BITMAP -> verify bitmap against
  expected IOVA footprint) so that the behaviour can be regression-tested
  beyond the KUnit PTE-level coverage.

* If possible, rebase and retest on top of the updated "iommu irqbypass"
  patchset.


Fangyu Yu (6):
  iommupt: Add RISC-V Second-stage (iohgatp) page table support
  iommu/riscv: Add domain_alloc_paging_flags for second-stage domain
  iommupt: Don't preset D when RISC-V IOMMU dirty tracking on
  iommu/riscv: Add dirty tracking support for second-stage domains
  iommu/riscv: Add IOTINVAL.GVMA after updating DDT/PDT entries
  iommupt: Add RISC-V dirty tracking PTE ops

Tomasz Jeznach (2):
  iommu/riscv: report iommu capabilities
  RISC-V: KVM: Enable KVM_VFIO interfaces on RISC-V arch

Zong Li (3):
  iommu/riscv: use data structure instead of individual values
  iommu/riscv: support GSCID and GVMA invalidation command
  iommu/riscv: support nested iommu for getting iommu hardware
    information

 arch/riscv/kvm/Kconfig               |   2 +
 drivers/iommu/generic_pt/fmt/riscv.h | 120 ++++++++++++-
 drivers/iommu/riscv/iommu-bits.h     |   7 +
 drivers/iommu/riscv/iommu.c          | 247 +++++++++++++++++++++++----
 include/linux/generic_pt/common.h    |  13 ++
 include/linux/generic_pt/iommu.h     |  17 +-
 include/uapi/linux/iommufd.h         |  18 ++
 7 files changed, 383 insertions(+), 41 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 01/11] iommupt: Add RISC-V Second-stage (iohgatp) page table support
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:32   ` Jason Gunthorpe
  2026-04-28 13:13 ` [RFC PATCH 02/11] iommu/riscv: report iommu capabilities fangyu.yu
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel,
	Fangyu Yu

From: Fangyu Yu <fangyu.yu@linux.alibaba.com>

Add support for Sv39x4/Sv48x4/Sv57x4 Second-stage page tables used by
the RISC-V IOMMU iohgatp register. The x4 root page table is 16 KiB
instead of the usual 4 KiB, covering 2 extra GPA bits (hw_max_vasz_lg2
= 41/50/59).

Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 drivers/iommu/generic_pt/fmt/riscv.h | 64 +++++++++++++++++++++++++---
 include/linux/generic_pt/common.h    |  5 +++
 include/linux/generic_pt/iommu.h     | 17 +++++++-
 3 files changed, 80 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/generic_pt/fmt/riscv.h b/drivers/iommu/generic_pt/fmt/riscv.h
index a7fef6266a36..4fe645e60375 100644
--- a/drivers/iommu/generic_pt/fmt/riscv.h
+++ b/drivers/iommu/generic_pt/fmt/riscv.h
@@ -37,7 +37,16 @@ enum {
 	PT_MAX_OUTPUT_ADDRESS_LG2 = 34,
 	PT_MAX_TOP_LEVEL = 1,
 #else
-	PT_MAX_VA_ADDRESS_LG2 = 57,
+	/*
+	 * PT_MAX_VA_ADDRESS_LG2 is the upper bound accepted by the generic
+	 * pt_iommu_init() range check.  It must cover both first-stage and
+	 * second-stage (G-stage) modes:
+	 *
+	 *   First-stage  (fsc/iosatp): Sv39=39, Sv48=48, Sv57=57
+	 *   Second-stage (iohgatp):    Sv39x4=41, Sv48x4=50, Sv57x4=59
+	 *
+	 */
+	PT_MAX_VA_ADDRESS_LG2 = 59,
 	PT_MAX_OUTPUT_ADDRESS_LG2 = 56,
 	PT_MAX_TOP_LEVEL = 4,
 #endif
@@ -124,6 +133,14 @@ riscvpt_entry_num_contig_lg2(const struct pt_state *pts)
 
 static inline unsigned int riscvpt_num_items_lg2(const struct pt_state *pts)
 {
+	/*
+	 * Second-stage (iohgatp) root page tables have 4x the usual number of
+	 * entries (2048 = 2^11 instead of 512 = 2^9) to cover the 2 extra GPA
+	 * bits in Sv39x4/Sv48x4/Sv57x4.  Only the root (top) level is
+	 * enlarged; all other levels remain at the standard 9-bit index width.
+	 */
+	if (to_riscvpt(pts)->second_stage && pts->level == pts->range->top_level)
+		return PT_TABLEMEM_LG2SZ - ilog2(sizeof(u64)) + 2;
 	return PT_TABLEMEM_LG2SZ - ilog2(sizeof(u64));
 }
 #define pt_num_items_lg2 riscvpt_num_items_lg2
@@ -254,6 +271,7 @@ riscvpt_iommu_fmt_init(struct pt_iommu_riscv_64 *iommu_table,
 	struct pt_riscv *table = &iommu_table->riscv_64pt;
 
 	switch (cfg->common.hw_max_vasz_lg2) {
+	/* First-stage (fsc/iosatp): Sv39 / Sv48 / Sv57 */
 	case 39:
 		pt_top_set_level(&table->common, 2);
 		break;
@@ -263,6 +281,22 @@ riscvpt_iommu_fmt_init(struct pt_iommu_riscv_64 *iommu_table,
 	case 57:
 		pt_top_set_level(&table->common, 4);
 		break;
+	/*
+	 * Second-stage (iohgatp): Sv39x4 / Sv48x4 / Sv57x4.
+	 * The top level is the same as for the first-stage counterpart.
+	 */
+	case 41:
+		pt_top_set_level(&table->common, 2);
+		table->second_stage = true;
+		break;
+	case 50:
+		pt_top_set_level(&table->common, 3);
+		table->second_stage = true;
+		break;
+	case 59:
+		pt_top_set_level(&table->common, 4);
+		table->second_stage = true;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -283,10 +317,17 @@ riscvpt_iommu_fmt_hw_info(struct pt_iommu_riscv_64 *table,
 	PT_WARN_ON(top_phys & ~PT_TOP_PHYS_MASK);
 
 	/*
-	 * See Table 3. Encodings of iosatp.MODE field" for DC.tx.SXL = 0:
-	 *  8 = Sv39 = top level 2
-	 *  9 = Sv38 = top level 3
-	 *  10 = Sv57 = top level 4
+	 * Both first-stage (fsc/iosatp) and second-stage (iohgatp) share the
+	 * same MODE numeric values for a given top level:
+	 *   top_level 2 -> MODE 8  (Sv39 / Sv39x4)
+	 *   top_level 3 -> MODE 9  (Sv48 / Sv48x4)
+	 *   top_level 4 -> MODE 10 (Sv57 / Sv57x4)
+	 *
+	 * The union members fsc_iosatp_mode and iohgatp_mode occupy the same
+	 * byte; the caller selects the appropriate name based on domain type.
+	 *
+	 * See "Table 3. Encodings of iosatp.MODE field" (DC.tc.SXL = 0) and
+	 * "Table 2. Encoding of iohgatp.MODE field" in the RISC-V IOMMU spec.
 	 */
 	info->fsc_iosatp_mode = top_range->top_level + 6;
 }
@@ -294,6 +335,7 @@ riscvpt_iommu_fmt_hw_info(struct pt_iommu_riscv_64 *table,
 
 #if defined(GENERIC_PT_KUNIT)
 static const struct pt_iommu_riscv_64_cfg riscv_64_kunit_fmt_cfgs[] = {
+	/* First-stage (fsc/iosatp): Sv39 / Sv48 / Sv57 */
 	[0] = { .common.features = BIT(PT_FEAT_RISCV_SVNAPOT_64K),
 		.common.hw_max_oasz_lg2 = 56,
 		.common.hw_max_vasz_lg2 = 39 },
@@ -303,6 +345,18 @@ static const struct pt_iommu_riscv_64_cfg riscv_64_kunit_fmt_cfgs[] = {
 	[2] = { .common.features = BIT(PT_FEAT_RISCV_SVNAPOT_64K),
 		.common.hw_max_oasz_lg2 = 56,
 		.common.hw_max_vasz_lg2 = 57 },
+	/*
+	 * Second-stage (iohgatp): Sv39x4 / Sv48x4 / Sv57x4.
+	 */
+	[3] = { .common.features = BIT(PT_FEAT_RISCV_SVNAPOT_64K),
+		.common.hw_max_oasz_lg2 = 56,
+		.common.hw_max_vasz_lg2 = 41 },
+	[4] = { .common.features = 0,
+		.common.hw_max_oasz_lg2 = 56,
+		.common.hw_max_vasz_lg2 = 50 },
+	[5] = { .common.features = BIT(PT_FEAT_RISCV_SVNAPOT_64K),
+		.common.hw_max_oasz_lg2 = 56,
+		.common.hw_max_vasz_lg2 = 59 },
 };
 #define kunit_fmt_cfgs riscv_64_kunit_fmt_cfgs
 enum {
diff --git a/include/linux/generic_pt/common.h b/include/linux/generic_pt/common.h
index fc5d0b5edadc..e82dff33ece8 100644
--- a/include/linux/generic_pt/common.h
+++ b/include/linux/generic_pt/common.h
@@ -181,6 +181,11 @@ struct pt_riscv_32 {
 
 struct pt_riscv_64 {
 	struct pt_common common;
+	/*
+	 * True when this table is used for second-stage / iohgatp
+	 * address translation.
+	 */
+	bool second_stage;
 };
 
 enum {
diff --git a/include/linux/generic_pt/iommu.h b/include/linux/generic_pt/iommu.h
index dd0edd02a48a..f27d229ff318 100644
--- a/include/linux/generic_pt/iommu.h
+++ b/include/linux/generic_pt/iommu.h
@@ -328,7 +328,22 @@ struct pt_iommu_riscv_64_cfg {
 
 struct pt_iommu_riscv_64_hw_info {
 	u64 ppn;
-	u8 fsc_iosatp_mode;
+	union {
+		/*
+		 * First-stage (fsc/iosatp) MODE encoding:
+		 *   8 = Sv39, 9 = Sv48, 10 = Sv57
+		 * Used to program DC.fsc.iosatp.MODE.
+		 */
+		u8 fsc_iosatp_mode;
+		/*
+		 * Second-stage (iohgatp) MODE encoding:
+		 *   8 = Sv39x4, 9 = Sv48x4, 10 = Sv57x4
+		 * Used to program DC.iohgatp.MODE.
+		 * The numeric values are identical to fsc_iosatp_mode;
+		 * the caller selects the interpretation based on domain type.
+		 */
+		u8 iohgatp_mode;
+	};
 };
 
 IOMMU_FORMAT(riscv_64, riscv_64pt);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 02/11] iommu/riscv: report iommu capabilities
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
  2026-04-28 13:13 ` [RFC PATCH 01/11] iommupt: Add RISC-V Second-stage (iohgatp) page table support fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:33   ` Jason Gunthorpe
  2026-04-28 13:13 ` [RFC PATCH 03/11] iommu/riscv: use data structure instead of individual values fangyu.yu
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel,
	Fangyu Yu

From: Tomasz Jeznach <tjeznach@rivosinc.com>

Report RISC-V IOMMU capabilities required by VFIO subsystem
to enable PCIe device assignment.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 drivers/iommu/riscv/iommu.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index a31f50bbad35..15e2a333f969 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1336,6 +1336,17 @@ static struct iommu_group *riscv_iommu_device_group(struct device *dev)
 	return generic_device_group(dev);
 }
 
+static bool riscv_iommu_capable(struct device *dev, enum iommu_cap cap)
+{
+	switch (cap) {
+	case IOMMU_CAP_CACHE_COHERENCY:
+	case IOMMU_CAP_DEFERRED_FLUSH:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int riscv_iommu_of_xlate(struct device *dev, const struct of_phandle_args *args)
 {
 	return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -1397,6 +1408,7 @@ static void riscv_iommu_release_device(struct device *dev)
 
 static const struct iommu_ops riscv_iommu_ops = {
 	.of_xlate = riscv_iommu_of_xlate,
+	.capable = riscv_iommu_capable,
 	.identity_domain = &riscv_iommu_identity_domain,
 	.blocked_domain = &riscv_iommu_blocking_domain,
 	.release_domain = &riscv_iommu_blocking_domain,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 03/11] iommu/riscv: use data structure instead of individual values
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
  2026-04-28 13:13 ` [RFC PATCH 01/11] iommupt: Add RISC-V Second-stage (iohgatp) page table support fangyu.yu
  2026-04-28 13:13 ` [RFC PATCH 02/11] iommu/riscv: report iommu capabilities fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:13 ` [RFC PATCH 04/11] iommu/riscv: support GSCID and GVMA invalidation command fangyu.yu
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel, Zong Li,
	Fangyu Yu

From: Zong Li <zong.li@sifive.com>

The parameter will be increased when we need to set up more
bit fields in the device context. Use a data structure to
wrap them up.

Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 drivers/iommu/riscv/iommu.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 15e2a333f969..369c98b7e1e5 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1077,7 +1077,7 @@ static void riscv_iommu_iodir_iotinval(struct riscv_iommu_device *iommu,
  * interim translation faults.
  */
 static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
-				     struct device *dev, u64 fsc, u64 ta)
+				     struct device *dev, struct riscv_iommu_dc *new_dc)
 {
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct riscv_iommu_dc *dc;
@@ -1116,10 +1116,10 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
 	for (i = 0; i < fwspec->num_ids; i++) {
 		dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
 		tc = READ_ONCE(dc->tc);
-		tc |= ta & RISCV_IOMMU_DC_TC_V;
+		tc |= new_dc->ta & RISCV_IOMMU_DC_TC_V;
 
-		WRITE_ONCE(dc->fsc, fsc);
-		WRITE_ONCE(dc->ta, ta & RISCV_IOMMU_PC_TA_PSCID);
+		WRITE_ONCE(dc->fsc, new_dc->fsc);
+		WRITE_ONCE(dc->ta, new_dc->ta & RISCV_IOMMU_PC_TA_PSCID);
 		/* Update device context, write TC.V as the last step. */
 		dma_wmb();
 		WRITE_ONCE(dc->tc, tc);
@@ -1205,22 +1205,22 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
 	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
 	struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
 	struct pt_iommu_riscv_64_hw_info pt_info;
-	u64 fsc, ta;
+	struct riscv_iommu_dc dc = {0};
 
 	pt_iommu_riscv_64_hw_info(&domain->riscvpt, &pt_info);
 
 	if (!riscv_iommu_pt_supported(iommu, pt_info.fsc_iosatp_mode))
 		return -ENODEV;
 
-	fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, pt_info.fsc_iosatp_mode) |
+	dc.fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, pt_info.fsc_iosatp_mode) |
 	      FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, pt_info.ppn);
-	ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
+	dc.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
 	     RISCV_IOMMU_PC_TA_V;
 
 	if (riscv_iommu_bond_link(domain, dev))
 		return -ENOMEM;
 
-	riscv_iommu_iodir_update(iommu, dev, fsc, ta);
+	riscv_iommu_iodir_update(iommu, dev, &dc);
 	riscv_iommu_bond_unlink(info->domain, dev);
 	info->domain = domain;
 
@@ -1292,9 +1292,12 @@ static int riscv_iommu_attach_blocking_domain(struct iommu_domain *iommu_domain,
 {
 	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
 	struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+	struct riscv_iommu_dc dc = {0};
+
+	dc.fsc = RISCV_IOMMU_FSC_BARE;
 
 	/* Make device context invalid, translation requests will fault w/ #258 */
-	riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, 0);
+	riscv_iommu_iodir_update(iommu, dev, &dc);
 	riscv_iommu_bond_unlink(info->domain, dev);
 	info->domain = NULL;
 
@@ -1314,8 +1317,12 @@ static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
 {
 	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
 	struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+	struct riscv_iommu_dc dc = {0};
+
+	dc.fsc = RISCV_IOMMU_FSC_BARE;
+	dc.ta = RISCV_IOMMU_PC_TA_V;
 
-	riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, RISCV_IOMMU_PC_TA_V);
+	riscv_iommu_iodir_update(iommu, dev, &dc);
 	riscv_iommu_bond_unlink(info->domain, dev);
 	info->domain = NULL;
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 04/11] iommu/riscv: support GSCID and GVMA invalidation command
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
                   ` (2 preceding siblings ...)
  2026-04-28 13:13 ` [RFC PATCH 03/11] iommu/riscv: use data structure instead of individual values fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:13 ` [RFC PATCH 05/11] RISC-V: KVM: Enable KVM_VFIO interfaces on RISC-V arch fangyu.yu
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel, Zong Li,
	Fangyu Yu

From: Zong Li <zong.li@sifive.com>

This patch adds a ID Allocator for GSCID and a wrap for setting up
GSCID in IOTLB invalidation command.

Set up iohgatp to enable second stage table and flush stage-2 table if
the GSCID is set.

The GSCID of domain should be freed when release domain. GSCID will be
allocated for parent domain in nested IOMMU process.

Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 drivers/iommu/riscv/iommu-bits.h |  7 +++++++
 drivers/iommu/riscv/iommu.c      | 32 ++++++++++++++++++++++++++------
 2 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
index 29a0040b1c32..7c440926fa23 100644
--- a/drivers/iommu/riscv/iommu-bits.h
+++ b/drivers/iommu/riscv/iommu-bits.h
@@ -716,6 +716,13 @@ static inline void riscv_iommu_cmd_inval_vma(struct riscv_iommu_command *cmd)
 	cmd->dword1 = 0;
 }
 
+static inline void riscv_iommu_cmd_inval_gvma(struct riscv_iommu_command *cmd)
+{
+	cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOTINVAL_OPCODE) |
+		      FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA);
+	cmd->dword1 = 0;
+}
+
 static inline void riscv_iommu_cmd_inval_set_addr(struct riscv_iommu_command *cmd,
 						  u64 addr)
 {
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 369c98b7e1e5..5dadf6d09139 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -48,6 +48,10 @@
 static DEFINE_IDA(riscv_iommu_pscids);
 #define RISCV_IOMMU_MAX_PSCID		(BIT(20) - 1)
 
+/* IOMMU GSCID allocation namespace. */
+static DEFINE_IDA(riscv_iommu_gscids);
+#define RISCV_IOMMU_MAX_GSCID		(BIT(16) - 1)
+
 /* Device resource-managed allocations */
 struct riscv_iommu_devres {
 	void *addr;
@@ -819,6 +823,7 @@ struct riscv_iommu_domain {
 	struct list_head bonds;
 	spinlock_t lock;		/* protect bonds list updates. */
 	int pscid;
+	int gscid;
 };
 PT_IOMMU_CHECK_DOMAIN(struct riscv_iommu_domain, riscvpt.iommu, domain);
 
@@ -967,15 +972,20 @@ static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
 
 		/*
 		 * IOTLB invalidation request can be safely omitted if already sent
-		 * to the IOMMU for the same PSCID, and with domain->bonds list
+		 * to the IOMMU for the same PSCID/GSCID, and with domain->bonds list
 		 * arranged based on the device's IOMMU, it's sufficient to check
 		 * last device the invalidation was sent to.
 		 */
 		if (iommu == prev)
 			continue;
 
-		riscv_iommu_cmd_inval_vma(&cmd);
-		riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
+		if (domain->gscid) {
+			riscv_iommu_cmd_inval_gvma(&cmd);
+			riscv_iommu_cmd_inval_set_gscid(&cmd, domain->gscid);
+		} else {
+			riscv_iommu_cmd_inval_vma(&cmd);
+			riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
+		}
 		if (end - start < RISCV_IOMMU_IOTLB_INVAL_LIMIT - 1) {
 			unsigned long iova = start;
 
@@ -1120,6 +1130,7 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
 
 		WRITE_ONCE(dc->fsc, new_dc->fsc);
 		WRITE_ONCE(dc->ta, new_dc->ta & RISCV_IOMMU_PC_TA_PSCID);
+		WRITE_ONCE(dc->iohgatp, new_dc->iohgatp);
 		/* Update device context, write TC.V as the last step. */
 		dma_wmb();
 		WRITE_ONCE(dc->tc, tc);
@@ -1175,8 +1186,10 @@ static void riscv_iommu_free_paging_domain(struct iommu_domain *iommu_domain)
 
 	WARN_ON(!list_empty(&domain->bonds));
 
-	if ((int)domain->pscid > 0)
+	if (domain->pscid > 0)
 		ida_free(&riscv_iommu_pscids, domain->pscid);
+	if (domain->gscid > 0)
+		ida_free(&riscv_iommu_gscids, domain->gscid);
 
 	pt_iommu_deinit(&domain->riscvpt.iommu);
 	kfree(domain);
@@ -1212,8 +1225,15 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
 	if (!riscv_iommu_pt_supported(iommu, pt_info.fsc_iosatp_mode))
 		return -ENODEV;
 
-	dc.fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, pt_info.fsc_iosatp_mode) |
-	      FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, pt_info.ppn);
+	if (domain->gscid) {
+		dc.iohgatp = FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_MODE, pt_info.iohgatp_mode) |
+			     FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_GSCID, domain->gscid) |
+			     FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_PPN, pt_info.ppn);
+	} else {
+		dc.fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, pt_info.fsc_iosatp_mode) |
+		      FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, pt_info.ppn);
+	}
+
 	dc.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
 	     RISCV_IOMMU_PC_TA_V;
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 05/11] RISC-V: KVM: Enable KVM_VFIO interfaces on RISC-V arch
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
                   ` (3 preceding siblings ...)
  2026-04-28 13:13 ` [RFC PATCH 04/11] iommu/riscv: support GSCID and GVMA invalidation command fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:13 ` [RFC PATCH 06/11] iommu/riscv: Add domain_alloc_paging_flags for second-stage domain fangyu.yu
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel,
	Andrew Jones, Fangyu Yu

From: Tomasz Jeznach <tjeznach@rivosinc.com>

Enable KVM/VFIO support on RISC-V architecture.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 arch/riscv/kvm/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index ec2cee0a39e0..54ee90f010ef 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -30,8 +30,10 @@ config KVM
 	select KVM_GENERIC_HARDWARE_ENABLING
 	select KVM_MMIO
 	select VIRT_XFER_TO_GUEST_WORK
+	select KVM_VFIO
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
+	select SRCU
 	help
 	  Support hosting virtualized guest machines.
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 06/11] iommu/riscv: Add domain_alloc_paging_flags for second-stage domain
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
                   ` (4 preceding siblings ...)
  2026-04-28 13:13 ` [RFC PATCH 05/11] RISC-V: KVM: Enable KVM_VFIO interfaces on RISC-V arch fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:35   ` Jason Gunthorpe
  2026-04-28 13:13 ` [RFC PATCH 07/11] iommupt: Don't preset D when RISC-V IOMMU dirty tracking on fangyu.yu
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel,
	Fangyu Yu

From: Fangyu Yu <fangyu.yu@linux.alibaba.com>

Replace .domain_alloc_paging with .domain_alloc_paging_flags so callers
can pass allocation flags to select the appropriate page-table type.

When IOMMU_HWPT_ALLOC_NEST_PARENT or IOMMU_HWPT_ALLOC_DIRTY_TRACKING is
set in @flags, allocate a second-stage (iohgatp) domain.

When @flags is 0 the behaviour is identical to the previous
domain_alloc_paging: first-stage (iosatp) domain.

Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 drivers/iommu/riscv/iommu.c | 66 ++++++++++++++++++++++++++++---------
 1 file changed, 51 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 5dadf6d09139..0c13430ecc7f 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1255,23 +1255,50 @@ static const struct iommu_domain_ops riscv_iommu_paging_domain_ops = {
 	.flush_iotlb_all = riscv_iommu_iotlb_flush_all,
 };
 
-static struct iommu_domain *riscv_iommu_alloc_paging_domain(struct device *dev)
+static struct iommu_domain *riscv_iommu_domain_alloc_paging_flags(
+		struct device *dev, u32 flags,
+		const struct iommu_user_data *user_data)
 {
+	const bool second_stage = flags &
+		(IOMMU_HWPT_ALLOC_NEST_PARENT | IOMMU_HWPT_ALLOC_DIRTY_TRACKING);
 	struct pt_iommu_riscv_64_cfg cfg = {};
 	struct riscv_iommu_domain *domain;
 	struct riscv_iommu_device *iommu;
 	int ret;
 
+	if (user_data)
+		return ERR_PTR(-EOPNOTSUPP);
+
 	iommu = dev_to_iommu(dev);
-	if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV57) {
-		cfg.common.hw_max_vasz_lg2 = 57;
-	} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV48) {
-		cfg.common.hw_max_vasz_lg2 = 48;
-	} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV39) {
-		cfg.common.hw_max_vasz_lg2 = 39;
+
+	if (second_stage) {
+		/*
+		 * Second-stage (iohgatp) page table for KVM VFIO device
+		 * pass-through and dirty tracking. The GPA space is 2 bits
+		 * wider than the corresponding first-stage VA space (x4 root
+		 * page table), so hw_max_vasz_lg2 values are 41/50/59.
+		 */
+		if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV57X4) {
+			cfg.common.hw_max_vasz_lg2 = 59;
+		} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV48X4) {
+			cfg.common.hw_max_vasz_lg2 = 50;
+		} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV39X4) {
+			cfg.common.hw_max_vasz_lg2 = 41;
+		} else {
+			dev_err(dev, "cannot find supported second-stage page table mode\n");
+			return ERR_PTR(-ENODEV);
+		}
 	} else {
-		dev_err(dev, "cannot find supported page table mode\n");
-		return ERR_PTR(-ENODEV);
+		if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV57) {
+			cfg.common.hw_max_vasz_lg2 = 57;
+		} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV48) {
+			cfg.common.hw_max_vasz_lg2 = 48;
+		} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV39) {
+			cfg.common.hw_max_vasz_lg2 = 39;
+		} else {
+			dev_err(dev, "cannot find supported page table mode\n");
+			return ERR_PTR(-ENODEV);
+		}
 	}
 	cfg.common.hw_max_oasz_lg2 = 56;
 
@@ -1291,11 +1318,20 @@ static struct iommu_domain *riscv_iommu_alloc_paging_domain(struct device *dev)
 	domain->riscvpt.iommu.nid = dev_to_node(iommu->dev);
 	domain->domain.ops = &riscv_iommu_paging_domain_ops;
 
-	domain->pscid = ida_alloc_range(&riscv_iommu_pscids, 1,
-					RISCV_IOMMU_MAX_PSCID, GFP_KERNEL);
-	if (domain->pscid < 0) {
-		riscv_iommu_free_paging_domain(&domain->domain);
-		return ERR_PTR(-ENOMEM);
+	if (second_stage) {
+		domain->gscid = ida_alloc_range(&riscv_iommu_gscids, 1,
+						RISCV_IOMMU_MAX_GSCID, GFP_KERNEL);
+		if (domain->gscid < 0) {
+			riscv_iommu_free_paging_domain(&domain->domain);
+			return ERR_PTR(-ENOMEM);
+		}
+	} else {
+		domain->pscid = ida_alloc_range(&riscv_iommu_pscids, 1,
+						RISCV_IOMMU_MAX_PSCID, GFP_KERNEL);
+		if (domain->pscid < 0) {
+			riscv_iommu_free_paging_domain(&domain->domain);
+			return ERR_PTR(-ENOMEM);
+		}
 	}
 
 	ret = pt_iommu_riscv_64_init(&domain->riscvpt, &cfg, GFP_KERNEL);
@@ -1439,7 +1475,7 @@ static const struct iommu_ops riscv_iommu_ops = {
 	.identity_domain = &riscv_iommu_identity_domain,
 	.blocked_domain = &riscv_iommu_blocking_domain,
 	.release_domain = &riscv_iommu_blocking_domain,
-	.domain_alloc_paging = riscv_iommu_alloc_paging_domain,
+	.domain_alloc_paging_flags = riscv_iommu_domain_alloc_paging_flags,
 	.device_group = riscv_iommu_device_group,
 	.probe_device = riscv_iommu_probe_device,
 	.release_device	= riscv_iommu_release_device,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 07/11] iommupt: Don't preset D when RISC-V IOMMU dirty tracking on
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
                   ` (5 preceding siblings ...)
  2026-04-28 13:13 ` [RFC PATCH 06/11] iommu/riscv: Add domain_alloc_paging_flags for second-stage domain fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:36   ` Jason Gunthorpe
  2026-04-28 13:13 ` [RFC PATCH 08/11] iommu/riscv: Add dirty tracking support for second-stage domains fangyu.yu
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel,
	Fangyu Yu

From: Fangyu Yu <fangyu.yu@linux.alibaba.com>

When mapping writable pages, the RISC-V format code currently
pre-sets the PTE D bit unconditionally.

If hardware dirty tracking is active (DC.tc.GADE set), the IOMMU
sets D autonomously on the first write. Pre-setting D makes every
new mapping appear dirty immediately and breaks dirty tracking.

Introduce PT_FEAT_RISCV_DIRTY_TRACKING_ACTIVE and, when set, leave
D cleared for new writable mappings so hardware can capture the
first write. Keep pre-setting D when dirty tracking is inactive.

Only meaningful for second-stage (iohgatp) page tables.

Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 drivers/iommu/generic_pt/fmt/riscv.h | 13 +++++++++++--
 include/linux/generic_pt/common.h    |  8 ++++++++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/generic_pt/fmt/riscv.h b/drivers/iommu/generic_pt/fmt/riscv.h
index 4fe645e60375..0281356cfaf6 100644
--- a/drivers/iommu/generic_pt/fmt/riscv.h
+++ b/drivers/iommu/generic_pt/fmt/riscv.h
@@ -248,8 +248,17 @@ static inline int riscvpt_iommu_set_prot(struct pt_common *common,
 	u64 pte;
 
 	pte = RISCVPT_A | RISCVPT_U;
-	if (iommu_prot & IOMMU_WRITE)
-		pte |= RISCVPT_W | RISCVPT_R | RISCVPT_D;
+	if (iommu_prot & IOMMU_WRITE) {
+		pte |= RISCVPT_W | RISCVPT_R;
+		/*
+		 * When hardware dirty tracking is active (GADE set), the IOMMU
+		 * sets the D bit autonomously on the first write access.
+		 *
+		 */
+		if (!(common->features &
+		      BIT(PT_FEAT_RISCV_DIRTY_TRACKING_ACTIVE)))
+			pte |= RISCVPT_D;
+	}
 	if (iommu_prot & IOMMU_READ)
 		pte |= RISCVPT_R;
 	if (!(iommu_prot & IOMMU_NOEXEC))
diff --git a/include/linux/generic_pt/common.h b/include/linux/generic_pt/common.h
index e82dff33ece8..4606c7464c27 100644
--- a/include/linux/generic_pt/common.h
+++ b/include/linux/generic_pt/common.h
@@ -193,6 +193,14 @@ enum {
 	 * Support the 64k contiguous page size following the Svnapot extension.
 	 */
 	PT_FEAT_RISCV_SVNAPOT_64K = PT_FEAT_FMT_START,
+	/*
+	 * Hardware dirty tracking is currently active: DC.tc.GADE is set and
+	 * the IOMMU will set the D bit in PTEs autonomously on write access.
+	 * When this flag is set, new mappings must not pre-set the D bit so
+	 * that every write is correctly captured by hardware.
+	 * Only meaningful for second-stage (iohgatp) page tables.
+	 */
+	PT_FEAT_RISCV_DIRTY_TRACKING_ACTIVE,
 
 };
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 08/11] iommu/riscv: Add dirty tracking support for second-stage domains
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
                   ` (6 preceding siblings ...)
  2026-04-28 13:13 ` [RFC PATCH 07/11] iommupt: Don't preset D when RISC-V IOMMU dirty tracking on fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:38   ` Jason Gunthorpe
  2026-04-28 13:13 ` [RFC PATCH 09/11] iommu/riscv: Add IOTINVAL.GVMA after updating DDT/PDT entries fangyu.yu
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel,
	Fangyu Yu

From: Fangyu Yu <fangyu.yu@linux.alibaba.com>

Add hardware dirty tracking support for second-stage (iohgatp) domains
used in KVM VFIO device pass-through.

The RISC-V IOMMU can automatically set the dirty bit in PTEs on write
access when DC.tc.GADE is set and the hardware has AMO_HWAD capability.
Wire this up to the iommufd dirty tracking interface:

  - riscv_iommu_set_dirty_tracking(): Walks all bonds of the domain and
    sets or clears DC.tc.GADE in each device context entry.

  - riscv_iommu_dirty_ops: Exposes set_dirty_tracking and the generic
    page-table read_and_clear_dirty via IOMMU_PT_DIRTY_OPS(riscv_64).

  - domain_alloc_paging_flags: Assigns dirty_ops to second-stage domains
    when AMO_HWAD is advertised in hardware capabilities.

  - riscv_iommu_capable: Reports IOMMU_CAP_DIRTY_TRACKING when
    AMO_HWAD is present.

Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 drivers/iommu/riscv/iommu.c | 84 +++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 0c13430ecc7f..1f7967074492 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1247,6 +1247,84 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
 	return 0;
 }
 
+/*
+ * Enable or disable hardware A/D bit updates (GADE) in the device context for
+ * all devices attached to a second-stage domain. When dirty tracking is
+ * enabled the IOMMU hardware will set the dirty bit in PTEs on write access,
+ * making them visible to read_and_clear_dirty().
+ */
+static int riscv_iommu_set_dirty_tracking(struct iommu_domain *iommu_domain,
+					  bool enable)
+{
+	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+	struct riscv_iommu_bond *bond;
+	struct riscv_iommu_device *iommu, *prev;
+	struct riscv_iommu_dc *dc;
+	struct iommu_fwspec *fwspec;
+	struct riscv_iommu_command cmd;
+	u64 tc;
+	int i;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(bond, &domain->bonds, list) {
+		iommu = dev_to_iommu(bond->dev);
+		fwspec = dev_iommu_fwspec_get(bond->dev);
+
+		for (i = 0; i < fwspec->num_ids; i++) {
+			dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
+			tc = READ_ONCE(dc->tc);
+			if (!(tc & RISCV_IOMMU_DC_TC_V))
+				continue;
+
+			if (enable)
+				tc |= RISCV_IOMMU_DC_TC_GADE;
+			else
+				tc &= ~RISCV_IOMMU_DC_TC_GADE;
+			WRITE_ONCE(dc->tc, tc);
+
+			/* Invalidate cached device context entry */
+			riscv_iommu_cmd_iodir_inval_ddt(&cmd);
+			riscv_iommu_cmd_iodir_set_did(&cmd, fwspec->ids[i]);
+			riscv_iommu_cmd_send(iommu, &cmd);
+			riscv_iommu_iodir_iotinval(iommu, false, dc->iohgatp, dc, NULL);
+		}
+	}
+
+	prev = NULL;
+	list_for_each_entry_rcu(bond, &domain->bonds, list) {
+		iommu = dev_to_iommu(bond->dev);
+		if (iommu == prev)
+			continue;
+
+		riscv_iommu_cmd_sync(iommu, RISCV_IOMMU_IOTINVAL_TIMEOUT);
+		prev = iommu;
+	}
+
+	rcu_read_unlock();
+
+	/*
+	 * Reflect the active dirty-tracking state in the page table feature
+	 * flags.  When active, riscvpt_iommu_set_prot() will leave D=0 in
+	 * new mappings so that the hardware can set it on the first write,
+	 * providing accurate per-page dirty information.  When inactive,
+	 * new mappings get D=1 to avoid write faults on a D=0 PTE.
+	 */
+	if (enable)
+		domain->riscvpt.riscv_64pt.common.features |=
+			BIT(PT_FEAT_RISCV_DIRTY_TRACKING_ACTIVE);
+	else
+		domain->riscvpt.riscv_64pt.common.features &=
+			~BIT(PT_FEAT_RISCV_DIRTY_TRACKING_ACTIVE);
+
+	return 0;
+}
+
+static const struct iommu_dirty_ops riscv_iommu_dirty_ops = {
+	IOMMU_PT_DIRTY_OPS(riscv_64),
+	.set_dirty_tracking = riscv_iommu_set_dirty_tracking,
+};
+
 static const struct iommu_domain_ops riscv_iommu_paging_domain_ops = {
 	IOMMU_PT_DOMAIN_OPS(riscv_64),
 	.attach_dev = riscv_iommu_attach_paging_domain,
@@ -1325,6 +1403,8 @@ static struct iommu_domain *riscv_iommu_domain_alloc_paging_flags(
 			riscv_iommu_free_paging_domain(&domain->domain);
 			return ERR_PTR(-ENOMEM);
 		}
+		if (iommu->caps & RISCV_IOMMU_CAPABILITIES_AMO_HWAD)
+			domain->domain.dirty_ops = &riscv_iommu_dirty_ops;
 	} else {
 		domain->pscid = ida_alloc_range(&riscv_iommu_pscids, 1,
 						RISCV_IOMMU_MAX_PSCID, GFP_KERNEL);
@@ -1401,10 +1481,14 @@ static struct iommu_group *riscv_iommu_device_group(struct device *dev)
 
 static bool riscv_iommu_capable(struct device *dev, enum iommu_cap cap)
 {
+	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+
 	switch (cap) {
 	case IOMMU_CAP_CACHE_COHERENCY:
 	case IOMMU_CAP_DEFERRED_FLUSH:
 		return true;
+	case IOMMU_CAP_DIRTY_TRACKING:
+		return !!(iommu->caps & RISCV_IOMMU_CAPABILITIES_AMO_HWAD);
 	default:
 		return false;
 	}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 09/11] iommu/riscv: Add IOTINVAL.GVMA after updating DDT/PDT entries
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
                   ` (7 preceding siblings ...)
  2026-04-28 13:13 ` [RFC PATCH 08/11] iommu/riscv: Add dirty tracking support for second-stage domains fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:13 ` [RFC PATCH 10/11] iommupt: Add RISC-V dirty tracking PTE ops fangyu.yu
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel,
	Fangyu Yu

From: Fangyu Yu <fangyu.yu@linux.alibaba.com>

Previously, only IOTINVAL.VMA was issued, which is insufficient for
second-stage address translation consistency.

Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 drivers/iommu/riscv/iommu.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 1f7967074492..cb9d315e82ee 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1065,12 +1065,15 @@ static void riscv_iommu_iodir_iotinval(struct riscv_iommu_device *iommu,
 		/*
 		 * else: IOTINVAL.VMA with GV=1,AV=PSCV=0,and
 		 * GSCID=DC.iohgatp.GSCID
-		 *
+		 */
+		riscv_iommu_cmd_send(iommu, &cmd);
+		/*
 		 * IOTINVAL.GVMA with GV=1,AV=0,and
 		 * GSCID=DC.iohgatp.GSCID
-		 * TODO: For now, the Second-Stage feature have not yet been merged,
-		 * also issue IOTINVAL.GVMA once second-stage support is merged.
 		 */
+		riscv_iommu_cmd_inval_gvma(&cmd);
+		riscv_iommu_cmd_inval_set_gscid(&cmd,
+			FIELD_GET(RISCV_IOMMU_DC_IOHGATP_GSCID, iohgatp));
 	}
 	riscv_iommu_cmd_send(iommu, &cmd);
 }
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 10/11] iommupt: Add RISC-V dirty tracking PTE ops
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
                   ` (8 preceding siblings ...)
  2026-04-28 13:13 ` [RFC PATCH 09/11] iommu/riscv: Add IOTINVAL.GVMA after updating DDT/PDT entries fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:39   ` Jason Gunthorpe
  2026-04-28 13:13 ` [RFC PATCH 11/11] iommu/riscv: support nested iommu for getting iommu hardware information fangyu.yu
  2026-05-04 19:53 ` [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains Andrew Jones
  11 siblings, 1 reply; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel,
	Fangyu Yu

From: Fangyu Yu <fangyu.yu@linux.alibaba.com>

Implement the three dirty-tracking hooks required by the generic page
table framework for the RISC-V format:

  pt_entry_is_write_dirty():
    Check the D bit (bit 7) in the PTE.

  pt_entry_make_write_clean():
    Clear the D bit across the full contiguous range.

  pt_entry_make_write_dirty():
    Atomically set D via try_cmpxchg64() on a single PTE.

Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 drivers/iommu/generic_pt/fmt/riscv.h | 43 ++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/iommu/generic_pt/fmt/riscv.h b/drivers/iommu/generic_pt/fmt/riscv.h
index 0281356cfaf6..44b87e70f029 100644
--- a/drivers/iommu/generic_pt/fmt/riscv.h
+++ b/drivers/iommu/generic_pt/fmt/riscv.h
@@ -222,6 +222,49 @@ static inline void riscvpt_attr_from_entry(const struct pt_state *pts,
 }
 #define pt_attr_from_entry riscvpt_attr_from_entry
 
+/*
+ * Dirty tracking: RISC-V PTEs use D (bit 7) as the hardware dirty bit.
+ * When Svnapot 64K is active a leaf entry spans 16 consecutive PTEs; we
+ * must check / clear all of them so that no dirty indication is lost.
+ */
+static inline bool riscvpt_entry_is_write_dirty(const struct pt_state *pts)
+{
+	unsigned int num_contig_lg2 = riscvpt_entry_num_contig_lg2(pts);
+	const pt_riscv_entry_t *tablep =
+		pt_cur_table(pts, pt_riscv_entry_t) +
+		log2_set_mod(pts->index, 0, num_contig_lg2);
+	const pt_riscv_entry_t *end = tablep + log2_to_int(num_contig_lg2);
+
+	for (; tablep != end; tablep++)
+		if (READ_ONCE(*tablep) & RISCVPT_D)
+			return true;
+	return false;
+}
+#define pt_entry_is_write_dirty riscvpt_entry_is_write_dirty
+
+static inline void riscvpt_entry_make_write_clean(struct pt_state *pts)
+{
+	unsigned int num_contig_lg2 = riscvpt_entry_num_contig_lg2(pts);
+	pt_riscv_entry_t *tablep =
+		pt_cur_table(pts, pt_riscv_entry_t) +
+		log2_set_mod(pts->index, 0, num_contig_lg2);
+	pt_riscv_entry_t *end = tablep + log2_to_int(num_contig_lg2);
+
+	for (; tablep != end; tablep++)
+		WRITE_ONCE(*tablep, READ_ONCE(*tablep) & ~(pt_riscv_entry_t)RISCVPT_D);
+}
+#define pt_entry_make_write_clean riscvpt_entry_make_write_clean
+
+static inline bool riscvpt_entry_make_write_dirty(struct pt_state *pts)
+{
+	pt_riscv_entry_t *tablep =
+		pt_cur_table(pts, pt_riscv_entry_t) + pts->index;
+	pt_riscv_entry_t new = pts->entry | RISCVPT_D;
+
+	return try_cmpxchg64(tablep, &pts->entry, new);
+}
+#define pt_entry_make_write_dirty riscvpt_entry_make_write_dirty
+
 /* --- iommu */
 #include <linux/generic_pt/iommu.h>
 #include <linux/iommu.h>
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 11/11] iommu/riscv: support nested iommu for getting iommu hardware information
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
                   ` (9 preceding siblings ...)
  2026-04-28 13:13 ` [RFC PATCH 10/11] iommupt: Add RISC-V dirty tracking PTE ops fangyu.yu
@ 2026-04-28 13:13 ` fangyu.yu
  2026-04-28 13:39   ` Jason Gunthorpe
  2026-05-04 19:53 ` [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains Andrew Jones
  11 siblings, 1 reply; 30+ messages in thread
From: fangyu.yu @ 2026-04-28 13:13 UTC (permalink / raw)
  To: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg
  Cc: guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel, Zong Li,
	Fangyu Yu

From: Zong Li <zong.li@sifive.com>

This patch implements .hw_info operation and the related data
structures for passing the IOMMU hardware capabilities for iommufd.

Signed-off-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
---
 drivers/iommu/riscv/iommu.c  | 19 +++++++++++++++++++
 include/uapi/linux/iommufd.h | 18 ++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index cb9d315e82ee..9abf446e1b85 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1556,8 +1556,27 @@ static void riscv_iommu_release_device(struct device *dev)
 	kfree_rcu_mightsleep(info);
 }
 
+static void *riscv_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
+{
+	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+	struct iommu_hw_info_riscv_iommu *info;
+
+	info = kzalloc_obj(*info);
+	if (!info)
+		return ERR_PTR(-ENOMEM);
+
+	info->capability = iommu->caps;
+	info->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
+
+	*length = sizeof(*info);
+	*type = IOMMU_HW_INFO_TYPE_RISCV_IOMMU;
+
+	return info;
+}
+
 static const struct iommu_ops riscv_iommu_ops = {
 	.of_xlate = riscv_iommu_of_xlate,
+	.hw_info = riscv_iommu_hw_info,
 	.capable = riscv_iommu_capable,
 	.identity_domain = &riscv_iommu_identity_domain,
 	.blocked_domain = &riscv_iommu_blocking_domain,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index e998dfbd6960..79d3dc5e8d19 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -660,6 +660,22 @@ struct iommu_hw_info_amd {
 	__aligned_u64 efr2;
 };
 
+/**
+ * struct iommu_hw_info_riscv_iommu - RISCV IOMMU hardware information
+ *
+ * @capability: Value of RISC-V IOMMU capability register defined in
+ *              RISC-V IOMMU spec section 5.3 IOMMU capabilities
+ * @fctl: Value of RISC-V IOMMU feature control register defined in
+ *              RISC-V IOMMU spec section 5.4 Features-control register
+ *
+ * Don't advertise ATS support to the guest because driver doesn't support it.
+ */
+struct iommu_hw_info_riscv_iommu {
+	__aligned_u64 capability;
+	__u32 fctl;
+	__u32 __reserved;
+};
+
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
  * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
@@ -670,6 +686,7 @@ struct iommu_hw_info_amd {
  * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
  *                                     SMMUv3) info type
  * @IOMMU_HW_INFO_TYPE_AMD: AMD IOMMU info type
+ * @IOMMU_HW_INFO_TYPE_RISCV_IOMMU: RISC-V iommu info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_NONE = 0,
@@ -678,6 +695,7 @@ enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
 	IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV = 3,
 	IOMMU_HW_INFO_TYPE_AMD = 4,
+	IOMMU_HW_INFO_TYPE_RISCV_IOMMU = 5,
 };
 
 /**
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 01/11] iommupt: Add RISC-V Second-stage (iohgatp) page table support
  2026-04-28 13:13 ` [RFC PATCH 01/11] iommupt: Add RISC-V Second-stage (iohgatp) page table support fangyu.yu
@ 2026-04-28 13:32   ` Jason Gunthorpe
  2026-04-29  1:06     ` fangyu.yu
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-04-28 13:32 UTC (permalink / raw)
  To: fangyu.yu
  Cc: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel

On Tue, Apr 28, 2026 at 09:13:49PM +0800, fangyu.yu@linux.alibaba.com wrote:
> @@ -263,6 +281,22 @@ riscvpt_iommu_fmt_init(struct pt_iommu_riscv_64 *iommu_table,
>  	case 57:
>  		pt_top_set_level(&table->common, 4);
>  		break;
> +	/*
> +	 * Second-stage (iohgatp): Sv39x4 / Sv48x4 / Sv57x4.
> +	 * The top level is the same as for the first-stage counterpart.
> +	 */
> +	case 41:
> +		pt_top_set_level(&table->common, 2);
> +		table->second_stage = true;
> +		break;

Second stage needs to be an explicit PT_FEAT not implicitly deduced
based on the vasz.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 02/11] iommu/riscv: report iommu capabilities
  2026-04-28 13:13 ` [RFC PATCH 02/11] iommu/riscv: report iommu capabilities fangyu.yu
@ 2026-04-28 13:33   ` Jason Gunthorpe
  2026-04-29  1:15     ` fangyu.yu
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-04-28 13:33 UTC (permalink / raw)
  To: fangyu.yu
  Cc: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel

On Tue, Apr 28, 2026 at 09:13:50PM +0800, fangyu.yu@linux.alibaba.com wrote:

> +static bool riscv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> +	switch (cap) {
> +	case IOMMU_CAP_CACHE_COHERENCY:
> +	case IOMMU_CAP_DEFERRED_FLUSH:

IOMMU_CAP_DEFERRED_FLUSH is not needed in v7.1

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 06/11] iommu/riscv: Add domain_alloc_paging_flags for second-stage domain
  2026-04-28 13:13 ` [RFC PATCH 06/11] iommu/riscv: Add domain_alloc_paging_flags for second-stage domain fangyu.yu
@ 2026-04-28 13:35   ` Jason Gunthorpe
  2026-04-29  1:21     ` fangyu.yu
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-04-28 13:35 UTC (permalink / raw)
  To: fangyu.yu
  Cc: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel

On Tue, Apr 28, 2026 at 09:13:54PM +0800, fangyu.yu@linux.alibaba.com wrote:
> From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
> 
> Replace .domain_alloc_paging with .domain_alloc_paging_flags so callers
> can pass allocation flags to select the appropriate page-table type.
> 
> When IOMMU_HWPT_ALLOC_NEST_PARENT or IOMMU_HWPT_ALLOC_DIRTY_TRACKING is
> set in @flags, allocate a second-stage (iohgatp) domain.
> 
> When @flags is 0 the behaviour is identical to the previous
> domain_alloc_paging: first-stage (iosatp) domain.
> 
> Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
> ---
>  drivers/iommu/riscv/iommu.c | 66 ++++++++++++++++++++++++++++---------
>  1 file changed, 51 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> index 5dadf6d09139..0c13430ecc7f 100644
> --- a/drivers/iommu/riscv/iommu.c
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -1255,23 +1255,50 @@ static const struct iommu_domain_ops riscv_iommu_paging_domain_ops = {
>  	.flush_iotlb_all = riscv_iommu_iotlb_flush_all,
>  };
>  
> -static struct iommu_domain *riscv_iommu_alloc_paging_domain(struct device *dev)
> +static struct iommu_domain *riscv_iommu_domain_alloc_paging_flags(
> +		struct device *dev, u32 flags,
> +		const struct iommu_user_data *user_data)
>  {
> +	const bool second_stage = flags &
> +		(IOMMU_HWPT_ALLOC_NEST_PARENT | IOMMU_HWPT_ALLOC_DIRTY_TRACKING);

This isn't the right logic, you should follow the switch/case design
from other drivers.

>  	struct pt_iommu_riscv_64_cfg cfg = {};
>  	struct riscv_iommu_domain *domain;
>  	struct riscv_iommu_device *iommu;
>  	int ret;
>  
> +	if (user_data)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
>  	iommu = dev_to_iommu(dev);
> -	if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV57) {
> -		cfg.common.hw_max_vasz_lg2 = 57;
> -	} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV48) {
> -		cfg.common.hw_max_vasz_lg2 = 48;
> -	} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV39) {
> -		cfg.common.hw_max_vasz_lg2 = 39;
> +
> +	if (second_stage) {
> +		/*
> +		 * Second-stage (iohgatp) page table for KVM VFIO device
> +		 * pass-through and dirty tracking. The GPA space is 2 bits
> +		 * wider than the corresponding first-stage VA space (x4 root
> +		 * page table), so hw_max_vasz_lg2 values are 41/50/59.
> +		 */
> +		if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV57X4) {
> +			cfg.common.hw_max_vasz_lg2 = 59;
> +		} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV48X4) {
> +			cfg.common.hw_max_vasz_lg2 = 50;
> +		} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV39X4) {
> +			cfg.common.hw_max_vasz_lg2 = 41;
> +		} else {
> +			dev_err(dev, "cannot find supported second-stage page table mode\n");
> +			return ERR_PTR(-ENODEV);

Do not make log messages for failing system calls.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 07/11] iommupt: Don't preset D when RISC-V IOMMU dirty tracking on
  2026-04-28 13:13 ` [RFC PATCH 07/11] iommupt: Don't preset D when RISC-V IOMMU dirty tracking on fangyu.yu
@ 2026-04-28 13:36   ` Jason Gunthorpe
  2026-04-29  1:41     ` fangyu.yu
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-04-28 13:36 UTC (permalink / raw)
  To: fangyu.yu
  Cc: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel

On Tue, Apr 28, 2026 at 09:13:55PM +0800, fangyu.yu@linux.alibaba.com wrote:
> From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
> 
> When mapping writable pages, the RISC-V format code currently
> pre-sets the PTE D bit unconditionally.
> 
> If hardware dirty tracking is active (DC.tc.GADE set), the IOMMU
> sets D autonomously on the first write. Pre-setting D makes every
> new mapping appear dirty immediately and breaks dirty tracking.
> 
> Introduce PT_FEAT_RISCV_DIRTY_TRACKING_ACTIVE and, when set, leave
> D cleared for new writable mappings so hardware can capture the
> first write. Keep pre-setting D when dirty tracking is inactive.
> 
> Only meaningful for second-stage (iohgatp) page tables.

You shouldn't need anything like this, the D bit is managed by the
iommufd core appropriately. It *should* start out pre-set as that is
faster when not tacking. Only once dirty tracking is started does D
get cleared. User space is supposed to assume that everything is dirty
prior to its first D clear.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 08/11] iommu/riscv: Add dirty tracking support for second-stage domains
  2026-04-28 13:13 ` [RFC PATCH 08/11] iommu/riscv: Add dirty tracking support for second-stage domains fangyu.yu
@ 2026-04-28 13:38   ` Jason Gunthorpe
  2026-04-29  1:46     ` fangyu.yu
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-04-28 13:38 UTC (permalink / raw)
  To: fangyu.yu
  Cc: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel

On Tue, Apr 28, 2026 at 09:13:56PM +0800, fangyu.yu@linux.alibaba.com wrote:

> @@ -1247,6 +1247,84 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
>  	return 0;
>  }
>  
> +/*
> + * Enable or disable hardware A/D bit updates (GADE) in the device context for
> + * all devices attached to a second-stage domain. When dirty tracking is
> + * enabled the IOMMU hardware will set the dirty bit in PTEs on write access,
> + * making them visible to read_and_clear_dirty().
> + */
> +static int riscv_iommu_set_dirty_tracking(struct iommu_domain *iommu_domain,
> +					  bool enable)
> +{
> +	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> +	struct riscv_iommu_bond *bond;
> +	struct riscv_iommu_device *iommu, *prev;
> +	struct riscv_iommu_dc *dc;
> +	struct iommu_fwspec *fwspec;
> +	struct riscv_iommu_command cmd;
> +	u64 tc;
> +	int i;
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry_rcu(bond, &domain->bonds, list) {
> +		iommu = dev_to_iommu(bond->dev);
> +		fwspec = dev_iommu_fwspec_get(bond->dev);
> +
> +		for (i = 0; i < fwspec->num_ids; i++) {
> +			dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
> +			tc = READ_ONCE(dc->tc);
> +			if (!(tc & RISCV_IOMMU_DC_TC_V))
> +				continue;
> +
> +			if (enable)
> +				tc |= RISCV_IOMMU_DC_TC_GADE;
> +			else
> +				tc &= ~RISCV_IOMMU_DC_TC_GADE;
> +			WRITE_ONCE(dc->tc, tc);

I'm pretty sure you don't need to do this. Just preset GADE when ever
a S2 domain is attached, rely on the pre-set D to avoid any HW cost
and you are fine. No need to change it dynamically unless something is
reall weird about riscv.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 10/11] iommupt: Add RISC-V dirty tracking PTE ops
  2026-04-28 13:13 ` [RFC PATCH 10/11] iommupt: Add RISC-V dirty tracking PTE ops fangyu.yu
@ 2026-04-28 13:39   ` Jason Gunthorpe
  2026-04-29  1:52     ` fangyu.yu
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-04-28 13:39 UTC (permalink / raw)
  To: fangyu.yu
  Cc: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel

On Tue, Apr 28, 2026 at 09:13:58PM +0800, fangyu.yu@linux.alibaba.com wrote:
> From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
> 
> Implement the three dirty-tracking hooks required by the generic page
> table framework for the RISC-V format:
> 
>   pt_entry_is_write_dirty():
>     Check the D bit (bit 7) in the PTE.
> 
>   pt_entry_make_write_clean():
>     Clear the D bit across the full contiguous range.
> 
>   pt_entry_make_write_dirty():
>     Atomically set D via try_cmpxchg64() on a single PTE.
> 
> Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
> ---
>  drivers/iommu/generic_pt/fmt/riscv.h | 43 ++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)

This patch should probably go earlier in your series, before adding
the alloc_paging flags at least.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 11/11] iommu/riscv: support nested iommu for getting iommu hardware information
  2026-04-28 13:13 ` [RFC PATCH 11/11] iommu/riscv: support nested iommu for getting iommu hardware information fangyu.yu
@ 2026-04-28 13:39   ` Jason Gunthorpe
  2026-04-29  2:37     ` fangyu.yu
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-04-28 13:39 UTC (permalink / raw)
  To: fangyu.yu
  Cc: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel, Zong Li

On Tue, Apr 28, 2026 at 09:13:59PM +0800, fangyu.yu@linux.alibaba.com wrote:
> From: Zong Li <zong.li@sifive.com>
> 
> This patch implements .hw_info operation and the related data
> structures for passing the IOMMU hardware capabilities for iommufd.
> 
> Signed-off-by: Zong Li <zong.li@sifive.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
> ---
>  drivers/iommu/riscv/iommu.c  | 19 +++++++++++++++++++
>  include/uapi/linux/iommufd.h | 18 ++++++++++++++++++
>  2 files changed, 37 insertions(+)

This has nothing to do with dirty tracking. It should go with a series
introducing viommu.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: [RFC PATCH 01/11] iommupt: Add RISC-V Second-stage (iohgatp) page table support
  2026-04-28 13:32   ` Jason Gunthorpe
@ 2026-04-29  1:06     ` fangyu.yu
  2026-04-29 12:18       ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: fangyu.yu @ 2026-04-29  1:06 UTC (permalink / raw)
  To: jgg
  Cc: alex, anup, aou, atish.patra, baolu.lu, fangyu.yu, guoren, iommu,
	joro, kevin.tian, kvm-riscv, kvm, linux-kernel, linux-riscv,
	palmer, pjw, robin.murphy, skhawaja, tjeznach, vasant.hegde, will

>> @@ -263,6 +281,22 @@ riscvpt_iommu_fmt_init(struct pt_iommu_riscv_64 *iommu_table,
>>  	case 57:
>>  		pt_top_set_level(&table->common, 4);
>>  		break;
>> +	/*
>> +	 * Second-stage (iohgatp): Sv39x4 / Sv48x4 / Sv57x4.
>> +	 * The top level is the same as for the first-stage counterpart.
>> +	 */
>> +	case 41:
>> +		pt_top_set_level(&table->common, 2);
>> +		table->second_stage = true;
>> +		break;
>
>Second stage needs to be an explicit PT_FEAT not implicitly deduced
>based on the vasz.

Agreed. I will add an explicit PT_FEAT_RISCV_SECOND_STAGE flag and
stop deriving second-stage semantics from vasz.

Thanks,
Fangyu

>
>Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: [RFC PATCH 02/11] iommu/riscv: report iommu capabilities
  2026-04-28 13:33   ` Jason Gunthorpe
@ 2026-04-29  1:15     ` fangyu.yu
  0 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-29  1:15 UTC (permalink / raw)
  To: jgg
  Cc: alex, anup, aou, atish.patra, baolu.lu, fangyu.yu, guoren, iommu,
	joro, kevin.tian, kvm-riscv, kvm, linux-kernel, linux-riscv,
	palmer, pjw, robin.murphy, skhawaja, tjeznach, vasant.hegde, will

>> +static bool riscv_iommu_capable(struct device *dev, enum iommu_cap cap)
>> +{
>> +	switch (cap) {
>> +	case IOMMU_CAP_CACHE_COHERENCY:
>> +	case IOMMU_CAP_DEFERRED_FLUSH:
>
>IOMMU_CAP_DEFERRED_FLUSH is not needed in v7.1

Thanks, I will drop IOMMU_CAP_DEFERRED_FLUSH.

>
>Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: [RFC PATCH 06/11] iommu/riscv: Add domain_alloc_paging_flags for second-stage domain
  2026-04-28 13:35   ` Jason Gunthorpe
@ 2026-04-29  1:21     ` fangyu.yu
  0 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-29  1:21 UTC (permalink / raw)
  To: jgg
  Cc: alex, anup, aou, atish.patra, baolu.lu, fangyu.yu, guoren, iommu,
	joro, kevin.tian, kvm-riscv, kvm, linux-kernel, linux-riscv,
	palmer, pjw, robin.murphy, skhawaja, tjeznach, vasant.hegde, will

Understood. I will rework the flags handling to follow the switch/case pattern
used by other drivers, and remove the dev_err() on the failure path.

Thanks,
Fangyu

>Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: [RFC PATCH 07/11] iommupt: Don't preset D when RISC-V IOMMU dirty tracking on
  2026-04-28 13:36   ` Jason Gunthorpe
@ 2026-04-29  1:41     ` fangyu.yu
  0 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-29  1:41 UTC (permalink / raw)
  To: jgg
  Cc: alex, anup, aou, atish.patra, baolu.lu, fangyu.yu, guoren, iommu,
	joro, kevin.tian, kvm-riscv, kvm, linux-kernel, linux-riscv,
	palmer, pjw, robin.murphy, skhawaja, tjeznach, vasant.hegde, will

>> From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
>> 
>> When mapping writable pages, the RISC-V format code currently
>> pre-sets the PTE D bit unconditionally.
>> 
>> If hardware dirty tracking is active (DC.tc.GADE set), the IOMMU
>> sets D autonomously on the first write. Pre-setting D makes every
>> new mapping appear dirty immediately and breaks dirty tracking.
>> 
>> Introduce PT_FEAT_RISCV_DIRTY_TRACKING_ACTIVE and, when set, leave
>> D cleared for new writable mappings so hardware can capture the
>> first write. Keep pre-setting D when dirty tracking is inactive.
>> 
>> Only meaningful for second-stage (iohgatp) page tables.
>
>You shouldn't need anything like this, the D bit is managed by the
>iommufd core appropriately. It *should* start out pre-set as that is
>faster when not tacking. Only once dirty tracking is started does D
>get cleared. User space is supposed to assume that everything is dirty
>prior to its first D clear.
>

Thanks, that makes sense. I will drop PT_FEAT_RISCV_DIRTY_TRACKING_ACTIVE
and rely on the iommufd core to manage D bit.

Fangyu

>Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: [RFC PATCH 08/11] iommu/riscv: Add dirty tracking support for second-stage domains
  2026-04-28 13:38   ` Jason Gunthorpe
@ 2026-04-29  1:46     ` fangyu.yu
  0 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-29  1:46 UTC (permalink / raw)
  To: jgg
  Cc: alex, anup, aou, atish.patra, baolu.lu, fangyu.yu, guoren, iommu,
	joro, kevin.tian, kvm-riscv, kvm, linux-kernel, linux-riscv,
	palmer, pjw, robin.murphy, skhawaja, tjeznach, vasant.hegde, will

>> @@ -1247,6 +1247,84 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
>>  	return 0;
>>  }
>>  
>> +/*
>> + * Enable or disable hardware A/D bit updates (GADE) in the device context for
>> + * all devices attached to a second-stage domain. When dirty tracking is
>> + * enabled the IOMMU hardware will set the dirty bit in PTEs on write access,
>> + * making them visible to read_and_clear_dirty().
>> + */
>> +static int riscv_iommu_set_dirty_tracking(struct iommu_domain *iommu_domain,
>> +					  bool enable)
>> +{
>> +	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
>> +	struct riscv_iommu_bond *bond;
>> +	struct riscv_iommu_device *iommu, *prev;
>> +	struct riscv_iommu_dc *dc;
>> +	struct iommu_fwspec *fwspec;
>> +	struct riscv_iommu_command cmd;
>> +	u64 tc;
>> +	int i;
>> +
>> +	rcu_read_lock();
>> +
>> +	list_for_each_entry_rcu(bond, &domain->bonds, list) {
>> +		iommu = dev_to_iommu(bond->dev);
>> +		fwspec = dev_iommu_fwspec_get(bond->dev);
>> +
>> +		for (i = 0; i < fwspec->num_ids; i++) {
>> +			dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
>> +			tc = READ_ONCE(dc->tc);
>> +			if (!(tc & RISCV_IOMMU_DC_TC_V))
>> +				continue;
>> +
>> +			if (enable)
>> +				tc |= RISCV_IOMMU_DC_TC_GADE;
>> +			else
>> +				tc &= ~RISCV_IOMMU_DC_TC_GADE;
>> +			WRITE_ONCE(dc->tc, tc);
>
>I'm pretty sure you don't need to do this. Just preset GADE when ever
>a S2 domain is attached, rely on the pre-set D to avoid any HW cost
>and you are fine. No need to change it dynamically unless something is
>reall weird about riscv.
>

Thanks, that’s a good suggestion. I will follow that approach: preset GADE
on second-stage domain attach and rely on the core-managed D-bit behavior.

Fangyu

>Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: [RFC PATCH 10/11] iommupt: Add RISC-V dirty tracking PTE ops
  2026-04-28 13:39   ` Jason Gunthorpe
@ 2026-04-29  1:52     ` fangyu.yu
  0 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-29  1:52 UTC (permalink / raw)
  To: jgg
  Cc: alex, anup, aou, atish.patra, baolu.lu, fangyu.yu, guoren, iommu,
	joro, kevin.tian, kvm-riscv, kvm, linux-kernel, linux-riscv,
	palmer, pjw, robin.murphy, skhawaja, tjeznach, vasant.hegde, will

>> From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
>> 
>> Implement the three dirty-tracking hooks required by the generic page
>> table framework for the RISC-V format:
>> 
>>   pt_entry_is_write_dirty():
>>     Check the D bit (bit 7) in the PTE.
>> 
>>   pt_entry_make_write_clean():
>>     Clear the D bit across the full contiguous range.
>> 
>>   pt_entry_make_write_dirty():
>>     Atomically set D via try_cmpxchg64() on a single PTE.
>> 
>> Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
>> ---
>>  drivers/iommu/generic_pt/fmt/riscv.h | 43 ++++++++++++++++++++++++++++
>>  1 file changed, 43 insertions(+)
>
>This patch should probably go earlier in your series, before adding
>the alloc_paging flags at least.
>

Agreed. I will reorder the series and place this patch earlier.

Thanks,
Fangyu

>Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: [RFC PATCH 11/11] iommu/riscv: support nested iommu for getting iommu hardware information
  2026-04-28 13:39   ` Jason Gunthorpe
@ 2026-04-29  2:37     ` fangyu.yu
  0 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-29  2:37 UTC (permalink / raw)
  To: jgg
  Cc: alex, anup, aou, atish.patra, baolu.lu, fangyu.yu, guoren, iommu,
	joro, kevin.tian, kvm-riscv, kvm, linux-kernel, linux-riscv,
	palmer, pjw, robin.murphy, skhawaja, tjeznach, vasant.hegde, will,
	zong.li

>> From: Zong Li <zong.li@sifive.com>
>> 
>> This patch implements .hw_info operation and the related data
>> structures for passing the IOMMU hardware capabilities for iommufd.
>> 
>> Signed-off-by: Zong Li <zong.li@sifive.com>
>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
>> ---
>>  drivers/iommu/riscv/iommu.c  | 19 +++++++++++++++++++
>>  include/uapi/linux/iommufd.h | 18 ++++++++++++++++++
>>  2 files changed, 37 insertions(+)
>
>This has nothing to do with dirty tracking. It should go with a series
>introducing viommu.
>

Thanks for pointing that out.

I added the .hw_info-related patch because, during passthrough testing with a
VM, I observed that QEMU calls iommufd_get_hw_info, so I initially thought it
was required. However, it appears that the .hw_info implementation is not
necessary for this series. I will remove the .hw_info-related patch from my
dirty-tracking series in a follow-up revision.

Thanks,
Fangyu

>Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: [RFC PATCH 01/11] iommupt: Add RISC-V Second-stage (iohgatp) page table support
  2026-04-29  1:06     ` fangyu.yu
@ 2026-04-29 12:18       ` Jason Gunthorpe
  2026-04-29 15:42         ` fangyu.yu
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-04-29 12:18 UTC (permalink / raw)
  To: fangyu.yu
  Cc: alex, anup, aou, atish.patra, baolu.lu, guoren, iommu, joro,
	kevin.tian, kvm-riscv, kvm, linux-kernel, linux-riscv, palmer,
	pjw, robin.murphy, skhawaja, tjeznach, vasant.hegde, will

On Wed, Apr 29, 2026 at 09:06:50AM +0800, fangyu.yu@linux.alibaba.com wrote:
> >> @@ -263,6 +281,22 @@ riscvpt_iommu_fmt_init(struct pt_iommu_riscv_64 *iommu_table,
> >>  	case 57:
> >>  		pt_top_set_level(&table->common, 4);
> >>  		break;
> >> +	/*
> >> +	 * Second-stage (iohgatp): Sv39x4 / Sv48x4 / Sv57x4.
> >> +	 * The top level is the same as for the first-stage counterpart.
> >> +	 */
> >> +	case 41:
> >> +		pt_top_set_level(&table->common, 2);
> >> +		table->second_stage = true;
> >> +		break;
> >
> >Second stage needs to be an explicit PT_FEAT not implicitly deduced
> >based on the vasz.
> 
> Agreed. I will add an explicit PT_FEAT_RISCV_SECOND_STAGE flag and
> stop deriving second-stage semantics from vasz.

PT_FEAT_RISCV_S2 would match what I have for ARM

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: Re: [RFC PATCH 01/11] iommupt: Add RISC-V Second-stage (iohgatp) page table support
  2026-04-29 12:18       ` Jason Gunthorpe
@ 2026-04-29 15:42         ` fangyu.yu
  0 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-04-29 15:42 UTC (permalink / raw)
  To: jgg
  Cc: alex, anup, aou, atish.patra, baolu.lu, fangyu.yu, guoren, iommu,
	joro, kevin.tian, kvm-riscv, kvm, linux-kernel, linux-riscv,
	palmer, pjw, robin.murphy, skhawaja, tjeznach, vasant.hegde, will

>> >> @@ -263,6 +281,22 @@ riscvpt_iommu_fmt_init(struct pt_iommu_riscv_64 *iommu_table,
>> >>  	case 57:
>> >>  		pt_top_set_level(&table->common, 4);
>> >>  		break;
>> >> +	/*
>> >> +	 * Second-stage (iohgatp): Sv39x4 / Sv48x4 / Sv57x4.
>> >> +	 * The top level is the same as for the first-stage counterpart.
>> >> +	 */
>> >> +	case 41:
>> >> +		pt_top_set_level(&table->common, 2);
>> >> +		table->second_stage = true;
>> >> +		break;
>> >
>> >Second stage needs to be an explicit PT_FEAT not implicitly deduced
>> >based on the vasz.
>> 
>> Agreed. I will add an explicit PT_FEAT_RISCV_SECOND_STAGE flag and
>> stop deriving second-stage semantics from vasz.
>
>PT_FEAT_RISCV_S2 would match what I have for ARM
>

Thanks for the suggestion, I’ll use PT_FEAT_RISCV_S2 to match the ARM naming.

Fangyu

>Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains
  2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
                   ` (10 preceding siblings ...)
  2026-04-28 13:13 ` [RFC PATCH 11/11] iommu/riscv: support nested iommu for getting iommu hardware information fangyu.yu
@ 2026-05-04 19:53 ` Andrew Jones
  2026-05-05 13:48   ` fangyu.yu
  11 siblings, 1 reply; 30+ messages in thread
From: Andrew Jones @ 2026-05-04 19:53 UTC (permalink / raw)
  To: fangyu.yu
  Cc: joro, will, robin.murphy, pjw, palmer, aou, alex, tjeznach, jgg,
	kevin.tian, baolu.lu, vasant.hegde, anup, atish.patra, skhawaja,
	jgg, guoren, kvm, iommu, kvm-riscv, linux-riscv, linux-kernel

On Tue, Apr 28, 2026 at 09:13:48PM +0800, fangyu.yu@linux.alibaba.com wrote:
> From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
> 
> The RISC-V IOMMU architecture defines an AMO_HWAD capability (Hardware
> Access/Dirty update) that allows the IOMMU to atomically set the A/D bits
> in second-stage PTEs on DMA access.  When DC.tc.GADE is asserted, the IOMMU
> autonomously sets D on the first write to a page mapped by an iohgatp
> domain.  This series wires that capability up to the iommufd dirty-tracking
> interface (IOMMU_HWPT_SET_DIRTY_TRACKING / IOMMU_HWPT_GET_DIRTY_BITMAP) and
> reports IOMMU_CAP_DIRTY_TRACKING.
> 
> Design notes
> ------------
> 
> * The feature is scoped to second-stage (iohgatp) domains only; these are
>   the domains created for KVM / VFIO device pass-through when userspace
>   allocates an HWPT with IOMMU_HWPT_ALLOC_NEST_PARENT or
>   IOMMU_HWPT_ALLOC_DIRTY_TRACKING.  First-stage (iosatp) domains are not
>   touched by this series.
> 
> * The page-table side plugs into the existing generic_pt dirty hook
>   framework (amdv1 / vtdss style).  RISC-V adds the three required PTE
>   ops – is_write_dirty / make_write_clean / make_write_dirty.
> 
> Testing
> -------
> 
> * Test on QEMU RISC-V, a virtio-net and an e1000e device was passed through
>   to an L2 guest via vfio-pci + iommufd.
> 
> * generic_pt KUnit: the existing test_dirty case now runs and passes for
>   the RISC-V 64-bit format.
> 
> Follow-up work
> --------------
> * Build a dedicated end-to-end test case that drives the full flow
>   (HWPT_ALLOC with DIRTY_TRACKING -> attach -> IOAS_MAP -> generate real
>   DMA -> SET_DIRTY_TRACKING -> GET_DIRTY_BITMAP -> verify bitmap against
>   expected IOVA footprint) so that the behaviour can be regression-tested
>   beyond the KUnit PTE-level coverage.
> 
> * If possible, rebase and retest on top of the updated "iommu irqbypass"
>   patchset.

Thanks for this series! I was starting to go down a similar road myself
in order to limit irqbypass to IOMMU_HWPT_ALLOC_NEST_PARENT domains since
I wasn't happy with other approaches, e.g. continuing to use s-stage, but
activating g-stage too with identity mappings since the MSI table can't be
activated otherwise. Or, simply using g-stage instead of s-stage in order
to get the MSI table enabled. In the end, I think the best is to require
nested for irqbypass and this series will provide a good base for that.

I'll rebase irqbypass on this series and test it out.

Thanks,
drew

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains
  2026-05-04 19:53 ` [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains Andrew Jones
@ 2026-05-05 13:48   ` fangyu.yu
  0 siblings, 0 replies; 30+ messages in thread
From: fangyu.yu @ 2026-05-05 13:48 UTC (permalink / raw)
  To: andrew.jones
  Cc: alex, anup, aou, atish.patra, baolu.lu, fangyu.yu, guoren, iommu,
	jgg, jgg, joro, kevin.tian, kvm-riscv, kvm, linux-kernel,
	linux-riscv, palmer, pjw, robin.murphy, skhawaja, tjeznach,
	vasant.hegde, will

>> From: Fangyu Yu <fangyu.yu@linux.alibaba.com>
>> 
>> The RISC-V IOMMU architecture defines an AMO_HWAD capability (Hardware
>> Access/Dirty update) that allows the IOMMU to atomically set the A/D bits
>> in second-stage PTEs on DMA access.  When DC.tc.GADE is asserted, the IOMMU
>> autonomously sets D on the first write to a page mapped by an iohgatp
>> domain.  This series wires that capability up to the iommufd dirty-tracking
>> interface (IOMMU_HWPT_SET_DIRTY_TRACKING / IOMMU_HWPT_GET_DIRTY_BITMAP) and
>> reports IOMMU_CAP_DIRTY_TRACKING.
>> 
>> Design notes
>> ------------
>> 
>> * The feature is scoped to second-stage (iohgatp) domains only; these are
>>   the domains created for KVM / VFIO device pass-through when userspace
>>   allocates an HWPT with IOMMU_HWPT_ALLOC_NEST_PARENT or
>>   IOMMU_HWPT_ALLOC_DIRTY_TRACKING.  First-stage (iosatp) domains are not
>>   touched by this series.
>> 
>> * The page-table side plugs into the existing generic_pt dirty hook
>>   framework (amdv1 / vtdss style).  RISC-V adds the three required PTE
>>   ops – is_write_dirty / make_write_clean / make_write_dirty.
>> 
>> Testing
>> -------
>> 
>> * Test on QEMU RISC-V, a virtio-net and an e1000e device was passed through
>>   to an L2 guest via vfio-pci + iommufd.
>> 
>> * generic_pt KUnit: the existing test_dirty case now runs and passes for
>>   the RISC-V 64-bit format.
>> 
>> Follow-up work
>> --------------
>> * Build a dedicated end-to-end test case that drives the full flow
>>   (HWPT_ALLOC with DIRTY_TRACKING -> attach -> IOAS_MAP -> generate real
>>   DMA -> SET_DIRTY_TRACKING -> GET_DIRTY_BITMAP -> verify bitmap against
>>   expected IOVA footprint) so that the behaviour can be regression-tested
>>   beyond the KUnit PTE-level coverage.
>> 
>> * If possible, rebase and retest on top of the updated "iommu irqbypass"
>>   patchset.
>
>Thanks for this series! I was starting to go down a similar road myself
>in order to limit irqbypass to IOMMU_HWPT_ALLOC_NEST_PARENT domains since
>I wasn't happy with other approaches, e.g. continuing to use s-stage, but
>activating g-stage too with identity mappings since the MSI table can't be
>activated otherwise. Or, simply using g-stage instead of s-stage in order
>to get the MSI table enabled. In the end, I think the best is to require
>nested for irqbypass and this series will provide a good base for that.
>
>I'll rebase irqbypass on this series and test it out.
>

Thanks for the feedback. Jason has provided some helpful suggestions on this
series, and I am in the process of updating it. I expect to send out a new
version in the coming days.

Fangyu

>Thanks,
>drew

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2026-05-05 13:48 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 13:13 [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 01/11] iommupt: Add RISC-V Second-stage (iohgatp) page table support fangyu.yu
2026-04-28 13:32   ` Jason Gunthorpe
2026-04-29  1:06     ` fangyu.yu
2026-04-29 12:18       ` Jason Gunthorpe
2026-04-29 15:42         ` fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 02/11] iommu/riscv: report iommu capabilities fangyu.yu
2026-04-28 13:33   ` Jason Gunthorpe
2026-04-29  1:15     ` fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 03/11] iommu/riscv: use data structure instead of individual values fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 04/11] iommu/riscv: support GSCID and GVMA invalidation command fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 05/11] RISC-V: KVM: Enable KVM_VFIO interfaces on RISC-V arch fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 06/11] iommu/riscv: Add domain_alloc_paging_flags for second-stage domain fangyu.yu
2026-04-28 13:35   ` Jason Gunthorpe
2026-04-29  1:21     ` fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 07/11] iommupt: Don't preset D when RISC-V IOMMU dirty tracking on fangyu.yu
2026-04-28 13:36   ` Jason Gunthorpe
2026-04-29  1:41     ` fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 08/11] iommu/riscv: Add dirty tracking support for second-stage domains fangyu.yu
2026-04-28 13:38   ` Jason Gunthorpe
2026-04-29  1:46     ` fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 09/11] iommu/riscv: Add IOTINVAL.GVMA after updating DDT/PDT entries fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 10/11] iommupt: Add RISC-V dirty tracking PTE ops fangyu.yu
2026-04-28 13:39   ` Jason Gunthorpe
2026-04-29  1:52     ` fangyu.yu
2026-04-28 13:13 ` [RFC PATCH 11/11] iommu/riscv: support nested iommu for getting iommu hardware information fangyu.yu
2026-04-28 13:39   ` Jason Gunthorpe
2026-04-29  2:37     ` fangyu.yu
2026-05-04 19:53 ` [RFC PATCH 00/11] iommu/riscv: Add hardware dirty tracking for second-stage domains Andrew Jones
2026-05-05 13:48   ` fangyu.yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox