linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support
@ 2024-06-14 14:21 Zong Li
  2024-06-14 14:21 ` [RFC PATCH v2 01/10] iommu/riscv: add RISC-V IOMMU PMU support Zong Li
                   ` (9 more replies)
  0 siblings, 10 replies; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Zong Li

This series includes RISC-V IOMMU hardware performance monitor and
nested IOMMU support. It also introduces more operations, which are
required for nested IOMMU, such as g-stage flush and iotlb_sync_map.

This patch needs an additional patch from Robin Murphy to support
MSIs through nested domains (e.g., patch 09/10).

This patch set is implemented on top of the RISC-V IOMMU v7 series [1],
and tested on top of more features support [2] with some suggestions
[3]. This patch serie will be submitted as an RFC until the RISC-V
IOMMU has been merged.

Changes from v1:
- Rebase on RISC-V IOMMU v7 series
- Include patch for supporting MSIs through nested domains
- Iterate bond list for g-stage flush
- Use data structure instead of passing individual parameters
- PMU: adds IRQ_ONESHOT and SHARED flags for shared wired interrupt
- PMU: add mask of counter
- hw_info: remove unused check
- hw_info: add padding in data structure
- hw_info: add more comments for data structure
- cache_invalidate_user: remove warning message from userspace
- cache_invalidate_user: lock a riscv iommu device in riscv iommu domain
- cache_invalidate_user: link pass through device to s2 domain's bond
  list

[1] link: https://lists.infradead.org/pipermail/linux-riscv/2024-June/055413.html
[2] link: https://github.com/tjeznach/linux/tree/riscv_iommu_v7-rc2
[3] link: https://lists.infradead.org/pipermail/linux-riscv/2024-June/055426.html

Robin Murphy (1):
  iommu/dma: Support MSIs through nested domains

Zong Li (9):
  iommu/riscv: add RISC-V IOMMU PMU support
  iommu/riscv: support HPM and interrupt handling
  iommu/riscv: use data structure instead of individual values
  iommu/riscv: add iotlb_sync_map operation support
  iommu/riscv: support GSCID and GVMA invalidation command
  iommu/riscv: support nested iommu for getting iommu hardware
    information
  iommu/riscv: support nested iommu for creating domains owned by
    userspace
  iommu/riscv: support nested iommu for flushing cache
  iommu:riscv: support nested iommu for get_msi_mapping_domain operation

 drivers/iommu/dma-iommu.c        |  18 +-
 drivers/iommu/riscv/Makefile     |   3 +-
 drivers/iommu/riscv/iommu-bits.h |  23 ++
 drivers/iommu/riscv/iommu-pmu.c  | 479 ++++++++++++++++++++++++++++++
 drivers/iommu/riscv/iommu.c      | 492 ++++++++++++++++++++++++++++++-
 drivers/iommu/riscv/iommu.h      |   8 +
 include/linux/iommu.h            |   4 +
 include/uapi/linux/iommufd.h     |  46 +++
 8 files changed, 1054 insertions(+), 19 deletions(-)
 create mode 100644 drivers/iommu/riscv/iommu-pmu.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 01/10] iommu/riscv: add RISC-V IOMMU PMU support
  2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
@ 2024-06-14 14:21 ` Zong Li
  2024-06-17 14:55   ` Jason Gunthorpe
  2024-06-14 14:21 ` [RFC PATCH v2 02/10] iommu/riscv: support HPM and interrupt handling Zong Li
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Zong Li

This patch implements the RISC-V IOMMU hardware performance monitor, it
includes the counting ans sampling mode.

Specification doesn't define the event ID for counting the number of
clock cycles, there is no associated iohpmevt0. But we need an event for
counting cycle in perf, reserve the maximum number of event ID for it now.

Signed-off-by: Zong Li <zong.li@sifive.com>
---
 drivers/iommu/riscv/Makefile     |   2 +-
 drivers/iommu/riscv/iommu-bits.h |  16 ++
 drivers/iommu/riscv/iommu-pmu.c  | 479 +++++++++++++++++++++++++++++++
 drivers/iommu/riscv/iommu.h      |   8 +
 4 files changed, 504 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/riscv/iommu-pmu.c

diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
index f54c9ed17d41..d36625a1fd08 100644
--- a/drivers/iommu/riscv/Makefile
+++ b/drivers/iommu/riscv/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o
+obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o iommu-pmu.o
 obj-$(CONFIG_RISCV_IOMMU_PCI) += iommu-pci.o
diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
index 98daf0e1a306..60523449f016 100644
--- a/drivers/iommu/riscv/iommu-bits.h
+++ b/drivers/iommu/riscv/iommu-bits.h
@@ -17,6 +17,7 @@
 #include <linux/types.h>
 #include <linux/bitfield.h>
 #include <linux/bits.h>
+#include <linux/perf_event.h>
 
 /*
  * Chapter 5: Memory Mapped register interface
@@ -207,6 +208,7 @@ enum riscv_iommu_ddtp_modes {
 /* 5.22 Performance monitoring event counters (31 * 64bits) */
 #define RISCV_IOMMU_REG_IOHPMCTR_BASE	0x0068
 #define RISCV_IOMMU_REG_IOHPMCTR(_n)	(RISCV_IOMMU_REG_IOHPMCTR_BASE + ((_n) * 0x8))
+#define RISCV_IOMMU_IOHPMCTR_COUNTER	GENMASK_ULL(63, 0)
 
 /* 5.23 Performance monitoring event selectors (31 * 64bits) */
 #define RISCV_IOMMU_REG_IOHPMEVT_BASE	0x0160
@@ -250,6 +252,20 @@ enum riscv_iommu_hpmevent_id {
 	RISCV_IOMMU_HPMEVENT_MAX        = 9
 };
 
+/* Use maximum event ID for cycle event */
+#define RISCV_IOMMU_HPMEVENT_CYCLE	GENMASK_ULL(14, 0)
+
+#define RISCV_IOMMU_HPM_COUNTER_NUM	32
+
+struct riscv_iommu_pmu {
+	struct pmu pmu;
+	void __iomem *reg;
+	int num_counters;
+	u64 mask_counter;
+	struct perf_event *events[RISCV_IOMMU_IOHPMEVT_CNT + 1];
+	DECLARE_BITMAP(used_counters, RISCV_IOMMU_IOHPMEVT_CNT + 1);
+};
+
 /* 5.24 Translation request IOVA (64bits) */
 #define RISCV_IOMMU_REG_TR_REQ_IOVA     0x0258
 #define RISCV_IOMMU_TR_REQ_IOVA_VPN	GENMASK_ULL(63, 12)
diff --git a/drivers/iommu/riscv/iommu-pmu.c b/drivers/iommu/riscv/iommu-pmu.c
new file mode 100644
index 000000000000..5fc45aaf4ca3
--- /dev/null
+++ b/drivers/iommu/riscv/iommu-pmu.c
@@ -0,0 +1,479 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2024 SiFive
+ *
+ * Authors
+ *	Zong Li <zong.li@sifive.com>
+ */
+
+#include <linux/io-64-nonatomic-hi-lo.h>
+
+#include "iommu.h"
+#include "iommu-bits.h"
+
+#define to_riscv_iommu_pmu(p) (container_of(p, struct riscv_iommu_pmu, pmu))
+
+#define RISCV_IOMMU_PMU_ATTR_EXTRACTOR(_name, _mask)			\
+	static inline u32 get_##_name(struct perf_event *event)		\
+	{								\
+		return FIELD_GET(_mask, event->attr.config);		\
+	}								\
+
+RISCV_IOMMU_PMU_ATTR_EXTRACTOR(event, RISCV_IOMMU_IOHPMEVT_EVENTID);
+RISCV_IOMMU_PMU_ATTR_EXTRACTOR(partial_matching, RISCV_IOMMU_IOHPMEVT_DMASK);
+RISCV_IOMMU_PMU_ATTR_EXTRACTOR(pid_pscid, RISCV_IOMMU_IOHPMEVT_PID_PSCID);
+RISCV_IOMMU_PMU_ATTR_EXTRACTOR(did_gscid, RISCV_IOMMU_IOHPMEVT_DID_GSCID);
+RISCV_IOMMU_PMU_ATTR_EXTRACTOR(filter_pid_pscid, RISCV_IOMMU_IOHPMEVT_PV_PSCV);
+RISCV_IOMMU_PMU_ATTR_EXTRACTOR(filter_did_gscid, RISCV_IOMMU_IOHPMEVT_DV_GSCV);
+RISCV_IOMMU_PMU_ATTR_EXTRACTOR(filter_id_type, RISCV_IOMMU_IOHPMEVT_IDT);
+
+/* Formats */
+PMU_FORMAT_ATTR(event, "config:0-14");
+PMU_FORMAT_ATTR(partial_matching, "config:15");
+PMU_FORMAT_ATTR(pid_pscid, "config:16-35");
+PMU_FORMAT_ATTR(did_gscid, "config:36-59");
+PMU_FORMAT_ATTR(filter_pid_pscid, "config:60");
+PMU_FORMAT_ATTR(filter_did_gscid, "config:61");
+PMU_FORMAT_ATTR(filter_id_type, "config:62");
+
+static struct attribute *riscv_iommu_pmu_formats[] = {
+	&format_attr_event.attr,
+	&format_attr_partial_matching.attr,
+	&format_attr_pid_pscid.attr,
+	&format_attr_did_gscid.attr,
+	&format_attr_filter_pid_pscid.attr,
+	&format_attr_filter_did_gscid.attr,
+	&format_attr_filter_id_type.attr,
+	NULL,
+};
+
+static const struct attribute_group riscv_iommu_pmu_format_group = {
+	.name = "format",
+	.attrs = riscv_iommu_pmu_formats,
+};
+
+/* Events */
+static ssize_t riscv_iommu_pmu_event_show(struct device *dev,
+					  struct device_attribute *attr,
+					  char *page)
+{
+	struct perf_pmu_events_attr *pmu_attr;
+
+	pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr);
+
+	return sprintf(page, "event=0x%02llx\n", pmu_attr->id);
+}
+
+PMU_EVENT_ATTR(cycle, event_attr_cycle,
+	       RISCV_IOMMU_HPMEVENT_CYCLE, riscv_iommu_pmu_event_show);
+PMU_EVENT_ATTR(dont_count, event_attr_dont_count,
+	       RISCV_IOMMU_HPMEVENT_INVALID, riscv_iommu_pmu_event_show);
+PMU_EVENT_ATTR(untranslated_req, event_attr_untranslated_req,
+	       RISCV_IOMMU_HPMEVENT_URQ, riscv_iommu_pmu_event_show);
+PMU_EVENT_ATTR(translated_req, event_attr_translated_req,
+	       RISCV_IOMMU_HPMEVENT_TRQ, riscv_iommu_pmu_event_show);
+PMU_EVENT_ATTR(ats_trans_req, event_attr_ats_trans_req,
+	       RISCV_IOMMU_HPMEVENT_ATS_RQ, riscv_iommu_pmu_event_show);
+PMU_EVENT_ATTR(tlb_miss, event_attr_tlb_miss,
+	       RISCV_IOMMU_HPMEVENT_TLB_MISS, riscv_iommu_pmu_event_show);
+PMU_EVENT_ATTR(ddt_walks, event_attr_ddt_walks,
+	       RISCV_IOMMU_HPMEVENT_DD_WALK, riscv_iommu_pmu_event_show);
+PMU_EVENT_ATTR(pdt_walks, event_attr_pdt_walks,
+	       RISCV_IOMMU_HPMEVENT_PD_WALK, riscv_iommu_pmu_event_show);
+PMU_EVENT_ATTR(s_vs_pt_walks, event_attr_s_vs_pt_walks,
+	       RISCV_IOMMU_HPMEVENT_S_VS_WALKS, riscv_iommu_pmu_event_show);
+PMU_EVENT_ATTR(g_pt_walks, event_attr_g_pt_walks,
+	       RISCV_IOMMU_HPMEVENT_G_WALKS, riscv_iommu_pmu_event_show);
+
+static struct attribute *riscv_iommu_pmu_events[] = {
+	&event_attr_cycle.attr.attr,
+	&event_attr_dont_count.attr.attr,
+	&event_attr_untranslated_req.attr.attr,
+	&event_attr_translated_req.attr.attr,
+	&event_attr_ats_trans_req.attr.attr,
+	&event_attr_tlb_miss.attr.attr,
+	&event_attr_ddt_walks.attr.attr,
+	&event_attr_pdt_walks.attr.attr,
+	&event_attr_s_vs_pt_walks.attr.attr,
+	&event_attr_g_pt_walks.attr.attr,
+	NULL,
+};
+
+static const struct attribute_group riscv_iommu_pmu_events_group = {
+	.name = "events",
+	.attrs = riscv_iommu_pmu_events,
+};
+
+static const struct attribute_group *riscv_iommu_pmu_attr_grps[] = {
+	&riscv_iommu_pmu_format_group,
+	&riscv_iommu_pmu_events_group,
+	NULL,
+};
+
+/* PMU Operations */
+static void riscv_iommu_pmu_set_counter(struct riscv_iommu_pmu *pmu, u32 idx,
+					u64 value)
+{
+	void __iomem *addr = pmu->reg + RISCV_IOMMU_REG_IOHPMCYCLES;
+
+	if (WARN_ON_ONCE(idx < 0 || idx > pmu->num_counters))
+		return;
+
+	writeq(FIELD_PREP(RISCV_IOMMU_IOHPMCTR_COUNTER, value), addr + idx * 8);
+}
+
+static u64 riscv_iommu_pmu_get_counter(struct riscv_iommu_pmu *pmu, u32 idx)
+{
+	void __iomem *addr = pmu->reg + RISCV_IOMMU_REG_IOHPMCYCLES;
+	u64 value;
+
+	if (WARN_ON_ONCE(idx < 0 || idx > pmu->num_counters))
+		return -EINVAL;
+
+	value = readq(addr + idx * 8);
+
+	return FIELD_GET(RISCV_IOMMU_IOHPMCTR_COUNTER, value);
+}
+
+static u64 riscv_iommu_pmu_get_event(struct riscv_iommu_pmu *pmu, u32 idx)
+{
+	void __iomem *addr = pmu->reg + RISCV_IOMMU_REG_IOHPMEVT_BASE;
+
+	if (WARN_ON_ONCE(idx < 0 || idx > pmu->num_counters))
+		return 0;
+
+	/* There is no associtated IOHPMEVT0 for IOHPMCYCLES */
+	if (idx == 0)
+		return 0;
+
+	return readq(addr + (idx - 1) * 8);
+}
+
+static void riscv_iommu_pmu_set_event(struct riscv_iommu_pmu *pmu, u32 idx,
+				      u64 value)
+{
+	void __iomem *addr = pmu->reg + RISCV_IOMMU_REG_IOHPMEVT_BASE;
+
+	if (WARN_ON_ONCE(idx < 0 || idx > pmu->num_counters))
+		return;
+
+	/* There is no associtated IOHPMEVT0 for IOHPMCYCLES */
+	if (idx == 0)
+		return;
+
+	writeq(value, addr + (idx - 1) * 8);
+}
+
+static void riscv_iommu_pmu_enable_counter(struct riscv_iommu_pmu *pmu, u32 idx)
+{
+	void __iomem *addr = pmu->reg + RISCV_IOMMU_REG_IOCOUNTINH;
+	u32 value = readl(addr);
+
+	writel(value & ~BIT(idx), addr);
+}
+
+static void riscv_iommu_pmu_disable_counter(struct riscv_iommu_pmu *pmu, u32 idx)
+{
+	void __iomem *addr = pmu->reg + RISCV_IOMMU_REG_IOCOUNTINH;
+	u32 value = readl(addr);
+
+	writel(value | BIT(idx), addr);
+}
+
+static void riscv_iommu_pmu_enable_ovf_intr(struct riscv_iommu_pmu *pmu, u32 idx)
+{
+	u64 value;
+
+	if (get_event(pmu->events[idx]) == RISCV_IOMMU_HPMEVENT_CYCLE) {
+		value = riscv_iommu_pmu_get_counter(pmu, idx) & ~RISCV_IOMMU_IOHPMCYCLES_OF;
+		writeq(value, pmu->reg + RISCV_IOMMU_REG_IOHPMCYCLES);
+	} else {
+		value = riscv_iommu_pmu_get_event(pmu, idx) & ~RISCV_IOMMU_IOHPMEVT_OF;
+		writeq(value, pmu->reg + RISCV_IOMMU_REG_IOHPMEVT_BASE + (idx - 1) * 8);
+	}
+}
+
+static void riscv_iommu_pmu_disable_ovf_intr(struct riscv_iommu_pmu *pmu, u32 idx)
+{
+	u64 value;
+
+	if (get_event(pmu->events[idx]) == RISCV_IOMMU_HPMEVENT_CYCLE) {
+		value = riscv_iommu_pmu_get_counter(pmu, idx) | RISCV_IOMMU_IOHPMCYCLES_OF;
+		writeq(value, pmu->reg + RISCV_IOMMU_REG_IOHPMCYCLES);
+	} else {
+		value = riscv_iommu_pmu_get_event(pmu, idx) | RISCV_IOMMU_IOHPMEVT_OF;
+		writeq(value, pmu->reg + RISCV_IOMMU_REG_IOHPMEVT_BASE + (idx - 1) * 8);
+	}
+}
+
+static void riscv_iommu_pmu_start_all(struct riscv_iommu_pmu *pmu)
+{
+	int idx;
+
+	for_each_set_bit(idx, pmu->used_counters, pmu->num_counters) {
+		riscv_iommu_pmu_enable_ovf_intr(pmu, idx);
+		riscv_iommu_pmu_enable_counter(pmu, idx);
+	}
+}
+
+static void riscv_iommu_pmu_stop_all(struct riscv_iommu_pmu *pmu)
+{
+	writel(GENMASK_ULL(pmu->num_counters - 1, 0),
+	       pmu->reg + RISCV_IOMMU_REG_IOCOUNTINH);
+}
+
+/* PMU APIs */
+static int riscv_iommu_pmu_set_period(struct perf_event *event)
+{
+	struct riscv_iommu_pmu *pmu = to_riscv_iommu_pmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	s64 left = local64_read(&hwc->period_left);
+	s64 period = hwc->sample_period;
+	u64 max_period = pmu->mask_counter;
+	int ret = 0;
+
+	if (unlikely(left <= -period)) {
+		left = period;
+		local64_set(&hwc->period_left, left);
+		hwc->last_period = period;
+		ret = 1;
+	}
+
+	if (unlikely(left <= 0)) {
+		left += period;
+		local64_set(&hwc->period_left, left);
+		hwc->last_period = period;
+		ret = 1;
+	}
+
+	/*
+	 * Limit the maximum period to prevent the counter value
+	 * from overtaking the one we are about to program. In
+	 * effect we are reducing max_period to account for
+	 * interrupt latency (and we are being very conservative).
+	 */
+	if (left > (max_period >> 1))
+		left = (max_period >> 1);
+
+	local64_set(&hwc->prev_count, (u64)-left);
+	riscv_iommu_pmu_set_counter(pmu, hwc->idx, (u64)(-left) & max_period);
+	perf_event_update_userpage(event);
+
+	return ret;
+}
+
+static int riscv_iommu_pmu_event_init(struct perf_event *event)
+{
+	struct riscv_iommu_pmu *pmu = to_riscv_iommu_pmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+
+	hwc->idx = -1;
+	hwc->config = event->attr.config;
+
+	if (!is_sampling_event(event)) {
+		/*
+		 * For non-sampling runs, limit the sample_period to half
+		 * of the counter width. That way, the new counter value
+		 * is far less likely to overtake the previous one unless
+		 * you have some serious IRQ latency issues.
+		 */
+		hwc->sample_period = pmu->mask_counter >> 1;
+		hwc->last_period = hwc->sample_period;
+		local64_set(&hwc->period_left, hwc->sample_period);
+	}
+
+	return 0;
+}
+
+static void riscv_iommu_pmu_update(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	struct riscv_iommu_pmu *pmu = to_riscv_iommu_pmu(event->pmu);
+	u64 delta, prev, now;
+	u32 idx = hwc->idx;
+
+	do {
+		prev = local64_read(&hwc->prev_count);
+		now = riscv_iommu_pmu_get_counter(pmu, idx);
+	} while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
+
+	delta = FIELD_GET(RISCV_IOMMU_IOHPMCTR_COUNTER, now - prev) & pmu->mask_counter;
+	local64_add(delta, &event->count);
+	local64_sub(delta, &hwc->period_left);
+}
+
+static void riscv_iommu_pmu_start(struct perf_event *event, int flags)
+{
+	struct riscv_iommu_pmu *pmu = to_riscv_iommu_pmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+		return;
+
+	if (flags & PERF_EF_RELOAD)
+		WARN_ON_ONCE(!(event->hw.state & PERF_HES_UPTODATE));
+
+	hwc->state = 0;
+	riscv_iommu_pmu_set_period(event);
+	riscv_iommu_pmu_set_event(pmu, hwc->idx, hwc->config);
+	riscv_iommu_pmu_enable_ovf_intr(pmu, hwc->idx);
+	riscv_iommu_pmu_enable_counter(pmu, hwc->idx);
+
+	perf_event_update_userpage(event);
+}
+
+static void riscv_iommu_pmu_stop(struct perf_event *event, int flags)
+{
+	struct riscv_iommu_pmu *pmu = to_riscv_iommu_pmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (hwc->state & PERF_HES_STOPPED)
+		return;
+
+	riscv_iommu_pmu_set_event(pmu, hwc->idx, RISCV_IOMMU_HPMEVENT_INVALID);
+	riscv_iommu_pmu_disable_counter(pmu, hwc->idx);
+
+	if ((flags & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE))
+		riscv_iommu_pmu_update(event);
+
+	hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
+}
+
+static int riscv_iommu_pmu_add(struct perf_event *event, int flags)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	struct riscv_iommu_pmu *pmu = to_riscv_iommu_pmu(event->pmu);
+	unsigned int num_counters = pmu->num_counters;
+	int idx;
+
+	/* Reserve index zero for iohpmcycles */
+	if (get_event(event) == RISCV_IOMMU_HPMEVENT_CYCLE)
+		idx = 0;
+	else
+		idx = find_next_zero_bit(pmu->used_counters, num_counters, 1);
+
+	if (idx == num_counters)
+		return -EAGAIN;
+
+	set_bit(idx, pmu->used_counters);
+
+	pmu->events[idx] = event;
+	hwc->idx = idx;
+	hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+	if (flags & PERF_EF_START)
+		riscv_iommu_pmu_start(event, flags);
+
+	/* Propagate changes to the userspace mapping. */
+	perf_event_update_userpage(event);
+
+	return 0;
+}
+
+static void riscv_iommu_pmu_read(struct perf_event *event)
+{
+	riscv_iommu_pmu_update(event);
+}
+
+static void riscv_iommu_pmu_del(struct perf_event *event, int flags)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	struct riscv_iommu_pmu *pmu = to_riscv_iommu_pmu(event->pmu);
+	int idx = hwc->idx;
+
+	riscv_iommu_pmu_stop(event, PERF_EF_UPDATE);
+	pmu->events[idx] = NULL;
+	clear_bit(idx, pmu->used_counters);
+	perf_event_update_userpage(event);
+}
+
+irqreturn_t riscv_iommu_pmu_handle_irq(struct riscv_iommu_pmu *pmu)
+{
+	struct perf_sample_data data;
+	struct pt_regs *regs;
+	u32 ovf = readl(pmu->reg + RISCV_IOMMU_REG_IOCOUNTOVF);
+	int idx;
+
+	if (!ovf)
+		return IRQ_NONE;
+
+	riscv_iommu_pmu_stop_all(pmu);
+
+	regs = get_irq_regs();
+
+	for_each_set_bit(idx, (unsigned long *)&ovf, pmu->num_counters) {
+		struct perf_event *event = pmu->events[idx];
+		struct hw_perf_event *hwc;
+
+		if (WARN_ON_ONCE(!event) || !is_sampling_event(event))
+			continue;
+
+		hwc = &event->hw;
+
+		riscv_iommu_pmu_update(event);
+		perf_sample_data_init(&data, 0, hwc->last_period);
+		if (!riscv_iommu_pmu_set_period(event))
+			continue;
+
+		if (perf_event_overflow(event, &data, regs))
+			riscv_iommu_pmu_stop(event, 0);
+	}
+
+	riscv_iommu_pmu_start_all(pmu);
+
+	return IRQ_HANDLED;
+}
+
+int riscv_iommu_pmu_init(struct riscv_iommu_pmu *pmu, void __iomem *reg,
+			 const char *dev_name)
+{
+	char *name;
+	int ret;
+
+	pmu->reg = reg;
+	pmu->num_counters = RISCV_IOMMU_HPM_COUNTER_NUM;
+	pmu->mask_counter = RISCV_IOMMU_IOHPMCTR_COUNTER;
+
+	pmu->pmu = (struct pmu) {
+		.task_ctx_nr	= perf_invalid_context,
+		.event_init	= riscv_iommu_pmu_event_init,
+		.add		= riscv_iommu_pmu_add,
+		.del		= riscv_iommu_pmu_del,
+		.start		= riscv_iommu_pmu_start,
+		.stop		= riscv_iommu_pmu_stop,
+		.read		= riscv_iommu_pmu_read,
+		.attr_groups	= riscv_iommu_pmu_attr_grps,
+		.capabilities	= PERF_PMU_CAP_NO_EXCLUDE,
+		.module		= THIS_MODULE,
+	};
+
+	name = kasprintf(GFP_KERNEL, "riscv_iommu_pmu_%s", dev_name);
+
+	ret = perf_pmu_register(&pmu->pmu, name, -1);
+	if (ret) {
+		pr_err("Failed to register riscv_iommu_pmu_%s: %d\n",
+		       dev_name, ret);
+		return ret;
+	}
+
+	/* Stop all counters and later start the counter with perf */
+	riscv_iommu_pmu_stop_all(pmu);
+
+	pr_info("riscv_iommu_pmu_%s: Registered with %d counters\n",
+		dev_name, pmu->num_counters);
+
+	return 0;
+}
+
+void riscv_iommu_pmu_uninit(struct riscv_iommu_pmu *pmu)
+{
+	int idx;
+
+	/* Disable interrupt and functions */
+	for_each_set_bit(idx, pmu->used_counters, pmu->num_counters) {
+		riscv_iommu_pmu_disable_counter(pmu, idx);
+		riscv_iommu_pmu_disable_ovf_intr(pmu, idx);
+	}
+
+	perf_pmu_unregister(&pmu->pmu);
+}
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
index b1c4664542b4..92659a8a75ae 100644
--- a/drivers/iommu/riscv/iommu.h
+++ b/drivers/iommu/riscv/iommu.h
@@ -60,11 +60,19 @@ struct riscv_iommu_device {
 	unsigned int ddt_mode;
 	dma_addr_t ddt_phys;
 	u64 *ddt_root;
+
+	/* hardware performance monitor */
+	struct riscv_iommu_pmu pmu;
 };
 
 int riscv_iommu_init(struct riscv_iommu_device *iommu);
 void riscv_iommu_remove(struct riscv_iommu_device *iommu);
 
+int riscv_iommu_pmu_init(struct riscv_iommu_pmu *pmu, void __iomem *reg,
+			 const char *name);
+void riscv_iommu_pmu_uninit(struct riscv_iommu_pmu *pmu);
+irqreturn_t riscv_iommu_pmu_handle_irq(struct riscv_iommu_pmu *pmu);
+
 #define riscv_iommu_readl(iommu, addr) \
 	readl_relaxed((iommu)->reg + (addr))
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 02/10] iommu/riscv: support HPM and interrupt handling
  2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
  2024-06-14 14:21 ` [RFC PATCH v2 01/10] iommu/riscv: add RISC-V IOMMU PMU support Zong Li
@ 2024-06-14 14:21 ` Zong Li
  2024-12-10  7:54   ` [External] " yunhui cui
  2025-09-01 13:36   ` [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support niliqiang
  2024-06-14 14:21 ` [RFC PATCH v2 03/10] iommu/riscv: use data structure instead of individual values Zong Li
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Zong Li

This patch initialize the pmu stuff and uninitialize it when driver
removing. The interrupt handling is also provided, this handler need to
be primary handler instead of thread function, because pt_regs is empty
when threading the IRQ, but pt_regs is necessary by perf_event_overflow.

Signed-off-by: Zong Li <zong.li@sifive.com>
---
 drivers/iommu/riscv/iommu.c | 65 +++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 8b6a64c1ad8d..1716b2251f38 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -540,6 +540,62 @@ static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
+/*
+ * IOMMU Hardware performance monitor
+ */
+
+/* HPM interrupt primary handler */
+static irqreturn_t riscv_iommu_hpm_irq_handler(int irq, void *dev_id)
+{
+	struct riscv_iommu_device *iommu = (struct riscv_iommu_device *)dev_id;
+
+	/* Process pmu irq */
+	riscv_iommu_pmu_handle_irq(&iommu->pmu);
+
+	/* Clear performance monitoring interrupt pending */
+	riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, RISCV_IOMMU_IPSR_PMIP);
+
+	return IRQ_HANDLED;
+}
+
+/* HPM initialization */
+static int riscv_iommu_hpm_enable(struct riscv_iommu_device *iommu)
+{
+	int rc;
+
+	if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM))
+		return 0;
+
+	/*
+	 * pt_regs is empty when threading the IRQ, but pt_regs is necessary
+	 * by perf_event_overflow. Use primary handler instead of thread
+	 * function for PM IRQ.
+	 *
+	 * Set the IRQF_ONESHOT flag because this IRQ might be shared with
+	 * other threaded IRQs by other queues.
+	 */
+	rc = devm_request_irq(iommu->dev,
+			      iommu->irqs[riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IPSR_PMIP)],
+			      riscv_iommu_hpm_irq_handler, IRQF_ONESHOT | IRQF_SHARED, NULL, iommu);
+	if (rc)
+		return rc;
+
+	return riscv_iommu_pmu_init(&iommu->pmu, iommu->reg, dev_name(iommu->dev));
+}
+
+/* HPM uninitialization */
+static void riscv_iommu_hpm_disable(struct riscv_iommu_device *iommu)
+{
+	if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM))
+		return;
+
+	devm_free_irq(iommu->dev,
+		      iommu->irqs[riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IPSR_PMIP)],
+		      iommu);
+
+	riscv_iommu_pmu_uninit(&iommu->pmu);
+}
+
 /* Lookup and initialize device context info structure. */
 static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
 						 unsigned int devid)
@@ -1612,6 +1668,9 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
 	riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_IOMMU_MODE_OFF);
 	riscv_iommu_queue_disable(&iommu->cmdq);
 	riscv_iommu_queue_disable(&iommu->fltq);
+
+	if (iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM)
+		riscv_iommu_pmu_uninit(&iommu->pmu);
 }
 
 int riscv_iommu_init(struct riscv_iommu_device *iommu)
@@ -1651,6 +1710,10 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
 	if (rc)
 		goto err_queue_disable;
 
+	rc = riscv_iommu_hpm_enable(iommu);
+	if (rc)
+		goto err_hpm_disable;
+
 	rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
 				    dev_name(iommu->dev));
 	if (rc) {
@@ -1669,6 +1732,8 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
 err_remove_sysfs:
 	iommu_device_sysfs_remove(&iommu->iommu);
 err_iodir_off:
+	riscv_iommu_hpm_disable(iommu);
+err_hpm_disable:
 	riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_IOMMU_MODE_OFF);
 err_queue_disable:
 	riscv_iommu_queue_disable(&iommu->fltq);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 03/10] iommu/riscv: use data structure instead of individual values
  2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
  2024-06-14 14:21 ` [RFC PATCH v2 01/10] iommu/riscv: add RISC-V IOMMU PMU support Zong Li
  2024-06-14 14:21 ` [RFC PATCH v2 02/10] iommu/riscv: support HPM and interrupt handling Zong Li
@ 2024-06-14 14:21 ` Zong Li
  2024-06-14 14:21 ` [RFC PATCH v2 04/10] iommu/riscv: add iotlb_sync_map operation support Zong Li
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Zong Li

The parameter will be increased when we need to set up more
bit fields in the device context. Use a data structure to
wrap them up.

Signed-off-by: Zong Li <zong.li@sifive.com>
---
 drivers/iommu/riscv/iommu.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 1716b2251f38..9aeb4b20c145 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1045,7 +1045,7 @@ static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
  * interim translation faults.
  */
 static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
-				     struct device *dev, u64 fsc, u64 ta)
+				     struct device *dev, struct riscv_iommu_dc *new_dc)
 {
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct riscv_iommu_dc *dc;
@@ -1079,10 +1079,10 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
 	for (i = 0; i < fwspec->num_ids; i++) {
 		dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
 		tc = READ_ONCE(dc->tc);
-		tc |= ta & RISCV_IOMMU_DC_TC_V;
+		tc |= new_dc->ta & RISCV_IOMMU_DC_TC_V;
 
-		WRITE_ONCE(dc->fsc, fsc);
-		WRITE_ONCE(dc->ta, ta & RISCV_IOMMU_PC_TA_PSCID);
+		WRITE_ONCE(dc->fsc, new_dc->fsc);
+		WRITE_ONCE(dc->ta, new_dc->ta & RISCV_IOMMU_PC_TA_PSCID);
 		/* Update device context, write TC.V as the last step. */
 		dma_wmb();
 		WRITE_ONCE(dc->tc, tc);
@@ -1369,20 +1369,20 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
 	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
 	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
 	struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
-	u64 fsc, ta;
+	struct riscv_iommu_dc dc = {0};
 
 	if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
 		return -ENODEV;
 
-	fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
-	      FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
-	ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
-	     RISCV_IOMMU_PC_TA_V;
+	dc.fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
+		 FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
+	dc.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
+			   RISCV_IOMMU_PC_TA_V;
 
 	if (riscv_iommu_bond_link(domain, dev))
 		return -ENOMEM;
 
-	riscv_iommu_iodir_update(iommu, dev, fsc, ta);
+	riscv_iommu_iodir_update(iommu, dev, &dc);
 	riscv_iommu_bond_unlink(info->domain, dev);
 	info->domain = domain;
 
@@ -1484,9 +1484,12 @@ static int riscv_iommu_attach_blocking_domain(struct iommu_domain *iommu_domain,
 {
 	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
 	struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+	struct riscv_iommu_dc dc = {0};
+
+	dc.fsc = RISCV_IOMMU_FSC_BARE;
 
 	/* Make device context invalid, translation requests will fault w/ #258 */
-	riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, 0);
+	riscv_iommu_iodir_update(iommu, dev, &dc);
 	riscv_iommu_bond_unlink(info->domain, dev);
 	info->domain = NULL;
 
@@ -1505,8 +1508,12 @@ static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
 {
 	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
 	struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+	struct riscv_iommu_dc dc = {0};
+
+	dc.fsc = RISCV_IOMMU_FSC_BARE;
+	dc.ta = RISCV_IOMMU_PC_TA_V;
 
-	riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, RISCV_IOMMU_PC_TA_V);
+	riscv_iommu_iodir_update(iommu, dev, &dc);
 	riscv_iommu_bond_unlink(info->domain, dev);
 	info->domain = NULL;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 04/10] iommu/riscv: add iotlb_sync_map operation support
  2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
                   ` (2 preceding siblings ...)
  2024-06-14 14:21 ` [RFC PATCH v2 03/10] iommu/riscv: use data structure instead of individual values Zong Li
@ 2024-06-14 14:21 ` Zong Li
  2024-06-15  3:14   ` Baolu Lu
  2024-06-14 14:21 ` [RFC PATCH v2 05/10] iommu/riscv: support GSCID and GVMA invalidation command Zong Li
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Zong Li

Add iotlb_sync_map operation for flush IOTLB. Software must
flush the IOTLB after each page table.

Signed-off-by: Zong Li <zong.li@sifive.com>
---
 drivers/iommu/riscv/Makefile |  1 +
 drivers/iommu/riscv/iommu.c  | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
index d36625a1fd08..f02ce6ebfbd0 100644
--- a/drivers/iommu/riscv/Makefile
+++ b/drivers/iommu/riscv/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o iommu-pmu.o
 obj-$(CONFIG_RISCV_IOMMU_PCI) += iommu-pci.o
+obj-$(CONFIG_SIFIVE_IOMMU) += iommu-sifive.o
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 9aeb4b20c145..df7aeb2571ae 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1115,6 +1115,16 @@ static void riscv_iommu_iotlb_sync(struct iommu_domain *iommu_domain,
 	riscv_iommu_iotlb_inval(domain, gather->start, gather->end);
 }
 
+static int riscv_iommu_iotlb_sync_map(struct iommu_domain *iommu_domain,
+				      unsigned long iova, size_t size)
+{
+	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+
+	riscv_iommu_iotlb_inval(domain, iova, iova + size - 1);
+
+	return 0;
+}
+
 static inline size_t get_page_size(size_t size)
 {
 	if (size >= IOMMU_PAGE_SIZE_512G)
@@ -1396,6 +1406,7 @@ static const struct iommu_domain_ops riscv_iommu_paging_domain_ops = {
 	.unmap_pages = riscv_iommu_unmap_pages,
 	.iova_to_phys = riscv_iommu_iova_to_phys,
 	.iotlb_sync = riscv_iommu_iotlb_sync,
+	.iotlb_sync_map = riscv_iommu_iotlb_sync_map,
 	.flush_iotlb_all = riscv_iommu_iotlb_flush_all,
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 05/10] iommu/riscv: support GSCID and GVMA invalidation command
  2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
                   ` (3 preceding siblings ...)
  2024-06-14 14:21 ` [RFC PATCH v2 04/10] iommu/riscv: add iotlb_sync_map operation support Zong Li
@ 2024-06-14 14:21 ` Zong Li
  2024-06-14 14:21 ` [RFC PATCH v2 06/10] iommu/riscv: support nested iommu for getting iommu hardware information Zong Li
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Zong Li

This patch adds a ID Allocator for GSCID and a wrap for setting up
GSCID in IOTLB invalidation command.

Set up iohgatp to enable second stage table and flush stage-2 table if
the GSCID is set.

The GSCID of domain should be freed when release domain. GSCID will be
allocated for parent domain in nested IOMMU process.

Signed-off-by: Zong Li <zong.li@sifive.com>
---
 drivers/iommu/riscv/iommu-bits.h |  7 ++++++
 drivers/iommu/riscv/iommu.c      | 39 ++++++++++++++++++++++++++++----
 2 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
index 60523449f016..214735a335fd 100644
--- a/drivers/iommu/riscv/iommu-bits.h
+++ b/drivers/iommu/riscv/iommu-bits.h
@@ -731,6 +731,13 @@ static inline void riscv_iommu_cmd_inval_vma(struct riscv_iommu_command *cmd)
 	cmd->dword1 = 0;
 }
 
+static inline void riscv_iommu_cmd_inval_gvma(struct riscv_iommu_command *cmd)
+{
+	cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOTINVAL_OPCODE) |
+		      FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA);
+	cmd->dword1 = 0;
+}
+
 static inline void riscv_iommu_cmd_inval_set_addr(struct riscv_iommu_command *cmd,
 						  u64 addr)
 {
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index df7aeb2571ae..45309bd096e5 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -45,6 +45,10 @@
 static DEFINE_IDA(riscv_iommu_pscids);
 #define RISCV_IOMMU_MAX_PSCID		(BIT(20) - 1)
 
+/* IOMMU GSCID allocation namespace. */
+static DEFINE_IDA(riscv_iommu_gscids);
+#define RISCV_IOMMU_MAX_GSCID		(BIT(16) - 1)
+
 /* Device resource-managed allocations */
 struct riscv_iommu_devres {
 	void *addr;
@@ -845,6 +849,7 @@ struct riscv_iommu_domain {
 	struct list_head bonds;
 	spinlock_t lock;		/* protect bonds list updates. */
 	int pscid;
+	int gscid;
 	int amo_enabled:1;
 	int numa_node;
 	unsigned int pgd_mode;
@@ -993,20 +998,33 @@ static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
 	rcu_read_lock();
 
 	prev = NULL;
+
 	list_for_each_entry_rcu(bond, &domain->bonds, list) {
 		iommu = dev_to_iommu(bond->dev);
 
 		/*
 		 * IOTLB invalidation request can be safely omitted if already sent
-		 * to the IOMMU for the same PSCID, and with domain->bonds list
+		 * to the IOMMU for the same PSCID/GSCID, and with domain->bonds list
 		 * arranged based on the device's IOMMU, it's sufficient to check
 		 * last device the invalidation was sent to.
 		 */
 		if (iommu == prev)
 			continue;
 
-		riscv_iommu_cmd_inval_vma(&cmd);
-		riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
+		/*
+		 * S2 domain needs to flush entries in stage-2 page table, its
+		 * bond list has host devices and pass-through devices, the GVMA
+		 * command is no effect on host devices, because there are no
+		 * mapping of host devices in stage-2 page table.
+		 */
+		if (domain->gscid) {
+			riscv_iommu_cmd_inval_gvma(&cmd);
+			riscv_iommu_cmd_inval_set_gscid(&cmd, domain->gscid);
+		} else {
+			riscv_iommu_cmd_inval_vma(&cmd);
+			riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
+		}
+
 		if (len && len < RISCV_IOMMU_IOTLB_INVAL_LIMIT) {
 			for (iova = start; iova < end; iova += PAGE_SIZE) {
 				riscv_iommu_cmd_inval_set_addr(&cmd, iova);
@@ -1015,6 +1033,7 @@ static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
 		} else {
 			riscv_iommu_cmd_send(iommu, &cmd);
 		}
+
 		prev = iommu;
 	}
 
@@ -1083,6 +1102,7 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
 
 		WRITE_ONCE(dc->fsc, new_dc->fsc);
 		WRITE_ONCE(dc->ta, new_dc->ta & RISCV_IOMMU_PC_TA_PSCID);
+		WRITE_ONCE(dc->iohgatp, new_dc->iohgatp);
 		/* Update device context, write TC.V as the last step. */
 		dma_wmb();
 		WRITE_ONCE(dc->tc, tc);
@@ -1354,6 +1374,9 @@ static void riscv_iommu_free_paging_domain(struct iommu_domain *iommu_domain)
 	if ((int)domain->pscid > 0)
 		ida_free(&riscv_iommu_pscids, domain->pscid);
 
+	if ((int)domain->gscid > 0)
+		ida_free(&riscv_iommu_gscids, domain->gscid);
+
 	riscv_iommu_pte_free(domain, _io_pte_entry(pfn, _PAGE_TABLE), NULL);
 	kfree(domain);
 }
@@ -1384,8 +1407,14 @@ static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
 	if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
 		return -ENODEV;
 
-	dc.fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
-		 FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
+	if (domain->gscid)
+		dc.iohgatp = FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_MODE, domain->pgd_mode) |
+			     FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_GSCID, domain->gscid) |
+			     FIELD_PREP(RISCV_IOMMU_DC_IOHGATP_PPN, virt_to_pfn(domain->pgd_root));
+	else
+		dc.fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
+			 FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
+
 	dc.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
 			   RISCV_IOMMU_PC_TA_V;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 06/10] iommu/riscv: support nested iommu for getting iommu hardware information
  2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
                   ` (4 preceding siblings ...)
  2024-06-14 14:21 ` [RFC PATCH v2 05/10] iommu/riscv: support GSCID and GVMA invalidation command Zong Li
@ 2024-06-14 14:21 ` Zong Li
  2024-06-19 15:49   ` Jason Gunthorpe
  2024-06-14 14:21 ` [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace Zong Li
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Zong Li

This patch implements .hw_info operation and the related data
structures for passing the IOMMU hardware capabilities for iommufd.

Signed-off-by: Zong Li <zong.li@sifive.com>
---
 drivers/iommu/riscv/iommu.c  | 20 ++++++++++++++++++++
 include/uapi/linux/iommufd.h | 18 ++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 45309bd096e5..2130106e421f 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -19,6 +19,7 @@
 #include <linux/iopoll.h>
 #include <linux/kernel.h>
 #include <linux/pci.h>
+#include <uapi/linux/iommufd.h>
 
 #include "../iommu-pages.h"
 #include "iommu-bits.h"
@@ -1567,6 +1568,24 @@ static struct iommu_domain riscv_iommu_identity_domain = {
 	}
 };
 
+static void *riscv_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
+{
+	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+	struct iommu_hw_info_riscv_iommu *info;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return ERR_PTR(-ENOMEM);
+
+	info->capability = iommu->caps;
+	info->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
+
+	*length = sizeof(*info);
+	*type = IOMMU_HW_INFO_TYPE_RISCV_IOMMU;
+
+	return info;
+}
+
 static int riscv_iommu_device_domain_type(struct device *dev)
 {
 	return 0;
@@ -1644,6 +1663,7 @@ static void riscv_iommu_release_device(struct device *dev)
 static const struct iommu_ops riscv_iommu_ops = {
 	.pgsize_bitmap = SZ_4K,
 	.of_xlate = riscv_iommu_of_xlate,
+	.hw_info = riscv_iommu_hw_info,
 	.identity_domain = &riscv_iommu_identity_domain,
 	.blocked_domain = &riscv_iommu_blocking_domain,
 	.release_domain = &riscv_iommu_blocking_domain,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 1dfeaa2e649e..736f4408b5e0 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -475,15 +475,33 @@ struct iommu_hw_info_vtd {
 	__aligned_u64 ecap_reg;
 };
 
+/**
+ * struct iommu_hw_info_riscv_iommu - RISCV IOMMU hardware information
+ *
+ * @capability: Value of RISC-V IOMMU capability register defined in
+ *              RISC-V IOMMU spec section 5.3 IOMMU capabilities
+ * @fctl: Value of RISC-V IOMMU feature control register defined in
+ *              RISC-V IOMMU spec section 5.4 Features-control register
+ *
+ * Don't advertise ATS support to the guest because driver doesn't support it.
+ */
+struct iommu_hw_info_riscv_iommu {
+	__aligned_u64 capability;
+	__u32 fctl;
+	__u32 __reserved;
+};
+
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
  * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
  *                           info
  * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ * @IOMMU_HW_INFO_TYPE_RISCV_IOMMU: RISC-V iommu info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_NONE,
 	IOMMU_HW_INFO_TYPE_INTEL_VTD,
+	IOMMU_HW_INFO_TYPE_RISCV_IOMMU,
 };
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace
  2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
                   ` (5 preceding siblings ...)
  2024-06-14 14:21 ` [RFC PATCH v2 06/10] iommu/riscv: support nested iommu for getting iommu hardware information Zong Li
@ 2024-06-14 14:21 ` Zong Li
  2024-06-19 16:02   ` Jason Gunthorpe
  2024-06-19 16:34   ` Joao Martins
  2024-06-14 14:21 ` [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache Zong Li
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Zong Li

This patch implements .domain_alloc_user operation for creating domains
owend by userspace, e.g. through IOMMUFD. Add s2 domain for parent
domain for second stage, s1 domain will be the first stage.

Don't remove IOMMU private data of dev in blocked domain, because it
holds the user data of device, which is used when attaching device into
s1 domain.

Signed-off-by: Zong Li <zong.li@sifive.com>
---
 drivers/iommu/riscv/iommu.c  | 236 ++++++++++++++++++++++++++++++++++-
 include/uapi/linux/iommufd.h |  17 +++
 2 files changed, 252 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 2130106e421f..410b236e9b24 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -846,6 +846,8 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
 
 /* This struct contains protection domain specific IOMMU driver data. */
 struct riscv_iommu_domain {
+	struct riscv_iommu_domain *s2;
+	struct riscv_iommu_device *iommu;
 	struct iommu_domain domain;
 	struct list_head bonds;
 	spinlock_t lock;		/* protect bonds list updates. */
@@ -863,6 +865,7 @@ struct riscv_iommu_domain {
 /* Private IOMMU data for managed devices, dev_iommu_priv_* */
 struct riscv_iommu_info {
 	struct riscv_iommu_domain *domain;
+	struct riscv_iommu_dc dc_user;
 };
 
 /*
@@ -1532,7 +1535,6 @@ static int riscv_iommu_attach_blocking_domain(struct iommu_domain *iommu_domain,
 	/* Make device context invalid, translation requests will fault w/ #258 */
 	riscv_iommu_iodir_update(iommu, dev, &dc);
 	riscv_iommu_bond_unlink(info->domain, dev);
-	info->domain = NULL;
 
 	return 0;
 }
@@ -1568,6 +1570,237 @@ static struct iommu_domain riscv_iommu_identity_domain = {
 	}
 };
 
+/**
+ * Nested IOMMU operations
+ */
+
+static int riscv_iommu_attach_dev_nested(struct iommu_domain *domain, struct device *dev)
+{
+	struct riscv_iommu_domain *riscv_domain = iommu_domain_to_riscv(domain);
+	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+	struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+
+	/*
+	 * Add bond to the new domain's list, but don't unlink in current domain.
+	 * We need to flush entries in stage-2 page table by iterating the list.
+	 */
+	if (riscv_iommu_bond_link(riscv_domain, dev))
+		return -ENOMEM;
+
+	riscv_iommu_iotlb_inval(riscv_domain, 0, ULONG_MAX);
+	info->dc_user.ta |= RISCV_IOMMU_PC_TA_V;
+	riscv_iommu_iodir_update(iommu, dev, &info->dc_user);
+
+	info->domain = riscv_domain;
+
+	return 0;
+}
+
+static void riscv_iommu_domain_free_nested(struct iommu_domain *domain)
+{
+	struct riscv_iommu_domain *riscv_domain = iommu_domain_to_riscv(domain);
+	struct riscv_iommu_bond *bond;
+
+	/* Unlink bond in s2 domain, because we link bond both on s1 and s2 domain */
+	list_for_each_entry_rcu(bond, &riscv_domain->s2->bonds, list)
+		riscv_iommu_bond_unlink(riscv_domain->s2, bond->dev);
+
+	if ((int)riscv_domain->pscid > 0)
+		ida_free(&riscv_iommu_pscids, riscv_domain->pscid);
+
+	kfree(riscv_domain);
+}
+
+static const struct iommu_domain_ops riscv_iommu_nested_domain_ops = {
+	.attach_dev	= riscv_iommu_attach_dev_nested,
+	.free		= riscv_iommu_domain_free_nested,
+};
+
+static int
+riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg)
+{
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+	struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+	struct riscv_iommu_dc dc;
+	struct riscv_iommu_fq_record event;
+	u64 dc_len = sizeof(struct riscv_iommu_dc) >>
+		     (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_MSI_FLAT));
+	u64 event_len = sizeof(struct riscv_iommu_fq_record);
+	void __user *event_user = NULL;
+
+	for (int i = 0; i < fwspec->num_ids; i++) {
+		event.hdr =
+			FIELD_PREP(RISCV_IOMMU_FQ_HDR_CAUSE, RISCV_IOMMU_FQ_CAUSE_DDT_INVALID) |
+			FIELD_PREP(RISCV_IOMMU_FQ_HDR_DID, fwspec->ids[i]);
+
+		/* Sanity check DC of stage-1 from user data */
+		if (!user_arg->out_event_uptr || user_arg->event_len != event_len)
+			return -EINVAL;
+
+		event_user = u64_to_user_ptr(user_arg->out_event_uptr);
+
+		if (!user_arg->dc_uptr || user_arg->dc_len != dc_len)
+			return -EINVAL;
+
+		if (copy_from_user(&dc, u64_to_user_ptr(user_arg->dc_uptr), dc_len))
+			return -EFAULT;
+
+		if (!(dc.tc & RISCV_IOMMU_DDTE_V)) {
+			dev_dbg(dev, "Invalid DDT from user data\n");
+			if (copy_to_user(event_user, &event, event_len))
+				return -EFAULT;
+		}
+
+		if (!dc.fsc || dc.iohgatp) {
+			dev_dbg(dev, "Wrong page table from user data\n");
+			if (copy_to_user(event_user, &event, event_len))
+				return -EFAULT;
+		}
+
+		/* Save DC of stage-1 from user data */
+		memcpy(&info->dc_user,
+		       riscv_iommu_get_dc(iommu, fwspec->ids[i]),
+		       sizeof(struct riscv_iommu_dc));
+		info->dc_user.fsc = dc.fsc;
+	}
+
+	return 0;
+}
+
+static struct iommu_domain *
+riscv_iommu_domain_alloc_nested(struct device *dev,
+				struct iommu_domain *parent,
+				const struct iommu_user_data *user_data)
+{
+	struct riscv_iommu_domain *s2_domain = iommu_domain_to_riscv(parent);
+	struct riscv_iommu_domain *s1_domain;
+	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+	struct iommu_hwpt_riscv_iommu arg;
+	int ret, va_bits;
+
+	if (user_data->type != IOMMU_HWPT_DATA_RISCV_IOMMU)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (parent->type != IOMMU_DOMAIN_UNMANAGED)
+		return ERR_PTR(-EINVAL);
+
+	ret = iommu_copy_struct_from_user(&arg,
+					  user_data,
+					  IOMMU_HWPT_DATA_RISCV_IOMMU,
+					  out_event_uptr);
+	if (ret)
+		return ERR_PTR(ret);
+
+	s1_domain = kzalloc(sizeof(*s1_domain), GFP_KERNEL);
+	if (!s1_domain)
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock_init(&s1_domain->lock);
+	INIT_LIST_HEAD_RCU(&s1_domain->bonds);
+
+	s1_domain->pscid = ida_alloc_range(&riscv_iommu_pscids, 1,
+					   RISCV_IOMMU_MAX_PSCID, GFP_KERNEL);
+	if (s1_domain->pscid < 0) {
+		iommu_free_page(s1_domain->pgd_root);
+		kfree(s1_domain);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* Get device context of stage-1 from user*/
+	ret = riscv_iommu_get_dc_user(dev, &arg);
+	if (ret) {
+		kfree(s1_domain);
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (!iommu) {
+		va_bits = VA_BITS;
+	} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV57) {
+		va_bits = 57;
+	} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV48) {
+		va_bits = 48;
+	} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV39) {
+		va_bits = 39;
+	} else {
+		dev_err(dev, "cannot find supported page table mode\n");
+		return ERR_PTR(-ENODEV);
+	}
+
+	/*
+	 * The ops->domain_alloc_user could be directly called by the iommufd core,
+	 * instead of iommu core. So, this function need to set the default value of
+	 * following data member:
+	 *  - domain->pgsize_bitmap
+	 *  - domain->geometry
+	 *  - domain->type
+	 *  - domain->ops
+	 */
+	s1_domain->s2 = s2_domain;
+	s1_domain->iommu = iommu;
+	s1_domain->domain.type = IOMMU_DOMAIN_NESTED;
+	s1_domain->domain.ops = &riscv_iommu_nested_domain_ops;
+	s1_domain->domain.pgsize_bitmap = SZ_4K;
+	s1_domain->domain.geometry.aperture_start = 0;
+	s1_domain->domain.geometry.aperture_end = DMA_BIT_MASK(va_bits - 1);
+	s1_domain->domain.geometry.force_aperture = true;
+
+	return &s1_domain->domain;
+}
+
+static struct iommu_domain *
+riscv_iommu_domain_alloc_user(struct device *dev, u32 flags,
+			      struct iommu_domain *parent,
+			      const struct iommu_user_data *user_data)
+{
+	struct iommu_domain *domain;
+	struct riscv_iommu_domain *riscv_domain;
+
+	/* Allocate stage-1 domain if it has stage-2 parent domain */
+	if (parent)
+		return riscv_iommu_domain_alloc_nested(dev, parent, user_data);
+
+	if (flags & ~((IOMMU_HWPT_ALLOC_NEST_PARENT | IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (user_data)
+		return ERR_PTR(-EINVAL);
+
+	/* domain_alloc_user op needs to be fully initialized */
+	domain = iommu_domain_alloc(dev->bus);
+	if (!domain)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * We assume that nest-parent or g-stage only will come here
+	 * TODO: Shadow page table doesn't be supported now.
+	 *       We currently can't distinguish g-stage and shadow
+	 *       page table here. Shadow page table shouldn't be
+	 *       put at stage-2.
+	 */
+	riscv_domain = iommu_domain_to_riscv(domain);
+
+	/* pgd_root may be allocated in .domain_alloc_paging */
+	if (riscv_domain->pgd_root)
+		iommu_free_page(riscv_domain->pgd_root);
+
+	riscv_domain->pgd_root = iommu_alloc_pages_node(riscv_domain->numa_node,
+							GFP_KERNEL_ACCOUNT,
+							2);
+	if (!riscv_domain->pgd_root)
+		return ERR_PTR(-ENOMEM);
+
+	riscv_domain->gscid = ida_alloc_range(&riscv_iommu_gscids, 1,
+					      RISCV_IOMMU_MAX_GSCID, GFP_KERNEL);
+	if (riscv_domain->gscid < 0) {
+		iommu_free_pages(riscv_domain->pgd_root, 2);
+		kfree(riscv_domain);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	return domain;
+}
+
 static void *riscv_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
 {
 	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
@@ -1668,6 +1901,7 @@ static const struct iommu_ops riscv_iommu_ops = {
 	.blocked_domain = &riscv_iommu_blocking_domain,
 	.release_domain = &riscv_iommu_blocking_domain,
 	.domain_alloc_paging = riscv_iommu_alloc_paging_domain,
+	.domain_alloc_user = riscv_iommu_domain_alloc_user,
 	.def_domain_type = riscv_iommu_device_domain_type,
 	.device_group = riscv_iommu_device_group,
 	.probe_device = riscv_iommu_probe_device,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 736f4408b5e0..514463fe85d3 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -390,14 +390,31 @@ struct iommu_hwpt_vtd_s1 {
 	__u32 __reserved;
 };
 
+/**
+ * struct iommu_hwpt_riscv_iommu - RISCV IOMMU stage-1 device context table
+ *                                 info (IOMMU_HWPT_TYPE_RISCV_IOMMU)
+ * @dc_len: Length of device context
+ * @dc_uptr: User pointer to the address of device context
+ * @event_len: Length of an event record
+ * @out_event_uptr: User pointer to the address of event record
+ */
+struct iommu_hwpt_riscv_iommu {
+	__aligned_u64 dc_len;
+	__aligned_u64 dc_uptr;
+	__aligned_u64 event_len;
+	__aligned_u64 out_event_uptr;
+};
+
 /**
  * enum iommu_hwpt_data_type - IOMMU HWPT Data Type
  * @IOMMU_HWPT_DATA_NONE: no data
  * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
+ * @IOMMU_HWPT_DATA_RISCV_IOMMU: RISC-V IOMMU device context table
  */
 enum iommu_hwpt_data_type {
 	IOMMU_HWPT_DATA_NONE,
 	IOMMU_HWPT_DATA_VTD_S1,
+	IOMMU_HWPT_DATA_RISCV_IOMMU,
 };
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache
  2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
                   ` (6 preceding siblings ...)
  2024-06-14 14:21 ` [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace Zong Li
@ 2024-06-14 14:21 ` Zong Li
  2024-06-15  3:22   ` Baolu Lu
  2024-06-19 16:17   ` Jason Gunthorpe
  2024-06-14 14:21 ` [RFC PATCH v2 09/10] iommu/dma: Support MSIs through nested domains Zong Li
  2024-06-14 14:21 ` [RFC PATCH v2 10/10] iommu:riscv: support nested iommu for get_msi_mapping_domain operation Zong Li
  9 siblings, 2 replies; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Zong Li

This patch implements cache_invalidate_user operation for the userspace
to flush the hardware caches for a nested domain through iommufd.

Signed-off-by: Zong Li <zong.li@sifive.com>
---
 drivers/iommu/riscv/iommu.c  | 90 ++++++++++++++++++++++++++++++++++--
 include/uapi/linux/iommufd.h | 11 +++++
 2 files changed, 97 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 410b236e9b24..d08eb0a2939e 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1587,8 +1587,9 @@ static int riscv_iommu_attach_dev_nested(struct iommu_domain *domain, struct dev
 	if (riscv_iommu_bond_link(riscv_domain, dev))
 		return -ENOMEM;
 
-	riscv_iommu_iotlb_inval(riscv_domain, 0, ULONG_MAX);
-	info->dc_user.ta |= RISCV_IOMMU_PC_TA_V;
+	if (riscv_iommu_bond_link(info->domain, dev))
+		return -ENOMEM;
+
 	riscv_iommu_iodir_update(iommu, dev, &info->dc_user);
 
 	info->domain = riscv_domain;
@@ -1611,13 +1612,92 @@ static void riscv_iommu_domain_free_nested(struct iommu_domain *domain)
 	kfree(riscv_domain);
 }
 
+static int riscv_iommu_fix_user_cmd(struct riscv_iommu_command *cmd,
+				    unsigned int pscid, unsigned int gscid)
+{
+	u32 opcode = FIELD_GET(RISCV_IOMMU_CMD_OPCODE, cmd->dword0);
+
+	switch (opcode) {
+	case RISCV_IOMMU_CMD_IOTINVAL_OPCODE:
+		u32 func = FIELD_GET(RISCV_IOMMU_CMD_FUNC, cmd->dword0);
+
+		if (func != RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA &&
+		    func != RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA) {
+			pr_warn("The IOTINVAL function: 0x%x is not supported\n",
+				func);
+			return -EOPNOTSUPP;
+		}
+
+		if (func == RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA) {
+			cmd->dword0 &= ~RISCV_IOMMU_CMD_FUNC;
+			cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_FUNC,
+						  RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA);
+		}
+
+		cmd->dword0 &= ~(RISCV_IOMMU_CMD_IOTINVAL_PSCID |
+				 RISCV_IOMMU_CMD_IOTINVAL_GSCID);
+		riscv_iommu_cmd_inval_set_pscid(cmd, pscid);
+		riscv_iommu_cmd_inval_set_gscid(cmd, gscid);
+		break;
+	case RISCV_IOMMU_CMD_IODIR_OPCODE:
+		/*
+		 * Ensure the device ID is right. We expect that VMM has
+		 * transferred the device ID to host's from guest's.
+		 */
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int riscv_iommu_cache_invalidate_user(struct iommu_domain *domain,
+					     struct iommu_user_data_array *array)
+{
+	struct riscv_iommu_domain *riscv_domain = iommu_domain_to_riscv(domain);
+	struct iommu_hwpt_riscv_iommu_invalidate inv_info;
+	int ret, index;
+
+	if (array->type != IOMMU_HWPT_INVALIDATE_DATA_RISCV_IOMMU) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	for (index = 0; index < array->entry_num; index++) {
+		ret = iommu_copy_struct_from_user_array(&inv_info, array,
+							IOMMU_HWPT_INVALIDATE_DATA_RISCV_IOMMU,
+							index, cmd);
+		if (ret)
+			break;
+
+		ret = riscv_iommu_fix_user_cmd((struct riscv_iommu_command *)inv_info.cmd,
+					       riscv_domain->pscid,
+					       riscv_domain->s2->gscid);
+		if (ret == -EOPNOTSUPP)
+			continue;
+
+		riscv_iommu_cmd_send(riscv_domain->iommu,
+				     (struct riscv_iommu_command *)inv_info.cmd);
+		riscv_iommu_cmd_sync(riscv_domain->iommu,
+				     RISCV_IOMMU_IOTINVAL_TIMEOUT);
+	}
+
+out:
+	array->entry_num = index;
+
+	return ret;
+}
+
 static const struct iommu_domain_ops riscv_iommu_nested_domain_ops = {
 	.attach_dev	= riscv_iommu_attach_dev_nested,
 	.free		= riscv_iommu_domain_free_nested,
+	.cache_invalidate_user = riscv_iommu_cache_invalidate_user,
 };
 
 static int
-riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg)
+riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg,
+			struct riscv_iommu_domain *s1_domain)
 {
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
@@ -1663,6 +1743,8 @@ riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_
 		       riscv_iommu_get_dc(iommu, fwspec->ids[i]),
 		       sizeof(struct riscv_iommu_dc));
 		info->dc_user.fsc = dc.fsc;
+		info->dc_user.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, s1_domain->pscid) |
+					      RISCV_IOMMU_PC_TA_V;
 	}
 
 	return 0;
@@ -1708,7 +1790,7 @@ riscv_iommu_domain_alloc_nested(struct device *dev,
 	}
 
 	/* Get device context of stage-1 from user*/
-	ret = riscv_iommu_get_dc_user(dev, &arg);
+	ret = riscv_iommu_get_dc_user(dev, &arg, s1_domain);
 	if (ret) {
 		kfree(s1_domain);
 		return ERR_PTR(-EINVAL);
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 514463fe85d3..876cbe980a42 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -653,9 +653,11 @@ struct iommu_hwpt_get_dirty_bitmap {
  * enum iommu_hwpt_invalidate_data_type - IOMMU HWPT Cache Invalidation
  *                                        Data Type
  * @IOMMU_HWPT_INVALIDATE_DATA_VTD_S1: Invalidation data for VTD_S1
+ * @IOMMU_HWPT_INVALIDATE_DATA_RISCV_IOMMU: Invalidation data for RISCV_IOMMU
  */
 enum iommu_hwpt_invalidate_data_type {
 	IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+	IOMMU_HWPT_INVALIDATE_DATA_RISCV_IOMMU,
 };
 
 /**
@@ -694,6 +696,15 @@ struct iommu_hwpt_vtd_s1_invalidate {
 	__u32 __reserved;
 };
 
+/**
+ * struct iommu_hwpt_riscv_iommu_invalidate - RISCV IOMMU cache invalidation
+ *                                            (IOMMU_HWPT_TYPE_RISCV_IOMMU)
+ * @cmd: An array holds a command for cache invalidation
+ */
+struct iommu_hwpt_riscv_iommu_invalidate {
+	__aligned_u64 cmd[2];
+};
+
 /**
  * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
  * @size: sizeof(struct iommu_hwpt_invalidate)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 09/10] iommu/dma: Support MSIs through nested domains
  2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
                   ` (7 preceding siblings ...)
  2024-06-14 14:21 ` [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache Zong Li
@ 2024-06-14 14:21 ` Zong Li
  2024-06-14 18:12   ` Nicolin Chen
  2024-06-14 14:21 ` [RFC PATCH v2 10/10] iommu:riscv: support nested iommu for get_msi_mapping_domain operation Zong Li
  9 siblings, 1 reply; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Nicolin Chen

From: Robin Murphy <robin.murphy@arm.com>

Currently, iommu-dma is the only place outside of IOMMUFD and drivers
which might need to be aware of the stage 2 domain encapsulated within
a nested domain. This would be in the legacy-VFIO-style case where we're
using host-managed MSIs with an identity mapping at stage 1, where it is
the underlying stage 2 domain which owns an MSI cookie and holds the
corresponding dynamic mappings. Hook up the new op to resolve what we
need from a nested domain.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/dma-iommu.c | 18 ++++++++++++++++--
 include/linux/iommu.h     |  4 ++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f731e4b2a417..d4235bb0a427 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1806,6 +1806,20 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 	return NULL;
 }
 
+/*
+ * Nested domains may not have an MSI cookie or accept mappings, but they may
+ * be related to a domain which does, so we let them tell us what they need.
+ */
+static struct iommu_domain *iommu_dma_get_msi_mapping_domain(struct device *dev)
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+
+	if (domain && domain->type == IOMMU_DOMAIN_NESTED &&
+	    domain->ops->get_msi_mapping_domain)
+		domain = domain->ops->get_msi_mapping_domain(domain);
+	return domain;
+}
+
 /**
  * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
  * @desc: MSI descriptor, will store the MSI page
@@ -1816,7 +1830,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
 {
 	struct device *dev = msi_desc_to_dev(desc);
-	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iommu_domain *domain = iommu_dma_get_msi_mapping_domain(dev);
 	struct iommu_dma_msi_page *msi_page;
 	static DEFINE_MUTEX(msi_prepare_lock); /* see below */
 
@@ -1849,7 +1863,7 @@ int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
 void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
 {
 	struct device *dev = msi_desc_to_dev(desc);
-	const struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	const struct iommu_domain *domain = iommu_dma_get_msi_mapping_domain(dev);
 	const struct iommu_dma_msi_page *msi_page;
 
 	msi_page = msi_desc_get_iommu_cookie(desc);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 7bc8dff7cf6d..400df9ae7012 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -629,6 +629,8 @@ struct iommu_ops {
  * @enable_nesting: Enable nesting
  * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
  * @free: Release the domain after use.
+ * @get_msi_mapping_domain: Return the related iommu_domain that should hold the
+ *                          MSI cookie and accept mapping(s).
  */
 struct iommu_domain_ops {
 	int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
@@ -659,6 +661,8 @@ struct iommu_domain_ops {
 				  unsigned long quirks);
 
 	void (*free)(struct iommu_domain *domain);
+	struct iommu_domain *
+		(*get_msi_mapping_domain)(struct iommu_domain *domain);
 };
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH v2 10/10] iommu:riscv: support nested iommu for get_msi_mapping_domain operation
  2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
                   ` (8 preceding siblings ...)
  2024-06-14 14:21 ` [RFC PATCH v2 09/10] iommu/dma: Support MSIs through nested domains Zong Li
@ 2024-06-14 14:21 ` Zong Li
  9 siblings, 0 replies; 37+ messages in thread
From: Zong Li @ 2024-06-14 14:21 UTC (permalink / raw)
  To: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: Zong Li

Return the iommu_domain that should hold the MSI cookie.

Signed-off-by: Zong Li <zong.li@sifive.com>
---
 drivers/iommu/riscv/iommu.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index d08eb0a2939e..969a0ba32c9e 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -1689,10 +1689,22 @@ static int riscv_iommu_cache_invalidate_user(struct iommu_domain *domain,
 	return ret;
 }
 
+static struct iommu_domain *
+riscv_iommu_get_msi_mapping_domain(struct iommu_domain *domain)
+{
+	struct riscv_iommu_domain *riscv_domain = iommu_domain_to_riscv(domain);
+
+	if (riscv_domain->s2)
+		return &riscv_domain->s2->domain;
+
+	return domain;
+}
+
 static const struct iommu_domain_ops riscv_iommu_nested_domain_ops = {
 	.attach_dev	= riscv_iommu_attach_dev_nested,
 	.free		= riscv_iommu_domain_free_nested,
 	.cache_invalidate_user = riscv_iommu_cache_invalidate_user,
+	.get_msi_mapping_domain = riscv_iommu_get_msi_mapping_domain,
 };
 
 static int
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 09/10] iommu/dma: Support MSIs through nested domains
  2024-06-14 14:21 ` [RFC PATCH v2 09/10] iommu/dma: Support MSIs through nested domains Zong Li
@ 2024-06-14 18:12   ` Nicolin Chen
  2024-06-17  2:15     ` Zong Li
  0 siblings, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2024-06-14 18:12 UTC (permalink / raw)
  To: Zong Li
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv

On Fri, Jun 14, 2024 at 10:21:55PM +0800, Zong Li wrote:
> From: Robin Murphy <robin.murphy@arm.com>
> 
> Currently, iommu-dma is the only place outside of IOMMUFD and drivers
> which might need to be aware of the stage 2 domain encapsulated within
> a nested domain. This would be in the legacy-VFIO-style case where we're
> using host-managed MSIs with an identity mapping at stage 1, where it is
> the underlying stage 2 domain which owns an MSI cookie and holds the
> corresponding dynamic mappings. Hook up the new op to resolve what we
> need from a nested domain.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

I think there should be your Signed-off line at the end since you
act as a submitter :)

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 04/10] iommu/riscv: add iotlb_sync_map operation support
  2024-06-14 14:21 ` [RFC PATCH v2 04/10] iommu/riscv: add iotlb_sync_map operation support Zong Li
@ 2024-06-15  3:14   ` Baolu Lu
  2024-06-17 13:43     ` Zong Li
  0 siblings, 1 reply; 37+ messages in thread
From: Baolu Lu @ 2024-06-15  3:14 UTC (permalink / raw)
  To: Zong Li, joro, will, robin.murphy, tjeznach, paul.walmsley,
	palmer, aou, jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: baolu.lu

On 6/14/24 10:21 PM, Zong Li wrote:
> Add iotlb_sync_map operation for flush IOTLB. Software must
> flush the IOTLB after each page table.
> 
> Signed-off-by: Zong Li<zong.li@sifive.com>
> ---
>   drivers/iommu/riscv/Makefile |  1 +
>   drivers/iommu/riscv/iommu.c  | 11 +++++++++++
>   2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
> index d36625a1fd08..f02ce6ebfbd0 100644
> --- a/drivers/iommu/riscv/Makefile
> +++ b/drivers/iommu/riscv/Makefile
> @@ -1,3 +1,4 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o iommu-pmu.o
>   obj-$(CONFIG_RISCV_IOMMU_PCI) += iommu-pci.o
> +obj-$(CONFIG_SIFIVE_IOMMU) += iommu-sifive.o
> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> index 9aeb4b20c145..df7aeb2571ae 100644
> --- a/drivers/iommu/riscv/iommu.c
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -1115,6 +1115,16 @@ static void riscv_iommu_iotlb_sync(struct iommu_domain *iommu_domain,
>   	riscv_iommu_iotlb_inval(domain, gather->start, gather->end);
>   }
>   
> +static int riscv_iommu_iotlb_sync_map(struct iommu_domain *iommu_domain,
> +				      unsigned long iova, size_t size)
> +{
> +	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> +
> +	riscv_iommu_iotlb_inval(domain, iova, iova + size - 1);

Does the RISC-V IOMMU architecture always cache the non-present or
erroneous translation entries? If so, can you please provide more
context in the commit message?

If not, why do you want to flush the cache when building a new
translation?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache
  2024-06-14 14:21 ` [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache Zong Li
@ 2024-06-15  3:22   ` Baolu Lu
  2024-06-17  2:16     ` Zong Li
  2024-06-19 16:17   ` Jason Gunthorpe
  1 sibling, 1 reply; 37+ messages in thread
From: Baolu Lu @ 2024-06-15  3:22 UTC (permalink / raw)
  To: Zong Li, joro, will, robin.murphy, tjeznach, paul.walmsley,
	palmer, aou, jgg, kevin.tian, linux-kernel, iommu, linux-riscv
  Cc: baolu.lu

On 6/14/24 10:21 PM, Zong Li wrote:
> This patch implements cache_invalidate_user operation for the userspace
> to flush the hardware caches for a nested domain through iommufd.

$ grep "This patch" Documentation/process/submitting-patches.rst

Same to other commit messages.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 09/10] iommu/dma: Support MSIs through nested domains
  2024-06-14 18:12   ` Nicolin Chen
@ 2024-06-17  2:15     ` Zong Li
  0 siblings, 0 replies; 37+ messages in thread
From: Zong Li @ 2024-06-17  2:15 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv

On Sat, Jun 15, 2024 at 2:12 AM Nicolin Chen <nicolinc@nvidia.com> wrote:
>
> On Fri, Jun 14, 2024 at 10:21:55PM +0800, Zong Li wrote:
> > From: Robin Murphy <robin.murphy@arm.com>
> >
> > Currently, iommu-dma is the only place outside of IOMMUFD and drivers
> > which might need to be aware of the stage 2 domain encapsulated within
> > a nested domain. This would be in the legacy-VFIO-style case where we're
> > using host-managed MSIs with an identity mapping at stage 1, where it is
> > the underlying stage 2 domain which owns an MSI cookie and holds the
> > corresponding dynamic mappings. Hook up the new op to resolve what we
> > need from a nested domain.
> >
> > Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>
> I think there should be your Signed-off line at the end since you
> act as a submitter :)

Got it, add it in the next version, thanks.

>
> Thanks
> Nicolin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache
  2024-06-15  3:22   ` Baolu Lu
@ 2024-06-17  2:16     ` Zong Li
  0 siblings, 0 replies; 37+ messages in thread
From: Zong Li @ 2024-06-17  2:16 UTC (permalink / raw)
  To: Baolu Lu
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv

On Sat, Jun 15, 2024 at 11:24 AM Baolu Lu <baolu.lu@linux.intel.com> wrote:
>
> On 6/14/24 10:21 PM, Zong Li wrote:
> > This patch implements cache_invalidate_user operation for the userspace
> > to flush the hardware caches for a nested domain through iommufd.
>
> $ grep "This patch" Documentation/process/submitting-patches.rst
>
> Same to other commit messages.
>

Thank you for your tips. I will modify them in the next version.

> Best regards,
> baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 04/10] iommu/riscv: add iotlb_sync_map operation support
  2024-06-15  3:14   ` Baolu Lu
@ 2024-06-17 13:43     ` Zong Li
  2024-06-17 14:39       ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Zong Li @ 2024-06-17 13:43 UTC (permalink / raw)
  To: Baolu Lu
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv

On Sat, Jun 15, 2024 at 11:17 AM Baolu Lu <baolu.lu@linux.intel.com> wrote:
>
> On 6/14/24 10:21 PM, Zong Li wrote:
> > Add iotlb_sync_map operation for flush IOTLB. Software must
> > flush the IOTLB after each page table.
> >
> > Signed-off-by: Zong Li<zong.li@sifive.com>
> > ---
> >   drivers/iommu/riscv/Makefile |  1 +
> >   drivers/iommu/riscv/iommu.c  | 11 +++++++++++
> >   2 files changed, 12 insertions(+)
> >
> > diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
> > index d36625a1fd08..f02ce6ebfbd0 100644
> > --- a/drivers/iommu/riscv/Makefile
> > +++ b/drivers/iommu/riscv/Makefile
> > @@ -1,3 +1,4 @@
> >   # SPDX-License-Identifier: GPL-2.0-only
> >   obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o iommu-pmu.o
> >   obj-$(CONFIG_RISCV_IOMMU_PCI) += iommu-pci.o
> > +obj-$(CONFIG_SIFIVE_IOMMU) += iommu-sifive.o
> > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > index 9aeb4b20c145..df7aeb2571ae 100644
> > --- a/drivers/iommu/riscv/iommu.c
> > +++ b/drivers/iommu/riscv/iommu.c
> > @@ -1115,6 +1115,16 @@ static void riscv_iommu_iotlb_sync(struct iommu_domain *iommu_domain,
> >       riscv_iommu_iotlb_inval(domain, gather->start, gather->end);
> >   }
> >
> > +static int riscv_iommu_iotlb_sync_map(struct iommu_domain *iommu_domain,
> > +                                   unsigned long iova, size_t size)
> > +{
> > +     struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> > +
> > +     riscv_iommu_iotlb_inval(domain, iova, iova + size - 1);
>
> Does the RISC-V IOMMU architecture always cache the non-present or
> erroneous translation entries? If so, can you please provide more
> context in the commit message?
>
> If not, why do you want to flush the cache when building a new
> translation?
>

It seems to me that we can indeed remove this operation, because it
may be too aggressive given the following situation.

I added it for updating the MSI mapping when we change the irq
affinity of a pass-through device to another vCPU. The RISC-V IOMMU
spec allows MSI translation to go through the MSI flat table, MRIF, or
the normal page table. In the case of the normal page table, the MSI
mapping is created in the second-stage page table, mapping the GPA of
the guest's supervisor interrupt file to the HPA of host's guest
interrupt file. This MSI mapping needs to be updated when the HPA of
host's guest interrupt file is changed.

I think we can invalidate the cache after updating the MSI mapping,
rather than adding the iotlb_sync_map() operation for every mapping
created. Does it also make sense to you? If so, I will remove it in
the next version. Thanks.

> Best regards,
> baolu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 04/10] iommu/riscv: add iotlb_sync_map operation support
  2024-06-17 13:43     ` Zong Li
@ 2024-06-17 14:39       ` Jason Gunthorpe
  2024-06-18  3:01         ` Zong Li
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2024-06-17 14:39 UTC (permalink / raw)
  To: Zong Li
  Cc: Baolu Lu, joro, will, robin.murphy, tjeznach, paul.walmsley,
	palmer, aou, kevin.tian, linux-kernel, iommu, linux-riscv

On Mon, Jun 17, 2024 at 09:43:35PM +0800, Zong Li wrote:

> I added it for updating the MSI mapping when we change the irq
> affinity of a pass-through device to another vCPU. The RISC-V IOMMU
> spec allows MSI translation to go through the MSI flat table, MRIF, or
> the normal page table. In the case of the normal page table, the MSI
> mapping is created in the second-stage page table, mapping the GPA of
> the guest's supervisor interrupt file to the HPA of host's guest
> interrupt file. This MSI mapping needs to be updated when the HPA of
> host's guest interrupt file is changed.

It sounds like more thought is needed for the MSI architecture, having
the host read the guest page table to mirror weird MSI stuff seems
kind of wrong..

The S2 really needs to have the correct physical MSI pages statically
at boot time.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 01/10] iommu/riscv: add RISC-V IOMMU PMU support
  2024-06-14 14:21 ` [RFC PATCH v2 01/10] iommu/riscv: add RISC-V IOMMU PMU support Zong Li
@ 2024-06-17 14:55   ` Jason Gunthorpe
  2024-06-18  1:14     ` Zong Li
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2024-06-17 14:55 UTC (permalink / raw)
  To: Zong Li
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	kevin.tian, linux-kernel, iommu, linux-riscv

On Fri, Jun 14, 2024 at 10:21:47PM +0800, Zong Li wrote:
> This patch implements the RISC-V IOMMU hardware performance monitor, it
> includes the counting ans sampling mode.
> 
> Specification doesn't define the event ID for counting the number of
> clock cycles, there is no associated iohpmevt0. But we need an event for
> counting cycle in perf, reserve the maximum number of event ID for it now.

Why is this part of the nesting series?

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 01/10] iommu/riscv: add RISC-V IOMMU PMU support
  2024-06-17 14:55   ` Jason Gunthorpe
@ 2024-06-18  1:14     ` Zong Li
  0 siblings, 0 replies; 37+ messages in thread
From: Zong Li @ 2024-06-18  1:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	kevin.tian, linux-kernel, iommu, linux-riscv

On Mon, Jun 17, 2024 at 10:55 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Jun 14, 2024 at 10:21:47PM +0800, Zong Li wrote:
> > This patch implements the RISC-V IOMMU hardware performance monitor, it
> > includes the counting ans sampling mode.
> >
> > Specification doesn't define the event ID for counting the number of
> > clock cycles, there is no associated iohpmevt0. But we need an event for
> > counting cycle in perf, reserve the maximum number of event ID for it now.
>
> Why is this part of the nesting series?

As you mentioned, it should be a separate patch set, let me submit it
individually in the next version. Thanks

>
> Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 04/10] iommu/riscv: add iotlb_sync_map operation support
  2024-06-17 14:39       ` Jason Gunthorpe
@ 2024-06-18  3:01         ` Zong Li
  2024-06-18 13:31           ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Zong Li @ 2024-06-18  3:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, joro, will, robin.murphy, tjeznach, paul.walmsley,
	palmer, aou, kevin.tian, linux-kernel, iommu, linux-riscv

On Mon, Jun 17, 2024 at 10:39 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Jun 17, 2024 at 09:43:35PM +0800, Zong Li wrote:
>
> > I added it for updating the MSI mapping when we change the irq
> > affinity of a pass-through device to another vCPU. The RISC-V IOMMU
> > spec allows MSI translation to go through the MSI flat table, MRIF, or
> > the normal page table. In the case of the normal page table, the MSI
> > mapping is created in the second-stage page table, mapping the GPA of
> > the guest's supervisor interrupt file to the HPA of host's guest
> > interrupt file. This MSI mapping needs to be updated when the HPA of
> > host's guest interrupt file is changed.
>
> It sounds like more thought is needed for the MSI architecture, having
> the host read the guest page table to mirror weird MSI stuff seems
> kind of wrong..
>

Perhaps I should rephrase it. Host doesn't read the guest page table.
In a RISC-V system, MSIs are directed to a specific privilege level of
a specific hart, including a specific virtual hart. In a hart's IMSIC
(Incoming MSI Controller), it contains some 'interrupt files' for
these specific privilege level harts. For instance, if the target
address of MSI is the address of the interrupt file which is for a
specific supervisor level hart, then that hart's supervisor mode will
receive this MSI. Furthermore, when a hart implements the hypervisor
extension, its IMSIC will have interrupt files for virtual harts,
called 'guest interrupt files'.
We will create the MSI mapping in S2 page table at boot time firstly,
the mapping would be GPA of the interrupt file for supervisor level
(in guest view, it thinks it use a supervisor level interrupt file) to
HPA of the 'guest interrupt file' (in host view, the device should
actually use a guest interrupt file). When the vCPU is migrated to
another physical hart, the 'guest interrupt files' should be switched
to another physical hart's IMSIC's 'guest interrupt file', it means
that the HPA of this MSI mapping in S2 page table needs to be updated.

> The S2 really needs to have the correct physical MSI pages statically
> at boot time.
>
> Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 04/10] iommu/riscv: add iotlb_sync_map operation support
  2024-06-18  3:01         ` Zong Li
@ 2024-06-18 13:31           ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2024-06-18 13:31 UTC (permalink / raw)
  To: Zong Li
  Cc: Baolu Lu, joro, will, robin.murphy, tjeznach, paul.walmsley,
	palmer, aou, kevin.tian, linux-kernel, iommu, linux-riscv

On Tue, Jun 18, 2024 at 11:01:48AM +0800, Zong Li wrote:
> On Mon, Jun 17, 2024 at 10:39 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Mon, Jun 17, 2024 at 09:43:35PM +0800, Zong Li wrote:
> >
> > > I added it for updating the MSI mapping when we change the irq
> > > affinity of a pass-through device to another vCPU. The RISC-V IOMMU
> > > spec allows MSI translation to go through the MSI flat table, MRIF, or
> > > the normal page table. In the case of the normal page table, the MSI
> > > mapping is created in the second-stage page table, mapping the GPA of
> > > the guest's supervisor interrupt file to the HPA of host's guest
> > > interrupt file. This MSI mapping needs to be updated when the HPA of
> > > host's guest interrupt file is changed.
> >
> > It sounds like more thought is needed for the MSI architecture, having
> > the host read the guest page table to mirror weird MSI stuff seems
> > kind of wrong..
> 
> Perhaps I should rephrase it. Host doesn't read the guest page table.
> In a RISC-V system, MSIs are directed to a specific privilege level of
> a specific hart, including a specific virtual hart. In a hart's IMSIC
> (Incoming MSI Controller), it contains some 'interrupt files' for
> these specific privilege level harts. For instance, if the target
> address of MSI is the address of the interrupt file which is for a
> specific supervisor level hart, then that hart's supervisor mode will
> receive this MSI. Furthermore, when a hart implements the hypervisor
> extension, its IMSIC will have interrupt files for virtual harts,
> called 'guest interrupt files'.
> We will create the MSI mapping in S2 page table at boot time firstly,
> the mapping would be GPA of the interrupt file for supervisor level
> (in guest view, it thinks it use a supervisor level interrupt file) to
> HPA of the 'guest interrupt file' (in host view, the device should
> actually use a guest interrupt file). When the vCPU is migrated to
> another physical hart, the 'guest interrupt files' should be switched
> to another physical hart's IMSIC's 'guest interrupt file', it means
> that the HPA of this MSI mapping in S2 page table needs to be updated.

I am vaugely aware of these details, but it is good to hear them again.

However, none of that really explains why this is messing with
invalidation logic..

If you need to replace MSI pages in the S2 atomicaly as you migrate
vCPUs then you need a proper replace operation for the io page table.

map is supposed to fail if there are already mappings at that address,
you can't use it to replace existing mappings with something else.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 06/10] iommu/riscv: support nested iommu for getting iommu hardware information
  2024-06-14 14:21 ` [RFC PATCH v2 06/10] iommu/riscv: support nested iommu for getting iommu hardware information Zong Li
@ 2024-06-19 15:49   ` Jason Gunthorpe
  2024-06-21  7:32     ` Zong Li
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2024-06-19 15:49 UTC (permalink / raw)
  To: Zong Li
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	kevin.tian, linux-kernel, iommu, linux-riscv

On Fri, Jun 14, 2024 at 10:21:52PM +0800, Zong Li wrote:
> This patch implements .hw_info operation and the related data
> structures for passing the IOMMU hardware capabilities for iommufd.
> 
> Signed-off-by: Zong Li <zong.li@sifive.com>
> ---
>  drivers/iommu/riscv/iommu.c  | 20 ++++++++++++++++++++
>  include/uapi/linux/iommufd.h | 18 ++++++++++++++++++
>  2 files changed, 38 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

>  /**
>   * enum iommu_hw_info_type - IOMMU Hardware Info Types
>   * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
>   *                           info
>   * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
> + * @IOMMU_HW_INFO_TYPE_RISCV_IOMMU: RISC-V iommu info type
>   */

Is there a more formal name than "RISCV IOMMU" here? It seems like you
will probably have a RISCV_IOMMU_V2 someday, is that naming OK?

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace
  2024-06-14 14:21 ` [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace Zong Li
@ 2024-06-19 16:02   ` Jason Gunthorpe
  2024-06-28  9:03     ` Zong Li
  2024-06-19 16:34   ` Joao Martins
  1 sibling, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2024-06-19 16:02 UTC (permalink / raw)
  To: Zong Li
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	kevin.tian, linux-kernel, iommu, linux-riscv

On Fri, Jun 14, 2024 at 10:21:53PM +0800, Zong Li wrote:
> This patch implements .domain_alloc_user operation for creating domains
> owend by userspace, e.g. through IOMMUFD. Add s2 domain for parent
> domain for second stage, s1 domain will be the first stage.
> 
> Don't remove IOMMU private data of dev in blocked domain, because it
> holds the user data of device, which is used when attaching device into
> s1 domain.
> 
> Signed-off-by: Zong Li <zong.li@sifive.com>
> ---
>  drivers/iommu/riscv/iommu.c  | 236 ++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/iommufd.h |  17 +++
>  2 files changed, 252 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> index 2130106e421f..410b236e9b24 100644
> --- a/drivers/iommu/riscv/iommu.c
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -846,6 +846,8 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
>  
>  /* This struct contains protection domain specific IOMMU driver data. */
>  struct riscv_iommu_domain {
> +	struct riscv_iommu_domain *s2;
> +	struct riscv_iommu_device *iommu;

IMHO you should create a riscv_iommu_domain_nested and not put these
here, like ARM did.

The kernel can't change the nested domain so it can't recieve and
distribute invalidations.

> +/**
> + * Nested IOMMU operations
> + */
> +
> +static int riscv_iommu_attach_dev_nested(struct iommu_domain *domain, struct device *dev)
> +{
> +	struct riscv_iommu_domain *riscv_domain = iommu_domain_to_riscv(domain);
> +	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> +	struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> +
> +	/*
> +	 * Add bond to the new domain's list, but don't unlink in current domain.
> +	 * We need to flush entries in stage-2 page table by iterating the list.
> +	 */
> +	if (riscv_iommu_bond_link(riscv_domain, dev))
> +		return -ENOMEM;
> +
> +	riscv_iommu_iotlb_inval(riscv_domain, 0, ULONG_MAX);
> +	info->dc_user.ta |= RISCV_IOMMU_PC_TA_V;

Seems odd??

> +	riscv_iommu_iodir_update(iommu, dev, &info->dc_user);

This will be need some updating to allow changes that don't toggle
V=0, like in arm.

> +	info->domain = riscv_domain;
> +
> +	return 0;
> +}
> +
> +static void riscv_iommu_domain_free_nested(struct iommu_domain *domain)
> +{
> +	struct riscv_iommu_domain *riscv_domain = iommu_domain_to_riscv(domain);
> +	struct riscv_iommu_bond *bond;
> +
> +	/* Unlink bond in s2 domain, because we link bond both on s1 and s2 domain */
> +	list_for_each_entry_rcu(bond, &riscv_domain->s2->bonds, list)
> +		riscv_iommu_bond_unlink(riscv_domain->s2, bond->dev);
> +
> +	if ((int)riscv_domain->pscid > 0)
> +		ida_free(&riscv_iommu_pscids, riscv_domain->pscid);
> +
> +	kfree(riscv_domain);
> +}
> +
> +static const struct iommu_domain_ops riscv_iommu_nested_domain_ops = {
> +	.attach_dev	= riscv_iommu_attach_dev_nested,
> +	.free		= riscv_iommu_domain_free_nested,
> +};
> +
> +static int
> +riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg)
> +{
> +	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> +	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> +	struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> +	struct riscv_iommu_dc dc;
> +	struct riscv_iommu_fq_record event;
> +	u64 dc_len = sizeof(struct riscv_iommu_dc) >>
> +		     (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_MSI_FLAT));
> +	u64 event_len = sizeof(struct riscv_iommu_fq_record);
> +	void __user *event_user = NULL;
> +
> +	for (int i = 0; i < fwspec->num_ids; i++) {
> +		event.hdr =
> +			FIELD_PREP(RISCV_IOMMU_FQ_HDR_CAUSE, RISCV_IOMMU_FQ_CAUSE_DDT_INVALID) |
> +			FIELD_PREP(RISCV_IOMMU_FQ_HDR_DID, fwspec->ids[i]);
> +
> +		/* Sanity check DC of stage-1 from user data */
> +		if (!user_arg->out_event_uptr || user_arg->event_len != event_len)
> +			return -EINVAL;

This is not extensible, see below about just inlining it.

> +		event_user = u64_to_user_ptr(user_arg->out_event_uptr);
> +
> +		if (!user_arg->dc_uptr || user_arg->dc_len != dc_len)
> +			return -EINVAL;
> +
> +		if (copy_from_user(&dc, u64_to_user_ptr(user_arg->dc_uptr), dc_len))
> +			return -EFAULT;
> +
> +		if (!(dc.tc & RISCV_IOMMU_DDTE_V)) {
> +			dev_dbg(dev, "Invalid DDT from user data\n");
> +			if (copy_to_user(event_user, &event, event_len))
> +				return -EFAULT;
> +		}

On ARM we are going to support non-valid STEs. It should put the the
translation into blocking and ideally emulate translation failure
events.

> +
> +		if (!dc.fsc || dc.iohgatp) {
> +			dev_dbg(dev, "Wrong page table from user data\n");
> +			if (copy_to_user(event_user, &event, event_len))
> +				return -EFAULT;
> +		}
> +
> +		/* Save DC of stage-1 from user data */
> +		memcpy(&info->dc_user,
> +		       riscv_iommu_get_dc(iommu, fwspec->ids[i]),

This does not seem right, the fwspec shouldn't be part of domain
allocation, even arguably for nesting. The nesting domain should
represent the user_dc only. Any customization of kernel controlled bits
should be done during attach only, nor do I really understand why this
is looping over all the fwspecs but only memcpying the last one..

> +		       sizeof(struct riscv_iommu_dc));
> +		info->dc_user.fsc = dc.fsc;
> +	}
> +
> +	return 0;
> +}
> +
> +static struct iommu_domain *
> +riscv_iommu_domain_alloc_nested(struct device *dev,
> +				struct iommu_domain *parent,
> +				const struct iommu_user_data *user_data)
> +{
> +	struct riscv_iommu_domain *s2_domain = iommu_domain_to_riscv(parent);
> +	struct riscv_iommu_domain *s1_domain;
> +	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> +	struct iommu_hwpt_riscv_iommu arg;
> +	int ret, va_bits;
> +
> +	if (user_data->type != IOMMU_HWPT_DATA_RISCV_IOMMU)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	if (parent->type != IOMMU_DOMAIN_UNMANAGED)
> +		return ERR_PTR(-EINVAL);
> +
> +	ret = iommu_copy_struct_from_user(&arg,
> +					  user_data,
> +					  IOMMU_HWPT_DATA_RISCV_IOMMU,
> +					  out_event_uptr);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	s1_domain = kzalloc(sizeof(*s1_domain), GFP_KERNEL);
> +	if (!s1_domain)
> +		return ERR_PTR(-ENOMEM);
> +
> +	spin_lock_init(&s1_domain->lock);
> +	INIT_LIST_HEAD_RCU(&s1_domain->bonds);
> +
> +	s1_domain->pscid = ida_alloc_range(&riscv_iommu_pscids, 1,
> +					   RISCV_IOMMU_MAX_PSCID, GFP_KERNEL);
> +	if (s1_domain->pscid < 0) {
> +		iommu_free_page(s1_domain->pgd_root);
> +		kfree(s1_domain);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* Get device context of stage-1 from user*/
> +	ret = riscv_iommu_get_dc_user(dev, &arg);
> +	if (ret) {
> +		kfree(s1_domain);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	if (!iommu) {
> +		va_bits = VA_BITS;
> +	} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV57) {
> +		va_bits = 57;
> +	} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV48) {
> +		va_bits = 48;
> +	} else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV39) {
> +		va_bits = 39;
> +	} else {
> +		dev_err(dev, "cannot find supported page table mode\n");
> +		return ERR_PTR(-ENODEV);
> +	}
> +
> +	/*
> +	 * The ops->domain_alloc_user could be directly called by the iommufd core,
> +	 * instead of iommu core. So, this function need to set the default value of
> +	 * following data member:
> +	 *  - domain->pgsize_bitmap
> +	 *  - domain->geometry
> +	 *  - domain->type
> +	 *  - domain->ops
> +	 */
> +	s1_domain->s2 = s2_domain;
> +	s1_domain->iommu = iommu;
> +	s1_domain->domain.type = IOMMU_DOMAIN_NESTED;
> +	s1_domain->domain.ops = &riscv_iommu_nested_domain_ops;
> +	s1_domain->domain.pgsize_bitmap = SZ_4K;
> +	s1_domain->domain.geometry.aperture_start = 0;
> +	s1_domain->domain.geometry.aperture_end = DMA_BIT_MASK(va_bits - 1);
> +	s1_domain->domain.geometry.force_aperture = true;

There is no geometry or page size of nesting domains.

> +
> +	return &s1_domain->domain;
> +}
> +
> +static struct iommu_domain *
> +riscv_iommu_domain_alloc_user(struct device *dev, u32 flags,
> +			      struct iommu_domain *parent,
> +			      const struct iommu_user_data *user_data)
> +{
> +	struct iommu_domain *domain;
> +	struct riscv_iommu_domain *riscv_domain;
> +
> +	/* Allocate stage-1 domain if it has stage-2 parent domain */
> +	if (parent)
> +		return riscv_iommu_domain_alloc_nested(dev, parent, user_data);
> +
> +	if (flags & ~((IOMMU_HWPT_ALLOC_NEST_PARENT | IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	if (user_data)
> +		return ERR_PTR(-EINVAL);
> +
> +	/* domain_alloc_user op needs to be fully initialized */
> +	domain = iommu_domain_alloc(dev->bus);

Please organize your driver to avoid calling this core function
through a pointer and return the correct type from the start so you
don't have to cast.

> +	if (!domain)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/*
> +	 * We assume that nest-parent or g-stage only will come here
> +	 * TODO: Shadow page table doesn't be supported now.
> +	 *       We currently can't distinguish g-stage and shadow
> +	 *       page table here. Shadow page table shouldn't be
> +	 *       put at stage-2.
> +	 */
> +	riscv_domain = iommu_domain_to_riscv(domain);
> +
> +	/* pgd_root may be allocated in .domain_alloc_paging */
> +	if (riscv_domain->pgd_root)
> +		iommu_free_page(riscv_domain->pgd_root);

And don't do stuff like this, if domain_alloc didn't do the right
stuff then reorganize it so that it does. Most likely pass in a flag
that you are allocating the nest so it can setup properly if it is
only a small change like this.

> +/**
> + * struct iommu_hwpt_riscv_iommu - RISCV IOMMU stage-1 device context table
> + *                                 info (IOMMU_HWPT_TYPE_RISCV_IOMMU)
> + * @dc_len: Length of device context
> + * @dc_uptr: User pointer to the address of device context
> + * @event_len: Length of an event record
> + * @out_event_uptr: User pointer to the address of event record
> + */
> +struct iommu_hwpt_riscv_iommu {
> +	__aligned_u64 dc_len;
> +	__aligned_u64 dc_uptr;

Do we really want this to be a pointer? ARM just inlined it in the
struct, why not do that?

> +	__aligned_u64 event_len;
> +	__aligned_u64 out_event_uptr;
> +};

Similar here too, why not just inline the response memory?

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache
  2024-06-14 14:21 ` [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache Zong Li
  2024-06-15  3:22   ` Baolu Lu
@ 2024-06-19 16:17   ` Jason Gunthorpe
  2024-06-28  8:19     ` Zong Li
  1 sibling, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2024-06-19 16:17 UTC (permalink / raw)
  To: Zong Li
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	kevin.tian, linux-kernel, iommu, linux-riscv

On Fri, Jun 14, 2024 at 10:21:54PM +0800, Zong Li wrote:
> This patch implements cache_invalidate_user operation for the userspace
> to flush the hardware caches for a nested domain through iommufd.
> 
> Signed-off-by: Zong Li <zong.li@sifive.com>
> ---
>  drivers/iommu/riscv/iommu.c  | 90 ++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/iommufd.h | 11 +++++
>  2 files changed, 97 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> index 410b236e9b24..d08eb0a2939e 100644
> --- a/drivers/iommu/riscv/iommu.c
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -1587,8 +1587,9 @@ static int riscv_iommu_attach_dev_nested(struct iommu_domain *domain, struct dev
>  	if (riscv_iommu_bond_link(riscv_domain, dev))
>  		return -ENOMEM;
>  
> -	riscv_iommu_iotlb_inval(riscv_domain, 0, ULONG_MAX);
> -	info->dc_user.ta |= RISCV_IOMMU_PC_TA_V;
> +	if (riscv_iommu_bond_link(info->domain, dev))
> +		return -ENOMEM;

?? Is this in the wrong patch then? Confused

>  	riscv_iommu_iodir_update(iommu, dev, &info->dc_user);
>  
>  	info->domain = riscv_domain;
> @@ -1611,13 +1612,92 @@ static void riscv_iommu_domain_free_nested(struct iommu_domain *domain)
>  	kfree(riscv_domain);
>  }
>  
> +static int riscv_iommu_fix_user_cmd(struct riscv_iommu_command *cmd,
> +				    unsigned int pscid, unsigned int gscid)
> +{
> +	u32 opcode = FIELD_GET(RISCV_IOMMU_CMD_OPCODE, cmd->dword0);
> +
> +	switch (opcode) {
> +	case RISCV_IOMMU_CMD_IOTINVAL_OPCODE:
> +		u32 func = FIELD_GET(RISCV_IOMMU_CMD_FUNC, cmd->dword0);
> +
> +		if (func != RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA &&
> +		    func != RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA) {
> +			pr_warn("The IOTINVAL function: 0x%x is not supported\n",
> +				func);
> +			return -EOPNOTSUPP;
> +		}
> +
> +		if (func == RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA) {
> +			cmd->dword0 &= ~RISCV_IOMMU_CMD_FUNC;
> +			cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_FUNC,
> +						  RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA);
> +		}
> +
> +		cmd->dword0 &= ~(RISCV_IOMMU_CMD_IOTINVAL_PSCID |
> +				 RISCV_IOMMU_CMD_IOTINVAL_GSCID);
> +		riscv_iommu_cmd_inval_set_pscid(cmd, pscid);
> +		riscv_iommu_cmd_inval_set_gscid(cmd, gscid);
> +		break;
> +	case RISCV_IOMMU_CMD_IODIR_OPCODE:
> +		/*
> +		 * Ensure the device ID is right. We expect that VMM has
> +		 * transferred the device ID to host's from guest's.
> +		 */

I'm not sure what this remark means, but I expect you will need to
translate any devices IDs from virtual to physical.

>  
>  static int
> -riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg)
> +riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg,
> +			struct riscv_iommu_domain *s1_domain)
>  {
>  	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
>  	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> @@ -1663,6 +1743,8 @@ riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_
>  		       riscv_iommu_get_dc(iommu, fwspec->ids[i]),
>  		       sizeof(struct riscv_iommu_dc));
>  		info->dc_user.fsc = dc.fsc;
> +		info->dc_user.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, s1_domain->pscid) |
> +					      RISCV_IOMMU_PC_TA_V;
>  	}

It is really weird that the s1 domain has any kind of id. What is the
PSCID? Is it analogous to VMID on ARM?

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace
  2024-06-14 14:21 ` [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace Zong Li
  2024-06-19 16:02   ` Jason Gunthorpe
@ 2024-06-19 16:34   ` Joao Martins
  2024-06-21  7:34     ` Zong Li
  1 sibling, 1 reply; 37+ messages in thread
From: Joao Martins @ 2024-06-19 16:34 UTC (permalink / raw)
  To: Zong Li
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv

On 14/06/2024 15:21, Zong Li wrote:
> +static struct iommu_domain *
> +riscv_iommu_domain_alloc_user(struct device *dev, u32 flags,
> +			      struct iommu_domain *parent,
> +			      const struct iommu_user_data *user_data)
> +{
> +	struct iommu_domain *domain;
> +	struct riscv_iommu_domain *riscv_domain;
> +
> +	/* Allocate stage-1 domain if it has stage-2 parent domain */
> +	if (parent)
> +		return riscv_iommu_domain_alloc_nested(dev, parent, user_data);
> +
> +	if (flags & ~((IOMMU_HWPT_ALLOC_NEST_PARENT | IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
> +		return ERR_PTR(-EOPNOTSUPP);
> +

IOMMU_HWPT_ALLOC_DIRTY_TRACKING flag check should be dropped if it's not
supported in code (which looks to be the case in your series) e.g.

	if (flags & ~((IOMMU_HWPT_ALLOC_NEST_PARENT)))
		return ERR_PTR(-EOPNOTSUPP);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 06/10] iommu/riscv: support nested iommu for getting iommu hardware information
  2024-06-19 15:49   ` Jason Gunthorpe
@ 2024-06-21  7:32     ` Zong Li
  0 siblings, 0 replies; 37+ messages in thread
From: Zong Li @ 2024-06-21  7:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	kevin.tian, linux-kernel, iommu, linux-riscv

On Wed, Jun 19, 2024 at 11:49 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Jun 14, 2024 at 10:21:52PM +0800, Zong Li wrote:
> > This patch implements .hw_info operation and the related data
> > structures for passing the IOMMU hardware capabilities for iommufd.
> >
> > Signed-off-by: Zong Li <zong.li@sifive.com>
> > ---
> >  drivers/iommu/riscv/iommu.c  | 20 ++++++++++++++++++++
> >  include/uapi/linux/iommufd.h | 18 ++++++++++++++++++
> >  2 files changed, 38 insertions(+)
>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>
> >  /**
> >   * enum iommu_hw_info_type - IOMMU Hardware Info Types
> >   * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
> >   *                           info
> >   * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
> > + * @IOMMU_HW_INFO_TYPE_RISCV_IOMMU: RISC-V iommu info type
> >   */
>
> Is there a more formal name than "RISCV IOMMU" here? It seems like you
> will probably have a RISCV_IOMMU_V2 someday, is that naming OK?
>

RISC-V IOMMU currently doesn't have another name. If we are really
unfortunate enough to need to overhaul the entire structure, maybe we
can give it another proper name. But personally, RISCV_IOMMU_V2 is
fine with me.

I have also taken a look at comments on other patches, please allow me
a moment to digest that, I will reply to them later. Thanks

> Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace
  2024-06-19 16:34   ` Joao Martins
@ 2024-06-21  7:34     ` Zong Li
  0 siblings, 0 replies; 37+ messages in thread
From: Zong Li @ 2024-06-21  7:34 UTC (permalink / raw)
  To: Joao Martins
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv

On Thu, Jun 20, 2024 at 12:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 14/06/2024 15:21, Zong Li wrote:
> > +static struct iommu_domain *
> > +riscv_iommu_domain_alloc_user(struct device *dev, u32 flags,
> > +                           struct iommu_domain *parent,
> > +                           const struct iommu_user_data *user_data)
> > +{
> > +     struct iommu_domain *domain;
> > +     struct riscv_iommu_domain *riscv_domain;
> > +
> > +     /* Allocate stage-1 domain if it has stage-2 parent domain */
> > +     if (parent)
> > +             return riscv_iommu_domain_alloc_nested(dev, parent, user_data);
> > +
> > +     if (flags & ~((IOMMU_HWPT_ALLOC_NEST_PARENT | IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
> > +             return ERR_PTR(-EOPNOTSUPP);
> > +
>
> IOMMU_HWPT_ALLOC_DIRTY_TRACKING flag check should be dropped if it's not
> supported in code (which looks to be the case in your series) e.g.

Thanks for your tips, I will remove it in the next version.

>
>         if (flags & ~((IOMMU_HWPT_ALLOC_NEST_PARENT)))
>                 return ERR_PTR(-EOPNOTSUPP);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache
  2024-06-19 16:17   ` Jason Gunthorpe
@ 2024-06-28  8:19     ` Zong Li
  2024-06-28 22:26       ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Zong Li @ 2024-06-28  8:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	kevin.tian, linux-kernel, iommu, linux-riscv

On Thu, Jun 20, 2024 at 12:17 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Jun 14, 2024 at 10:21:54PM +0800, Zong Li wrote:
> > This patch implements cache_invalidate_user operation for the userspace
> > to flush the hardware caches for a nested domain through iommufd.
> >
> > Signed-off-by: Zong Li <zong.li@sifive.com>
> > ---
> >  drivers/iommu/riscv/iommu.c  | 90 ++++++++++++++++++++++++++++++++++--
> >  include/uapi/linux/iommufd.h | 11 +++++
> >  2 files changed, 97 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > index 410b236e9b24..d08eb0a2939e 100644
> > --- a/drivers/iommu/riscv/iommu.c
> > +++ b/drivers/iommu/riscv/iommu.c
> > @@ -1587,8 +1587,9 @@ static int riscv_iommu_attach_dev_nested(struct iommu_domain *domain, struct dev
> >       if (riscv_iommu_bond_link(riscv_domain, dev))
> >               return -ENOMEM;
> >
> > -     riscv_iommu_iotlb_inval(riscv_domain, 0, ULONG_MAX);
> > -     info->dc_user.ta |= RISCV_IOMMU_PC_TA_V;
> > +     if (riscv_iommu_bond_link(info->domain, dev))
> > +             return -ENOMEM;
>
> ?? Is this in the wrong patch then? Confused

Yes, it should be in 7th patch in this series. I will fix it in next version.

>
> >       riscv_iommu_iodir_update(iommu, dev, &info->dc_user);
> >
> >       info->domain = riscv_domain;
> > @@ -1611,13 +1612,92 @@ static void riscv_iommu_domain_free_nested(struct iommu_domain *domain)
> >       kfree(riscv_domain);
> >  }
> >
> > +static int riscv_iommu_fix_user_cmd(struct riscv_iommu_command *cmd,
> > +                                 unsigned int pscid, unsigned int gscid)
> > +{
> > +     u32 opcode = FIELD_GET(RISCV_IOMMU_CMD_OPCODE, cmd->dword0);
> > +
> > +     switch (opcode) {
> > +     case RISCV_IOMMU_CMD_IOTINVAL_OPCODE:
> > +             u32 func = FIELD_GET(RISCV_IOMMU_CMD_FUNC, cmd->dword0);
> > +
> > +             if (func != RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA &&
> > +                 func != RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA) {
> > +                     pr_warn("The IOTINVAL function: 0x%x is not supported\n",
> > +                             func);
> > +                     return -EOPNOTSUPP;
> > +             }
> > +
> > +             if (func == RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA) {
> > +                     cmd->dword0 &= ~RISCV_IOMMU_CMD_FUNC;
> > +                     cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_FUNC,
> > +                                               RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA);
> > +             }
> > +
> > +             cmd->dword0 &= ~(RISCV_IOMMU_CMD_IOTINVAL_PSCID |
> > +                              RISCV_IOMMU_CMD_IOTINVAL_GSCID);
> > +             riscv_iommu_cmd_inval_set_pscid(cmd, pscid);
> > +             riscv_iommu_cmd_inval_set_gscid(cmd, gscid);
> > +             break;
> > +     case RISCV_IOMMU_CMD_IODIR_OPCODE:
> > +             /*
> > +              * Ensure the device ID is right. We expect that VMM has
> > +              * transferred the device ID to host's from guest's.
> > +              */
>
> I'm not sure what this remark means, but I expect you will need to
> translate any devices IDs from virtual to physical.

I think we need some data structure to map it. I didn't do that here
because our internal implementation translates the right ID in VMM,
but as you mentioned, we can't expect that VMM will do that for
kernel.

>
> >
> >  static int
> > -riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg)
> > +riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg,
> > +                     struct riscv_iommu_domain *s1_domain)
> >  {
> >       struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> >       struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > @@ -1663,6 +1743,8 @@ riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_
> >                      riscv_iommu_get_dc(iommu, fwspec->ids[i]),
> >                      sizeof(struct riscv_iommu_dc));
> >               info->dc_user.fsc = dc.fsc;
> > +             info->dc_user.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, s1_domain->pscid) |
> > +                                           RISCV_IOMMU_PC_TA_V;
> >       }
>
> It is really weird that the s1 domain has any kind of id. What is the
> PSCID? Is it analogous to VMID on ARM?

I think the VMID is closer to the GSCID. The PSCID might be more like
the ASID, as it is used as the address space ID for the process
identified by the first-stage page table.
The GSCID used to tag the G-stage TLB, the PSCID is used to tag the
single stage TLB and the tuple {GSCID, PSCID} is used to tag the
VS-stage TLB. The IOTINVAL.VMA command can flush the mapping by
matching GSCID only, PSCID only or the tuple {GSCID, PSCID}. We can
consider the situation that there are two devices pass through to a
guest, then we will have two s1 domains under the same s2 domain, and
we can flush their mapping by { GSCID, PSCID } and { GSCID, PSCID' }
respectively.

>
> Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace
  2024-06-19 16:02   ` Jason Gunthorpe
@ 2024-06-28  9:03     ` Zong Li
  2024-06-28 22:32       ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Zong Li @ 2024-06-28  9:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	kevin.tian, linux-kernel, iommu, linux-riscv

On Thu, Jun 20, 2024 at 12:02 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Jun 14, 2024 at 10:21:53PM +0800, Zong Li wrote:
> > This patch implements .domain_alloc_user operation for creating domains
> > owend by userspace, e.g. through IOMMUFD. Add s2 domain for parent
> > domain for second stage, s1 domain will be the first stage.
> >
> > Don't remove IOMMU private data of dev in blocked domain, because it
> > holds the user data of device, which is used when attaching device into
> > s1 domain.
> >
> > Signed-off-by: Zong Li <zong.li@sifive.com>
> > ---
> >  drivers/iommu/riscv/iommu.c  | 236 ++++++++++++++++++++++++++++++++++-
> >  include/uapi/linux/iommufd.h |  17 +++
> >  2 files changed, 252 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > index 2130106e421f..410b236e9b24 100644
> > --- a/drivers/iommu/riscv/iommu.c
> > +++ b/drivers/iommu/riscv/iommu.c
> > @@ -846,6 +846,8 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
> >
> >  /* This struct contains protection domain specific IOMMU driver data. */
> >  struct riscv_iommu_domain {
> > +     struct riscv_iommu_domain *s2;
> > +     struct riscv_iommu_device *iommu;
>
> IMHO you should create a riscv_iommu_domain_nested and not put these
> here, like ARM did.
>
> The kernel can't change the nested domain so it can't recieve and
> distribute invalidations.

Ok, as you mentioned, there are many data elements in that data
structure won't be used in s1 domain.

>
> > +/**
> > + * Nested IOMMU operations
> > + */
> > +
> > +static int riscv_iommu_attach_dev_nested(struct iommu_domain *domain, struct device *dev)
> > +{
> > +     struct riscv_iommu_domain *riscv_domain = iommu_domain_to_riscv(domain);
> > +     struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > +     struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> > +
> > +     /*
> > +      * Add bond to the new domain's list, but don't unlink in current domain.
> > +      * We need to flush entries in stage-2 page table by iterating the list.
> > +      */
> > +     if (riscv_iommu_bond_link(riscv_domain, dev))
> > +             return -ENOMEM;
> > +
> > +     riscv_iommu_iotlb_inval(riscv_domain, 0, ULONG_MAX);
> > +     info->dc_user.ta |= RISCV_IOMMU_PC_TA_V;
>
> Seems odd??
>
> > +     riscv_iommu_iodir_update(iommu, dev, &info->dc_user);
>
> This will be need some updating to allow changes that don't toggle
> V=0, like in arm.

I think the right code snippet is put in 8th patch as you pointed in
8th patch. I will correct it in next version.

>
> > +     info->domain = riscv_domain;
> > +
> > +     return 0;
> > +}
> > +
> > +static void riscv_iommu_domain_free_nested(struct iommu_domain *domain)
> > +{
> > +     struct riscv_iommu_domain *riscv_domain = iommu_domain_to_riscv(domain);
> > +     struct riscv_iommu_bond *bond;
> > +
> > +     /* Unlink bond in s2 domain, because we link bond both on s1 and s2 domain */
> > +     list_for_each_entry_rcu(bond, &riscv_domain->s2->bonds, list)
> > +             riscv_iommu_bond_unlink(riscv_domain->s2, bond->dev);
> > +
> > +     if ((int)riscv_domain->pscid > 0)
> > +             ida_free(&riscv_iommu_pscids, riscv_domain->pscid);
> > +
> > +     kfree(riscv_domain);
> > +}
> > +
> > +static const struct iommu_domain_ops riscv_iommu_nested_domain_ops = {
> > +     .attach_dev     = riscv_iommu_attach_dev_nested,
> > +     .free           = riscv_iommu_domain_free_nested,
> > +};
> > +
> > +static int
> > +riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg)
> > +{
> > +     struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> > +     struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > +     struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> > +     struct riscv_iommu_dc dc;
> > +     struct riscv_iommu_fq_record event;
> > +     u64 dc_len = sizeof(struct riscv_iommu_dc) >>
> > +                  (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_MSI_FLAT));
> > +     u64 event_len = sizeof(struct riscv_iommu_fq_record);
> > +     void __user *event_user = NULL;
> > +
> > +     for (int i = 0; i < fwspec->num_ids; i++) {
> > +             event.hdr =
> > +                     FIELD_PREP(RISCV_IOMMU_FQ_HDR_CAUSE, RISCV_IOMMU_FQ_CAUSE_DDT_INVALID) |
> > +                     FIELD_PREP(RISCV_IOMMU_FQ_HDR_DID, fwspec->ids[i]);
> > +
> > +             /* Sanity check DC of stage-1 from user data */
> > +             if (!user_arg->out_event_uptr || user_arg->event_len != event_len)
> > +                     return -EINVAL;
>
> This is not extensible, see below about just inlining it.
>
> > +             event_user = u64_to_user_ptr(user_arg->out_event_uptr);
> > +
> > +             if (!user_arg->dc_uptr || user_arg->dc_len != dc_len)
> > +                     return -EINVAL;
> > +
> > +             if (copy_from_user(&dc, u64_to_user_ptr(user_arg->dc_uptr), dc_len))
> > +                     return -EFAULT;
> > +
> > +             if (!(dc.tc & RISCV_IOMMU_DDTE_V)) {
> > +                     dev_dbg(dev, "Invalid DDT from user data\n");
> > +                     if (copy_to_user(event_user, &event, event_len))
> > +                             return -EFAULT;
> > +             }
>
> On ARM we are going to support non-valid STEs. It should put the the
> translation into blocking and ideally emulate translation failure
> events.

Ok, let me consider this situation in next version.

>
> > +
> > +             if (!dc.fsc || dc.iohgatp) {
> > +                     dev_dbg(dev, "Wrong page table from user data\n");
> > +                     if (copy_to_user(event_user, &event, event_len))
> > +                             return -EFAULT;
> > +             }
> > +
> > +             /* Save DC of stage-1 from user data */
> > +             memcpy(&info->dc_user,
> > +                    riscv_iommu_get_dc(iommu, fwspec->ids[i]),
>
> This does not seem right, the fwspec shouldn't be part of domain
> allocation, even arguably for nesting. The nesting domain should
> represent the user_dc only. Any customization of kernel controlled bits
> should be done during attach only, nor do I really understand why this
> is looping over all the fwspecs but only memcpying the last one..
>

The fwspec is used to get the value of current dc, because we want to
also back up the address of second stage table (i.e. iohgatp), The
reason is that this value will be cleaned when device is attached to
the blocking domain, before the device attaches to s1 domain, we can't
get the original value of iohgatp anymore when attach device to s1
domain.
For the issue for only memcpying the last one, I should fix it in next
version, we might need to allocate the multiple user_dc at runtime,
because we don't statically know how many id will be used in one
platform device. does it make sense to you?

> > +                    sizeof(struct riscv_iommu_dc));
> > +             info->dc_user.fsc = dc.fsc;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static struct iommu_domain *
> > +riscv_iommu_domain_alloc_nested(struct device *dev,
> > +                             struct iommu_domain *parent,
> > +                             const struct iommu_user_data *user_data)
> > +{
> > +     struct riscv_iommu_domain *s2_domain = iommu_domain_to_riscv(parent);
> > +     struct riscv_iommu_domain *s1_domain;
> > +     struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > +     struct iommu_hwpt_riscv_iommu arg;
> > +     int ret, va_bits;
> > +
> > +     if (user_data->type != IOMMU_HWPT_DATA_RISCV_IOMMU)
> > +             return ERR_PTR(-EOPNOTSUPP);
> > +
> > +     if (parent->type != IOMMU_DOMAIN_UNMANAGED)
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     ret = iommu_copy_struct_from_user(&arg,
> > +                                       user_data,
> > +                                       IOMMU_HWPT_DATA_RISCV_IOMMU,
> > +                                       out_event_uptr);
> > +     if (ret)
> > +             return ERR_PTR(ret);
> > +
> > +     s1_domain = kzalloc(sizeof(*s1_domain), GFP_KERNEL);
> > +     if (!s1_domain)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     spin_lock_init(&s1_domain->lock);
> > +     INIT_LIST_HEAD_RCU(&s1_domain->bonds);
> > +
> > +     s1_domain->pscid = ida_alloc_range(&riscv_iommu_pscids, 1,
> > +                                        RISCV_IOMMU_MAX_PSCID, GFP_KERNEL);
> > +     if (s1_domain->pscid < 0) {
> > +             iommu_free_page(s1_domain->pgd_root);
> > +             kfree(s1_domain);
> > +             return ERR_PTR(-ENOMEM);
> > +     }
> > +
> > +     /* Get device context of stage-1 from user*/
> > +     ret = riscv_iommu_get_dc_user(dev, &arg);
> > +     if (ret) {
> > +             kfree(s1_domain);
> > +             return ERR_PTR(-EINVAL);
> > +     }
> > +
> > +     if (!iommu) {
> > +             va_bits = VA_BITS;
> > +     } else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV57) {
> > +             va_bits = 57;
> > +     } else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV48) {
> > +             va_bits = 48;
> > +     } else if (iommu->caps & RISCV_IOMMU_CAPABILITIES_SV39) {
> > +             va_bits = 39;
> > +     } else {
> > +             dev_err(dev, "cannot find supported page table mode\n");
> > +             return ERR_PTR(-ENODEV);
> > +     }
> > +
> > +     /*
> > +      * The ops->domain_alloc_user could be directly called by the iommufd core,
> > +      * instead of iommu core. So, this function need to set the default value of
> > +      * following data member:
> > +      *  - domain->pgsize_bitmap
> > +      *  - domain->geometry
> > +      *  - domain->type
> > +      *  - domain->ops
> > +      */
> > +     s1_domain->s2 = s2_domain;
> > +     s1_domain->iommu = iommu;
> > +     s1_domain->domain.type = IOMMU_DOMAIN_NESTED;
> > +     s1_domain->domain.ops = &riscv_iommu_nested_domain_ops;
> > +     s1_domain->domain.pgsize_bitmap = SZ_4K;
> > +     s1_domain->domain.geometry.aperture_start = 0;
> > +     s1_domain->domain.geometry.aperture_end = DMA_BIT_MASK(va_bits - 1);
> > +     s1_domain->domain.geometry.force_aperture = true;
>
> There is no geometry or page size of nesting domains.
>

Thanks for pointing it out. I will fix it in the next version.

> > +
> > +     return &s1_domain->domain;
> > +}
> > +
> > +static struct iommu_domain *
> > +riscv_iommu_domain_alloc_user(struct device *dev, u32 flags,
> > +                           struct iommu_domain *parent,
> > +                           const struct iommu_user_data *user_data)
> > +{
> > +     struct iommu_domain *domain;
> > +     struct riscv_iommu_domain *riscv_domain;
> > +
> > +     /* Allocate stage-1 domain if it has stage-2 parent domain */
> > +     if (parent)
> > +             return riscv_iommu_domain_alloc_nested(dev, parent, user_data);
> > +
> > +     if (flags & ~((IOMMU_HWPT_ALLOC_NEST_PARENT | IOMMU_HWPT_ALLOC_DIRTY_TRACKING)))
> > +             return ERR_PTR(-EOPNOTSUPP);
> > +
> > +     if (user_data)
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     /* domain_alloc_user op needs to be fully initialized */
> > +     domain = iommu_domain_alloc(dev->bus);
>
> Please organize your driver to avoid calling this core function
> through a pointer and return the correct type from the start so you
> don't have to cast.

Ok, let me modify this part. Thanks .

>
> > +     if (!domain)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     /*
> > +      * We assume that nest-parent or g-stage only will come here
> > +      * TODO: Shadow page table doesn't be supported now.
> > +      *       We currently can't distinguish g-stage and shadow
> > +      *       page table here. Shadow page table shouldn't be
> > +      *       put at stage-2.
> > +      */
> > +     riscv_domain = iommu_domain_to_riscv(domain);
> > +
> > +     /* pgd_root may be allocated in .domain_alloc_paging */
> > +     if (riscv_domain->pgd_root)
> > +             iommu_free_page(riscv_domain->pgd_root);
>
> And don't do stuff like this, if domain_alloc didn't do the right
> stuff then reorganize it so that it does. Most likely pass in a flag
> that you are allocating the nest so it can setup properly if it is
> only a small change like this.

Yes, if we don't rely on the original domain allocation, it won't be
weird as it is now.

>
> > +/**
> > + * struct iommu_hwpt_riscv_iommu - RISCV IOMMU stage-1 device context table
> > + *                                 info (IOMMU_HWPT_TYPE_RISCV_IOMMU)
> > + * @dc_len: Length of device context
> > + * @dc_uptr: User pointer to the address of device context
> > + * @event_len: Length of an event record
> > + * @out_event_uptr: User pointer to the address of event record
> > + */
> > +struct iommu_hwpt_riscv_iommu {
> > +     __aligned_u64 dc_len;
> > +     __aligned_u64 dc_uptr;
>
> Do we really want this to be a pointer? ARM just inlined it in the
> struct, why not do that?
>
> > +     __aligned_u64 event_len;
> > +     __aligned_u64 out_event_uptr;
> > +};
>
> Similar here too, why not just inline the response memory?

I think we can just inline them, just like what we do in the
'iommu_hwpt_riscv_iommu_invalidate'. does I understand correctly?

>
> Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache
  2024-06-28  8:19     ` Zong Li
@ 2024-06-28 22:26       ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2024-06-28 22:26 UTC (permalink / raw)
  To: Zong Li
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	kevin.tian, linux-kernel, iommu, linux-riscv

On Fri, Jun 28, 2024 at 04:19:28PM +0800, Zong Li wrote:

> > > +     case RISCV_IOMMU_CMD_IODIR_OPCODE:
> > > +             /*
> > > +              * Ensure the device ID is right. We expect that VMM has
> > > +              * transferred the device ID to host's from guest's.
> > > +              */
> >
> > I'm not sure what this remark means, but I expect you will need to
> > translate any devices IDs from virtual to physical.
> 
> I think we need some data structure to map it. I didn't do that here
> because our internal implementation translates the right ID in VMM,
> but as you mentioned, we can't expect that VMM will do that for
> kernel.

Yes, you need the viommu stuff Nicolin is working on to hold the
translation, same as the ARM driver.

In the mean time you can't support this invalidation opcode.
 
> > >  static int
> > > -riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg)
> > > +riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_arg,
> > > +                     struct riscv_iommu_domain *s1_domain)
> > >  {
> > >       struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> > >       struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > > @@ -1663,6 +1743,8 @@ riscv_iommu_get_dc_user(struct device *dev, struct iommu_hwpt_riscv_iommu *user_
> > >                      riscv_iommu_get_dc(iommu, fwspec->ids[i]),
> > >                      sizeof(struct riscv_iommu_dc));
> > >               info->dc_user.fsc = dc.fsc;
> > > +             info->dc_user.ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, s1_domain->pscid) |
> > > +                                           RISCV_IOMMU_PC_TA_V;
> > >       }
> >
> > It is really weird that the s1 domain has any kind of id. What is the
> > PSCID? Is it analogous to VMID on ARM?
> 
> I think the VMID is closer to the GSCID. The PSCID might be more like
> the ASID, as it is used as the address space ID for the process
> identified by the first-stage page table.

That does sound like the ASID, but I would expect this to work by
using the VM provided PSCID and just flowing the PSCID through
transparently during the invalidation.

Why have the kernel allocate and override a PSCID when the PSCID is
scoped by the GSCID and can be safely delegated to the VM?

This is going to be necessary if you ever want to support the direct
invalidate queues like ARM/AMD have already as it will not be
desirable to translate the PSCID on that performance path.

It will also be necessary to implement the viommu invalidation path
since there is no domain there, which is needed for the ATS as above.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace
  2024-06-28  9:03     ` Zong Li
@ 2024-06-28 22:32       ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2024-06-28 22:32 UTC (permalink / raw)
  To: Zong Li
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	kevin.tian, linux-kernel, iommu, linux-riscv

On Fri, Jun 28, 2024 at 05:03:41PM +0800, Zong Li wrote:
> > > +
> > > +             if (!dc.fsc || dc.iohgatp) {
> > > +                     dev_dbg(dev, "Wrong page table from user data\n");
> > > +                     if (copy_to_user(event_user, &event, event_len))
> > > +                             return -EFAULT;
> > > +             }
> > > +
> > > +             /* Save DC of stage-1 from user data */
> > > +             memcpy(&info->dc_user,
> > > +                    riscv_iommu_get_dc(iommu, fwspec->ids[i]),
> >
> > This does not seem right, the fwspec shouldn't be part of domain
> > allocation, even arguably for nesting. The nesting domain should
> > represent the user_dc only. Any customization of kernel controlled bits
> > should be done during attach only, nor do I really understand why this
> > is looping over all the fwspecs but only memcpying the last one..
> >
> 
> The fwspec is used to get the value of current dc, because we want to
> also back up the address of second stage table (i.e. iohgatp), The
> reason is that this value will be cleaned when device is attached to
> the blocking domain, before the device attaches to s1 domain, we can't
> get the original value of iohgatp anymore when attach device to s1
> domain.

This is wrong, you get the value of iohgatp from the S2 domain the
nest knows directly. You must never make assumptions about domain
attach order or rely on the current value of the HW tables to
construct any attachment.

Follow the design like ARM has now where the value of the device table
entry is computed wholly from scratch using only the contents of the
domain pointer, including combining the S1 and S2 domain information.

And then you need to refactor and use the programmer I wrote for ARM
to be able to do the correct hitless transitions without a V=0
step. It is not too hard but will clean this all up.

> > > +/**
> > > + * struct iommu_hwpt_riscv_iommu - RISCV IOMMU stage-1 device context table
> > > + *                                 info (IOMMU_HWPT_TYPE_RISCV_IOMMU)
> > > + * @dc_len: Length of device context
> > > + * @dc_uptr: User pointer to the address of device context
> > > + * @event_len: Length of an event record
> > > + * @out_event_uptr: User pointer to the address of event record
> > > + */
> > > +struct iommu_hwpt_riscv_iommu {
> > > +     __aligned_u64 dc_len;
> > > +     __aligned_u64 dc_uptr;
> >
> > Do we really want this to be a pointer? ARM just inlined it in the
> > struct, why not do that?
> >
> > > +     __aligned_u64 event_len;
> > > +     __aligned_u64 out_event_uptr;
> > > +};
> >
> > Similar here too, why not just inline the response memory?
> 
> I think we can just inline them, just like what we do in the
> 'iommu_hwpt_riscv_iommu_invalidate'. does I understand correctly?

Yeah I think so

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] [RFC PATCH v2 02/10] iommu/riscv: support HPM and interrupt handling
  2024-06-14 14:21 ` [RFC PATCH v2 02/10] iommu/riscv: support HPM and interrupt handling Zong Li
@ 2024-12-10  7:54   ` yunhui cui
  2024-12-10  8:48     ` Xu Lu
  2025-09-01 13:36   ` [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support niliqiang
  1 sibling, 1 reply; 37+ messages in thread
From: yunhui cui @ 2024-12-10  7:54 UTC (permalink / raw)
  To: Zong Li
  Cc: joro, will, robin.murphy, tjeznach, paul.walmsley, palmer, aou,
	jgg, kevin.tian, linux-kernel, iommu, linux-riscv, luxu.kernel

Add Luxu in the loop.

On Fri, Jun 14, 2024 at 10:22 PM Zong Li <zong.li@sifive.com> wrote:
>
> This patch initialize the pmu stuff and uninitialize it when driver
> removing. The interrupt handling is also provided, this handler need to
> be primary handler instead of thread function, because pt_regs is empty
> when threading the IRQ, but pt_regs is necessary by perf_event_overflow.
>
> Signed-off-by: Zong Li <zong.li@sifive.com>
> ---
>  drivers/iommu/riscv/iommu.c | 65 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
>
> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> index 8b6a64c1ad8d..1716b2251f38 100644
> --- a/drivers/iommu/riscv/iommu.c
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -540,6 +540,62 @@ static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
>         return IRQ_HANDLED;
>  }
>
> +/*
> + * IOMMU Hardware performance monitor
> + */
> +
> +/* HPM interrupt primary handler */
> +static irqreturn_t riscv_iommu_hpm_irq_handler(int irq, void *dev_id)
> +{
> +       struct riscv_iommu_device *iommu = (struct riscv_iommu_device *)dev_id;
> +
> +       /* Process pmu irq */
> +       riscv_iommu_pmu_handle_irq(&iommu->pmu);
> +
> +       /* Clear performance monitoring interrupt pending */
> +       riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, RISCV_IOMMU_IPSR_PMIP);
> +
> +       return IRQ_HANDLED;
> +}
> +
> +/* HPM initialization */
> +static int riscv_iommu_hpm_enable(struct riscv_iommu_device *iommu)
> +{
> +       int rc;
> +
> +       if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM))
> +               return 0;
> +
> +       /*
> +        * pt_regs is empty when threading the IRQ, but pt_regs is necessary
> +        * by perf_event_overflow. Use primary handler instead of thread
> +        * function for PM IRQ.
> +        *
> +        * Set the IRQF_ONESHOT flag because this IRQ might be shared with
> +        * other threaded IRQs by other queues.
> +        */
> +       rc = devm_request_irq(iommu->dev,
> +                             iommu->irqs[riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IPSR_PMIP)],
> +                             riscv_iommu_hpm_irq_handler, IRQF_ONESHOT | IRQF_SHARED, NULL, iommu);
> +       if (rc)
> +               return rc;
> +
> +       return riscv_iommu_pmu_init(&iommu->pmu, iommu->reg, dev_name(iommu->dev));
> +}
> +
> +/* HPM uninitialization */
> +static void riscv_iommu_hpm_disable(struct riscv_iommu_device *iommu)
> +{
> +       if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM))
> +               return;
> +
> +       devm_free_irq(iommu->dev,
> +                     iommu->irqs[riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IPSR_PMIP)],
> +                     iommu);
> +
> +       riscv_iommu_pmu_uninit(&iommu->pmu);
> +}
> +
>  /* Lookup and initialize device context info structure. */
>  static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
>                                                  unsigned int devid)
> @@ -1612,6 +1668,9 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
>         riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_IOMMU_MODE_OFF);
>         riscv_iommu_queue_disable(&iommu->cmdq);
>         riscv_iommu_queue_disable(&iommu->fltq);
> +
> +       if (iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM)
> +               riscv_iommu_pmu_uninit(&iommu->pmu);
>  }
>
>  int riscv_iommu_init(struct riscv_iommu_device *iommu)
> @@ -1651,6 +1710,10 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
>         if (rc)
>                 goto err_queue_disable;
>
> +       rc = riscv_iommu_hpm_enable(iommu);
> +       if (rc)
> +               goto err_hpm_disable;
> +
>         rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
>                                     dev_name(iommu->dev));
>         if (rc) {
> @@ -1669,6 +1732,8 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
>  err_remove_sysfs:
>         iommu_device_sysfs_remove(&iommu->iommu);
>  err_iodir_off:
> +       riscv_iommu_hpm_disable(iommu);
> +err_hpm_disable:
>         riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_IOMMU_MODE_OFF);
>  err_queue_disable:
>         riscv_iommu_queue_disable(&iommu->fltq);
> --
> 2.17.1
>
>

Thanks,
Yunhui

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] [RFC PATCH v2 02/10] iommu/riscv: support HPM and interrupt handling
  2024-12-10  7:54   ` [External] " yunhui cui
@ 2024-12-10  8:48     ` Xu Lu
  2024-12-27  8:37       ` Zong Li
  0 siblings, 1 reply; 37+ messages in thread
From: Xu Lu @ 2024-12-10  8:48 UTC (permalink / raw)
  To: yunhui cui
  Cc: Zong Li, joro, will, robin.murphy, tjeznach, paul.walmsley,
	palmer, aou, jgg, kevin.tian, linux-kernel, iommu, linux-riscv

Hi Zong Li,

Thanks for your job. We have tested your iommu pmu driver and have
some feedbacks.

1. Maybe it is better to clear ipsr.PMIP first and then handle the pmu
ovf irq in riscv_iommu_hpm_irq_handler(). Otherwise, if a new overflow
happens after the riscv_iommu_pmu_handle_irq() and before pmip clear,
we will drop it.

2. The period_left can be messed in riscv_iommu_pmu_update() as
riscv_iommu_pmu_get_counter() always return the whole register value
while bit 63 in hpmcycle actually indicates whether overflow happens
instead of current value. Maybe these two functions should be
implemented as:

static void riscv_iommu_pmu_set_counter(struct riscv_iommu_pmu *pmu, u32 idx,
                                        u64 value)
{
        void __iomem *addr = pmu->reg + RISCV_IOMMU_REG_IOHPMCYCLES;

        if (WARN_ON_ONCE(idx < 0 || idx > pmu->num_counters))
                return;

        if (idx == 0)
                value = (value & ~RISCV_IOMMU_IOHPMCYCLES_OF) |
                        (readq(addr) & RISCV_IOMMU_IOHPMCYCLES_OF);

        writeq(FIELD_PREP(RISCV_IOMMU_IOHPMCTR_COUNTER, value), addr + idx * 8);
}

static u64 riscv_iommu_pmu_get_counter(struct riscv_iommu_pmu *pmu, u32 idx)
{
        void __iomem *addr = pmu->reg + RISCV_IOMMU_REG_IOHPMCYCLES;
        u64 value;

        if (WARN_ON_ONCE(idx < 0 || idx > pmu->num_counters))
                return -EINVAL;

        value = readq(addr + idx * 8);

        if (idx == 0)
                return FIELD_GET(RISCV_IOMMU_IOHPMCYCLES_COUNTER, value);

        return FIELD_GET(RISCV_IOMMU_IOHPMCTR_COUNTER, value);
}

Please ignore me if these issues have already been discussed.

Best regards,

Xu Lu

On Tue, Dec 10, 2024 at 3:55 PM yunhui cui <cuiyunhui@bytedance.com> wrote:
>
> Add Luxu in the loop.
>
> On Fri, Jun 14, 2024 at 10:22 PM Zong Li <zong.li@sifive.com> wrote:
> >
> > This patch initialize the pmu stuff and uninitialize it when driver
> > removing. The interrupt handling is also provided, this handler need to
> > be primary handler instead of thread function, because pt_regs is empty
> > when threading the IRQ, but pt_regs is necessary by perf_event_overflow.
> >
> > Signed-off-by: Zong Li <zong.li@sifive.com>
> > ---
> >  drivers/iommu/riscv/iommu.c | 65 +++++++++++++++++++++++++++++++++++++
> >  1 file changed, 65 insertions(+)
> >
> > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > index 8b6a64c1ad8d..1716b2251f38 100644
> > --- a/drivers/iommu/riscv/iommu.c
> > +++ b/drivers/iommu/riscv/iommu.c
> > @@ -540,6 +540,62 @@ static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
> >         return IRQ_HANDLED;
> >  }
> >
> > +/*
> > + * IOMMU Hardware performance monitor
> > + */
> > +
> > +/* HPM interrupt primary handler */
> > +static irqreturn_t riscv_iommu_hpm_irq_handler(int irq, void *dev_id)
> > +{
> > +       struct riscv_iommu_device *iommu = (struct riscv_iommu_device *)dev_id;
> > +
> > +       /* Process pmu irq */
> > +       riscv_iommu_pmu_handle_irq(&iommu->pmu);
> > +
> > +       /* Clear performance monitoring interrupt pending */
> > +       riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, RISCV_IOMMU_IPSR_PMIP);
> > +
> > +       return IRQ_HANDLED;
> > +}
> > +
> > +/* HPM initialization */
> > +static int riscv_iommu_hpm_enable(struct riscv_iommu_device *iommu)
> > +{
> > +       int rc;
> > +
> > +       if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM))
> > +               return 0;
> > +
> > +       /*
> > +        * pt_regs is empty when threading the IRQ, but pt_regs is necessary
> > +        * by perf_event_overflow. Use primary handler instead of thread
> > +        * function for PM IRQ.
> > +        *
> > +        * Set the IRQF_ONESHOT flag because this IRQ might be shared with
> > +        * other threaded IRQs by other queues.
> > +        */
> > +       rc = devm_request_irq(iommu->dev,
> > +                             iommu->irqs[riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IPSR_PMIP)],
> > +                             riscv_iommu_hpm_irq_handler, IRQF_ONESHOT | IRQF_SHARED, NULL, iommu);
> > +       if (rc)
> > +               return rc;
> > +
> > +       return riscv_iommu_pmu_init(&iommu->pmu, iommu->reg, dev_name(iommu->dev));
> > +}
> > +
> > +/* HPM uninitialization */
> > +static void riscv_iommu_hpm_disable(struct riscv_iommu_device *iommu)
> > +{
> > +       if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM))
> > +               return;
> > +
> > +       devm_free_irq(iommu->dev,
> > +                     iommu->irqs[riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IPSR_PMIP)],
> > +                     iommu);
> > +
> > +       riscv_iommu_pmu_uninit(&iommu->pmu);
> > +}
> > +
> >  /* Lookup and initialize device context info structure. */
> >  static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
> >                                                  unsigned int devid)
> > @@ -1612,6 +1668,9 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
> >         riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_IOMMU_MODE_OFF);
> >         riscv_iommu_queue_disable(&iommu->cmdq);
> >         riscv_iommu_queue_disable(&iommu->fltq);
> > +
> > +       if (iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM)
> > +               riscv_iommu_pmu_uninit(&iommu->pmu);
> >  }
> >
> >  int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > @@ -1651,6 +1710,10 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> >         if (rc)
> >                 goto err_queue_disable;
> >
> > +       rc = riscv_iommu_hpm_enable(iommu);
> > +       if (rc)
> > +               goto err_hpm_disable;
> > +
> >         rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
> >                                     dev_name(iommu->dev));
> >         if (rc) {
> > @@ -1669,6 +1732,8 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> >  err_remove_sysfs:
> >         iommu_device_sysfs_remove(&iommu->iommu);
> >  err_iodir_off:
> > +       riscv_iommu_hpm_disable(iommu);
> > +err_hpm_disable:
> >         riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_IOMMU_MODE_OFF);
> >  err_queue_disable:
> >         riscv_iommu_queue_disable(&iommu->fltq);
> > --
> > 2.17.1
> >
> >
>
> Thanks,
> Yunhui

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] [RFC PATCH v2 02/10] iommu/riscv: support HPM and interrupt handling
  2024-12-10  8:48     ` Xu Lu
@ 2024-12-27  8:37       ` Zong Li
  0 siblings, 0 replies; 37+ messages in thread
From: Zong Li @ 2024-12-27  8:37 UTC (permalink / raw)
  To: Xu Lu
  Cc: yunhui cui, joro, will, robin.murphy, tjeznach, paul.walmsley,
	palmer, aou, jgg, kevin.tian, linux-kernel, iommu, linux-riscv

On Tue, Dec 10, 2024 at 4:48 PM Xu Lu <luxu.kernel@bytedance.com> wrote:
>
> Hi Zong Li,
>
> Thanks for your job. We have tested your iommu pmu driver and have
> some feedbacks.
>
> 1. Maybe it is better to clear ipsr.PMIP first and then handle the pmu
> ovf irq in riscv_iommu_hpm_irq_handler(). Otherwise, if a new overflow
> happens after the riscv_iommu_pmu_handle_irq() and before pmip clear,
> we will drop it.

Yes, you are right. Let me change the order in the next version.

>
> 2. The period_left can be messed in riscv_iommu_pmu_update() as
> riscv_iommu_pmu_get_counter() always return the whole register value
> while bit 63 in hpmcycle actually indicates whether overflow happens
> instead of current value. Maybe these two functions should be
> implemented as:

Thanks for catch that. I will fix them in the next version.

>
> static void riscv_iommu_pmu_set_counter(struct riscv_iommu_pmu *pmu, u32 idx,
>                                         u64 value)
> {
>         void __iomem *addr = pmu->reg + RISCV_IOMMU_REG_IOHPMCYCLES;
>
>         if (WARN_ON_ONCE(idx < 0 || idx > pmu->num_counters))
>                 return;
>
>         if (idx == 0)
>                 value = (value & ~RISCV_IOMMU_IOHPMCYCLES_OF) |
>                         (readq(addr) & RISCV_IOMMU_IOHPMCYCLES_OF);
>
>         writeq(FIELD_PREP(RISCV_IOMMU_IOHPMCTR_COUNTER, value), addr + idx * 8);
> }
>
> static u64 riscv_iommu_pmu_get_counter(struct riscv_iommu_pmu *pmu, u32 idx)
> {
>         void __iomem *addr = pmu->reg + RISCV_IOMMU_REG_IOHPMCYCLES;
>         u64 value;
>
>         if (WARN_ON_ONCE(idx < 0 || idx > pmu->num_counters))
>                 return -EINVAL;
>
>         value = readq(addr + idx * 8);
>
>         if (idx == 0)
>                 return FIELD_GET(RISCV_IOMMU_IOHPMCYCLES_COUNTER, value);
>
>         return FIELD_GET(RISCV_IOMMU_IOHPMCTR_COUNTER, value);
> }
>
> Please ignore me if these issues have already been discussed.
>
> Best regards,
>
> Xu Lu
>
> On Tue, Dec 10, 2024 at 3:55 PM yunhui cui <cuiyunhui@bytedance.com> wrote:
> >
> > Add Luxu in the loop.
> >
> > On Fri, Jun 14, 2024 at 10:22 PM Zong Li <zong.li@sifive.com> wrote:
> > >
> > > This patch initialize the pmu stuff and uninitialize it when driver
> > > removing. The interrupt handling is also provided, this handler need to
> > > be primary handler instead of thread function, because pt_regs is empty
> > > when threading the IRQ, but pt_regs is necessary by perf_event_overflow.
> > >
> > > Signed-off-by: Zong Li <zong.li@sifive.com>
> > > ---
> > >  drivers/iommu/riscv/iommu.c | 65 +++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 65 insertions(+)
> > >
> > > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > > index 8b6a64c1ad8d..1716b2251f38 100644
> > > --- a/drivers/iommu/riscv/iommu.c
> > > +++ b/drivers/iommu/riscv/iommu.c
> > > @@ -540,6 +540,62 @@ static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
> > >         return IRQ_HANDLED;
> > >  }
> > >
> > > +/*
> > > + * IOMMU Hardware performance monitor
> > > + */
> > > +
> > > +/* HPM interrupt primary handler */
> > > +static irqreturn_t riscv_iommu_hpm_irq_handler(int irq, void *dev_id)
> > > +{
> > > +       struct riscv_iommu_device *iommu = (struct riscv_iommu_device *)dev_id;
> > > +
> > > +       /* Process pmu irq */
> > > +       riscv_iommu_pmu_handle_irq(&iommu->pmu);
> > > +
> > > +       /* Clear performance monitoring interrupt pending */
> > > +       riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, RISCV_IOMMU_IPSR_PMIP);
> > > +
> > > +       return IRQ_HANDLED;
> > > +}
> > > +
> > > +/* HPM initialization */
> > > +static int riscv_iommu_hpm_enable(struct riscv_iommu_device *iommu)
> > > +{
> > > +       int rc;
> > > +
> > > +       if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM))
> > > +               return 0;
> > > +
> > > +       /*
> > > +        * pt_regs is empty when threading the IRQ, but pt_regs is necessary
> > > +        * by perf_event_overflow. Use primary handler instead of thread
> > > +        * function for PM IRQ.
> > > +        *
> > > +        * Set the IRQF_ONESHOT flag because this IRQ might be shared with
> > > +        * other threaded IRQs by other queues.
> > > +        */
> > > +       rc = devm_request_irq(iommu->dev,
> > > +                             iommu->irqs[riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IPSR_PMIP)],
> > > +                             riscv_iommu_hpm_irq_handler, IRQF_ONESHOT | IRQF_SHARED, NULL, iommu);
> > > +       if (rc)
> > > +               return rc;
> > > +
> > > +       return riscv_iommu_pmu_init(&iommu->pmu, iommu->reg, dev_name(iommu->dev));
> > > +}
> > > +
> > > +/* HPM uninitialization */
> > > +static void riscv_iommu_hpm_disable(struct riscv_iommu_device *iommu)
> > > +{
> > > +       if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM))
> > > +               return;
> > > +
> > > +       devm_free_irq(iommu->dev,
> > > +                     iommu->irqs[riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IPSR_PMIP)],
> > > +                     iommu);
> > > +
> > > +       riscv_iommu_pmu_uninit(&iommu->pmu);
> > > +}
> > > +
> > >  /* Lookup and initialize device context info structure. */
> > >  static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
> > >                                                  unsigned int devid)
> > > @@ -1612,6 +1668,9 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
> > >         riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_IOMMU_MODE_OFF);
> > >         riscv_iommu_queue_disable(&iommu->cmdq);
> > >         riscv_iommu_queue_disable(&iommu->fltq);
> > > +
> > > +       if (iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM)
> > > +               riscv_iommu_pmu_uninit(&iommu->pmu);
> > >  }
> > >
> > >  int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > > @@ -1651,6 +1710,10 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > >         if (rc)
> > >                 goto err_queue_disable;
> > >
> > > +       rc = riscv_iommu_hpm_enable(iommu);
> > > +       if (rc)
> > > +               goto err_hpm_disable;
> > > +
> > >         rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
> > >                                     dev_name(iommu->dev));
> > >         if (rc) {
> > > @@ -1669,6 +1732,8 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > >  err_remove_sysfs:
> > >         iommu_device_sysfs_remove(&iommu->iommu);
> > >  err_iodir_off:
> > > +       riscv_iommu_hpm_disable(iommu);
> > > +err_hpm_disable:
> > >         riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_IOMMU_MODE_OFF);
> > >  err_queue_disable:
> > >         riscv_iommu_queue_disable(&iommu->fltq);
> > > --
> > > 2.17.1
> > >
> > >
> >
> > Thanks,
> > Yunhui

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support
  2024-06-14 14:21 ` [RFC PATCH v2 02/10] iommu/riscv: support HPM and interrupt handling Zong Li
  2024-12-10  7:54   ` [External] " yunhui cui
@ 2025-09-01 13:36   ` niliqiang
  2025-09-02  4:01     ` Zong Li
  1 sibling, 1 reply; 37+ messages in thread
From: niliqiang @ 2025-09-01 13:36 UTC (permalink / raw)
  To: zong.li
  Cc: aou, iommu, jgg, joro, kevin.tian, linux-kernel, linux-riscv,
	palmer, paul.walmsley, robin.murphy, tjeznach, will, chenruisust

Hi Zong

Fri, 14 Jun 2024 22:21:48 +0800, Zong Li <zong.li@sifive.com> wrote:

> This patch initialize the pmu stuff and uninitialize it when driver
> removing. The interrupt handling is also provided, this handler need to
> be primary handler instead of thread function, because pt_regs is empty
> when threading the IRQ, but pt_regs is necessary by perf_event_overflow.
>
> Signed-off-by: Zong Li <zong.li@sifive.com>
> ---
>  drivers/iommu/riscv/iommu.c | 65 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
>
> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> index 8b6a64c1ad8d..1716b2251f38 100644
> --- a/drivers/iommu/riscv/iommu.c
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -540,6 +540,62 @@ static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
>   return IRQ_HANDLED;
>  }
>  
> +/*
> + * IOMMU Hardware performance monitor
> + */
> +
> +/* HPM interrupt primary handler */
> +static irqreturn_t riscv_iommu_hpm_irq_handler(int irq, void *dev_id)
> +{
> + struct riscv_iommu_device *iommu = (struct riscv_iommu_device *)dev_id;
> +
> + /* Process pmu irq */
> + riscv_iommu_pmu_handle_irq(&iommu->pmu);
> +
> + /* Clear performance monitoring interrupt pending */
> + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, RISCV_IOMMU_IPSR_PMIP);
> +
> + return IRQ_HANDLED;
> +}
> +
> +/* HPM initialization */
> +static int riscv_iommu_hpm_enable(struct riscv_iommu_device *iommu)
> +{
> + int rc;
> +
> + if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM))
> +     return 0;
> +
> + /*
> +  * pt_regs is empty when threading the IRQ, but pt_regs is necessary
> +  * by perf_event_overflow. Use primary handler instead of thread
> +  * function for PM IRQ.
> +  *
> +  * Set the IRQF_ONESHOT flag because this IRQ might be shared with
> +  * other threaded IRQs by other queues.
> +  */
> + rc = devm_request_irq(iommu->dev,
> +               iommu->irqs[riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IPSR_PMIP)],
> +               riscv_iommu_hpm_irq_handler, IRQF_ONESHOT | IRQF_SHARED, NULL, iommu);
> + if (rc)
> +     return rc;
> +
> + return riscv_iommu_pmu_init(&iommu->pmu, iommu->reg, dev_name(iommu->dev));
> +}
> +

What are the benefits of initializing the iommu-pmu driver in the iommu driver? 

It might be better for the RISC-V IOMMU PMU driver to be loaded as a separate module, as this would allow greater flexibility since different vendors may need to add custom events.

Also, I'm not quite clear on how custom events should be added if the RISC-V iommu-pmu is placed within the iommu driver.


Best regards,
Liqiang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support
  2025-09-01 13:36   ` [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support niliqiang
@ 2025-09-02  4:01     ` Zong Li
  0 siblings, 0 replies; 37+ messages in thread
From: Zong Li @ 2025-09-02  4:01 UTC (permalink / raw)
  To: niliqiang
  Cc: aou, iommu, jgg, joro, kevin.tian, linux-kernel, linux-riscv,
	palmer, paul.walmsley, robin.murphy, tjeznach, will, chenruisust

On Mon, Sep 1, 2025 at 9:37 PM niliqiang <ni_liqiang@126.com> wrote:
>
> Hi Zong
>
> Fri, 14 Jun 2024 22:21:48 +0800, Zong Li <zong.li@sifive.com> wrote:
>
> > This patch initialize the pmu stuff and uninitialize it when driver
> > removing. The interrupt handling is also provided, this handler need to
> > be primary handler instead of thread function, because pt_regs is empty
> > when threading the IRQ, but pt_regs is necessary by perf_event_overflow.
> >
> > Signed-off-by: Zong Li <zong.li@sifive.com>
> > ---
> >  drivers/iommu/riscv/iommu.c | 65 +++++++++++++++++++++++++++++++++++++
> >  1 file changed, 65 insertions(+)
> >
> > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > index 8b6a64c1ad8d..1716b2251f38 100644
> > --- a/drivers/iommu/riscv/iommu.c
> > +++ b/drivers/iommu/riscv/iommu.c
> > @@ -540,6 +540,62 @@ static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
> >   return IRQ_HANDLED;
> >  }
> >
> > +/*
> > + * IOMMU Hardware performance monitor
> > + */
> > +
> > +/* HPM interrupt primary handler */
> > +static irqreturn_t riscv_iommu_hpm_irq_handler(int irq, void *dev_id)
> > +{
> > + struct riscv_iommu_device *iommu = (struct riscv_iommu_device *)dev_id;
> > +
> > + /* Process pmu irq */
> > + riscv_iommu_pmu_handle_irq(&iommu->pmu);
> > +
> > + /* Clear performance monitoring interrupt pending */
> > + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, RISCV_IOMMU_IPSR_PMIP);
> > +
> > + return IRQ_HANDLED;
> > +}
> > +
> > +/* HPM initialization */
> > +static int riscv_iommu_hpm_enable(struct riscv_iommu_device *iommu)
> > +{
> > + int rc;
> > +
> > + if (!(iommu->caps & RISCV_IOMMU_CAPABILITIES_HPM))
> > +     return 0;
> > +
> > + /*
> > +  * pt_regs is empty when threading the IRQ, but pt_regs is necessary
> > +  * by perf_event_overflow. Use primary handler instead of thread
> > +  * function for PM IRQ.
> > +  *
> > +  * Set the IRQF_ONESHOT flag because this IRQ might be shared with
> > +  * other threaded IRQs by other queues.
> > +  */
> > + rc = devm_request_irq(iommu->dev,
> > +               iommu->irqs[riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IPSR_PMIP)],
> > +               riscv_iommu_hpm_irq_handler, IRQF_ONESHOT | IRQF_SHARED, NULL, iommu);
> > + if (rc)
> > +     return rc;
> > +
> > + return riscv_iommu_pmu_init(&iommu->pmu, iommu->reg, dev_name(iommu->dev));
> > +}
> > +
>
> What are the benefits of initializing the iommu-pmu driver in the iommu driver?
>
> It might be better for the RISC-V IOMMU PMU driver to be loaded as a separate module, as this would allow greater flexibility since different vendors may need to add custom events.
>
> Also, I'm not quite clear on how custom events should be added if the RISC-V iommu-pmu is placed within the iommu driver.

Hi Liqiang,
My original idea is that, since the IOMMU HPM is not always present,
it depends on the capability.HPM bit, if we separate HPM into an
individual module, I assume that the PMU driver may not have access to
the IOMMU's complete MMIO region. I’m not sure how we would check the
capability register in the PMU driver and avoid the following
situation: capability.HPM is zero, but the IOMMU-PMU driver is still
loaded because the PMU node is present in the DTS. It will be helpful
if you have any suggestions on this.

Regarding custom events, since we don’t have the driver data, my
current rough idea is to add a vendor event map table to list the
vendor events and use Kconfig to define them respectively. This is
just an initial thought and may not be the good solution, so feel free
to share any recommendations. Of course, if we eventually decide to
move it to drivers/perf as an individual module, then we could use the
driver data for custom events, similar to what ARM does.

Thanks

>
>
> Best regards,
> Liqiang
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-09-02  4:01 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-14 14:21 [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support Zong Li
2024-06-14 14:21 ` [RFC PATCH v2 01/10] iommu/riscv: add RISC-V IOMMU PMU support Zong Li
2024-06-17 14:55   ` Jason Gunthorpe
2024-06-18  1:14     ` Zong Li
2024-06-14 14:21 ` [RFC PATCH v2 02/10] iommu/riscv: support HPM and interrupt handling Zong Li
2024-12-10  7:54   ` [External] " yunhui cui
2024-12-10  8:48     ` Xu Lu
2024-12-27  8:37       ` Zong Li
2025-09-01 13:36   ` [RFC PATCH v2 00/10] RISC-V IOMMU HPM and nested IOMMU support niliqiang
2025-09-02  4:01     ` Zong Li
2024-06-14 14:21 ` [RFC PATCH v2 03/10] iommu/riscv: use data structure instead of individual values Zong Li
2024-06-14 14:21 ` [RFC PATCH v2 04/10] iommu/riscv: add iotlb_sync_map operation support Zong Li
2024-06-15  3:14   ` Baolu Lu
2024-06-17 13:43     ` Zong Li
2024-06-17 14:39       ` Jason Gunthorpe
2024-06-18  3:01         ` Zong Li
2024-06-18 13:31           ` Jason Gunthorpe
2024-06-14 14:21 ` [RFC PATCH v2 05/10] iommu/riscv: support GSCID and GVMA invalidation command Zong Li
2024-06-14 14:21 ` [RFC PATCH v2 06/10] iommu/riscv: support nested iommu for getting iommu hardware information Zong Li
2024-06-19 15:49   ` Jason Gunthorpe
2024-06-21  7:32     ` Zong Li
2024-06-14 14:21 ` [RFC PATCH v2 07/10] iommu/riscv: support nested iommu for creating domains owned by userspace Zong Li
2024-06-19 16:02   ` Jason Gunthorpe
2024-06-28  9:03     ` Zong Li
2024-06-28 22:32       ` Jason Gunthorpe
2024-06-19 16:34   ` Joao Martins
2024-06-21  7:34     ` Zong Li
2024-06-14 14:21 ` [RFC PATCH v2 08/10] iommu/riscv: support nested iommu for flushing cache Zong Li
2024-06-15  3:22   ` Baolu Lu
2024-06-17  2:16     ` Zong Li
2024-06-19 16:17   ` Jason Gunthorpe
2024-06-28  8:19     ` Zong Li
2024-06-28 22:26       ` Jason Gunthorpe
2024-06-14 14:21 ` [RFC PATCH v2 09/10] iommu/dma: Support MSIs through nested domains Zong Li
2024-06-14 18:12   ` Nicolin Chen
2024-06-17  2:15     ` Zong Li
2024-06-14 14:21 ` [RFC PATCH v2 10/10] iommu:riscv: support nested iommu for get_msi_mapping_domain operation Zong Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).