Linux IOMMU Development

Linux IOMMU Development
 help / color / mirror / Atom feed

* Re: [RFC PATCH v4 7/9] arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU
From: Jonas Karlman @ 2026-06-13  8:18 UTC (permalink / raw)
  To: MidG971
  Cc: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson,
	dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik
In-Reply-To: <20260613070116.438906-8-midgy971@gmail.com>

Hi Midgy,

On 6/13/2026 9:01 AM, MidG971 wrote:
> From: Midgy BALON <midgy971@gmail.com>
> 
> The RK3568 has an NVDLA-derived NPU at fde40000 with its own IOMMU at
> fde4b000. Add both nodes (disabled by default) and the NPU power-domain
> child under the PMU power-controller, and point rockchip,pmu at the PMU
> syscon that controls the NPU NoC bus-idle.
> 
> Besides the SCMI compute clock, assign the CRU CLK_NPU so the NPU AXI
> bus clock comes up at 200 MHz rather than the 12 MHz boot default.
> 
> The power-domain deliberately carries no pm_qos: qos_npu sits behind the
> NPU NoC, which is gated until the NPU is brought up, so a genpd power-off
> QoS save would fault reading it.
> 
> Signed-off-by: Midgy BALON <midgy971@gmail.com>
> ---
>  arch/arm64/boot/dts/rockchip/rk356x-base.dtsi | 38 +++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/rockchip/rk356x-base.dtsi b/arch/arm64/boot/dts/rockchip/rk356x-base.dtsi
> index 64bdd8b7754b5..313db59c1aed8 100644
> --- a/arch/arm64/boot/dts/rockchip/rk356x-base.dtsi
> +++ b/arch/arm64/boot/dts/rockchip/rk356x-base.dtsi
> @@ -512,6 +512,13 @@ power-domain@RK3568_PD_GPU {
>  				#power-domain-cells = <0>;
>  			};
>  
> +			pd_npu: power-domain@RK3568_PD_NPU {
> +				reg = <RK3568_PD_NPU>;
> +				clocks = <&cru ACLK_NPU_PRE>,
> +					 <&cru HCLK_NPU_PRE>;
> +				#power-domain-cells = <0>;
> +			};
> +
>  			/* These power domains are grouped by VD_LOGIC */
>  			power-domain@RK3568_PD_VI {
>  				reg = <RK3568_PD_VI>;
> @@ -572,6 +579,37 @@ power-domain@RK3568_PD_RKVENC {
>  		};
>  	};
>  
> +	rknn_core_0: npu@fde40000 {
> +		compatible = "rockchip,rk3568-rknn-core";
> +		reg = <0x0 0xfde40000 0x0 0x1000>,
> +		      <0x0 0xfde41000 0x0 0x1000>,
> +		      <0x0 0xfde43000 0x0 0x1000>;
> +		reg-names = "pc", "cna", "core";
> +		interrupts = <GIC_SPI 151 IRQ_TYPE_LEVEL_HIGH>;
> +		clocks = <&cru ACLK_NPU>, <&cru HCLK_NPU>,
> +			 <&scmi_clk SCMI_CLK_NPU>, <&cru PCLK_NPU_PRE>;
> +		clock-names = "aclk", "hclk", "npu", "pclk";
> +		assigned-clocks = <&scmi_clk SCMI_CLK_NPU>, <&cru CLK_NPU>;
> +		assigned-clock-rates = <200000000>, <600000000>;

This looks strange, the SCMI clk can be seen as a virtual clock that
changes between normal CRU NPU clk and a PVTPLL NPU clk (depending on
rate). 200 MHz, a typical opp-suspend value (switch to CRU clk instead
of PVTPLL), will set the CLK_NPU rate to 200 MHz, then setting CLK_NPU
to 600 MHz (the lowest rate that activates PVTPLL mode) outside of SCMI
control looks strange.

Suggest you only set SCMI NPU clk rate to 200 or 400 MHz and drop any
special handling, e.g. noc_init, to closer match RK3588 and defer any
use of PVTPLL clk to a future series that also adds OPP support.

> +		resets = <&cru SRST_A_NPU>, <&cru SRST_H_NPU>;
> +		reset-names = "srst_a", "srst_h";
> +		power-domains = <&power RK3568_PD_NPU>;
> +		rockchip,pmu = <&pmu>;

This looks wrong, the rockchip,pmu prop is typically used to reference
PMU GRF, see i.e. pinctrl node, not the power-management that is seem to
be correctly abstracted using power-domains above, please drop this prop.

Regards,
Jonas

> +		iommus = <&rknn_mmu_0>;
> +		status = "disabled";
> +	};
> +
> +	rknn_mmu_0: iommu@fde4b000 {
> +		compatible = "rockchip,iommu";
> +		reg = <0x0 0xfde4b000 0x0 0x40>;
> +		interrupts = <GIC_SPI 151 IRQ_TYPE_LEVEL_HIGH>;
> +		clock-names = "aclk", "iface";
> +		clocks = <&cru ACLK_NPU>, <&cru HCLK_NPU>;
> +		power-domains = <&power RK3568_PD_NPU>;
> +		#iommu-cells = <0>;
> +		status = "disabled";
> +	};
> +
>  	gpu: gpu@fde60000 {
>  		compatible = "rockchip,rk3568-mali", "arm,mali-bifrost";
>  		reg = <0x0 0xfde60000 0x0 0x4000>;


^ permalink raw reply

* Re: [RFC PATCH v4 8/9] arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU
From: Jonas Karlman @ 2026-06-13  7:40 UTC (permalink / raw)
  To: MidG971
  Cc: tomeu@tomeuvizoso.net, ogabbay@kernel.org, heiko@sntech.de,
	robh@kernel.org, krzk+dt@kernel.org, conor+dt@kernel.org,
	ulf.hansson@linaro.org, dri-devel@lists.freedesktop.org,
	linux-rockchip@lists.infradead.org, devicetree@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-pm@vger.kernel.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	xxm@rock-chips.com, chaoyi.chen@rock-chips.com,
	finley.xiao@rock-chips.com, diederik@cknow-tech.com
In-Reply-To: <20260613070116.438906-9-midgy971@gmail.com>

Hi Midgy,

On 6/13/2026 9:01 AM, MidG971 wrote:
> From: Midgy BALON <midgy971@gmail.com>
> 
> Enable the NPU and its IOMMU on ROCK 3B and wire vdd_npu as the NPU
> power domain's domain-supply, so genpd brings the rail up and down with
> the domain (the domain is marked need_regulator). The PVTPLL compute
> clock is brought up later by the driver.
> 
> The rail is no longer kept always-on, so pin it to 1000 mV (the NPU's
> 1 GHz operating voltage; the driver runs a fixed compute rate with no
> devfreq voltage scaling) and mark it boot-on, so it is up before the
> power domain de-idles the NPU NoC at power-on.
> 
> Signed-off-by: Midgy BALON <midgy971@gmail.com>
> ---
>  .../arm64/boot/dts/rockchip/rk3568-rock-3b.dts | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/boot/dts/rockchip/rk3568-rock-3b.dts b/arch/arm64/boot/dts/rockchip/rk3568-rock-3b.dts
> index 69001e453732e..d3f9776c2bdc3 100644
> --- a/arch/arm64/boot/dts/rockchip/rk3568-rock-3b.dts
> +++ b/arch/arm64/boot/dts/rockchip/rk3568-rock-3b.dts
> @@ -330,9 +330,10 @@ regulator-state-mem {
>  
>  			vdd_npu: DCDC_REG4 {
>  				regulator-name = "vdd_npu";
> +				regulator-boot-on;

There is no need for the NPU in the bootloader, do not use DT as a
workaround for software issues.

This series mention the PVTPLL NPU clk and seem to contains some
workarounds related to how the PVTPLL clock is handled in TF-A.

The PVTPLL block typically require the pclk and power domain enabled to
function, and this series seem to add workarounds to try and ensure this,
e.g. with noc_init to activate PVTPLL usage.

I would suggest that you do not involve the PVTPLL clock in this initial
NPU support for RK3568, set CLK_NPU to 400 MHz and use it instead of the
SCMI clock, or keep SCMI clk rate less than or equal to 400 MHz to
disable PVTPLL_NEED mode in TF-A.

In a future series you can extend Linux with a proper PVTPLL clk driver
and OPP support for the rocket driver to correctly ensure pclk and pd is
enabled when a PVTPLL clock is managed.

>  				regulator-initial-mode = <0x2>;
> -				regulator-min-microvolt = <500000>;
> -				regulator-max-microvolt = <1350000>;
> +				regulator-min-microvolt = <1000000>;
> +				regulator-max-microvolt = <1000000>;

Please describe the HW, do not add workarounds for software issues or
shortcomings.

Regards,
Jonas

>  				regulator-ramp-delay = <6001>;
>  
>  				regulator-state-mem {
> @@ -787,3 +788,16 @@ vp0_out_hdmi: endpoint@ROCKCHIP_VOP2_EP_HDMI0 {
>  		remote-endpoint = <&hdmi_in_vp0>;
>  	};
>  };
> +
> +&pd_npu {
> +	domain-supply = <&vdd_npu>;
> +};
> +
> +&rknn_core_0 {
> +	npu-supply = <&vdd_npu>;
> +	status = "okay";
> +};
> +
> +&rknn_mmu_0 {
> +	status = "okay";
> +};


^ permalink raw reply

* [RFC PATCH v4 9/9] pmdomain: rockchip: Add a regulator to the RK3568 NPU power domain
From: MidG971 @ 2026-06-13  7:01 UTC (permalink / raw)
  To: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson
  Cc: dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik,
	jonas, Midgy BALON
In-Reply-To: <20260613070116.438906-1-midgy971@gmail.com>

From: Midgy BALON <midgy971@gmail.com>

The RK3568 NPU rail (vdd_npu) needs to be enabled before the domain is
powered on and disabled after it is powered off. Give DOMAIN_RK3568 a
regulator parameter (like DOMAIN_RK3588 already has) so the NPU domain
can set need_regulator, letting genpd manage the rail wired up as the
domain's domain-supply instead of marking it always-on in DT.

Suggested-by: Chaoyi Chen <chaoyi.chen@rock-chips.com>
Signed-off-by: Midgy BALON <midgy971@gmail.com>
---
 drivers/pmdomain/rockchip/pm-domains.c | 36 ++++++++++++++++++--------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/pmdomain/rockchip/pm-domains.c b/drivers/pmdomain/rockchip/pm-domains.c
index 490bbb1d1d8e8..19db307e3811d 100644
--- a/drivers/pmdomain/rockchip/pm-domains.c
+++ b/drivers/pmdomain/rockchip/pm-domains.c
@@ -138,6 +138,20 @@ struct rockchip_pmu {
 	.active_wakeup = wakeup,			\
 }
 
+#define DOMAIN_M_R(_name, pwr, status, req, idle, ack, wakeup, regulator)	\
+{							\
+	.name = _name,				\
+	.pwr_w_mask = (pwr) << 16,			\
+	.pwr_mask = (pwr),				\
+	.status_mask = (status),			\
+	.req_w_mask = (req) << 16,			\
+	.req_mask = (req),				\
+	.idle_mask = (idle),				\
+	.ack_mask = (ack),				\
+	.active_wakeup = wakeup,			\
+	.need_regulator = regulator,			\
+}
+
 #define DOMAIN_M_G(_name, pwr, status, req, idle, ack, g_mask, wakeup, keepon)	\
 {							\
 	.name = _name,					\
@@ -241,8 +255,8 @@ struct rockchip_pmu {
 #define DOMAIN_RK3562(name, pwr, req, g_mask, mem, wakeup)		\
 	DOMAIN_M_G_SD(name, pwr, pwr, req, req, req, g_mask, mem, wakeup, false)
 
-#define DOMAIN_RK3568(name, pwr, req, wakeup)		\
-	DOMAIN_M(name, pwr, pwr, req, req, req, wakeup)
+#define DOMAIN_RK3568(name, pwr, req, wakeup, regulator)		\
+	DOMAIN_M_R(name, pwr, pwr, req, req, req, wakeup, regulator)
 
 #define DOMAIN_RK3576(name, p_offset, pwr, status, r_status, r_offset, req, idle, g_mask, wakeup)	\
 	DOMAIN_M_O_R_G(name, p_offset, pwr, status, 0, r_status, r_status, r_offset, req, idle, idle, g_mask, wakeup)
@@ -1274,15 +1288,15 @@ static const struct rockchip_domain_info rk3562_pm_domains[] = {
 };
 
 static const struct rockchip_domain_info rk3568_pm_domains[] = {
-	[RK3568_PD_NPU]		= DOMAIN_RK3568("npu",  BIT(1), BIT(2),  false),
-	[RK3568_PD_GPU]		= DOMAIN_RK3568("gpu",  BIT(0), BIT(1),  false),
-	[RK3568_PD_VI]		= DOMAIN_RK3568("vi",   BIT(6), BIT(3),  false),
-	[RK3568_PD_VO]		= DOMAIN_RK3568("vo",   BIT(7), BIT(4),  false),
-	[RK3568_PD_RGA]		= DOMAIN_RK3568("rga",  BIT(5), BIT(5),  false),
-	[RK3568_PD_VPU]		= DOMAIN_RK3568("vpu",  BIT(2), BIT(6),  false),
-	[RK3568_PD_RKVDEC]	= DOMAIN_RK3568("vdec", BIT(4), BIT(8),  false),
-	[RK3568_PD_RKVENC]	= DOMAIN_RK3568("venc", BIT(3), BIT(7),  false),
-	[RK3568_PD_PIPE]	= DOMAIN_RK3568("pipe", BIT(8), BIT(11), false),
+	[RK3568_PD_NPU]		= DOMAIN_RK3568("npu",  BIT(1), BIT(2),  false, true),
+	[RK3568_PD_GPU]		= DOMAIN_RK3568("gpu",  BIT(0), BIT(1),  false, false),
+	[RK3568_PD_VI]		= DOMAIN_RK3568("vi",   BIT(6), BIT(3),  false, false),
+	[RK3568_PD_VO]		= DOMAIN_RK3568("vo",   BIT(7), BIT(4),  false, false),
+	[RK3568_PD_RGA]		= DOMAIN_RK3568("rga",  BIT(5), BIT(5),  false, false),
+	[RK3568_PD_VPU]		= DOMAIN_RK3568("vpu",  BIT(2), BIT(6),  false, false),
+	[RK3568_PD_RKVDEC]	= DOMAIN_RK3568("vdec", BIT(4), BIT(8),  false, false),
+	[RK3568_PD_RKVENC]	= DOMAIN_RK3568("venc", BIT(3), BIT(7),  false, false),
+	[RK3568_PD_PIPE]	= DOMAIN_RK3568("pipe", BIT(8), BIT(11), false, false),
 };
 
 static const struct rockchip_domain_info rk3576_pm_domains[] = {
-- 
2.39.5


^ permalink raw reply related

* [RFC PATCH v4 8/9] arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU
From: MidG971 @ 2026-06-13  7:01 UTC (permalink / raw)
  To: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson
  Cc: dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik,
	jonas, Midgy BALON
In-Reply-To: <20260613070116.438906-1-midgy971@gmail.com>

From: Midgy BALON <midgy971@gmail.com>

Enable the NPU and its IOMMU on ROCK 3B and wire vdd_npu as the NPU
power domain's domain-supply, so genpd brings the rail up and down with
the domain (the domain is marked need_regulator). The PVTPLL compute
clock is brought up later by the driver.

The rail is no longer kept always-on, so pin it to 1000 mV (the NPU's
1 GHz operating voltage; the driver runs a fixed compute rate with no
devfreq voltage scaling) and mark it boot-on, so it is up before the
power domain de-idles the NPU NoC at power-on.

Signed-off-by: Midgy BALON <midgy971@gmail.com>
---
 .../arm64/boot/dts/rockchip/rk3568-rock-3b.dts | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/boot/dts/rockchip/rk3568-rock-3b.dts b/arch/arm64/boot/dts/rockchip/rk3568-rock-3b.dts
index 69001e453732e..d3f9776c2bdc3 100644
--- a/arch/arm64/boot/dts/rockchip/rk3568-rock-3b.dts
+++ b/arch/arm64/boot/dts/rockchip/rk3568-rock-3b.dts
@@ -330,9 +330,10 @@ regulator-state-mem {
 
 			vdd_npu: DCDC_REG4 {
 				regulator-name = "vdd_npu";
+				regulator-boot-on;
 				regulator-initial-mode = <0x2>;
-				regulator-min-microvolt = <500000>;
-				regulator-max-microvolt = <1350000>;
+				regulator-min-microvolt = <1000000>;
+				regulator-max-microvolt = <1000000>;
 				regulator-ramp-delay = <6001>;
 
 				regulator-state-mem {
@@ -787,3 +788,16 @@ vp0_out_hdmi: endpoint@ROCKCHIP_VOP2_EP_HDMI0 {
 		remote-endpoint = <&hdmi_in_vp0>;
 	};
 };
+
+&pd_npu {
+	domain-supply = <&vdd_npu>;
+};
+
+&rknn_core_0 {
+	npu-supply = <&vdd_npu>;
+	status = "okay";
+};
+
+&rknn_mmu_0 {
+	status = "okay";
+};
-- 
2.39.5


^ permalink raw reply related

* [RFC PATCH v4 7/9] arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU
From: MidG971 @ 2026-06-13  7:01 UTC (permalink / raw)
  To: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson
  Cc: dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik,
	jonas, Midgy BALON
In-Reply-To: <20260613070116.438906-1-midgy971@gmail.com>

From: Midgy BALON <midgy971@gmail.com>

The RK3568 has an NVDLA-derived NPU at fde40000 with its own IOMMU at
fde4b000. Add both nodes (disabled by default) and the NPU power-domain
child under the PMU power-controller, and point rockchip,pmu at the PMU
syscon that controls the NPU NoC bus-idle.

Besides the SCMI compute clock, assign the CRU CLK_NPU so the NPU AXI
bus clock comes up at 200 MHz rather than the 12 MHz boot default.

The power-domain deliberately carries no pm_qos: qos_npu sits behind the
NPU NoC, which is gated until the NPU is brought up, so a genpd power-off
QoS save would fault reading it.

Signed-off-by: Midgy BALON <midgy971@gmail.com>
---
 arch/arm64/boot/dts/rockchip/rk356x-base.dtsi | 38 +++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/arch/arm64/boot/dts/rockchip/rk356x-base.dtsi b/arch/arm64/boot/dts/rockchip/rk356x-base.dtsi
index 64bdd8b7754b5..313db59c1aed8 100644
--- a/arch/arm64/boot/dts/rockchip/rk356x-base.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk356x-base.dtsi
@@ -512,6 +512,13 @@ power-domain@RK3568_PD_GPU {
 				#power-domain-cells = <0>;
 			};
 
+			pd_npu: power-domain@RK3568_PD_NPU {
+				reg = <RK3568_PD_NPU>;
+				clocks = <&cru ACLK_NPU_PRE>,
+					 <&cru HCLK_NPU_PRE>;
+				#power-domain-cells = <0>;
+			};
+
 			/* These power domains are grouped by VD_LOGIC */
 			power-domain@RK3568_PD_VI {
 				reg = <RK3568_PD_VI>;
@@ -572,6 +579,37 @@ power-domain@RK3568_PD_RKVENC {
 		};
 	};
 
+	rknn_core_0: npu@fde40000 {
+		compatible = "rockchip,rk3568-rknn-core";
+		reg = <0x0 0xfde40000 0x0 0x1000>,
+		      <0x0 0xfde41000 0x0 0x1000>,
+		      <0x0 0xfde43000 0x0 0x1000>;
+		reg-names = "pc", "cna", "core";
+		interrupts = <GIC_SPI 151 IRQ_TYPE_LEVEL_HIGH>;
+		clocks = <&cru ACLK_NPU>, <&cru HCLK_NPU>,
+			 <&scmi_clk SCMI_CLK_NPU>, <&cru PCLK_NPU_PRE>;
+		clock-names = "aclk", "hclk", "npu", "pclk";
+		assigned-clocks = <&scmi_clk SCMI_CLK_NPU>, <&cru CLK_NPU>;
+		assigned-clock-rates = <200000000>, <600000000>;
+		resets = <&cru SRST_A_NPU>, <&cru SRST_H_NPU>;
+		reset-names = "srst_a", "srst_h";
+		power-domains = <&power RK3568_PD_NPU>;
+		rockchip,pmu = <&pmu>;
+		iommus = <&rknn_mmu_0>;
+		status = "disabled";
+	};
+
+	rknn_mmu_0: iommu@fde4b000 {
+		compatible = "rockchip,iommu";
+		reg = <0x0 0xfde4b000 0x0 0x40>;
+		interrupts = <GIC_SPI 151 IRQ_TYPE_LEVEL_HIGH>;
+		clock-names = "aclk", "iface";
+		clocks = <&cru ACLK_NPU>, <&cru HCLK_NPU>;
+		power-domains = <&power RK3568_PD_NPU>;
+		#iommu-cells = <0>;
+		status = "disabled";
+	};
+
 	gpu: gpu@fde60000 {
 		compatible = "rockchip,rk3568-mali", "arm,mali-bifrost";
 		reg = <0x0 0xfde60000 0x0 0x4000>;
-- 
2.39.5


^ permalink raw reply related

* [RFC PATCH v4 6/9] dt-bindings: npu: rockchip,rk3588-rknn-core: Add RK3568
From: MidG971 @ 2026-06-13  7:01 UTC (permalink / raw)
  To: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson
  Cc: dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik,
	jonas, Midgy BALON
In-Reply-To: <20260613070116.438906-1-midgy971@gmail.com>

From: Midgy BALON <midgy971@gmail.com>

The RK3568 carries a single core of the same NVDLA-derived NPU IP as the
RK3588.  Add its compatible.

On RK3568 the NPU NoC bus-idle and power gating are controlled through the
system PMU rather than a dedicated register block, so add a rockchip,pmu
phandle to that syscon.  The RK3568 NPU has no dedicated SRAM rail, so
sram-supply is required only on RK3588.

Signed-off-by: Midgy BALON <midgy971@gmail.com>
---
 .../npu/rockchip,rk3588-rknn-core.yaml        | 27 ++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/npu/rockchip,rk3588-rknn-core.yaml b/Documentation/devicetree/bindings/npu/rockchip,rk3588-rknn-core.yaml
index caca2a4903cd1..e0b948ac47d45 100644
--- a/Documentation/devicetree/bindings/npu/rockchip,rk3588-rknn-core.yaml
+++ b/Documentation/devicetree/bindings/npu/rockchip,rk3588-rknn-core.yaml
@@ -21,6 +21,7 @@ properties:
 
   compatible:
     enum:
+      - rockchip,rk3568-rknn-core
       - rockchip,rk3588-rknn-core
 
   reg:
@@ -50,6 +51,13 @@ properties:
 
   npu-supply: true
 
+  rockchip,pmu:
+    $ref: /schemas/types.yaml#/definitions/phandle
+    description:
+      Phandle to the PMU syscon.  On RK3568 the NPU's NoC bus-idle and
+      power gating are controlled through the PMU; this points to that
+      syscon so those registers can be reached.
+
   power-domains:
     maxItems: 1
 
@@ -75,7 +83,24 @@ required:
   - resets
   - reset-names
   - npu-supply
-  - sram-supply
+
+allOf:
+  - if:
+      properties:
+        compatible:
+          contains:
+            const: rockchip,rk3588-rknn-core
+    then:
+      required:
+        - sram-supply
+  - if:
+      properties:
+        compatible:
+          contains:
+            const: rockchip,rk3568-rknn-core
+    then:
+      required:
+        - rockchip,pmu
 
 additionalProperties: false
 
-- 
2.39.5


^ permalink raw reply related

* [RFC PATCH v4 5/9] accel: rocket: Keep the IOMMU domain attached across jobs
From: MidG971 @ 2026-06-13  7:01 UTC (permalink / raw)
  To: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson
  Cc: dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik,
	jonas, Midgy BALON
In-Reply-To: <20260613070116.438906-1-midgy971@gmail.com>

From: Midgy BALON <midgy971@gmail.com>

rocket attached the job's IOMMU domain in rocket_job_run() and
detached it again on every completion and reset. Each attach/detach
toggles the rk_iommu stall/force-reset/paging handshake, and on
RK3568 the NPU MMU is idle between jobs, so that handshake times out
and logs a burst of "stall/paging request timed out" errors for
every job.

Attach the per-context domain once and keep it: track the attached
domain in the core, swap it only when a job from a different context
runs, and detach it at core teardown. A reference on the attached
domain is held so it outlives the job that first attached it and is
released on swap/teardown.

Because a hardware reset (on job timeout) wipes the IOMMU page-table
base register, drop the attached domain after rocket_core_reset() so
the next job re-attaches and reprograms it. Also tear down the
scheduler before detaching the IOMMU in rocket_core_fini(), so an
in-flight job can no longer reach the domain being detached.

Signed-off-by: Midgy BALON <midgy971@gmail.com>
---
 drivers/accel/rocket/rocket_core.c | 14 +++++++++++-
 drivers/accel/rocket/rocket_core.h |  3 +++
 drivers/accel/rocket/rocket_job.c  | 35 +++++++++++++++++++++++++-----
 3 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
index 779e951596a15..6c128f585cff4 100644
--- a/drivers/accel/rocket/rocket_core.c
+++ b/drivers/accel/rocket/rocket_core.c
@@ -13,6 +13,7 @@
 #include <linux/reset.h>
 
 #include "rocket_core.h"
+#include "rocket_drv.h"
 #include "rocket_job.h"
 
 int rocket_core_init(struct rocket_core *core)
@@ -112,9 +113,20 @@ void rocket_core_fini(struct rocket_core *core)
 {
 	pm_runtime_dont_use_autosuspend(core->dev);
 	pm_runtime_disable(core->dev);
+
+	/*
+	 * Stop the scheduler before tearing down the IOMMU so an in-flight
+	 * job can no longer touch the (about to be detached) domain.
+	 */
+	rocket_job_fini(core);
+
+	if (core->attached_domain) {
+		iommu_detach_group(NULL, core->iommu_group);
+		rocket_iommu_domain_put(core->attached_domain);
+		core->attached_domain = NULL;
+	}
 	iommu_group_put(core->iommu_group);
 	core->iommu_group = NULL;
-	rocket_job_fini(core);
 }
 
 void rocket_core_reset(struct rocket_core *core)
diff --git a/drivers/accel/rocket/rocket_core.h b/drivers/accel/rocket/rocket_core.h
index 5a145ba8c5a92..78791ecb32e75 100644
--- a/drivers/accel/rocket/rocket_core.h
+++ b/drivers/accel/rocket/rocket_core.h
@@ -42,6 +42,8 @@ struct rocket_soc_data {
 #define rocket_core_writel(core, reg, value) \
 	writel(value, (core)->core_iomem + (REG_CORE_##reg) - REG_CORE_S_STATUS)
 
+struct rocket_iommu_domain;
+
 struct rocket_core {
 	struct device *dev;
 	struct rocket_device *rdev;
@@ -56,6 +58,7 @@ struct rocket_core {
 	struct reset_control_bulk_data resets[2];
 
 	struct iommu_group *iommu_group;
+	struct rocket_iommu_domain *attached_domain;
 
 	struct mutex job_lock;
 	struct rocket_job *in_flight_job;
diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index e25234261536b..368b2ebead1b3 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -9,6 +9,7 @@
 #include <drm/rocket_accel.h>
 #include <linux/interrupt.h>
 #include <linux/iommu.h>
+#include <linux/kref.h>
 #include <linux/platform_device.h>
 #include <linux/pm_runtime.h>
 
@@ -314,9 +315,26 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job)
 	if (ret < 0)
 		return fence;
 
-	ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-	if (ret < 0)
-		return fence;
+	/*
+	 * Attach the job's IOMMU domain only when it differs from the one
+	 * already attached. Re-attaching per job toggles the rk_iommu
+	 * stall/reset handshake on an idle NPU MMU, which is slow and
+	 * noisy; keep the domain attached across jobs instead.
+	 */
+	if (core->attached_domain != job->domain) {
+		if (core->attached_domain) {
+			iommu_detach_group(NULL, core->iommu_group);
+			rocket_iommu_domain_put(core->attached_domain);
+			core->attached_domain = NULL;
+		}
+
+		ret = iommu_attach_group(job->domain->domain, core->iommu_group);
+		if (ret < 0)
+			return fence;
+
+		kref_get(&job->domain->kref);
+		core->attached_domain = job->domain;
+	}
 
 	scoped_guard(mutex, &core->job_lock) {
 		core->in_flight_job = job;
@@ -340,7 +358,6 @@ static void rocket_job_handle_irq(struct rocket_core *core)
 				return;
 			}
 
-			iommu_detach_group(NULL, iommu_group_get(core->dev));
 			dma_fence_signal(core->in_flight_job->done_fence);
 			pm_runtime_put_autosuspend(core->dev);
 			core->in_flight_job = NULL;
@@ -376,7 +393,15 @@ rocket_reset(struct rocket_core *core, struct drm_sched_job *bad)
 	 */
 	rocket_core_reset(core);
 
-	iommu_detach_group(NULL, core->iommu_group);
+	/*
+	 * The reset wipes the IOMMU page-table base, so drop the attached
+	 * domain to force the next job to re-attach and reprogram it.
+	 */
+	if (core->attached_domain) {
+		iommu_detach_group(NULL, core->iommu_group);
+		rocket_iommu_domain_put(core->attached_domain);
+		core->attached_domain = NULL;
+	}
 
 	/* NPU has been reset, we can clear the reset pending bit. */
 	atomic_set(&core->reset.pending, 0);
-- 
2.39.5


^ permalink raw reply related

* [RFC PATCH v4 4/9] accel: rocket: Reset the NPU before detaching the IOMMU on timeout
From: MidG971 @ 2026-06-13  7:01 UTC (permalink / raw)
  To: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson
  Cc: dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik,
	jonas, Midgy BALON
In-Reply-To: <20260613070116.438906-1-midgy971@gmail.com>

From: Midgy BALON <midgy971@gmail.com>

On a job timeout the NPU AXI master can be left wedged with
outstanding transactions. rocket_reset() detached the IOMMU group
before resetting the hardware, so iommu_detach_group() ->
__iommu_group_set_core_domain() asked the rk_iommu to stall and wait
for the in-flight transactions to drain. They never did, the stall
request timed out (-ETIMEDOUT) and the IOMMU core WARNed:

  WARNING: drivers/iommu/iommu.c:157 __iommu_group_set_core_domain
    iommu_detach_group
    rocket_reset
    rocket_job_timedout

Assert the core reset first: it quiesces the AXI master so the
following IOMMU detach completes cleanly. Move the detach after
rocket_core_reset() and out of the job_lock (it does not touch
in_flight_job).

Signed-off-by: Midgy BALON <midgy971@gmail.com>
---
 drivers/accel/rocket/rocket_job.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
index ac51bff39833f..e25234261536b 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -364,14 +364,20 @@ rocket_reset(struct rocket_core *core, struct drm_sched_job *bad)
 		if (core->in_flight_job)
 			pm_runtime_put_noidle(core->dev);
 
-		iommu_detach_group(NULL, core->iommu_group);
-
 		core->in_flight_job = NULL;
 	}
 
-	/* Proceed with reset now. */
+	/*
+	 * Reset the NPU hardware before detaching the IOMMU. A timed-out job
+	 * leaves the NPU AXI master wedged; detaching the IOMMU then issues a
+	 * stall request that never drains and times out (warning in the IOMMU
+	 * core). Asserting the core reset first quiesces the master so the
+	 * detach completes cleanly.
+	 */
 	rocket_core_reset(core);
 
+	iommu_detach_group(NULL, core->iommu_group);
+
 	/* NPU has been reset, we can clear the reset pending bit. */
 	atomic_set(&core->reset.pending, 0);
 
-- 
2.39.5


^ permalink raw reply related

* [RFC PATCH v4 3/9] accel: rocket: Add RK3568 SoC support
From: MidG971 @ 2026-06-13  7:01 UTC (permalink / raw)
  To: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson
  Cc: dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik,
	jonas, Midgy BALON
In-Reply-To: <20260613070116.438906-1-midgy971@gmail.com>

From: Midgy BALON <midgy971@gmail.com>

The RK3568 has a single core of the same NVDLA-derived NPU IP as the
RK3588, with a 32-bit AXI master.  Add rk3568_soc_data and its
compatible.

Unlike the RK3588, the RK3568 NPU's compute clock is a PVTPLL managed by
TF-A via SCMI; start it from an noc_init callback with a real rate change
(an intermediate rate defeats the clock framework's unchanged-rate
shortcut).  Powering on and de-idling the NPU NoC are left to the power
domain (genpd), which performs them when the IOMMU supplier is resumed,
so the driver does not poke the PMU directly.

If noc_init fails, unwind through rocket_core_fini() so the core is torn
down completely rather than leaking the runtime-PM and IOMMU state.

Signed-off-by: Midgy BALON <midgy971@gmail.com>
---
 drivers/accel/rocket/rocket_core.c |  9 +++++++++
 drivers/accel/rocket/rocket_core.h |  3 +++
 drivers/accel/rocket/rocket_drv.c  | 31 ++++++++++++++++++++++++++++++
 3 files changed, 43 insertions(+)

diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
index 09c445af7de73..779e951596a15 100644
--- a/drivers/accel/rocket/rocket_core.c
+++ b/drivers/accel/rocket/rocket_core.c
@@ -88,6 +88,15 @@ int rocket_core_init(struct rocket_core *core)
 		return err;
 	}
 
+	if (core->soc_data->noc_init) {
+		err = core->soc_data->noc_init(core);
+		if (err) {
+			pm_runtime_put_sync(dev);
+			rocket_core_fini(core);
+			return err;
+		}
+	}
+
 	version = rocket_pc_readl(core, VERSION);
 	version += rocket_pc_readl(core, VERSION_NUM) & 0xffff;
 
diff --git a/drivers/accel/rocket/rocket_core.h b/drivers/accel/rocket/rocket_core.h
index d6421251670dc..5a145ba8c5a92 100644
--- a/drivers/accel/rocket/rocket_core.h
+++ b/drivers/accel/rocket/rocket_core.h
@@ -18,10 +18,13 @@ struct rocket_core;
  * struct rocket_soc_data - per-SoC configuration data
  * @num_cores: Number of NPU cores in this SoC.
  * @dma_bits: Physical address width reachable by the NPU's AXI master.
+ * @noc_init: Optional callback to bring up the NPU before it is reachable.
+ *            Used on RK3568 to start the PVTPLL compute clock via SCMI.
  */
 struct rocket_soc_data {
 	unsigned int num_cores;
 	unsigned int dma_bits;
+	int (*noc_init)(struct rocket_core *core);
 };
 
 #define rocket_pc_readl(core, reg) \
diff --git a/drivers/accel/rocket/rocket_drv.c b/drivers/accel/rocket/rocket_drv.c
index f0beed2d522c7..86484110ad6f0 100644
--- a/drivers/accel/rocket/rocket_drv.c
+++ b/drivers/accel/rocket/rocket_drv.c
@@ -10,6 +10,7 @@
 #include <linux/err.h>
 #include <linux/iommu.h>
 #include <linux/of.h>
+#include <linux/of_clk.h>
 #include <linux/platform_device.h>
 #include <linux/pm_runtime.h>
 
@@ -223,12 +224,42 @@ static void rocket_remove(struct platform_device *pdev)
 	}
 }
 
+/*
+ * The NPU compute clock is a PVTPLL managed by TF-A via SCMI; spin it up
+ * with a real rate change (an intermediate rate defeats the clock
+ * framework's unchanged-rate shortcut).  Powering on and de-idling the NPU
+ * NoC are handled by the power domain (genpd) before the NPU is accessed.
+ */
+static int rk3568_noc_init(struct rocket_core *core)
+{
+	struct clk *npu_clk;
+
+	npu_clk = of_clk_get_by_name(core->dev->of_node, "npu");
+	if (IS_ERR(npu_clk))
+		return dev_err_probe(core->dev, PTR_ERR(npu_clk),
+				     "failed to get the NPU SCMI clock\n");
+
+	if (clk_set_rate(npu_clk, 600000000UL) ||
+	    clk_set_rate(npu_clk, 1000000000UL))
+		dev_warn(core->dev, "failed to set the NPU compute clock rate\n");
+	clk_put(npu_clk);
+
+	return 0;
+}
+
+static const struct rocket_soc_data rk3568_soc_data = {
+	.num_cores = 1,
+	.dma_bits = 32,
+	.noc_init = rk3568_noc_init,
+};
+
 static const struct rocket_soc_data rk3588_soc_data = {
 	.num_cores = 3,
 	.dma_bits = 40,
 };
 
 static const struct of_device_id dt_match[] = {
+	{ .compatible = "rockchip,rk3568-rknn-core", .data = &rk3568_soc_data },
 	{ .compatible = "rockchip,rk3588-rknn-core", .data = &rk3588_soc_data },
 	{}
 };
-- 
2.39.5


^ permalink raw reply related

* [RFC PATCH v4 2/9] accel: rocket: Derive DMA width and core count from match data
From: MidG971 @ 2026-06-13  7:01 UTC (permalink / raw)
  To: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson
  Cc: dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik,
	jonas, Midgy BALON
In-Reply-To: <20260613070116.438906-1-midgy971@gmail.com>

From: Midgy BALON <midgy971@gmail.com>

The probe already has the per-SoC match data, which now records the core
count and DMA width.  Use it for the cores array allocation and the
device DMA mask instead of re-scanning the device tree for available core
nodes.

While at it, reject a device tree that declares more NPU core nodes than
the SoC has, so the fixed-size cores array can never be overrun.

Signed-off-by: Midgy BALON <midgy971@gmail.com>
---
 drivers/accel/rocket/rocket_core.h   |  2 ++
 drivers/accel/rocket/rocket_device.c | 15 +++++----------
 drivers/accel/rocket/rocket_device.h |  3 ++-
 drivers/accel/rocket/rocket_drv.c    | 13 ++++++++++++-
 4 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/accel/rocket/rocket_core.h b/drivers/accel/rocket/rocket_core.h
index 8ee105a0be40e..d6421251670dc 100644
--- a/drivers/accel/rocket/rocket_core.h
+++ b/drivers/accel/rocket/rocket_core.h
@@ -16,9 +16,11 @@ struct rocket_core;
 
 /**
  * struct rocket_soc_data - per-SoC configuration data
+ * @num_cores: Number of NPU cores in this SoC.
  * @dma_bits: Physical address width reachable by the NPU's AXI master.
  */
 struct rocket_soc_data {
+	unsigned int num_cores;
 	unsigned int dma_bits;
 };
 
diff --git a/drivers/accel/rocket/rocket_device.c b/drivers/accel/rocket/rocket_device.c
index 46e6ee1e72c5f..6186f4faa3a2a 100644
--- a/drivers/accel/rocket/rocket_device.c
+++ b/drivers/accel/rocket/rocket_device.c
@@ -6,18 +6,16 @@
 #include <linux/clk.h>
 #include <linux/dma-mapping.h>
 #include <linux/platform_device.h>
-#include <linux/of.h>
 
 #include "rocket_device.h"
 
 struct rocket_device *rocket_device_init(struct platform_device *pdev,
-					 const struct drm_driver *rocket_drm_driver)
+					 const struct drm_driver *rocket_drm_driver,
+					 const struct rocket_soc_data *soc_data)
 {
 	struct device *dev = &pdev->dev;
-	struct device_node *core_node;
 	struct rocket_device *rdev;
 	struct drm_device *ddev;
-	unsigned int num_cores = 0;
 	int err;
 
 	rdev = devm_drm_dev_alloc(dev, rocket_drm_driver, struct rocket_device, ddev);
@@ -27,17 +25,14 @@ struct rocket_device *rocket_device_init(struct platform_device *pdev,
 	ddev = &rdev->ddev;
 	dev_set_drvdata(dev, rdev);
 
-	for_each_compatible_node(core_node, NULL, "rockchip,rk3588-rknn-core")
-		if (of_device_is_available(core_node))
-			num_cores++;
-
-	rdev->cores = devm_kcalloc(dev, num_cores, sizeof(*rdev->cores), GFP_KERNEL);
+	rdev->cores = devm_kcalloc(dev, soc_data->num_cores, sizeof(*rdev->cores),
+				   GFP_KERNEL);
 	if (!rdev->cores)
 		return ERR_PTR(-ENOMEM);
 
 	dma_set_max_seg_size(dev, UINT_MAX);
 
-	err = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(40));
+	err = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(soc_data->dma_bits));
 	if (err)
 		return ERR_PTR(err);
 
diff --git a/drivers/accel/rocket/rocket_device.h b/drivers/accel/rocket/rocket_device.h
index ce662abc01d3d..2f74e078974e3 100644
--- a/drivers/accel/rocket/rocket_device.h
+++ b/drivers/accel/rocket/rocket_device.h
@@ -22,7 +22,8 @@ struct rocket_device {
 };
 
 struct rocket_device *rocket_device_init(struct platform_device *pdev,
-					 const struct drm_driver *rocket_drm_driver);
+					 const struct drm_driver *rocket_drm_driver,
+					 const struct rocket_soc_data *soc_data);
 void rocket_device_fini(struct rocket_device *rdev);
 #define to_rocket_device(drm_dev) \
 	((struct rocket_device *)(container_of((drm_dev), struct rocket_device, ddev)))
diff --git a/drivers/accel/rocket/rocket_drv.c b/drivers/accel/rocket/rocket_drv.c
index 384c38e13acce..f0beed2d522c7 100644
--- a/drivers/accel/rocket/rocket_drv.c
+++ b/drivers/accel/rocket/rocket_drv.c
@@ -159,11 +159,15 @@ static const struct drm_driver rocket_drm_driver = {
 
 static int rocket_probe(struct platform_device *pdev)
 {
+	const struct rocket_soc_data *soc_data = of_device_get_match_data(&pdev->dev);
 	int ret;
 
+	if (!soc_data)
+		return -EINVAL;
+
 	if (rdev == NULL) {
 		/* First core probing, initialize DRM device. */
-		rdev = rocket_device_init(drm_dev, &rocket_drm_driver);
+		rdev = rocket_device_init(drm_dev, &rocket_drm_driver, soc_data);
 		if (IS_ERR(rdev)) {
 			dev_err(&pdev->dev, "failed to initialize rocket device\n");
 			return PTR_ERR(rdev);
@@ -172,6 +176,12 @@ static int rocket_probe(struct platform_device *pdev)
 
 	unsigned int core = rdev->num_cores;
 
+	if (core >= soc_data->num_cores) {
+		dev_err(&pdev->dev, "too many NPU core nodes (max %u)\n",
+			soc_data->num_cores);
+		return -EINVAL;
+	}
+
 	dev_set_drvdata(&pdev->dev, rdev);
 
 	rdev->cores[core].rdev = rdev;
@@ -214,6 +224,7 @@ static void rocket_remove(struct platform_device *pdev)
 }
 
 static const struct rocket_soc_data rk3588_soc_data = {
+	.num_cores = 3,
 	.dma_bits = 40,
 };
 
-- 
2.39.5


^ permalink raw reply related

* [RFC PATCH v4 1/9] accel: rocket: Introduce per-SoC rocket_soc_data
From: MidG971 @ 2026-06-13  7:01 UTC (permalink / raw)
  To: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson
  Cc: dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik,
	jonas, Midgy BALON
In-Reply-To: <20260613070116.438906-1-midgy971@gmail.com>

From: Midgy BALON <midgy971@gmail.com>

Add a per-SoC data structure carried in the OF match table, currently
holding only the NPU AXI address width, and use it for the per-core DMA
mask instead of a hardcoded 40-bit value.  No functional change: the
RK3588 AXI master is 40-bit.  This prepares for SoCs with a narrower
address width.

Signed-off-by: Midgy BALON <midgy971@gmail.com>
---
 drivers/accel/rocket/rocket_core.c |  7 ++++++-
 drivers/accel/rocket/rocket_core.h | 11 +++++++++++
 drivers/accel/rocket/rocket_drv.c  |  6 +++++-
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
index b3b2fa9ba645a..09c445af7de73 100644
--- a/drivers/accel/rocket/rocket_core.c
+++ b/drivers/accel/rocket/rocket_core.c
@@ -7,6 +7,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/err.h>
 #include <linux/iommu.h>
+#include <linux/of.h>
 #include <linux/platform_device.h>
 #include <linux/pm_runtime.h>
 #include <linux/reset.h>
@@ -21,6 +22,10 @@ int rocket_core_init(struct rocket_core *core)
 	u32 version;
 	int err = 0;
 
+	core->soc_data = of_device_get_match_data(dev);
+	if (!core->soc_data)
+		return dev_err_probe(dev, -EINVAL, "missing SoC match data\n");
+
 	core->resets[0].id = "srst_a";
 	core->resets[1].id = "srst_h";
 	err = devm_reset_control_bulk_get_exclusive(&pdev->dev, ARRAY_SIZE(core->resets),
@@ -52,7 +57,7 @@ int rocket_core_init(struct rocket_core *core)
 
 	dma_set_max_seg_size(dev, UINT_MAX);
 
-	err = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(40));
+	err = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(core->soc_data->dma_bits));
 	if (err)
 		return err;
 
diff --git a/drivers/accel/rocket/rocket_core.h b/drivers/accel/rocket/rocket_core.h
index f6d7382854ca9..8ee105a0be40e 100644
--- a/drivers/accel/rocket/rocket_core.h
+++ b/drivers/accel/rocket/rocket_core.h
@@ -12,6 +12,16 @@
 
 #include "rocket_registers.h"
 
+struct rocket_core;
+
+/**
+ * struct rocket_soc_data - per-SoC configuration data
+ * @dma_bits: Physical address width reachable by the NPU's AXI master.
+ */
+struct rocket_soc_data {
+	unsigned int dma_bits;
+};
+
 #define rocket_pc_readl(core, reg) \
 	readl((core)->pc_iomem + (REG_PC_##reg))
 #define rocket_pc_writel(core, reg, value) \
@@ -31,6 +41,7 @@ struct rocket_core {
 	struct device *dev;
 	struct rocket_device *rdev;
 	unsigned int index;
+	const struct rocket_soc_data *soc_data;
 
 	int irq;
 	void __iomem *pc_iomem;
diff --git a/drivers/accel/rocket/rocket_drv.c b/drivers/accel/rocket/rocket_drv.c
index 8bbbce594883e..384c38e13acce 100644
--- a/drivers/accel/rocket/rocket_drv.c
+++ b/drivers/accel/rocket/rocket_drv.c
@@ -213,8 +213,12 @@ static void rocket_remove(struct platform_device *pdev)
 	}
 }
 
+static const struct rocket_soc_data rk3588_soc_data = {
+	.dma_bits = 40,
+};
+
 static const struct of_device_id dt_match[] = {
-	{ .compatible = "rockchip,rk3588-rknn-core" },
+	{ .compatible = "rockchip,rk3588-rknn-core", .data = &rk3588_soc_data },
 	{}
 };
 MODULE_DEVICE_TABLE(of, dt_match);
-- 
2.39.5


^ permalink raw reply related

* [RFC PATCH v4 0/9] accel: rocket: Add RK3568 NPU support
From: MidG971 @ 2026-06-13  7:01 UTC (permalink / raw)
  To: tomeu, ogabbay, heiko, robh, krzk+dt, conor+dt, ulf.hansson
  Cc: dri-devel, linux-rockchip, devicetree, linux-arm-kernel, linux-pm,
	iommu, linux-kernel, xxm, chaoyi.chen, finley.xiao, diederik,
	jonas, Midgy BALON

From: Midgy BALON <midgy971@gmail.com>

RFC, not for merge. End-to-end inference does not produce correct output
yet (see Status), so per the v2 discussion this is a request for design
feedback. It probes, attaches, and submits cleanly on a stock v7.1-rc6
tree; what remains is one hardware-internal issue.

The RK3568 has a single NVDLA-derived NPU core, the same IP family as the
RK3588 NPU the driver already supports; the register layout matches. The
RK3568 differences are a 32-bit NPU AXI/IOMMU (vs 40-bit) and explicit
PVTPLL/PMU bring-up to power and de-idle the NPU before it is reachable.

Patches:
  1-2  rocket: per-SoC data struct, then derive DMA width and core count
       from match data (refactors, no functional change); patch 2 also
       bounds-checks the per-SoC cores array.
  3    rocket: RK3568 SoC data; start the PVTPLL compute clock via SCMI.
       Powering on and de-idling the NPU NoC are left to the power domain.
  4    rocket: reset the NPU before detaching the IOMMU on a job timeout
       (the detach otherwise stalls a wedged AXI master and WARNs).
  5    rocket: keep the IOMMU domain attached across jobs instead of
       re-attaching per job (the per-job rk_iommu handshake on the idle
       NPU MMU is slow and noisy); also drop the domain on reset and stop
       the scheduler before IOMMU teardown.
  6    dt-bindings: add the RK3568 NPU compatible; require rockchip,pmu
       for RK3568.
  7-8  arm64 dts: add the NPU and its IOMMU, and enable them on ROCK 3B.
  9    pmdomain: give the RK3568 NPU power domain a regulator so genpd
       owns vdd_npu via domain-supply (Suggested-by Chaoyi Chen).

Dependencies. This series no longer touches the IOMMU driver; two
in-flight Rockchip IOMMU changes are relevant but not part of it:
  - Simon Xue's "iommu/rockchip: Drop global rk_ops in favor of
    per-device ops" [1]. On boards with more than 4 GiB of RAM the NPU
    MMU's DTE must stay below 4 GiB (its DTE address is 32-bit), so the
    NPU IOMMU is described with the "rockchip,iommu" compatible, whose ops
    allocate the page tables with GFP_DMA32; the SoC's other IOMMUs use
    the "rockchip,rk3568-iommu" (40-bit) ops. The driver keeps a single
    global ops pointer, so two ops on one SoC trip its coexistence check;
    this series therefore sits on top of Simon's per-device-ops change,
    which Rockchip (Chaoyi Chen) confirmed is the intended way to give the
    NPU MMU its 32-bit DTE.
  - "iommu/rockchip: disable fetch dte time limit" [2] (Simon Xue / Sven
    Pueschel, in the iommu tree), which sets AUTO_GATING bit 31. v3 carried
    a local AUTO_GATING patch; that unconditional fix has since been merged,
    so this series drops its IOMMU patch. The bit is a no-op on this
    hardware in any case (the page walk completes on its reset value).

Power bring-up. The NPU is brought up through the power-domain layer (no
driver hack): the NPU power-domain keeps its clocks but drops the pm_qos
phandle (qos_npu sits behind the gated NPU NoC, so genpd's power-off QoS
save faults reading it), and vdd_npu is wired as the domain's
domain-supply with the domain marked need_regulator (patch 9), so genpd
brings the rail up before it de-idles the NoC at power-on. The PMU de-idle
then ACKs without PVTPLL running; PVTPLL is only needed for compute.

Status. On v7.1-rc6 the driver probes, creates /dev/accel/accel0,
attaches an IOMMU domain, and submits jobs; the program controller
fetches and broadcasts the command list. Inference output is still
wrong. The kernel side (this series) appears complete; what remains is
mesa/Teflon userspace, which still emits RK3588-tuned config (to be
filed on mesa-dev), and the hardware: with corrected config the NPU
reads the full input and weight tensors (per its DMA counters) but the
MAC/output stage never completes and the job times out, leaving the
output at the buffer's zero-point. It is not in the command list (a
byte-exact replay of the vendor's command list behaves the same).
Pointers from anyone with RK3568 NPU experience welcome.

Known residual. On the first IOMMU attach the NPU MMU is idle with paging
already enabled; the rk_iommu stall/reset handshake does not complete in
that state and logs one burst of timeouts before the (kept) domain
settles. It is harmless here because the job times out regardless, but it
points at an idle-MMU reconfiguration corner the rk_iommu code does not
handle on this block.

[1] https://lore.kernel.org/linux-rockchip/20260310105303.128859-1-xxm@rock-chips.com/
[2] https://lore.kernel.org/all/20260428-spu-iommudtefix-v2-1-f592f579e508@pengutronix.de/

Changes since v3:
  - Dropped the local AUTO_GATING patch: the correct fix (set AUTO_GATING
    bit 31, "disable fetch dte time limit") has since been merged upstream
    [2], so the series no longer touches the IOMMU driver.
  - vdd_npu: new pmdomain patch (9) gives the RK3568 NPU domain a regulator
    (need_regulator) and the board wires domain-supply, dropping the
    regulator-always-on workaround (Suggested-by Chaoyi Chen). It relies on
    the in-tree pmdomain default-off-if-need_regulator handling. The
    "Failed to create device link ... <pmic>" line at pmdomain probe is a
    pre-existing fw_devlink cyclic-dependency warning (the single
    power-controller provides every domain, including the one the I2C PMIC
    needs), seen the same way on RK3588; it is harmless here beyond a few
    wasted EPROBE_DEFER retries, and a proper fix belongs in the
    power-controller driver, not this series.
  - rk356x dts: also assign the CRU CLK_NPU so the NPU AXI bus clock comes
    up at 200 MHz instead of the 12 MHz boot default; order the NPU/IOMMU
    nodes by unit address.
  - rocket RK3568: fetch the SCMI/PVTPLL clock by name (the v3 bulk index
    resolved to the wrong clock); drop the redundant driver PMU de-idle
    writes (handled by the power domain).
  - rocket: clear the attached IOMMU domain on reset; unwind through
    rocket_core_fini() on noc_init failure; stop the scheduler before the
    IOMMU teardown.
  - rocket: bounds-check the cores array against the per-SoC core count.
  - Binding: require rockchip,pmu on RK3568.
  - Dependency framing: confirmed by Rockchip as v2 + 32-bit DTE via
    Simon's per-device-ops series (was framed as v1 in v3).

Midgy BALON (9):
  accel: rocket: Introduce per-SoC rocket_soc_data
  accel: rocket: Derive DMA width and core count from match data
  accel: rocket: Add RK3568 SoC support
  accel: rocket: Reset the NPU before detaching the IOMMU on timeout
  accel: rocket: Keep the IOMMU domain attached across jobs
  dt-bindings: npu: rockchip,rk3588-rknn-core: Add RK3568
  arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU
  arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU
  pmdomain: rockchip: Add a regulator to the RK3568 NPU power domain

 .../npu/rockchip,rk3588-rknn-core.yaml        | 27 +++++++++-
 .../boot/dts/rockchip/rk3568-rock-3b.dts      | 18 ++++++-
 arch/arm64/boot/dts/rockchip/rk356x-base.dtsi | 38 ++++++++++++++
 drivers/accel/rocket/rocket_core.c            | 30 ++++++++++-
 drivers/accel/rocket/rocket_core.h            | 19 +++++++
 drivers/accel/rocket/rocket_device.c          | 15 ++----
 drivers/accel/rocket/rocket_device.h          |  3 +-
 drivers/accel/rocket/rocket_drv.c             | 50 ++++++++++++++++++-
 drivers/accel/rocket/rocket_job.c             | 45 ++++++++++++++---
 drivers/pmdomain/rockchip/pm-domains.c        | 36 +++++++++----
 10 files changed, 245 insertions(+), 36 deletions(-)


base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.39.5


^ permalink raw reply

* Re: [PATCH] iommu/vt-d: Clear Present bit before tearing down scalable-mode context entry
From: Baolu Lu @ 2026-06-13  1:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michael Bommarito, David Woodhouse, Joerg Roedel, Will Deacon,
	Robin Murphy, iommu, linux-kernel
In-Reply-To: <20260611115257.GD1066031@ziepe.ca>

On 6/11/26 19:52, Jason Gunthorpe wrote:
> On Mon, Jun 01, 2026 at 01:35:08PM +0800, Baolu Lu wrote:
>> On 5/28/26 10:55, Michael Bommarito wrote:
>>> device_pasid_table_teardown() zeroes the 128-bit scalable-mode context
>>> entry with context_clear_entry() while the Present bit is still set. This
>>> creates a window where the hardware can fetch a torn entry, with some
>>> fields already zeroed while Present is still set, leading to unpredictable
>>> behavior or spurious faults. The context-cache invalidation is issued only
>>> after the entry has been zeroed, and intel_pasid_free_table() then frees
>>> the PASID directory pages, so the IOMMU can keep walking a stale Present=1
>>> entry that points at freed memory.
>>>
>>> While x86 provides strong write ordering, the compiler may reorder the two
>>> 64-bit writes to the entry, and the hardware fetch is not guaranteed to be
>>> atomic with respect to multiple CPU writes.
>>>
>>> Commit c1e4f1dccbe9d ("iommu/vt-d: Clear Present bit before tearing down
>>> context entry") fixed this exact pattern in domain_context_clear_one() and
>>> the copied-context path, but device_pasid_table_teardown() was not
>>> converted.
>>>
>>> Align it with the "Guidance to Software for Invalidations" in the VT-d
>>> spec, Section 6.5.3.3, using the same ownership handshake as the sibling
>>> fix: clear only the Present bit, flush it to the IOMMU, perform the
>>> context-cache invalidation, and only then zero the rest of the entry.
>>>
>>> Fixes: 81e921fd32161 ("iommu/vt-d: Fix NULL domain on device release")
>>> Signed-off-by: Michael Bommarito<michael.bommarito@gmail.com>
>>> Assisted-by:Claude:claude-opus-4-7
>>> ---
>>> Found by static analysis while auditing the callers of context_clear_entry()
>>> for the same teardown ordering that c1e4f1dccbe9d addressed. This site is
>>> reachable only in scalable mode, so it does not manifest on the legacy-mode
>>> hardware available to me; I could not trigger a runtime fault and the change
>>> is verified by code inspection only, on the same basis as the sibling fix.
>>> Compile-tested on x86_64 with CONFIG_INTEL_IOMMU; no new warnings.
>>>
>>>    drivers/iommu/intel/pasid.c | 4 +++-
>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>> Queued for linux-next. Thank you!
> What happened to your work to move over to the ARM updator that
> doesn't have any of these bugs? 🙂

I am working on that series, but since this is a fix that should be
backported, I queued it for this merge cycle.

Thanks,
baolu

^ permalink raw reply

* Re: [RFC PATCH v3 0/9] accel: rocket: Add RK3568 NPU support
From: Sebastian Reichel @ 2026-06-12 21:15 UTC (permalink / raw)
  To: Diederik de Haas
  Cc: Midgy Balon, Chaoyi Chen, tomeu, ogabbay, heiko, robh, krzk+dt,
	conor+dt, joro, will, robin.murphy, dri-devel, linux-rockchip,
	devicetree, linux-arm-kernel, iommu, linux-kernel, Simon Xue,
	Finley Xiao, Jonas Karlman
In-Reply-To: <DJ5FUW50YM2N.6ZTY4WK27ZP5@cknow-tech.com>

[-- Attachment #1: Type: text/plain, Size: 2069 bytes --]

Hi,

On Wed, Jun 10, 2026 at 04:28:17PM +0200, Diederik de Haas wrote:
> On Wed Jun 10, 2026 at 3:36 PM CEST, Midgy Balon wrote:
> [    2.110935] rockchip-pm-domain fd8d8000.power-management:power-controller: Failed to create device link (0x180) with supplier 2-0042 for /power-management@fd8d8000/power-controller/power-domain@8
> [    2.557459] sdhci-dwcmshc fe2e0000.mmc: Can't reduce the clock below 52MHz in HS200/HS400 mode
> [    2.647174] rockchip-pm-domain fd8d8000.power-management:power-controller: Failed to create device link (0x180) with supplier 2-0042 for /power-management@fd8d8000/power-controller/power-domain@8
> [    2.945089] rockchip-pm-domain fd8d8000.power-management:power-controller: Failed to create device link (0x180) with supplier spi2.0 for /power-management@fd8d8000/power-controller/power-domain@12
> 
> 8 = NPU; 12 = GPU
> 
> on both nanopc-t6-lts and nanopc-t6-plus (both RK3588).
> And on a 6.18 dmesg output I have for Rock 5B, I see the ~ same, but then
> it's 1-0042 instead of 2-0042. 
> 
> I don't know if it's bad or harmless, but it is consistent.

The fw_devlink framework tries to figure out a sensible probe order
by analyzing links between devices. The warning is because there is
a cyclic dependency. This happens because all power domains are
provided by one device (power-controller).

Now if you want to probe the I2C regulator 2-0042, you need the
I2C controller and to probe the I2C controller you need the I2C
power domain and for that you need the power-controller. But for
the power-controller you need 2-0042 (for the NPU power-domain).
At this point fw_devlink gives up and prints the warning.

Apart from the warning this results in the kernel missing dependency
information, so there might be some extra probe calls ending in
-EPROBE_DEFER (which wastes CPU power and delays the boot process).

So it's neither super bad, nor completely harmless. Fixing this
properly requires some heavy restructuring of the Rockchip
power-controller driver.

Greetings,

-- Sebastian

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: AMD iopt levels question
From: Jerry Snitselaar @ 2026-06-12 19:52 UTC (permalink / raw)
  To: Vasant Hegde; +Cc: Jörg Rödel, suravee.suthikulpanit, iommu
In-Reply-To: <95921d73-f108-4383-88b5-e706f10da047@amd.com>

On Fri, Jun 12, 2026 at 03:22:05PM +0530, Vasant Hegde wrote:
> Hi Jerry,
> 
> On 6/11/2026 11:11 PM, Jerry Snitselaar wrote:
> > I don't recall, is there a reason amd_iommu doesn't initialize the io
> > page tables to the number of levels supported on the system?
> 
> Its because HW supports increasing level dynamically and for performance reason
> we start w/ 3 level.
> 
> > 
> > I was looking at an issue recently with an i40e controller where
> > someone had a set of them in a bond. When it would fail over to
> > another device, the device would make a request to map as part of
> > that, and the domain would transition into the 64-bit iova address
> > space. When that happens the i40e would start generating requests
> > causing io page faults to monotonically increasing addresses starting
> > at 0x0, repeatedly causing event log resets.
> 
> I didn't get the problem. Why will it start from address 0x0? Our default is  3
> level. Even when its increasing the page table level, it shouldn't hit fault for
> existing maps.
>

I don't think the 0x0 is from the iova management code, but
potentially the firmware on these controllers, and I don't think it is
generating faults from any valid dma addresses, except potentially for
permission errors. It is just into this state where it clears some
register, and it is walking the entire iova address space for the
domain. The vast majority of the callbacks are suppressed, plus the
event log needs to keep restarting, but in one of the logs it had
walked up past the 32-bit boundary with the requests before they dealt
with the system. To be clear I think the problem is with the
controller, not the iommu code. Using iommu.forcedac works around the
issue. I tried to see if I could make it happen with some other
controller, and had no luck with that. It was just a thought that
there is the window when it is doing the DTE update due to the page
level increase where the IOMMU will reject requests.

> 
> > 
> > I didn't have a system where I could reproduce with bonded group of
> > them, but I could induce the behavior of the i40e generating the dma
> > requests by doing horrible things like throwing a bunch of io at it
> > from a remote system, and then mess with ring buffer sizes for the
> > device. I was able to capture a vmcore, and near as I could tell it is
> > during a window where the DTE gets updated as part of increasing the
> > io page table levels, a UR would be sent back to the i40e in response
> > to a request, and the controller would start sending these dma
> > requests.
> > 
> > With the HATS support in the kernel now would it make sense to
> > initialize pgtable->mode to amd_iommu_hpt_level if it is set? 
> 
> Are you referring to upstream code with generic page table support?
> 
> -Vasant
> 

The dangers of looking at a downstream issue, and writing email. Sorry
about that. The original reported issue was a downstream release with
earlier code. I'd reproduced the behavior from the i40e with a build
of 7.0-rc1 at the time, but I'm verifying now if it still happens with
a current build. So with the generic_pt code it would be
cfg->starting_level. Really though this is something for Intel to look
into with that controller. Console log below to show what it looks
like when the i40e goes off the rails.

Regards,
Jerry

---

    [root@lenovo-sd535v3-04 ~]# uname -r
    7.1.0-0.rc7.260611g9716c086c8e8.50.eln157.x86_64

    start sending traffic to the system
    mess with the mtu and tx/rx ring sizes

    ...

    [ 3175.119566] i40e 0000:81:00.0 ens1f0np0: Changing Tx descriptor count from 512 to 8160.
    [ 3175.133364] i40e 0000:81:00.0 ens1f0np0: Changing Rx descriptor count from 512 to 8160
    [ 3175.264538] i40e 0000:81:00.0: Using 64-bit DMA addresses
    [ 3175.270271] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x14 flags=0x0000]
    [ 3175.281573] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x1014 flags=0x0000]
    [ 3175.292722] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x2014 flags=0x0050]
    [ 3175.303870] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x3014 flags=0x0050]
    [ 3175.315017] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x3214 flags=0x0050]
    [ 3175.326166] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x3414 flags=0x0050]
    [ 3175.337313] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x3614 flags=0x0050]
    [ 3175.410119] i40e 0000:81:00.0: VSI seid 390 Tx ring 0 disable timeout
    [root@lenovo-sd535v3-04 ~]# [ 3193.727720] i40e 0000:81:00.0 ens1f0np0: NETDEV WATCHDOG: CPU: 128: transmit queue 86 timed out 5504 ms
    [ 3193.738292] i40e 0000:81:00.0 ens1f0np0: tx_timeout: VSI_seid: 390, Q 86, NTC: 0x0, HWB: 0x0, NTU: 0x2, TAIL: 0x0, INT: 0x1
    [ 3193.750798] i40e 0000:81:00.0 ens1f0np0: tx_timeout recovery level 1, txqueue 86
    [ 3193.811965] i40e 0000:81:00.0: VSI seid 390 Tx ring 0 disable timeout
    [ 3194.033833] i40e 0000:81:00.0: VSI seid 392 Tx ring 767 disable timeout
    [ 3194.274965] i40e 0000:81:00.1: VSI seid 391 Tx ring 0 disable timeout
    [ 3194.341842] i40e 0000:81:00.1: VSI seid 393 Tx ring 767 disable timeout
    [ 3197.343334] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x3814 flags=0x0000]
    [ 3197.354485] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x4014 flags=0x0000]
    [ 3197.365633] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5014 flags=0x0000]
    [ 3197.376781] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x6014 flags=0x0000]
    [ 3197.387929] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x7014 flags=0x0000]
    [ 3197.399076] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x8014 flags=0x0000]
    [ 3197.410227] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x8a14 flags=0x0000]
    [ 3197.421376] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x9014 flags=0x0000]
    [ 3197.432523] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x9a14 flags=0x0000]
    [ 3197.443670] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0xa014 flags=0x0000]
    [ 3197.457770] AMD-Vi: IOMMU Event log restarting
    [ 3197.465917] AMD-Vi: IOMMU Event log restarting
    [ 3197.474064] AMD-Vi: IOMMU Event log restarting
    [ 3197.482210] AMD-Vi: IOMMU Event log restarting
    [ 3197.490358] AMD-Vi: IOMMU Event log restarting
    [ 3197.498504] AMD-Vi: IOMMU Event log restarting
    [ 3197.506644] AMD-Vi: IOMMU Event log restarting
    [ 3197.514793] AMD-Vi: IOMMU Event log restarting
    [ 3197.522941] AMD-Vi: IOMMU Event log restarting
    [ 3197.531086] AMD-Vi: IOMMU Event log restarting
    [ 3200.145031] i40e 0000:81:00.0: capability discovery failed, err -EIO aq_err LIBIE_AQ_RC_OK
    [ 3200.410262] i40e 0000:81:00.0: ignoring delete macvlan error on PF, err -EIO, aq_err LIBIE_AQ_RC_OK
    [ 3202.343533] amd_iommu_report_page_fault: 815972 callbacks suppressed
    [ 3202.343535] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x94b843014 flags=0x0000]
    [ 3202.363252] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x94b843814 flags=0x0000]
    [ 3202.391691] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x94b844014 flags=0x0000]
    [ 3202.420100] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x94b844814 flags=0x0000]
    [ 3202.443571] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x94b845014 flags=0x0000]
    [ 3202.455260] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x94b845814 flags=0x0000]
    [ 3202.466946] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x94b846014 flags=0x0000]
    [ 3202.478627] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x94b846814 flags=0x0000]
    [ 3202.490307] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x94b847014 flags=0x0000]
    [ 3202.501987] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x94b847c14 flags=0x0000]

    ... lots of event log restarts and io page fault messages ...

    [ 3242.351205] amd_iommu_report_page_fault: 802222 callbacks suppressed
    [ 3242.351206] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x53b3cb8814 flags=0x0000]
    [ 3242.386829] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x53b3cb9014 flags=0x0000]
    [ 3242.415291] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x53b3cb9814 flags=0x0000]
    [ 3242.431241] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x53b3cba014 flags=0x0000]
    [ 3242.443004] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x53b3cba814 flags=0x0000]
    [ 3242.454768] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x53b3cbb014 flags=0x0000]
    [ 3242.466529] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x53b3cbbe14 flags=0x0000]
    [ 3242.478292] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x53b3cbc014 flags=0x0000]
    [ 3242.490054] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x53b3cbc614 flags=0x0000]
    [ 3242.501821] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x53b3cbce14 flags=0x0000]
    [ 3242.529911] amd_iommu_restart_log: 1502 callbacks suppressed
    [ 3242.529912] AMD-Vi: IOMMU Event log restarting
    [ 3242.544484] AMD-Vi: IOMMU Event log restarting
    [ 3242.552667] AMD-Vi: IOMMU Event log restarting
    [ 3242.560844] AMD-Vi: IOMMU Event log restarting
    [ 3242.569018] AMD-Vi: IOMMU Event log restarting
    [ 3242.577193] AMD-Vi: IOMMU Event log restarting
    [ 3242.585368] AMD-Vi: IOMMU Event log restarting
    [ 3242.593543] AMD-Vi: IOMMU Event log restarting
    [ 3242.601719] AMD-Vi: IOMMU Event log restarting
    [ 3242.609892] AMD-Vi: IOMMU Event log restarting
    [ 3247.352164] amd_iommu_report_page_fault: 802273 callbacks suppressed
    [ 3247.352165] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5d0163f814 flags=0x0000]
    [ 3247.387790] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5d01640014 flags=0x0000]
    [ 3247.416251] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5d01640814 flags=0x0000]
    [ 3247.431201] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5d01641014 flags=0x0000]
    [ 3247.442964] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5d01641814 flags=0x0000]
    [ 3247.454729] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5d01642014 flags=0x0000]
    [ 3247.466492] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5d01642814 flags=0x0000]
    [ 3247.478255] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5d01643014 flags=0x0000]
    [ 3247.490016] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5d01643814 flags=0x0000]
    [ 3247.501779] i40e 0000:81:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0011 address=0x5d01644014 flags=0x0000]


With forcedac set:

    [root@lenovo-sd535v3-04 ~]# uname -r
    7.1.0-0.rc7.260611g9716c086c8e8.50.eln157.x86_64
    [root@lenovo-sd535v3-04 ~]# ip link set ens1f0np0 mtu 9000

    start scp of iso on remote system to i40e system

    [root@lenovo-sd535v3-04 ~]# ethtool -G ens1f0np0 rx 8160 tx 8160
    [   80.033523] i40e 0000:81:00.0 ens1f0np0: Changing Tx descriptor count from 512 to 8160.
    [   80.047691] i40e 0000:81:00.0 ens1f0np0: Changing Rx descriptor count from 512 to 8160
    [root@lenovo-sd535v3-04 ~]#
    [root@lenovo-sd535v3-04 ~]# cat /proc/cmdline
    BOOT_IMAGE=(hd0,gpt2)/vmlinuz-7.1.0-0.rc7.260611g9716c086c8e8.50.eln157.x86_64 root=/dev/mapper/rhel_lenovo--sd535v3--04-root ro crashkernel=2G-64G:256M,64G-:512M resume=UUID=12da0907-a209-4e4a-a294-ea9a06817efe rd.lvm.lv=rhel_lenovo-sd535v3-04/root rd.lvm.lv=rhel_lenovo-sd535v3-04/swap console=tty0 console=ttyS0,115200n81 iommu.forcedac=1


^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/arm-smmu-v3: Process vIOMMU invalidations in batches
From: Nicolin Chen @ 2026-06-12 19:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, Kevin Tian, Robin Murphy, Joerg Roedel, Shuah Khan,
	Pranjal Shrivastava, Kees Cook, Yi Liu, Eric Auger,
	linux-arm-kernel, iommu, linux-kernel, linux-kselftest
In-Reply-To: <20260612135409.GI1962447@nvidia.com>

On Fri, Jun 12, 2026 at 10:54:09AM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 03, 2026 at 02:26:56PM -0700, Nicolin Chen wrote:
> > +int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
> > +			       struct iommu_user_data_array *array)
> > +{
> > +	struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
> > +	u32 issued = 0;
> > +	int ret = 0;
> > +
> > +	if (array->type != IOMMU_VIOMMU_INVALIDATE_DATA_ARM_SMMUV3) {
> > +		array->entry_num = 0;
> > +		return -EINVAL;
> > +	}
> > +
> > +	while (issued != array->entry_num) {
> > +		/* Process and issue the command(s) in batch */
> > +		ret = arm_vsmmu_cache_invalidate_batch(vsmmu, array, &issued);
> > +		if (ret)
> > +			break;
> > +	}
> > +
> > +	array->entry_num = issued;
> >  	return ret;
> 
> I think every driver will have this same problem, how about lifting
> this loop to the core code?

Sure. I think that makes things a bit cleaner. I'll try that. Then,
this would become another iommufd series.

> Also not sure I like the validation flow, I think it will be easier to
> understand for everything if either num is 0 and nothing was done with
> an error code
> 
> Or num is non zero and no error code.
> 
> Like it doesn't make sense to fail immediately if zero pad is nonzero
> in iommu_copy_struct_from_full_user_array() but then to try to
> partially continue if arm_vsmmu_convert_user_cmd() finds illegal data
> in the very same buffer. Be consistent, validate the user buffer, if
> it is not valid fail immeidately. Then execute a fully valid user buffer.

I don't think fully validating the user buffer is correct..

VMM would have to know which command failed, to flag it in the CONS
register, indicating: a) commands prior to the CONS are issued, and
b) command pointed by the CONS is illegal.

Then, guest kernel reads the CONS register to pinpoint this illegal
command and swap it with a CMD_SYNC (__arm_smmu_cmdq_skip_err).

E.g., if the 16th command in a 64-command array is illegal, kernel
should issue the first 15 commands, returns -EIO; then, VMM should
flag illegal at CONS pointing to the 16th command.

The design in this patch is implemented in this way. And arguably,
I think the nonzero-padding case is VMM violating the ABI, in which
case the return code would be different than -EIO. And VMM should
fix itself instead of flagging illegal in the CONS register.

Do you agree?

Thanks
Nicolin

^ permalink raw reply

* Re: [PATCH v3 3/3] iova: defer maple tree erase on GFP_ATOMIC failure
From: Liam R. Howlett @ 2026-06-12 18:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Rik van Riel, linux-kernel, kernel-team, robin.murphy, joro, will,
	iommu, kyle, Rik van Riel, maple-tree
In-Reply-To: <20260612180303.GO1066031@ziepe.ca>

+Cc maple-tree list.

On 26/06/12 03:03PM, Jason Gunthorpe wrote:
> On Fri, Jun 12, 2026 at 01:23:58PM -0400, Rik van Riel wrote:
> > On Fri, 2026-06-12 at 13:48 -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 12, 2026 at 12:02:55PM -0400, Rik van Riel wrote:
> > > > 
> > > > The mas_erase() function calls mas_nomem(mas, GFP_KERNEL),
> > > > which is not safe to call while holding a spinlock.
> > > 
> > > Oh, the kdoc doesn't say that, it doesn't return any error code if it
> > > can't allocate memory, and not a single caller checks for erase
> > > failures.

I should fix that.

> > > 
> > > I assumed internally it "somehow worked out" even though there are
> > > allocations in the callchains..
> > > 
> > > This is probably a better question for Liam? Can mtree_erase actually
> > > fail ENOMEM? Is it safe to call it in an atomic context?
> > 
> > Yes, it can fail.
> 
> Currently it never returns a failure to the caller. Look at mas_erase():
> 
> 	entry = mas_state_walk(mas);
> 	if (!entry)
> 		return NULL;
> [..]
> 	if (mas_is_err(mas))
> 		goto out;
> [..]
> out:
> 	mas_destroy(mas);
> 	return entry;
> 
> There is no propogation of ENOMEM, it returns success. No caller
> checks for any error here either.

At one point this was considered to be impossible to fail, and it is
documented to return the entry or null.

> 
> So I think the intention is that it cannot fail, yet it does have the
> memory allocations and busted failure path. Hence asking Liam what it
> should be, and what about an atomic context.
> 
> Perhaps this might be relying on the modern kernels "small allocations
> never fail", meaning mas_erase never fails, but then you can't
> call it from an atomic context..

It _can_ fail, but right now that error will not be propagated to the
caller.  The caller could infer the failure due to the return of NULL...
if that's what would happen, so there's very much an issue on failure
that I need to investigate and fix.

It is possible that it fails with -ENOMEM.  And the spinlock can be
dropped if you are not using an external lock and the gfp flag allows
blocking.  On failure to allocate and the lock is dropped, we retry the
operation from the start in case there was a race with another writer.

I think I should probably change this to return -ENOMEM, fix the docs on
it, and probably audit the callers (most use external locks or are
early-boot-so-don't-worry-about-it).  Any issue here should also be
caught by lockdep pretty quickly.

> 
> In any case, it does look like you can't use mas_erase from an atomic
> context anyhow so your prior option with the mas_store_gfp() and
> failure handling seems reasonable.

mas_store_gfp() works if you know the range of your entry.  You could
also write the XA_ZERO_ENTRY over the entry so that there is no internal
node changes - just a value swap.  If you do this, you have to be
careful when reading things back when using the mas_ interface.

I think, in your case, hitting an XA_ZERO_ENTRY would be necessary to
indicate that we cannot reuse this particular location until it is
correctly dealt with?  Or is the maple tree the only reason it is
considered unusable?

One thing to remember is that each write can cause allocations to occur,
so if you have a list of items being overwritten then you are causing
the tree to do each write and potentially rebalancing (as you shrink the
data beyond the lower limit of the node).

One way around that is to write XA_ZERO_ENTRY over each one as you
deal with your entry.  Then, when you are done you do a single
mas_store_gfp() of NULL over the whole range.  It will be a larger tree
operation, but smaller than the incremental steps.

Thanks,
Liam

^ permalink raw reply

* Re: [PATCH v3 3/3] iova: defer maple tree erase on GFP_ATOMIC failure
From: Jason Gunthorpe @ 2026-06-12 18:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Liam R. Howlett, linux-kernel, kernel-team, robin.murphy, joro,
	will, iommu, kyle, Rik van Riel
In-Reply-To: <ff5f9e09edbcb9c87feac6af00a1d835c783be4f.camel@surriel.com>

On Fri, Jun 12, 2026 at 01:23:58PM -0400, Rik van Riel wrote:
> On Fri, 2026-06-12 at 13:48 -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 12, 2026 at 12:02:55PM -0400, Rik van Riel wrote:
> > > 
> > > The mas_erase() function calls mas_nomem(mas, GFP_KERNEL),
> > > which is not safe to call while holding a spinlock.
> > 
> > Oh, the kdoc doesn't say that, it doesn't return any error code if it
> > can't allocate memory, and not a single caller checks for erase
> > failures.
> > 
> > I assumed internally it "somehow worked out" even though there are
> > allocations in the callchains..
> > 
> > This is probably a better question for Liam? Can mtree_erase actually
> > fail ENOMEM? Is it safe to call it in an atomic context?
> 
> Yes, it can fail.

Currently it never returns a failure to the caller. Look at mas_erase():

	entry = mas_state_walk(mas);
	if (!entry)
		return NULL;
[..]
	if (mas_is_err(mas))
		goto out;
[..]
out:
	mas_destroy(mas);
	return entry;

There is no propogation of ENOMEM, it returns success. No caller
checks for any error here either.

So I think the intention is that it cannot fail, yet it does have the
memory allocations and busted failure path. Hence asking Liam what it
should be, and what about an atomic context.

Perhaps this might be relying on the modern kernels "small allocations
never fail", meaning mas_erase never fails, but then you can't
call it from an atomic context..

In any case, it does look like you can't use mas_erase from an atomic
context anyhow so your prior option with the mas_store_gfp() and
failure handling seems reasonable.

Jason

^ permalink raw reply

* Re: AMD iopt levels question
From: Jason Gunthorpe @ 2026-06-12 17:31 UTC (permalink / raw)
  To: Jerry Snitselaar
  Cc: Jörg Rödel, suravee.suthikulpanit, Vasant Hegde, iommu
In-Reply-To: <airgbnJLPm3tnNig@jsnitsel-thinkpadt14sgen2i.remote.csb>

On Thu, Jun 11, 2026 at 10:41:31AM -0700, Jerry Snitselaar wrote:

> I didn't have a system where I could reproduce with bonded group of
> them, but I could induce the behavior of the i40e generating the dma
> requests by doing horrible things like throwing a bunch of io at it
> from a remote system, and then mess with ring buffer sizes for the
> device. I was able to capture a vmcore, and near as I could tell it is
> during a window where the DTE gets updated as part of increasing the
> io page table levels, a UR would be sent back to the i40e in response
> to a request, and the controller would start sending these dma
> requests.

IIRC there were some bugs around this flow that were races with the
DTE/IOPTE changes and concurrent DMA. I belive the current upstream
kernel has fixed them all.

You shouldn't get an kind of UR during the increasing process for any
in-use valid dma_addr_t.

Jason

^ permalink raw reply

* Re: [PATCH] dma-iommu: Introduce API to reserve IOVA regions for dynamically created devices
From: Jason Gunthorpe @ 2026-06-12 17:26 UTC (permalink / raw)
  To: Vishnu Reddy
  Cc: Robin Murphy, joro, will, m.szyprowski, iommu, linux-kernel,
	vikash.garodia, dikshita.agarwal
In-Reply-To: <bb59f07e-ca7e-f012-6a4b-0a148350b69c@oss.qualcomm.com>

On Wed, Jun 10, 2026 at 07:57:50PM +0530, Vishnu Reddy wrote:

>   +--------------------------------------------------+
>   |                  VPU Hardware                    |
>   |                                                  |
>   |  +------------+   SID-0   IOVA: 600MB - 3500MB   |
>   |  |  Block 0   |                                  |
>   |  +------------+                                  |
>   |                                                  |
>   |  +------------+   SID-1   IOVA: 0MB  - 3500MB    |
>   |  |  Block 1   |                                  |
>   |  +------------+                                  |
>   |                                                  |
>   |  +------------+   SID-2   IOVA: 16MB - 600MB     |
>   |  |  Block 2   |                                  |
>   |  +------------+                                  |
>   +--------------------------------------------------+
> 
> Each Stream ID maps to a distinct IOMMU context bank, and each context
> bank enforces a different IOVA range.

I think Robin is saying you have to describe your HW properly in
device tree. In Linux a single struct device should not own multiple
*different* IOMMU contexts.

So your DT should describe all those blocks as unique DT nodes with
the proper dma ranges and related data so they can do DMA
correctly. Then the parent device has to assemble itself from that
collection of struct devices.

> These are synthetic child devices created at runtime and do not have their
> own of_node. Inheriting the parent of_node might not be the correct way.

Why would you create child devices at runtime? Linux doesn't really
have a good way to create a fully DMA capable struct device at runtime
without a DT backing description. The fact you immediately hit API
problems like this is a big clue :)

Jason

^ permalink raw reply

* Re: [PATCH v3 3/3] iova: defer maple tree erase on GFP_ATOMIC failure
From: Rik van Riel @ 2026-06-12 17:23 UTC (permalink / raw)
  To: Jason Gunthorpe, Liam R. Howlett
  Cc: linux-kernel, kernel-team, robin.murphy, joro, will, iommu, kyle,
	Rik van Riel
In-Reply-To: <20260612164852.GL1066031@ziepe.ca>

On Fri, 2026-06-12 at 13:48 -0300, Jason Gunthorpe wrote:
> On Fri, Jun 12, 2026 at 12:02:55PM -0400, Rik van Riel wrote:
> > 
> > The mas_erase() function calls mas_nomem(mas, GFP_KERNEL),
> > which is not safe to call while holding a spinlock.
> 
> Oh, the kdoc doesn't say that, it doesn't return any error code if it
> can't allocate memory, and not a single caller checks for erase
> failures.
> 
> I assumed internally it "somehow worked out" even though there are
> allocations in the callchains..
> 
> This is probably a better question for Liam? Can mtree_erase actually
> fail ENOMEM? Is it safe to call it in an atomic context?

Yes, it can fail.

When it does, __free_iova and friends fall back to
asynchronously freeing the iova from a worker.

If we are ok with always asynchronously freeing
iovas, we might be able to simplify the code by
always going through that helper.

If there are cases where asynchronously freeing
the iova breaks the system, we cannot use the
maple tree, but need the augmented rbtree, instead.

I do still have a cleaned up version of the augmented
rbtree, if asynchronous freeing is a real concern.

-- 
All Rights Reversed.

^ permalink raw reply

* Re: [PATCH v3 3/3] iova: defer maple tree erase on GFP_ATOMIC failure
From: Jason Gunthorpe @ 2026-06-12 16:48 UTC (permalink / raw)
  To: Rik van Riel, Liam R. Howlett
  Cc: linux-kernel, kernel-team, robin.murphy, joro, will, iommu, kyle,
	Rik van Riel
In-Reply-To: <61d51d4b5779d80145ceb38e9632a7cc8a79dbec.camel@surriel.com>

On Fri, Jun 12, 2026 at 12:02:55PM -0400, Rik van Riel wrote:
> On Tue, 2026-06-09 at 10:04 -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 02, 2026 at 11:35:48PM -0400, Rik van Riel wrote:
> > > +/*
> > > + * Remove an IOVA entry from the maple tree. Returns true on
> > > success.
> > > + * On failure (maple tree node allocation under GFP_ATOMIC
> > > failed),
> > > + * returns false — the entry remains in the tree and the caller
> > > must
> > > + * not free the struct iova.
> > > + */
> > > +static bool remove_iova(struct iova_domain *iovad, struct iova
> > > *iova)
> > >  {
> > >  	MA_STATE(mas, &iovad->mtree, iova->pfn_lo, iova->pfn_hi);
> > >  
> > > @@ -165,7 +175,36 @@ static void remove_iova(struct iova_domain
> > > *iovad, struct iova *iova)
> > >  	if (iova->pfn_lo < iovad->dma_32bit_pfn)
> > >  		iovad->max32_alloc_size = iovad->dma_32bit_pfn;
> > >  
> > > -	mas_store_gfp(&mas, NULL, GFP_ATOMIC);
> > > +	if (mas_store_gfp(&mas, NULL, GFP_ATOMIC))
> > > +		return false;
> > 
> > But why does it use mas_store(NULL) instead of mas_erase()? I thought
> > the iova alloc/free has to be pair wise, we don't split allocations?
> > 
> I just looked into this some more, and I was
> confused earlier this week.
> 
> The mas_erase() function calls mas_nomem(mas, GFP_KERNEL),
> which is not safe to call while holding a spinlock.

Oh, the kdoc doesn't say that, it doesn't return any error code if it
can't allocate memory, and not a single caller checks for erase
failures.

I assumed internally it "somehow worked out" even though there are
allocations in the callchains..

This is probably a better question for Liam? Can mtree_erase actually
fail ENOMEM? Is it safe to call it in an atomic context?

Jason

^ permalink raw reply

* Re: [PATCH v3 3/3] iova: defer maple tree erase on GFP_ATOMIC failure
From: Rik van Riel @ 2026-06-12 16:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, kernel-team, robin.murphy, joro, will, iommu, kyle,
	Rik van Riel
In-Reply-To: <20260609130418.GI2764304@ziepe.ca>

On Tue, 2026-06-09 at 10:04 -0300, Jason Gunthorpe wrote:
> On Tue, Jun 02, 2026 at 11:35:48PM -0400, Rik van Riel wrote:
> > +/*
> > + * Remove an IOVA entry from the maple tree. Returns true on
> > success.
> > + * On failure (maple tree node allocation under GFP_ATOMIC
> > failed),
> > + * returns false — the entry remains in the tree and the caller
> > must
> > + * not free the struct iova.
> > + */
> > +static bool remove_iova(struct iova_domain *iovad, struct iova
> > *iova)
> >  {
> >  	MA_STATE(mas, &iovad->mtree, iova->pfn_lo, iova->pfn_hi);
> >  
> > @@ -165,7 +175,36 @@ static void remove_iova(struct iova_domain
> > *iovad, struct iova *iova)
> >  	if (iova->pfn_lo < iovad->dma_32bit_pfn)
> >  		iovad->max32_alloc_size = iovad->dma_32bit_pfn;
> >  
> > -	mas_store_gfp(&mas, NULL, GFP_ATOMIC);
> > +	if (mas_store_gfp(&mas, NULL, GFP_ATOMIC))
> > +		return false;
> 
> But why does it use mas_store(NULL) instead of mas_erase()? I thought
> the iova alloc/free has to be pair wise, we don't split allocations?
> 
I just looked into this some more, and I was
confused earlier this week.

The mas_erase() function calls mas_nomem(mas, GFP_KERNEL),
which is not safe to call while holding a spinlock.

The remove_iova() function holds a spinlock, with
interrupts blocked, and needs to run like that because
it could be called from places like IO completion
handlers.

That leaves the option of either having slightly
uglier maple tree code, or going back to the
augmented rbtree (but cleaning that up a little).

Just let me know what you prefer, I'm happy to do
either.

-- 
All Rights Reversed.

^ permalink raw reply

* Re: [GIT PULL] dma-mapping fixes for Linux 7.1
From: pr-tracker-bot @ 2026-06-12 15:55 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Linus Torvalds, linux-kernel, iommu, Marek Szyprowski,
	Robin Murphy, Jason Gunthorpe, Li RongQing
In-Reply-To: <20260611205751.1255411-1-m.szyprowski@samsung.com>

The pull request you sent on Thu, 11 Jun 2026 22:57:51 +0200:

> https://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux.git tags/dma-mapping-7.1-2026-06-11

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/f51cae6603c05b4b1fac65c773592e5bc8037251

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH V2] iommu/hyperv: Create hyperv subdirectory under drivers/iommu
From: Easwar Hariharan @ 2026-06-12 15:49 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, iommu, easwar.hariharan, mhklinux, wei.liu,
	zhangyu1, jacob.pan, schakrabarti
In-Reply-To: <20260603225010.1347623-1-mrathor@linux.microsoft.com>

On 6/3/2026 15:50, Mukesh R wrote:
> Create hyperv subdirectory under drivers/iommu in anticipation of more
> hyperv related files from upcoming PCI passthru and pv-IOMMU patches.
> Also, the current file hyperv-iommu.c actually implements irq remapping on
> x86, so rename to more appropriate hv-irq-remap-x86.c and move it under
> the new hyperv subdirectory. Since this file implements irq_remap_ops
> exposed by drivers/iommu/irq_remapping.h, it cannot be relocated to the
> irq directory. This is in sync with other backend directories like amd
> and intel there.
> 
> Lastly, this file should not be tied to CONFIG_HYPERV_IOMMU, but to
> CONFIG_HYPERV and CONFIG_IRQ_REMAP.
> 
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
> V2: rename hv-irq-remap.c to hv-irq-remap-x86.c
> ---
>  MAINTAINERS                                              | 2 +-
>  drivers/iommu/Kconfig                                    | 9 ---------
>  drivers/iommu/Makefile                                   | 2 +-
>  drivers/iommu/hyperv/Makefile                            | 2 ++
>  .../iommu/{hyperv-iommu.c => hyperv/hv-irq-remap-x86.c}  | 8 +-------
>  drivers/iommu/irq_remapping.c                            | 2 +-
>  6 files changed, 6 insertions(+), 19 deletions(-)
>  create mode 100644 drivers/iommu/hyperv/Makefile
>  rename drivers/iommu/{hyperv-iommu.c => hyperv/hv-irq-remap-x86.c} (99%)
Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>

^ permalink raw reply

page: next (older)
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox