Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH net,v3 1/1] net: stmmac: Update default_an_inband before passing value to phylink_config
From: patchwork-bot+netdevbpf @ 2026-04-18 19:20 UTC (permalink / raw)
  To: KhaiWenTan
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, mcoquelin.stm32,
	alexandre.torgue, rmk+kernel, maxime.chevallier, netdev,
	linux-stm32, linux-arm-kernel, linux-kernel, yoong.siang.song,
	hong.aun.looi, khai.wen.tan
In-Reply-To: <20260416102609.7953-1-khai.wen.tan@intel.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 16 Apr 2026 18:26:09 +0800 you wrote:
> From: KhaiWenTan <khai.wen.tan@linux.intel.com>
> 
> get_interfaces() will update both the plat->phy_interfaces and
> mdio_bus_data->default_an_inband based on reading a SERDES register. As
> get_interfaces() will be called after default_an_inband had already been
> read, dwmac-intel regressed as a result with incorrect default_an_inband
> value in phylink_config.
> 
> [...]

Here is the summary with links:
  - [net,v3,1/1] net: stmmac: Update default_an_inband before passing value to phylink_config
    https://git.kernel.org/netdev/net/c/8cff9dbe89d8

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




^ permalink raw reply

* [PATCH wireless] wifi: mt76: mt7996: Fix NULL pointer dereference in mt7996_init_tx_queues()
From: Lorenzo Bianconi @ 2026-04-18 18:04 UTC (permalink / raw)
  To: Felix Fietkau, Ryder Lee, Shayne Chen, Sean Wang, Johannes Berg,
	Matthias Brugger, AngeloGioacchino Del Regno, Lorenzo Bianconi
  Cc: linux-wireless, linux-arm-kernel, linux-mediatek

When MT76_NPU and CONFIG_NET_MEDIATEK_SOC_WED are enabled and
mt76 detects properly the Airoha NPU SoC, mt7996_init_tx_queues() will
dereference a NULL WED pointer.
Fix the issue by always passing the WED pointer from mt7996_dma_init().

Fixes: cd7951f242a7 ("wifi: mt76: mt7996: Integrate MT7990 dma configuration for NPU")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/wireless/mediatek/mt76/mt7996/dma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7996/dma.c b/drivers/net/wireless/mediatek/mt76/mt7996/dma.c
index 8f5d297dafce..3d9353811a02 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7996/dma.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7996/dma.c
@@ -683,7 +683,7 @@ int mt7996_dma_init(struct mt7996_dev *dev)
 		ret = mt7996_init_tx_queues(&dev->phy, MT_TXQ_ID(0),
 					    MT7996_NPU_TX_RING_SIZE,
 					    MT_TXQ_RING_BASE(0) + hif1_ofs,
-					    NULL);
+					    wed);
 	else
 		ret = mt7996_init_tx_queues(&dev->phy,
 					    MT_TXQ_ID(dev->mphy.band_idx),

---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260418-mt7996-dma-init-npu-fix-7f2840c49208

Best regards,
-- 
Lorenzo Bianconi <lorenzo@kernel.org>



^ permalink raw reply related

* Re: [PATCH net] net: dsa: mt7530: fix .get_stats64 sleeping in atomic context
From: Jakub Kicinski @ 2026-04-18 18:01 UTC (permalink / raw)
  To: daniel
  Cc: Jakub Kicinski, chester.a.unal, andrew, olteanv, davem, edumazet,
	pabeni, matthias.bgg, angelogioacchino.delregno, linux,
	ansuelsmth, netdev, linux-kernel, linux-arm-kernel,
	linux-mediatek, frank-w, john
In-Reply-To: <79dc0ec5b6be698b14cb66339d6f63033ca2934a.1776397542.git.daniel@makrotopia.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: net: dsa: mt7530: fix .get_stats64 sleeping in atomic context

This commit addresses a sleep-in-atomic bug when reading MDIO-connected
switch port statistics by delegating the register reads to a delayed
workqueue. The get_stats64 callback now returns cached values protected
by a spinlock instead of synchronously reading from the MDIO bus.
MMIO-connected switches are unaffected and continue to read counters
directly.

> diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
> index b9423389c2ef0..786d3a8492bcb 100644
> --- a/drivers/net/dsa/mt7530.c
> +++ b/drivers/net/dsa/mt7530.c

[ ... ]

> @@ -951,6 +952,43 @@ static void mt7530_get_stats64(struct dsa_switch *ds, int port,
>  			       &storage->rx_crc_errors);
>  }
>  
> +static void mt7530_stats_poll(struct work_struct *work)
> +{
> +	struct mt7530_priv *priv = container_of(work, struct mt7530_priv,
> +						stats_work.work);
> +	struct rtnl_link_stats64 stats = {};
> +	struct dsa_port *dp;
> +	int port;
> +
> +	dsa_switch_for_each_user_port(dp, priv->ds) {
> +		port = dp->index;
> +
> +		mt7530_read_port_stats64(priv, port, &stats);
> +
> +		spin_lock(&priv->stats_lock);

Are we at risk of a deadlock here?

The get_stats64 callback can be called by dev_get_stats(), which runs in
softirq context.

If this delayed work runs in process context and is preempted by a softirq
that attempts to acquire the lock via dev_get_stats(), a deadlock could
occur. Both paths should likely use spin_lock_bh() and spin_unlock_bh().

> +		priv->ports[port].stats = stats;
> +		spin_unlock(&priv->stats_lock);
> +	}
> +
> +	schedule_delayed_work(&priv->stats_work,
> +			      MT7530_STATS_POLL_INTERVAL);
> +}
> +
> +static void mt7530_get_stats64(struct dsa_switch *ds, int port,
> +			       struct rtnl_link_stats64 *storage)
> +{
> +	struct mt7530_priv *priv = ds->priv;
> +
> +	if (priv->bus) {
> +		spin_lock(&priv->stats_lock);
> +		*storage = priv->ports[port].stats;
> +		spin_unlock(&priv->stats_lock);
> +		mod_delayed_work(system_wq, &priv->stats_work, 0);

Could this unconditional background refresh cause a denial of service on
the MDIO bus?

If an unprivileged user continually reads network statistics in a tight
loop, mt7530_stats_poll() will execute continuously, potentially starving
other MDIO tasks like PHY link-state monitoring.

Additionally, is it possible for this call to re-arm the workqueue during
teardown?

In mt7530_remove_common(), cancel_delayed_work_sync() is called before
dsa_unregister_switch(). If get_stats64() is called during or after
dsa_unregister_switch(), it would re-arm the workqueue after it was
explicitly cancelled, potentially causing a use-after-free when priv is
freed.

> +	} else {
> +		mt7530_read_port_stats64(priv, port, storage);
> +	}
> +}

[ ... ]

> @@ -3137,6 +3175,13 @@ mt753x_setup(struct dsa_switch *ds)
>  	if (ret && priv->irq_domain)
>  		mt7530_free_mdio_irq(priv);
>  
> +	if (!ret && priv->bus) {
> +		spin_lock_init(&priv->stats_lock);
> +		INIT_DELAYED_WORK(&priv->stats_work, mt7530_stats_poll);
> +		schedule_delayed_work(&priv->stats_work,
> +				      MT7530_STATS_POLL_INTERVAL);

If dsa_register_switch() encounters an error after this setup callback
completes successfully, the DSA core will attempt to clean up.

Since mt7530_switch_ops does not implement a teardown callback to cancel
this delayed work, will the work remain scheduled and eventually
dereference freed priv memory once the driver core frees it via devres?

> +	}
> +
>  	return ret;
>  }
-- 
pw-bot: cr


^ permalink raw reply

* Re: [PATCH] cpu/hotplug: Fix NULL kobject warning in cpuhp_smt_enable()
From: Catalin Marinas @ 2026-04-18 15:05 UTC (permalink / raw)
  To: Jinjie Ruan
  Cc: tglx, peterz, sudeep.holla, yangyicong, dietmar.eggemann,
	Jonathan.Cameron, linux-kernel, James Morse, linux-arm-kernel
In-Reply-To: <aeNxKpHzTQX4_kId@arm.com>

On Sat, Apr 18, 2026 at 12:55:22PM +0100, Catalin Marinas wrote:
> On Fri, Apr 17, 2026 at 03:55:34PM +0800, Jinjie Ruan wrote:
> > When booting with `maxcpus` greater than the number of present CPUs (e.g.,
> > QEMU -smp cpus=4,maxcpus=8), some CPUs are marked as 'present' but have not
> > yet been registered via register_cpu(). Consequently, the per-cpu device
> > objects for these CPUs are not yet initialized.
[...]
> Another option would have been to avoid marking such CPUs present but I
> think this will break other things. Yet another option is to register
> all CPU devices even if they never come up (like maxcpus greater than
> actual CPUs).

Something like below, untested (and I don't claim I properly understand
this code; just lots of tokens used trying to make sense of it ;))

------------------------8<-------------------------
diff --git a/arch/arm64/kernel/acpi.c b/arch/arm64/kernel/acpi.c
index a9d884fd1d00..4c0a5ed906ea 100644
--- a/arch/arm64/kernel/acpi.c
+++ b/arch/arm64/kernel/acpi.c
@@ -448,12 +448,14 @@ int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid, u32 apci_id,
 		return *pcpu;
 	}
 
+	set_cpu_present(*pcpu, true);
 	return 0;
 }
 EXPORT_SYMBOL(acpi_map_cpu);
 
 int acpi_unmap_cpu(int cpu)
 {
+	set_cpu_present(cpu, false);
 	return 0;
 }
 EXPORT_SYMBOL(acpi_unmap_cpu);
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 1aa324104afb..751a74d997e1 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -510,8 +510,10 @@ int arch_register_cpu(int cpu)
 	struct cpu *c = &per_cpu(cpu_devices, cpu);
 
 	if (!acpi_disabled && !acpi_handle &&
-	    IS_ENABLED(CONFIG_ACPI_HOTPLUG_CPU))
+	    IS_ENABLED(CONFIG_ACPI_HOTPLUG_CPU)) {
+		set_cpu_present(cpu, false);
 		return -EPROBE_DEFER;
+	}
 
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 	/* For now block anything that looks like physical CPU Hotplug */


^ permalink raw reply related

* Re: [PATCH] arm: dts: allwinner: t113s mangopi: enable watchdog for reboot
From: Andre Przywara @ 2026-04-18 11:55 UTC (permalink / raw)
  To: Jernej Škrabec
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Chen-Yu Tsai,
	Samuel Holland, Michal Piekos, devicetree, linux-arm-kernel,
	linux-sunxi, linux-kernel
In-Reply-To: <2825865.mvXUDI8C0e@jernej-laptop>

On Fri, 17 Apr 2026 20:19:20 +0200
Jernej Škrabec <jernej.skrabec@gmail.com> wrote:

> Hi,
> 
> Dne nedelja, 12. april 2026 ob 19:42:10 Srednjeevropski poletni čas je Michal Piekos napisal(a):
> > Reboot hangs on MangoPi MQ-R T113s because no restart handler is
> > available.
> > 
> > Enable the SoC watchdog whose driver registers a restart handler.
> > 
> > Tested on MangoPi MQ-R T113s.
> > 
> > Signed-off-by: Michal Piekos <michal.piekos@mmpsystems.pl>
> > ---
> >  arch/arm/boot/dts/allwinner/sun8i-t113s-mangopi-mq-r-t113.dts | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/arm/boot/dts/allwinner/sun8i-t113s-mangopi-mq-r-t113.dts b/arch/arm/boot/dts/allwinner/sun8i-t113s-mangopi-mq-r-t113.dts
> > index 8b3a75383816..f0232a5e903b 100644
> > --- a/arch/arm/boot/dts/allwinner/sun8i-t113s-mangopi-mq-r-t113.dts
> > +++ b/arch/arm/boot/dts/allwinner/sun8i-t113s-mangopi-mq-r-t113.dts
> > @@ -33,3 +33,7 @@ rtl8189ftv: wifi@1 {
> >  		interrupt-names = "host-wake";
> >  	};
> >  };
> > +
> > +&wdt {
> > +	status = "okay";
> > +};  
> 
> Move this to sun8i-t113s.dtsi. All t113 boards have the same issue.
> Watchdog should be always enabled on ARM.

We actually have that line in U-Boot:
https://github.com/u-boot/u-boot/blob/master/arch/arm/dts/sunxi-u-boot.dtsi#L22-L27

IIRC, the idea was that it is *firmware* that chooses the watchdog, so
the generic DT should not be the place to set this.

If people use $fdtcontroladdr as the DT source, everything falls in
place neatly, no need for changes or runtime patching.

Cheers,
Andre


^ permalink raw reply

* Re: [PATCH] cpu/hotplug: Fix NULL kobject warning in cpuhp_smt_enable()
From: Catalin Marinas @ 2026-04-18 11:55 UTC (permalink / raw)
  To: Jinjie Ruan
  Cc: tglx, peterz, sudeep.holla, yangyicong, dietmar.eggemann,
	Jonathan.Cameron, linux-kernel, James Morse, linux-arm-kernel
In-Reply-To: <20260417075534.3745793-1-ruanjinjie@huawei.com>

+ James Morse, linux-arm-kernel

On Fri, Apr 17, 2026 at 03:55:34PM +0800, Jinjie Ruan wrote:
> When booting with `maxcpus` greater than the number of present CPUs (e.g.,
> QEMU -smp cpus=4,maxcpus=8), some CPUs are marked as 'present' but have not
> yet been registered via register_cpu(). Consequently, the per-cpu device
> objects for these CPUs are not yet initialized.
> 
> In cpuhp_smt_enable(), the code iterates over all present CPUs. Calling
> _cpu_up() for these unregistered CPUs eventually leads to
> sysfs_create_group() being called with a NULL kobject (or a kobject
> without a directory), triggering the following warning in
> fs/sysfs/group.c:
> 
> 	if (WARN_ON(!kobj || (!update && !kobj->sd)))
> 		return -EINVAL;
> 
> Fix this by adding a check for get_cpu_device(cpu). This ensures that
> SMT is only enabled for CPUs that have successfully completed their
> base device registration in sysfs.
> 
> How to reproduce:
> 
> 	1. echo off > /sys/devices/system/cpu/smt/control
> 		psci: CPU1 killed (polled 0 ms)
> 		psci: CPU3 killed (polled 0 ms)
> 
> 	2. echo 2 > /sys/devices/system/cpu/smt/control
> 
> 	Detected PIPT I-cache on CPU1
> 	GICv3: CPU1: found redistributor 1 region 0:0x00000000080c0000
> 	CPU1: Booted secondary processor 0x0000000001 [0x410fd082]
> 	Detected PIPT I-cache on CPU3
> 	GICv3: CPU3: found redistributor 3 region 0:0x0000000008100000
> 	CPU3: Booted secondary processor 0x0000000003 [0x410fd082]
> 	------------[ cut here ]------------
> 	WARNING: fs/sysfs/group.c:137 at internal_create_group+0x41c/0x4bc, CPU#2: sh/181
> 	Modules linked in:
> 	CPU: 2 UID: 0 PID: 181 Comm: sh Not tainted 7.0.0-rc1-00010-g8d13386c7624 #142 PREEMPT
> 	Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> 	pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> 	pc : internal_create_group+0x41c/0x4bc
> 	lr : sysfs_create_group+0x18/0x24
> 	sp : ffff80008078ba40
> 	x29: ffff80008078ba40 x28: ffff296c980ad000 x27: ffff00007fb94128
> 	x26: 0000000000000054 x25: ffffd693e845f3f0 x24: 0000000000000001
> 	x23: 0000000000000001 x22: 0000000000000004 x21: 0000000000000000
> 	x20: ffffd693e845fc10 x19: 0000000000000004 x18: 00000000ffffffff
> 	x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> 	x14: 0000000000000358 x13: 0000000000000007 x12: 0000000000000350
> 	x11: 0000000000000008 x10: 0000000000000407 x9 : 0000000000000400
> 	x8 : ffff00007fbf3b60 x7 : 0000000000000000 x6 : ffffd693e845f3f0
> 	x5 : ffff00007fb94128 x4 : 0000000000000000 x3 : ffff000000f4eac0
> 	x2 : ffffd693e7095a08 x1 : 0000000000000000 x0 : 0000000000000000
> 	Call trace:
> 	 internal_create_group+0x41c/0x4bc (P)
> 	 sysfs_create_group+0x18/0x24
> 	 topology_add_dev+0x1c/0x28
> 	 cpuhp_invoke_callback+0x104/0x20c
> 	 __cpuhp_invoke_callback_range+0x94/0x11c
> 	 _cpu_up+0x200/0x37c
> 	 cpuhp_smt_enable+0xbc/0x114
> 	 control_store+0xe8/0x1d4
> 	 dev_attr_store+0x18/0x2c
> 	 sysfs_kf_write+0x7c/0x94
> 	 kernfs_fop_write_iter+0x128/0x1b8
> 	 vfs_write+0x2b0/0x354
> 	 ksys_write+0x68/0xfc
> 	 __arm64_sys_write+0x1c/0x28
> 	 invoke_syscall+0x48/0x10c
> 	 el0_svc_common.constprop.0+0x40/0xe8
> 	 do_el0_svc+0x20/0x2c
> 	 el0_svc+0x34/0x124
> 	 el0t_64_sync_handler+0xa0/0xe4
> 	 el0t_64_sync+0x198/0x19c
> 	---[ end trace 0000000000000000 ]---
> 
> Cc: stable@vger.kernel.org
> Fixes: eed4583bcf9a6 ("arm64: Kconfig: Enable HOTPLUG_SMT")
> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
> ---
>  kernel/cpu.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bc4f7a9ba64e..df725d92ad4f 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2706,6 +2706,13 @@ int cpuhp_smt_enable(void)
>  	cpu_maps_update_begin();
>  	cpu_smt_control = CPU_SMT_ENABLED;
>  	for_each_present_cpu(cpu) {
> +		/*
> +		 * Skip CPUs that have not been registered in sysfs yet.
> +		 * This avoids triggering NULL kobject warnings for maxcpus.
> +		 */
> +		if (!get_cpu_device(cpu))
> +			continue;
> +
>  		/* Skip online CPUs and CPUs on offline nodes */
>  		if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))
>  			continue;

I spent some time trying to understand how we can get into this
situation. It seems to be an ACPI only thing.

Since commit eba4675008a6 ("arm64: arch_register_cpu() variant to check
if an ACPI handle is now available.") as part of the virtual CPU hotplug
series, arch_register_cpu() can defer registering the CPU devices. IIUC
this is done later via acpi_processor_hotadd_init(). smp_prepare_cpus(),
however, still marks them as present.

cpuhp_smt_enable(), if triggered later, walks the present cpus and
attempts _cpu_up(). cpuhp_up_callbacks() will call topology_add_dev()
(registered as a CPUHP_TOPOLOGY_PREPARE callback) and warn since
get_cpu_device() returns NULL. I think this can't happen during early
_cpu_up() calls since smp_init() is called before the topology callback
has been registered (slightly later during do_basic_setup()).

I'm not sure what the best fix should be. If we go with something along
the lines of the above, I wonder whether we should instead return an
error in the topology_add_dev() callback if get_cpu_device() returns
NULL. It feels a bit wrong to add a check for sysfs in
cpuhp_smt_enable() just because of some callbacks attempted by
_cpu_up(). There's precedent in cpu_capacity_sysctl_add() returning
-ENOENT. However, this messes up the smt control enabling since
cpuhp_smt_enable() will bail out early and return an error.

Another option would have been to avoid marking such CPUs present but I
think this will break other things. Yet another option is to register
all CPU devices even if they never come up (like maxcpus greater than
actual CPUs).

Opinions? It might be an arm64+ACPI-only thing.

-- 
Catalin


^ permalink raw reply

* Re: [PATCH v2 2/9] driver core: Add dev_set_drv_queue_sync_state()
From: Danilo Krummrich @ 2026-04-18 11:23 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Saravana Kannan, Rafael J . Wysocki, Greg Kroah-Hartman, linux-pm,
	Sudeep Holla, Cristian Marussi, Kevin Hilman, Stephen Boyd,
	Marek Szyprowski, Bjorn Andersson, Abel Vesa, Peng Fan,
	Tomi Valkeinen, Maulik Shah, Konrad Dybcio, Thierry Reding,
	Jonathan Hunter, Geert Uytterhoeven, Dmitry Baryshkov,
	linux-arm-kernel, linux-kernel, Geert Uytterhoeven, driver-core
In-Reply-To: <20260410104058.83748-3-ulf.hansson@linaro.org>

On Fri Apr 10, 2026 at 12:40 PM CEST, Ulf Hansson wrote:
> diff --git a/include/linux/device.h b/include/linux/device.h
> index e65d564f01cd..f812e70bdf22 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -994,6 +994,18 @@ static inline int dev_set_drv_sync_state(struct device *dev,
>  	return 0;
>  }
>  
> +static inline int dev_set_drv_queue_sync_state(struct device *dev,
> +					       void (*fn)(struct device *dev))

As this is a public function, please add some documentation.

> +{
> +	if (!dev || !dev->driver)
> +		return 0;
> +	if (dev->driver->queue_sync_state && dev->driver->queue_sync_state != fn)
> +		return -EBUSY;
> +	if (!dev->driver->queue_sync_state)
> +		dev->driver->queue_sync_state = fn;

I think this follows dev_set_drv_sync_state(), but I think it is worth pointing
out that it is yet another blocker for moving towards
const struct device_driver.

> +	return 0;
> +}
> +
>  static inline void dev_set_removable(struct device *dev,
>  				     enum device_removable removable)
>  {
> -- 
> 2.43.0



^ permalink raw reply

* Re: [PATCH v2 1/9] driver core: Enable suppliers to implement fine grained sync_state support
From: Danilo Krummrich @ 2026-04-18 11:23 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Saravana Kannan, Rafael J . Wysocki, Greg Kroah-Hartman, linux-pm,
	Sudeep Holla, Cristian Marussi, Kevin Hilman, Stephen Boyd,
	Marek Szyprowski, Bjorn Andersson, Abel Vesa, Peng Fan,
	Tomi Valkeinen, Maulik Shah, Konrad Dybcio, Thierry Reding,
	Jonathan Hunter, Geert Uytterhoeven, Dmitry Baryshkov,
	linux-arm-kernel, linux-kernel, Geert Uytterhoeven, driver-core
In-Reply-To: <20260410104058.83748-2-ulf.hansson@linaro.org>

On Fri Apr 10, 2026 at 12:40 PM CEST, Ulf Hansson wrote:
> The common sync_state support isn't fine grained enough for some types of
> suppliers, like power domains for example. Especially when a supplier
> provides multiple independent power domains, each with their own set of
> consumers. In these cases we need to wait for all consumers for all the
> provided power domains before invoking the supplier's ->sync_state().
>
> To allow a more fine grained sync_state support to be implemented on per
> supplier's driver basis, let's add a new optional callback. As soon as
> there is an update worth to consider in regards to managing sync_state for
> a supplier device, __device_links_queue_sync_state() invokes the callback.
>
> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> ---
>  drivers/base/core.c           | 7 ++++++-
>  include/linux/device/driver.h | 7 +++++++
>  2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 09b98f02f559..4085a011d8ca 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -1106,7 +1106,9 @@ int device_links_check_suppliers(struct device *dev)
>   * Queues a device for a sync_state() callback when the device links write lock
>   * isn't held. This allows the sync_state() execution flow to use device links
>   * APIs.  The caller must ensure this function is called with
> - * device_links_write_lock() held.
> + * device_links_write_lock() held.  Note, if the optional queue_sync_state()
> + * callback has been assigned too, it gets called for every update to allowing a

s/allowing/allow/

> + * more fine grained support to be implemented on per supplier basis.
>   *
>   * This function does a get_device() to make sure the device is not freed while
>   * on this list.
> @@ -1126,6 +1128,9 @@ static void __device_links_queue_sync_state(struct device *dev,
>  	if (dev->state_synced)
>  		return;
>  
> +	if (dev->driver && dev->driver->queue_sync_state)
> +		dev->driver->queue_sync_state(dev);

This seems to be called without the device lock being held, which seems to allow
the queue_sync_state() callback to execute concurrently with remove(). This
opens the door for all kinds of UAF conditions in drivers.

This also made me aware that the above dev_has_sync_state() is probably broken,
as it also performs the following check without the device lock being held.

	dev->driver && dev->driver->sync_state

I think nothing prevents dev->driver to become NULL concurrently; in this case
READ_ONCE() should be sufficient though as it doesn't execute the callback.

I will send a patch for this.

> +
>  	list_for_each_entry(link, &dev->links.consumers, s_node) {
>  		if (!device_link_test(link, DL_FLAG_MANAGED))
>  			continue;
> diff --git a/include/linux/device/driver.h b/include/linux/device/driver.h
> index bbc67ec513ed..bc9ae1cbe03c 100644
> --- a/include/linux/device/driver.h
> +++ b/include/linux/device/driver.h
> @@ -68,6 +68,12 @@ enum probe_type {
>   *		be called at late_initcall_sync level. If the device has
>   *		consumers that are never bound to a driver, this function
>   *		will never get called until they do.
> + * @queue_sync_state: Similar to the ->sync_state() callback, but called to
> + *		allow syncing device state to software state in a more fine
> + *		grained way. It is called when there is an updated state that
> + *		may be worth to consider for any of the consumers linked to
> + *		this device. If implemented, the ->sync_state() callback is
> + *		required too.

What happens if this is not the case? Maybe worth to check and warn about this
in driver_register().

>   * @remove:	Called when the device is removed from the system to
>   *		unbind a device from this driver.
>   * @shutdown:	Called at shut-down time to quiesce the device.
> @@ -110,6 +116,7 @@ struct device_driver {
>  
>  	int (*probe) (struct device *dev);
>  	void (*sync_state)(struct device *dev);
> +	void (*queue_sync_state)(struct device *dev);
>  	int (*remove) (struct device *dev);
>  	void (*shutdown) (struct device *dev);
>  	int (*suspend) (struct device *dev, pm_message_t state);
> -- 
> 2.43.0



^ permalink raw reply

* Re: [PATCH v2 0/9] driver core / pmdomain: Add support for fined grained sync_state
From: Danilo Krummrich @ 2026-04-18 11:23 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Saravana Kannan, Rafael J . Wysocki, Greg Kroah-Hartman, linux-pm,
	Sudeep Holla, Cristian Marussi, Kevin Hilman, Stephen Boyd,
	Marek Szyprowski, Bjorn Andersson, Abel Vesa, Peng Fan,
	Tomi Valkeinen, Maulik Shah, Konrad Dybcio, Thierry Reding,
	Jonathan Hunter, Geert Uytterhoeven, Dmitry Baryshkov,
	linux-arm-kernel, linux-kernel, driver-core
In-Reply-To: <CAPDyKFrPz9gaBBp6xV1=KkoemEfapc0p3POZxuBTvDw7Vamxtg@mail.gmail.com>

On Fri Apr 17, 2026 at 1:27 PM CEST, Ulf Hansson wrote:
> + Danilo (for the driver core changes)

Thanks -- please also remember to Cc: driver-core@lists.linux.dev.


^ permalink raw reply

* Re: [RFC PATCH 4/4] firmware: arm_ffa: check pkvm initailised when initailise ffa driver
From: Yeoreum Yun @ 2026-04-18 10:34 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-security-module, linux-kernel, linux-integrity,
	linux-arm-kernel, kvmarm, paul, jmorris, serge, zohar,
	roberto.sassu, dmitry.kasatkin, eric.snowberg, peterhuewe, jarkko,
	jgg, sudeep.holla, oupton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will
In-Reply-To: <87se8sbozv.wl-maz@kernel.org>

Hi Marc,

> On Fri, 17 Apr 2026 18:57:59 +0100,
> Yeoreum Yun <yeoreum.yun@arm.com> wrote:
> >
> > When pKVM is enabled, the FF-A driver must be initialized after pKVM.
> > Otherwise, pKVM cannot negotiate the FF-A version or
> > obtain RX/TX buffer information, leading to failures in FF-A calls.
> >
> > During FF-A driver initialization, check whether pKVM has been initialized.
> > If not, defer probing of the FF-A driver.
> >
> > Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> > ---
> >  arch/arm64/kvm/arm.c              |  1 +
> >  drivers/firmware/arm_ffa/driver.c | 12 ++++++++++++
> >  2 files changed, 13 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 410ffd41fd73..0f517b1c05cd 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -119,6 +119,7 @@ bool is_kvm_arm_initialised(void)
> >  {
> >  	return kvm_arm_initialised;
> >  }
> > +EXPORT_SYMBOL(is_kvm_arm_initialised);
>
> EXPORT_SYMBOL_GPL(), please.

Okay.

>
> >
> >  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
> >  {
> > diff --git a/drivers/firmware/arm_ffa/driver.c b/drivers/firmware/arm_ffa/driver.c
> > index 02c76ac1570b..2647d6554afd 100644
> > --- a/drivers/firmware/arm_ffa/driver.c
> > +++ b/drivers/firmware/arm_ffa/driver.c
> > @@ -42,6 +42,8 @@
> >  #include <linux/uuid.h>
> >  #include <linux/xarray.h>
> >
> > +#include <asm/virt.h>
> > +
> >  #include "common.h"
> >
> >  #define FFA_DRIVER_VERSION	FFA_VERSION_1_2
> > @@ -2035,6 +2037,16 @@ static int __init ffa_init(void)
> >  	u32 buf_sz;
> >  	size_t rxtx_bufsz = SZ_4K;
> >
> > +	/*
> > +	 * When pKVM is enabled, the FF-A driver must be initialized
> > +	 * after pKVM initialization. Otherwise, pKVM cannot negotiate
> > +	 * the FF-A version or obtain RX/TX buffer information,
> > +	 * which leads to failures in FF-A calls.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_KVM) && is_protected_kvm_enabled() &&
> > +	    !is_kvm_arm_initialised())
> > +		return -EPROBE_DEFER;
> > +
>
> That's still fundamentally wrong: pkvm is not ready until
> finalize_pkvm() has finished, and that's not indicated by
> is_kvm_arm_initialised().

Thanks. I miss the TSC bit set in here.
IMHO, I'd like to make an new state check function --
is_pkvm_arm_initialised() so that ff-a driver to know whether
pkvm is initialised.

or any other suggestion?

Thanks.

--
Sincerely,
Yeoreum Yun


^ permalink raw reply

* Re: [RFC PATCH 4/4] firmware: arm_ffa: check pkvm initailised when initailise ffa driver
From: Marc Zyngier @ 2026-04-18  9:24 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: linux-security-module, linux-kernel, linux-integrity,
	linux-arm-kernel, kvmarm, paul, jmorris, serge, zohar,
	roberto.sassu, dmitry.kasatkin, eric.snowberg, peterhuewe, jarkko,
	jgg, sudeep.holla, oupton, joey.gouly, suzuki.poulose, yuzenghui,
	catalin.marinas, will
In-Reply-To: <20260417175759.3191279-5-yeoreum.yun@arm.com>

On Fri, 17 Apr 2026 18:57:59 +0100,
Yeoreum Yun <yeoreum.yun@arm.com> wrote:
> 
> When pKVM is enabled, the FF-A driver must be initialized after pKVM.
> Otherwise, pKVM cannot negotiate the FF-A version or
> obtain RX/TX buffer information, leading to failures in FF-A calls.
> 
> During FF-A driver initialization, check whether pKVM has been initialized.
> If not, defer probing of the FF-A driver.
> 
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> ---
>  arch/arm64/kvm/arm.c              |  1 +
>  drivers/firmware/arm_ffa/driver.c | 12 ++++++++++++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 410ffd41fd73..0f517b1c05cd 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -119,6 +119,7 @@ bool is_kvm_arm_initialised(void)
>  {
>  	return kvm_arm_initialised;
>  }
> +EXPORT_SYMBOL(is_kvm_arm_initialised);

EXPORT_SYMBOL_GPL(), please.

> 
>  int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
>  {
> diff --git a/drivers/firmware/arm_ffa/driver.c b/drivers/firmware/arm_ffa/driver.c
> index 02c76ac1570b..2647d6554afd 100644
> --- a/drivers/firmware/arm_ffa/driver.c
> +++ b/drivers/firmware/arm_ffa/driver.c
> @@ -42,6 +42,8 @@
>  #include <linux/uuid.h>
>  #include <linux/xarray.h>
> 
> +#include <asm/virt.h>
> +
>  #include "common.h"
> 
>  #define FFA_DRIVER_VERSION	FFA_VERSION_1_2
> @@ -2035,6 +2037,16 @@ static int __init ffa_init(void)
>  	u32 buf_sz;
>  	size_t rxtx_bufsz = SZ_4K;
> 
> +	/*
> +	 * When pKVM is enabled, the FF-A driver must be initialized
> +	 * after pKVM initialization. Otherwise, pKVM cannot negotiate
> +	 * the FF-A version or obtain RX/TX buffer information,
> +	 * which leads to failures in FF-A calls.
> +	 */
> +	if (IS_ENABLED(CONFIG_KVM) && is_protected_kvm_enabled() &&
> +	    !is_kvm_arm_initialised())
> +		return -EPROBE_DEFER;
> +

That's still fundamentally wrong: pkvm is not ready until
finalize_pkvm() has finished, and that's not indicated by
is_kvm_arm_initialised().

	M.

-- 
Jazz isn't dead. It just smells funny.


^ permalink raw reply

* [PATCH v7 1/4] KVM: arm64: PMU: Add kvm_pmu_enabled_counter_mask()
From: Akihiko Odaki @ 2026-04-18  8:14 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Kees Cook,
	Gustavo A. R. Silva, Paolo Bonzini, Jonathan Corbet, Shuah Khan
  Cc: linux-arm-kernel, kvmarm, linux-kernel, linux-hardening, devel,
	kvm, linux-doc, linux-kselftest, Akihiko Odaki
In-Reply-To: <20260418-hybrid-v7-0-2bf39ad009bf@rsg.ci.i.u-tokyo.ac.jp>

This function will be useful to enumerate enabled counters.

Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
 arch/arm64/kvm/pmu-emul.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index b03dbda7f1ab..59ec96e09321 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -619,18 +619,24 @@ void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val)
 	}
 }
 
-static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc)
+static u64 kvm_pmu_enabled_counter_mask(struct kvm_vcpu *vcpu)
 {
-	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
-	unsigned int mdcr = __vcpu_sys_reg(vcpu, MDCR_EL2);
+	u64 mask = 0;
 
-	if (!(__vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & BIT(pmc->idx)))
-		return false;
+	if (__vcpu_sys_reg(vcpu, MDCR_EL2) & MDCR_EL2_HPME)
+		mask |= kvm_pmu_hyp_counter_mask(vcpu);
 
-	if (kvm_pmu_counter_is_hyp(vcpu, pmc->idx))
-		return mdcr & MDCR_EL2_HPME;
+	if (kvm_vcpu_read_pmcr(vcpu) & ARMV8_PMU_PMCR_E)
+		mask |= ~kvm_pmu_hyp_counter_mask(vcpu);
+
+	return __vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & mask;
+}
+
+static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc)
+{
+	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
 
-	return kvm_vcpu_read_pmcr(vcpu) & ARMV8_PMU_PMCR_E;
+	return kvm_pmu_enabled_counter_mask(vcpu) & BIT(pmc->idx);
 }
 
 static bool kvm_pmc_counts_at_el0(struct kvm_pmc *pmc)

-- 
2.53.0



^ permalink raw reply related

* [PATCH v7 0/4] KVM: arm64: PMU: Use multiple host PMUs
From: Akihiko Odaki @ 2026-04-18  8:14 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Kees Cook,
	Gustavo A. R. Silva, Paolo Bonzini, Jonathan Corbet, Shuah Khan
  Cc: linux-arm-kernel, kvmarm, linux-kernel, linux-hardening, devel,
	kvm, linux-doc, linux-kselftest, Akihiko Odaki

On a heterogeneous arm64 system, KVM's PMU emulation is based on the
features of a single host PMU instance. When a vCPU is migrated to a
pCPU with an incompatible PMU, counters such as PMCCNTR_EL0 stop
incrementing.

Although this behavior is permitted by the architecture, Windows does
not handle it gracefully and may crash with a division-by-zero error.

The current workaround requires VMMs to pin vCPUs to a set of pCPUs
that share a compatible PMU. This is difficult to implement correctly in
QEMU/libvirt, where pinning occurs after vCPU initialization, and it
also restricts the guest to a subset of available pCPUs.

This patch introduces the KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
attribute. If set, PMUv3 will be emulated without programmable event
counters. KVM will be able to run VCPUs on any physical CPUs with a
compatible hardware PMU.

This allows Windows guests to run reliably on heterogeneous systems
without crashing, even without vCPU pinning, and enables VMMs to
schedule vCPUs across all available pCPUs, making full use of the host
hardware.

A QEMU patch that demonstrates the usage of the new attribute is
available at:
https://lore.kernel.org/qemu-devel/20260225-kvm-v2-1-b8d743db0f73@rsg.ci.i.u-tokyo.ac.jp/
("[PATCH RFC v2] target/arm/kvm: Choose PMU backend")

Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
Changes in v7:
- Fixed the vCPU run hang in test_fixed_counters_only().
- Link to v6: https://lore.kernel.org/r/20260413-hybrid-v6-0-e79d760f7f1b@rsg.ci.i.u-tokyo.ac.jp

Changes in v6:
- Removed WARN_ON_ONCE() in kvm_pmu_create_perf_event(). It can be
  triggered in kvm_arch_vcpu_load() before it checks supported_cpus.
- Removed an extra lockdep assertion in kvm_arm_pmu_v3_get_attr().
- Fixed error messages in test_fixed_counters_only().
- Fixed the vCPU run in test_fixed_counters_only().
- Link to v5: https://lore.kernel.org/r/20260411-hybrid-v5-0-b043b4d9f49e@rsg.ci.i.u-tokyo.ac.jp

Changes in v5:
- Rebased.
- Fixed the order to clear KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY in
  kvm_arm_pmu_v3_set_pmu().
- Fixed the setting of KVM_ARM_VCPU_PMU_V3_IRQ in
  test_fixed_counters_only().
- Changed to WARN_ON_ONCE() when kvm_pmu_probe_armpmu() returns NULL in
  kvm_pmu_create_perf_event(), which is no longer supposed to happen.
- Link to v4: https://lore.kernel.org/r/20260317-hybrid-v4-0-bd62bcd48644@rsg.ci.i.u-tokyo.ac.jp

Changes in v4:
- Extracted kvm_pmu_enabled_counter_mask() into a separate patch.
- Added patch "KVM: arm64: PMU: Protect the list of PMUs with RCU".
- Merged KVM_REQ_CREATE_PMU into KVM_REQ_RELOAD_PMU.
- Added a check to avoid unnecessary KVM_REQ_RELOAD_PMU requests.
- Dropped the change to avoid setting kvm_arm_set_default_pmu() when
  KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY is not set.
- Link to v3: https://lore.kernel.org/r/20260225-hybrid-v3-0-46e8fe220880@rsg.ci.i.u-tokyo.ac.jp

Changes in v3:
- Renamed the attribute to KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY.
- Changed to request the creation of perf counters when loading vCPU.
- Link to v2: https://lore.kernel.org/r/20250806-hybrid-v2-0-0661aec3af8c@rsg.ci.i.u-tokyo.ac.jp

Changes in v2:
- Added the KVM_ARM_VCPU_PMU_V3_COMPOSITION attribute to opt in the
  feature.
- Added code to handle overflow.
- Link to v1: https://lore.kernel.org/r/20250319-hybrid-v1-1-4d1ada10e705@daynix.com

---
Akihiko Odaki (4):
      KVM: arm64: PMU: Add kvm_pmu_enabled_counter_mask()
      KVM: arm64: PMU: Protect the list of PMUs with RCU
      KVM: arm64: PMU: Introduce FIXED_COUNTERS_ONLY
      KVM: arm64: selftests: Test PMU_V3_FIXED_COUNTERS_ONLY

 Documentation/virt/kvm/devices/vcpu.rst            |  29 ++++
 arch/arm64/include/asm/kvm_host.h                  |   2 +
 arch/arm64/include/uapi/asm/kvm.h                  |   1 +
 arch/arm64/kvm/arm.c                               |   1 +
 arch/arm64/kvm/pmu-emul.c                          | 187 ++++++++++++++-------
 include/kvm/arm_pmu.h                              |   2 +
 .../selftests/kvm/arm64/vpmu_counter_access.c      | 153 ++++++++++++++---
 7 files changed, 292 insertions(+), 83 deletions(-)
---
base-commit: 94b4ae79ebb42a8a6f2124b4d4b033b15a98e4f9
change-id: 20250224-hybrid-01d5ff47edd2

Best regards,
--  
Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>



^ permalink raw reply

* [PATCH v7 4/4] KVM: arm64: selftests: Test PMU_V3_FIXED_COUNTERS_ONLY
From: Akihiko Odaki @ 2026-04-18  8:14 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Kees Cook,
	Gustavo A. R. Silva, Paolo Bonzini, Jonathan Corbet, Shuah Khan
  Cc: linux-arm-kernel, kvmarm, linux-kernel, linux-hardening, devel,
	kvm, linux-doc, linux-kselftest, Akihiko Odaki
In-Reply-To: <20260418-hybrid-v7-0-2bf39ad009bf@rsg.ci.i.u-tokyo.ac.jp>

Assert the following:
- KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY is unset at initialization.
- KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY can be set.
- Setting KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY for the first time
  after setting an event filter results in EBUSY.
- KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY can be set again even if an
  event filter has already been set.
- Setting KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY after running a VCPU
  results in EBUSY.
- The existing test cases pass with
  KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY set.

Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
 .../selftests/kvm/arm64/vpmu_counter_access.c      | 153 +++++++++++++++++----
 1 file changed, 127 insertions(+), 26 deletions(-)

diff --git a/tools/testing/selftests/kvm/arm64/vpmu_counter_access.c b/tools/testing/selftests/kvm/arm64/vpmu_counter_access.c
index ae36325c022f..0ed0a8513b03 100644
--- a/tools/testing/selftests/kvm/arm64/vpmu_counter_access.c
+++ b/tools/testing/selftests/kvm/arm64/vpmu_counter_access.c
@@ -403,12 +403,7 @@ static void create_vpmu_vm(void *guest_code)
 {
 	struct kvm_vcpu_init init;
 	uint8_t pmuver, ec;
-	uint64_t dfr0, irq = 23;
-	struct kvm_device_attr irq_attr = {
-		.group = KVM_ARM_VCPU_PMU_V3_CTRL,
-		.attr = KVM_ARM_VCPU_PMU_V3_IRQ,
-		.addr = (uint64_t)&irq,
-	};
+	uint64_t dfr0;
 
 	/* The test creates the vpmu_vm multiple times. Ensure a clean state */
 	memset(&vpmu_vm, 0, sizeof(vpmu_vm));
@@ -434,8 +429,6 @@ static void create_vpmu_vm(void *guest_code)
 	TEST_ASSERT(pmuver != ID_AA64DFR0_EL1_PMUVer_IMP_DEF &&
 		    pmuver >= ID_AA64DFR0_EL1_PMUVer_IMP,
 		    "Unexpected PMUVER (0x%x) on the vCPU with PMUv3", pmuver);
-
-	vcpu_ioctl(vpmu_vm.vcpu, KVM_SET_DEVICE_ATTR, &irq_attr);
 }
 
 static void destroy_vpmu_vm(void)
@@ -461,15 +454,30 @@ static void run_vcpu(struct kvm_vcpu *vcpu, uint64_t pmcr_n)
 	}
 }
 
-static void test_create_vpmu_vm_with_nr_counters(unsigned int nr_counters, bool expect_fail)
+static void guest_code_done(void)
+{
+	GUEST_DONE();
+}
+
+static void test_create_vpmu_vm_with_nr_counters(unsigned int nr_counters,
+						 bool fixed_counters_only,
+						 bool expect_fail)
 {
 	struct kvm_vcpu *vcpu;
 	unsigned int prev;
 	int ret;
+	uint64_t irq = 23;
 
 	create_vpmu_vm(guest_code);
 	vcpu = vpmu_vm.vcpu;
 
+	if (fixed_counters_only)
+		vcpu_device_attr_set(vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+				     KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY, NULL);
+
+	vcpu_device_attr_set(vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+			     KVM_ARM_VCPU_PMU_V3_IRQ, &irq);
+
 	prev = get_pmcr_n(vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_PMCR_EL0)));
 
 	ret = __vcpu_device_attr_set(vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
@@ -489,15 +497,15 @@ static void test_create_vpmu_vm_with_nr_counters(unsigned int nr_counters, bool
  * Create a guest with one vCPU, set the PMCR_EL0.N for the vCPU to @pmcr_n,
  * and run the test.
  */
-static void run_access_test(uint64_t pmcr_n)
+static void run_access_test(uint64_t pmcr_n, bool fixed_counters_only)
 {
 	uint64_t sp;
 	struct kvm_vcpu *vcpu;
 	struct kvm_vcpu_init init;
 
-	pr_debug("Test with pmcr_n %lu\n", pmcr_n);
+	pr_debug("Test with pmcr_n %lu, fixed_counters_only %d\n", pmcr_n, fixed_counters_only);
 
-	test_create_vpmu_vm_with_nr_counters(pmcr_n, false);
+	test_create_vpmu_vm_with_nr_counters(pmcr_n, fixed_counters_only, false);
 	vcpu = vpmu_vm.vcpu;
 
 	/* Save the initial sp to restore them later to run the guest again */
@@ -531,14 +539,14 @@ static struct pmreg_sets validity_check_reg_sets[] = {
  * Create a VM, and check if KVM handles the userspace accesses of
  * the PMU register sets in @validity_check_reg_sets[] correctly.
  */
-static void run_pmregs_validity_test(uint64_t pmcr_n)
+static void run_pmregs_validity_test(uint64_t pmcr_n, bool fixed_counters_only)
 {
 	int i;
 	struct kvm_vcpu *vcpu;
 	uint64_t set_reg_id, clr_reg_id, reg_val;
 	uint64_t valid_counters_mask, max_counters_mask;
 
-	test_create_vpmu_vm_with_nr_counters(pmcr_n, false);
+	test_create_vpmu_vm_with_nr_counters(pmcr_n, fixed_counters_only, false);
 	vcpu = vpmu_vm.vcpu;
 
 	valid_counters_mask = get_counters_mask(pmcr_n);
@@ -588,11 +596,11 @@ static void run_pmregs_validity_test(uint64_t pmcr_n)
  * the vCPU to @pmcr_n, which is larger than the host value.
  * The attempt should fail as @pmcr_n is too big to set for the vCPU.
  */
-static void run_error_test(uint64_t pmcr_n)
+static void run_error_test(uint64_t pmcr_n, bool fixed_counters_only)
 {
 	pr_debug("Error test with pmcr_n %lu (larger than the host)\n", pmcr_n);
 
-	test_create_vpmu_vm_with_nr_counters(pmcr_n, true);
+	test_create_vpmu_vm_with_nr_counters(pmcr_n, fixed_counters_only, true);
 	destroy_vpmu_vm();
 }
 
@@ -622,22 +630,115 @@ static bool kvm_supports_nr_counters_attr(void)
 	return supported;
 }
 
-int main(void)
+static void test_config(uint64_t pmcr_n, bool fixed_counters_only)
 {
-	uint64_t i, pmcr_n;
-
-	TEST_REQUIRE(kvm_has_cap(KVM_CAP_ARM_PMU_V3));
-	TEST_REQUIRE(kvm_supports_vgic_v3());
-	TEST_REQUIRE(kvm_supports_nr_counters_attr());
+	uint64_t i;
 
-	pmcr_n = get_pmcr_n_limit();
 	for (i = 0; i <= pmcr_n; i++) {
-		run_access_test(i);
-		run_pmregs_validity_test(i);
+		run_access_test(i, fixed_counters_only);
+		run_pmregs_validity_test(i, fixed_counters_only);
 	}
 
 	for (i = pmcr_n + 1; i < ARMV8_PMU_MAX_COUNTERS; i++)
-		run_error_test(i);
+		run_error_test(i, fixed_counters_only);
+}
+
+static void test_fixed_counters_only(void)
+{
+	struct kvm_pmu_event_filter filter = { .nevents = 0 };
+	struct kvm_vm *vm;
+	struct kvm_vcpu *running_vcpu;
+	struct kvm_vcpu *stopped_vcpu;
+	struct kvm_vcpu_init init;
+	int ret;
+	uint64_t irq = 23;
+
+	create_vpmu_vm(guest_code);
+	ret = __vcpu_has_device_attr(vpmu_vm.vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+				     KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY);
+	if (ret) {
+		TEST_ASSERT(ret == -1 && errno == ENXIO,
+			    KVM_IOCTL_ERROR(KVM_HAS_DEVICE_ATTR, ret));
+		destroy_vpmu_vm();
+		return;
+	}
+
+	/* Assert that FIXED_COUNTERS_ONLY is unset at initialization. */
+	ret = __vcpu_device_attr_get(vpmu_vm.vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+				     KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY, NULL);
+	TEST_ASSERT(ret == -1 && errno == ENXIO,
+		    KVM_IOCTL_ERROR(KVM_GET_DEVICE_ATTR, ret));
+
+	/* Assert that setting FIXED_COUNTERS_ONLY succeeds. */
+	vcpu_device_attr_set(vpmu_vm.vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+			     KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY, NULL);
+
+	/* Assert that getting FIXED_COUNTERS_ONLY succeeds. */
+	vcpu_device_attr_get(vpmu_vm.vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+			     KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY, NULL);
+
+	/*
+	 * Assert that setting FIXED_COUNTERS_ONLY again succeeds even if an
+	 * event filter has already been set.
+	 */
+	vcpu_device_attr_set(vpmu_vm.vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+			     KVM_ARM_VCPU_PMU_V3_FILTER, &filter);
+
+	vcpu_device_attr_set(vpmu_vm.vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+			     KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY, NULL);
+
+	destroy_vpmu_vm();
+
+	create_vpmu_vm(guest_code);
+
+	/*
+	 * Assert that setting FIXED_COUNTERS_ONLY results in EBUSY if an event
+	 * filter has already been set while FIXED_COUNTERS_ONLY has not.
+	 */
+	vcpu_device_attr_set(vpmu_vm.vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+			     KVM_ARM_VCPU_PMU_V3_FILTER, &filter);
+
+	ret = __vcpu_device_attr_set(vpmu_vm.vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+				     KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY, NULL);
+	TEST_ASSERT(ret == -1 && errno == EBUSY,
+		    KVM_IOCTL_ERROR(KVM_SET_DEVICE_ATTR, ret));
+
+	destroy_vpmu_vm();
+
+	/*
+	 * Assert that setting FIXED_COUNTERS_ONLY after running a VCPU results
+	 * in EBUSY.
+	 */
+	vm = vm_create(2);
+	vm_ioctl(vm, KVM_ARM_PREFERRED_TARGET, &init);
+	init.features[0] |= (1 << KVM_ARM_VCPU_PMU_V3);
+	running_vcpu = aarch64_vcpu_add(vm, 0, &init, guest_code_done);
+	stopped_vcpu = aarch64_vcpu_add(vm, 1, &init, guest_code_done);
+	kvm_arch_vm_finalize_vcpus(vm);
+	vcpu_device_attr_set(running_vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+			     KVM_ARM_VCPU_PMU_V3_IRQ, &irq);
+	vcpu_device_attr_set(running_vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+			     KVM_ARM_VCPU_PMU_V3_INIT, NULL);
+	vcpu_run(running_vcpu);
+
+	ret = __vcpu_device_attr_set(stopped_vcpu, KVM_ARM_VCPU_PMU_V3_CTRL,
+				     KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY, NULL);
+	TEST_ASSERT(ret == -1 && errno == EBUSY,
+		    KVM_IOCTL_ERROR(KVM_SET_DEVICE_ATTR, ret));
+
+	kvm_vm_free(vm);
+
+	test_config(0, true);
+}
+
+int main(void)
+{
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_ARM_PMU_V3));
+	TEST_REQUIRE(kvm_supports_vgic_v3());
+	TEST_REQUIRE(kvm_supports_nr_counters_attr());
+
+	test_config(get_pmcr_n_limit(), false);
+	test_fixed_counters_only();
 
 	return 0;
 }

-- 
2.53.0



^ permalink raw reply related

* [PATCH v7 3/4] KVM: arm64: PMU: Introduce FIXED_COUNTERS_ONLY
From: Akihiko Odaki @ 2026-04-18  8:14 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Kees Cook,
	Gustavo A. R. Silva, Paolo Bonzini, Jonathan Corbet, Shuah Khan
  Cc: linux-arm-kernel, kvmarm, linux-kernel, linux-hardening, devel,
	kvm, linux-doc, linux-kselftest, Akihiko Odaki
In-Reply-To: <20260418-hybrid-v7-0-2bf39ad009bf@rsg.ci.i.u-tokyo.ac.jp>

On a heterogeneous arm64 system, KVM's PMU emulation is based on the
features of a single host PMU instance. When a vCPU is migrated to a
pCPU with an incompatible PMU, counters such as PMCCNTR_EL0 stop
incrementing.

Although this behavior is permitted by the architecture, Windows does
not handle it gracefully and may crash with a division-by-zero error.

The current workaround requires VMMs to pin vCPUs to a set of pCPUs
that share a compatible PMU. This is difficult to implement correctly in
QEMU/libvirt, where pinning occurs after vCPU initialization, and it
also restricts the guest to a subset of available pCPUs.

Introduce the KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY attribute to
create a "fixed-counters-only" PMU. When set, KVM exposes a PMU that is
compatible with all pCPUs but that does not support programmable
event counters which may have different feature sets on different PMUs.

This allows Windows guests to run reliably on heterogeneous systems
without crashing, even without vCPU pinning, and enables VMMs to
schedule vCPUs across all available pCPUs, making full use of the host
hardware.

Much like KVM_ARM_VCPU_PMU_V3_IRQ and other read-write attributes, this
attribute provides a getter that facilitates kernel and userspace
debugging/testing.

Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
 Documentation/virt/kvm/devices/vcpu.rst |  29 ++++++
 arch/arm64/include/asm/kvm_host.h       |   2 +
 arch/arm64/include/uapi/asm/kvm.h       |   1 +
 arch/arm64/kvm/arm.c                    |   1 +
 arch/arm64/kvm/pmu-emul.c               | 155 +++++++++++++++++++++++---------
 include/kvm/arm_pmu.h                   |   2 +
 6 files changed, 147 insertions(+), 43 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 60bf205cb373..e0aeb1897d77 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -161,6 +161,35 @@ explicitly selected, or the number of counters is out of range for the
 selected PMU. Selecting a new PMU cancels the effect of setting this
 attribute.
 
+1.6 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
+------------------------------------------------------
+
+:Parameters: no additional parameter in kvm_device_attr.addr
+
+:Returns:
+
+	 =======  =====================================================
+	 -EBUSY   Attempted to set after initializing PMUv3 or running
+		  VCPU, or attempted to set for the first time after
+		  setting an event filter
+	 -ENXIO   Attempted to get before setting
+	 -ENODEV  Attempted to set while PMUv3 not supported
+	 =======  =====================================================
+
+If set, PMUv3 will be emulated without programmable event counters. The VCPU
+will use any compatible hardware PMU. This attribute is particularly useful on
+heterogeneous systems where different hardware PMUs cover different physical
+CPUs. The compatibility of hardware PMUs can be checked with
+KVM_ARM_VCPU_PMU_V3_SET_PMU. All VCPUs in a VM share this attribute. It isn't
+possible to set it for the first time if a PMU event filter is already present.
+
+Note that KVM will not make any attempts to run the VCPU on the physical CPUs
+with compatible hardware PMUs. This is entirely left to userspace. However,
+attempting to run the VCPU on an unsupported CPU will fail and KVM_RUN will
+return with exit_reason = KVM_EXIT_FAIL_ENTRY and populate the fail_entry struct
+by setting hardware_entry_failure_reason field to
+KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED and the cpu field to the processor id.
+
 2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
 =================================
 
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 59f25b85be2b..b59e0182472c 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -353,6 +353,8 @@ struct kvm_arch {
 #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS		10
 	/* Unhandled SEAs are taken to userspace */
 #define KVM_ARCH_FLAG_EXIT_SEA				11
+	/* PMUv3 is emulated without progammable event counters */
+#define KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY	12
 	unsigned long flags;
 
 	/* VM-wide vCPU feature set */
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index a792a599b9d6..474c84fa757f 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -436,6 +436,7 @@ enum {
 #define   KVM_ARM_VCPU_PMU_V3_FILTER		2
 #define   KVM_ARM_VCPU_PMU_V3_SET_PMU		3
 #define   KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS	4
+#define   KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY	5
 #define KVM_ARM_VCPU_TIMER_CTRL		1
 #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER		0
 #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 620a465248d1..dca16ca26d32 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -634,6 +634,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (has_vhe())
 		kvm_vcpu_load_vhe(vcpu);
 	kvm_arch_vcpu_load_fp(vcpu);
+	kvm_vcpu_load_pmu(vcpu);
 	kvm_vcpu_pmu_restore_guest(vcpu);
 	if (kvm_arm_is_pvtime_enabled(&vcpu->arch))
 		kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index ef5140bbfe28..d1009c144581 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -326,7 +326,10 @@ u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu)
 
 static void kvm_pmc_enable_perf_event(struct kvm_pmc *pmc)
 {
-	if (!pmc->perf_event) {
+	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
+
+	if (!pmc->perf_event ||
+	    !cpumask_test_cpu(vcpu->cpu, &to_arm_pmu(pmc->perf_event->pmu)->supported_cpus)) {
 		kvm_pmu_create_perf_event(pmc);
 		return;
 	}
@@ -667,10 +670,8 @@ static bool kvm_pmc_counts_at_el2(struct kvm_pmc *pmc)
 	return kvm_pmc_read_evtreg(pmc) & ARMV8_PMU_INCLUDE_EL2;
 }
 
-static int kvm_map_pmu_event(struct kvm *kvm, unsigned int eventsel)
+static int kvm_map_pmu_event(struct arm_pmu *pmu, unsigned int eventsel)
 {
-	struct arm_pmu *pmu = kvm->arch.arm_pmu;
-
 	/*
 	 * The CPU PMU likely isn't PMUv3; let the driver provide a mapping
 	 * for the guest's PMUv3 event ID.
@@ -681,6 +682,23 @@ static int kvm_map_pmu_event(struct kvm *kvm, unsigned int eventsel)
 	return eventsel;
 }
 
+static struct arm_pmu *kvm_pmu_probe_armpmu(int cpu)
+{
+	struct arm_pmu_entry *entry;
+	struct arm_pmu *pmu;
+
+	guard(rcu)();
+
+	list_for_each_entry_rcu(entry, &arm_pmus, entry) {
+		pmu = entry->arm_pmu;
+
+		if (cpumask_test_cpu(cpu, &pmu->supported_cpus))
+			return pmu;
+	}
+
+	return NULL;
+}
+
 /**
  * kvm_pmu_create_perf_event - create a perf event for a counter
  * @pmc: Counter context
@@ -694,6 +712,12 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc)
 	int eventsel;
 	u64 evtreg;
 
+	if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &vcpu->kvm->arch.flags)) {
+		arm_pmu = kvm_pmu_probe_armpmu(vcpu->cpu);
+		if (!arm_pmu)
+			return;
+	}
+
 	evtreg = kvm_pmc_read_evtreg(pmc);
 
 	kvm_pmu_stop_counter(pmc);
@@ -722,7 +746,7 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc)
 	 * Don't create an event if we're running on hardware that requires
 	 * PMUv3 event translation and we couldn't find a valid mapping.
 	 */
-	eventsel = kvm_map_pmu_event(vcpu->kvm, eventsel);
+	eventsel = kvm_map_pmu_event(arm_pmu, eventsel);
 	if (eventsel < 0)
 		return;
 
@@ -810,42 +834,6 @@ void kvm_host_pmu_init(struct arm_pmu *pmu)
 	list_add_tail_rcu(&entry->entry, &arm_pmus);
 }
 
-static struct arm_pmu *kvm_pmu_probe_armpmu(void)
-{
-	struct arm_pmu_entry *entry;
-	struct arm_pmu *pmu;
-	int cpu;
-
-	guard(rcu)();
-
-	/*
-	 * It is safe to use a stale cpu to iterate the list of PMUs so long as
-	 * the same value is used for the entirety of the loop. Given this, and
-	 * the fact that no percpu data is used for the lookup there is no need
-	 * to disable preemption.
-	 *
-	 * It is still necessary to get a valid cpu, though, to probe for the
-	 * default PMU instance as userspace is not required to specify a PMU
-	 * type. In order to uphold the preexisting behavior KVM selects the
-	 * PMU instance for the core during vcpu init. A dependent use
-	 * case would be a user with disdain of all things big.LITTLE that
-	 * affines the VMM to a particular cluster of cores.
-	 *
-	 * In any case, userspace should just do the sane thing and use the UAPI
-	 * to select a PMU type directly. But, be wary of the baggage being
-	 * carried here.
-	 */
-	cpu = raw_smp_processor_id();
-	list_for_each_entry_rcu(entry, &arm_pmus, entry) {
-		pmu = entry->arm_pmu;
-
-		if (cpumask_test_cpu(cpu, &pmu->supported_cpus))
-			return pmu;
-	}
-
-	return NULL;
-}
-
 static u64 __compute_pmceid(struct arm_pmu *pmu, bool pmceid1)
 {
 	u32 hi[2], lo[2];
@@ -888,6 +876,9 @@ u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1)
 	u64 val, mask = 0;
 	int base, i, nr_events;
 
+	if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &vcpu->kvm->arch.flags))
+		return 0;
+
 	if (!pmceid1) {
 		val = compute_pmceid0(cpu_pmu);
 		base = 0;
@@ -915,6 +906,26 @@ u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1)
 	return val & mask;
 }
 
+void kvm_vcpu_load_pmu(struct kvm_vcpu *vcpu)
+{
+	unsigned long mask = kvm_pmu_enabled_counter_mask(vcpu);
+	struct kvm_pmc *pmc;
+	struct arm_pmu *cpu_pmu;
+	int i;
+
+	for_each_set_bit(i, &mask, 32) {
+		pmc = kvm_vcpu_idx_to_pmc(vcpu, i);
+		if (!pmc->perf_event)
+			continue;
+
+		cpu_pmu = to_arm_pmu(pmc->perf_event->pmu);
+		if (!cpumask_test_cpu(vcpu->cpu, &cpu_pmu->supported_cpus)) {
+			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
+			break;
+		}
+	}
+}
+
 void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu)
 {
 	u64 mask = kvm_pmu_implemented_counter_mask(vcpu);
@@ -1016,6 +1027,9 @@ u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm)
 {
 	struct arm_pmu *arm_pmu = kvm->arch.arm_pmu;
 
+	if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &kvm->arch.flags))
+		return 0;
+
 	/*
 	 * PMUv3 requires that all event counters are capable of counting any
 	 * event, though the same may not be true of non-PMUv3 hardware.
@@ -1070,7 +1084,24 @@ static void kvm_arm_set_pmu(struct kvm *kvm, struct arm_pmu *arm_pmu)
  */
 int kvm_arm_set_default_pmu(struct kvm *kvm)
 {
-	struct arm_pmu *arm_pmu = kvm_pmu_probe_armpmu();
+	/*
+	 * It is safe to use a stale cpu to iterate the list of PMUs so long as
+	 * the same value is used for the entirety of the loop. Given this, and
+	 * the fact that no percpu data is used for the lookup there is no need
+	 * to disable preemption.
+	 *
+	 * It is still necessary to get a valid cpu, though, to probe for the
+	 * default PMU instance as userspace is not required to specify a PMU
+	 * type. In order to uphold the preexisting behavior KVM selects the
+	 * PMU instance for the core during vcpu init. A dependent use
+	 * case would be a user with disdain of all things big.LITTLE that
+	 * affines the VMM to a particular cluster of cores.
+	 *
+	 * In any case, userspace should just do the sane thing and use the UAPI
+	 * to select a PMU type directly. But, be wary of the baggage being
+	 * carried here.
+	 */
+	struct arm_pmu *arm_pmu = kvm_pmu_probe_armpmu(raw_smp_processor_id());
 
 	if (!arm_pmu)
 		return -ENODEV;
@@ -1098,6 +1129,7 @@ static int kvm_arm_pmu_v3_set_pmu(struct kvm_vcpu *vcpu, int pmu_id)
 				break;
 			}
 
+			clear_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &kvm->arch.flags);
 			kvm_arm_set_pmu(kvm, arm_pmu);
 			cpumask_copy(kvm->arch.supported_cpus, &arm_pmu->supported_cpus);
 			ret = 0;
@@ -1108,11 +1140,42 @@ static int kvm_arm_pmu_v3_set_pmu(struct kvm_vcpu *vcpu, int pmu_id)
 	return ret;
 }
 
+static int kvm_arm_pmu_v3_set_pmu_fixed_counters_only(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct arm_pmu_entry *entry;
+	struct arm_pmu *arm_pmu;
+	struct cpumask *supported_cpus = kvm->arch.supported_cpus;
+
+	lockdep_assert_held(&kvm->arch.config_lock);
+
+	if (kvm_vm_has_ran_once(kvm) ||
+	    (kvm->arch.pmu_filter &&
+	     !test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &kvm->arch.flags)))
+		return -EBUSY;
+
+	set_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &kvm->arch.flags);
+	kvm_arm_set_nr_counters(kvm, 0);
+	cpumask_clear(supported_cpus);
+
+	guard(rcu)();
+
+	list_for_each_entry_rcu(entry, &arm_pmus, entry) {
+		arm_pmu = entry->arm_pmu;
+		cpumask_or(supported_cpus, supported_cpus, &arm_pmu->supported_cpus);
+	}
+
+	return 0;
+}
+
 static int kvm_arm_pmu_v3_set_nr_counters(struct kvm_vcpu *vcpu, unsigned int n)
 {
 	struct kvm *kvm = vcpu->kvm;
 
-	if (!kvm->arch.arm_pmu)
+	lockdep_assert_held(&kvm->arch.config_lock);
+
+	if (!kvm->arch.arm_pmu &&
+	    !test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &kvm->arch.flags))
 		return -EINVAL;
 
 	if (n > kvm_arm_pmu_get_max_counters(kvm))
@@ -1227,6 +1290,8 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 
 		return kvm_arm_pmu_v3_set_nr_counters(vcpu, n);
 	}
+	case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
+		return kvm_arm_pmu_v3_set_pmu_fixed_counters_only(vcpu);
 	case KVM_ARM_VCPU_PMU_V3_INIT:
 		return kvm_arm_pmu_v3_init(vcpu);
 	}
@@ -1253,6 +1318,9 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 		irq = vcpu->arch.pmu.irq_num;
 		return put_user(irq, uaddr);
 	}
+	case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
+		if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &vcpu->kvm->arch.flags))
+			return 0;
 	}
 
 	return -ENXIO;
@@ -1266,6 +1334,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 	case KVM_ARM_VCPU_PMU_V3_FILTER:
 	case KVM_ARM_VCPU_PMU_V3_SET_PMU:
 	case KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS:
+	case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
 		if (kvm_vcpu_has_pmu(vcpu))
 			return 0;
 	}
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 96754b51b411..1375cbaf97b2 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -56,6 +56,7 @@ void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u64 val);
 void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val);
 void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data,
 				    u64 select_idx);
+void kvm_vcpu_load_pmu(struct kvm_vcpu *vcpu);
 void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu);
 int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu,
 			    struct kvm_device_attr *attr);
@@ -161,6 +162,7 @@ static inline u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1)
 static inline void kvm_pmu_update_vcpu_events(struct kvm_vcpu *vcpu) {}
 static inline void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu) {}
 static inline void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu) {}
+static inline void kvm_vcpu_load_pmu(struct kvm_vcpu *vcpu) {}
 static inline void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu) {}
 static inline u8 kvm_arm_pmu_get_pmuver_limit(void)
 {

-- 
2.53.0



^ permalink raw reply related

* [PATCH v7 2/4] KVM: arm64: PMU: Protect the list of PMUs with RCU
From: Akihiko Odaki @ 2026-04-18  8:14 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Kees Cook,
	Gustavo A. R. Silva, Paolo Bonzini, Jonathan Corbet, Shuah Khan
  Cc: linux-arm-kernel, kvmarm, linux-kernel, linux-hardening, devel,
	kvm, linux-doc, linux-kselftest, Akihiko Odaki
In-Reply-To: <20260418-hybrid-v7-0-2bf39ad009bf@rsg.ci.i.u-tokyo.ac.jp>

Convert the list of PMUs to a RCU-protected list that has primitives to
avoid read-side contention.

Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
 arch/arm64/kvm/pmu-emul.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index 59ec96e09321..ef5140bbfe28 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -7,9 +7,9 @@
 #include <linux/cpu.h>
 #include <linux/kvm.h>
 #include <linux/kvm_host.h>
-#include <linux/list.h>
 #include <linux/perf_event.h>
 #include <linux/perf/arm_pmu.h>
+#include <linux/rculist.h>
 #include <linux/uaccess.h>
 #include <asm/kvm_emulate.h>
 #include <kvm/arm_pmu.h>
@@ -26,7 +26,6 @@ static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc);
 
 bool kvm_supports_guest_pmuv3(void)
 {
-	guard(mutex)(&arm_pmus_lock);
 	return !list_empty(&arm_pmus);
 }
 
@@ -808,7 +807,7 @@ void kvm_host_pmu_init(struct arm_pmu *pmu)
 		return;
 
 	entry->arm_pmu = pmu;
-	list_add_tail(&entry->entry, &arm_pmus);
+	list_add_tail_rcu(&entry->entry, &arm_pmus);
 }
 
 static struct arm_pmu *kvm_pmu_probe_armpmu(void)
@@ -817,7 +816,7 @@ static struct arm_pmu *kvm_pmu_probe_armpmu(void)
 	struct arm_pmu *pmu;
 	int cpu;
 
-	guard(mutex)(&arm_pmus_lock);
+	guard(rcu)();
 
 	/*
 	 * It is safe to use a stale cpu to iterate the list of PMUs so long as
@@ -837,7 +836,7 @@ static struct arm_pmu *kvm_pmu_probe_armpmu(void)
 	 * carried here.
 	 */
 	cpu = raw_smp_processor_id();
-	list_for_each_entry(entry, &arm_pmus, entry) {
+	list_for_each_entry_rcu(entry, &arm_pmus, entry) {
 		pmu = entry->arm_pmu;
 
 		if (cpumask_test_cpu(cpu, &pmu->supported_cpus))
@@ -1088,9 +1087,9 @@ static int kvm_arm_pmu_v3_set_pmu(struct kvm_vcpu *vcpu, int pmu_id)
 	int ret = -ENXIO;
 
 	lockdep_assert_held(&kvm->arch.config_lock);
-	mutex_lock(&arm_pmus_lock);
+	guard(rcu)();
 
-	list_for_each_entry(entry, &arm_pmus, entry) {
+	list_for_each_entry_rcu(entry, &arm_pmus, entry) {
 		arm_pmu = entry->arm_pmu;
 		if (arm_pmu->pmu.type == pmu_id) {
 			if (kvm_vm_has_ran_once(kvm) ||
@@ -1106,7 +1105,6 @@ static int kvm_arm_pmu_v3_set_pmu(struct kvm_vcpu *vcpu, int pmu_id)
 		}
 	}
 
-	mutex_unlock(&arm_pmus_lock);
 	return ret;
 }
 

-- 
2.53.0



^ permalink raw reply related

* [PATCH] iommu/arm-smmu-v3: Stop queue allocation retry at PAGE_SIZE
From: leo.jiang1224 @ 2026-04-18  5:31 UTC (permalink / raw)
  To: will; +Cc: robin.murphy, joro, iommu, linux-arm-kernel, LoserJL

From: LoserJL <leo.jiang1224@foxmail.com>

In arm_smmu_init_one_queue(), the loop reduces max_n_shift if
dmam_alloc_coherent() fails. However, since dmam_alloc_coherent()
allocates at least PAGE_SIZE, retrying with a smaller size after
a PAGE_SIZE failure is logically redundant.

Moreover, if a sub-page retry were to succeed due to concurrent memory
release, the hardware would be configured with a smaller queue depth
despite a full page being allocated. This leads to inefficient memory
usage and unnecessary hardware performance limitation.

Terminate the loop once qsz reaches PAGE_SIZE to ensure logical
consistency and optimal hardware configuration.

Signed-off-by: LoserJL <leo.jiang1224@foxmail.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index e8d7dbe495f0..e0ec118ff560 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -4418,7 +4418,14 @@ int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 		qsz = ((1 << q->llq.max_n_shift) * dwords) << 3;
 		q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma,
 					      GFP_KERNEL);
-		if (q->base || qsz < PAGE_SIZE)
+		/*
+		 * If allocation succeeds, we're done. If it fails, only retry
+		 * if the requested size is still larger than a page. Since
+		 * dmam_alloc_coherent() allocates at least PAGE_SIZE, retrying
+		 * with a sub-page size is logically redundant and could lead
+		 * to sub-optimal hardware configuration.
+		 */
+		if (q->base || qsz <= PAGE_SIZE)
 			break;
 
 		q->llq.max_n_shift--;
-- 
2.43.0



^ permalink raw reply related

* Re: [PATCH] arm: dts: allwinner: t113s mangopi: enable watchdog for reboot
From: Jernej Škrabec @ 2026-04-17 18:19 UTC (permalink / raw)
  To: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Chen-Yu Tsai,
	Samuel Holland, Michal Piekos
  Cc: devicetree, linux-arm-kernel, linux-sunxi, linux-kernel,
	Michal Piekos
In-Reply-To: <20260412-t113-mangopi-reboot-hang-v1-1-5002cfa6e0cc@mmpsystems.pl>

Hi,

Dne nedelja, 12. april 2026 ob 19:42:10 Srednjeevropski poletni čas je Michal Piekos napisal(a):
> Reboot hangs on MangoPi MQ-R T113s because no restart handler is
> available.
> 
> Enable the SoC watchdog whose driver registers a restart handler.
> 
> Tested on MangoPi MQ-R T113s.
> 
> Signed-off-by: Michal Piekos <michal.piekos@mmpsystems.pl>
> ---
>  arch/arm/boot/dts/allwinner/sun8i-t113s-mangopi-mq-r-t113.dts | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/arm/boot/dts/allwinner/sun8i-t113s-mangopi-mq-r-t113.dts b/arch/arm/boot/dts/allwinner/sun8i-t113s-mangopi-mq-r-t113.dts
> index 8b3a75383816..f0232a5e903b 100644
> --- a/arch/arm/boot/dts/allwinner/sun8i-t113s-mangopi-mq-r-t113.dts
> +++ b/arch/arm/boot/dts/allwinner/sun8i-t113s-mangopi-mq-r-t113.dts
> @@ -33,3 +33,7 @@ rtl8189ftv: wifi@1 {
>  		interrupt-names = "host-wake";
>  	};
>  };
> +
> +&wdt {
> +	status = "okay";
> +};

Move this to sun8i-t113s.dtsi. All t113 boards have the same issue.
Watchdog should be always enabled on ARM.

Best regards,
Jernej





^ permalink raw reply

* Re: [PATCH v3 2/2] mailbox: Make mbox_send_message() return error code when tx fails
From: Joonwon Kang @ 2026-04-18  3:38 UTC (permalink / raw)
  To: jassisinghbrar
  Cc: akpm, angelogioacchino.delregno, jonathanh, joonwonkang,
	linux-arm-kernel, linux-kernel, linux-mediatek, linux-tegra,
	matthias.bgg, stable, thierry.reding
In-Reply-To: <CABb+yY2yBZ+hgr-=Uh_sRk-TJZRfsk2AYtoS5rPtUN8kVsUScA@mail.gmail.com>

> On Fri, Apr 17, 2026 at 3:43 AM Joonwon Kang <joonwonkang@google.com> wrote:
> >
> > > On Fri, Apr 3, 2026 at 10:19 AM Joonwon Kang <joonwonkang@google.com> wrote:
> > > >
> > > > > On Thu, Apr 2, 2026 at 12:07 PM Joonwon Kang <joonwonkang@google.com> wrote:
> > > > > >
> > > > > > When the mailbox controller failed transmitting message, the error code
> > > > > > was only passed to the client's tx done handler and not to
> > > > > > mbox_send_message(). For this reason, the function could return a false
> > > > > > success. This commit resolves the issue by introducing the tx status and
> > > > > > checking it before mbox_send_message() returns.
> > > > > >
> > > > > Can you please share the scenario when this becomes necessary? This
> > > > > can potentially change the ground underneath some clients, so we have
> > > > > to be sure this is really useful.
> > > >
> > > > I would say the problem here is generic enough to apply to all the cases where
> > > > the send result needs to be checked. Since the return value of the send API is
> > > > not the real send result, any users who believe that this blocking send API
> > > > will return the real send result could fall for that. For example, users may
> > > > think the send was successful even though it was not actually. I believe it is
> > > > uncommon that users have to register a callback solely to get the send result
> > > > even though they are using the blocking send API already. Also, I guess there
> > > > is no special reason why only the mailbox send API should work this way among
> > > > other typical blocking send APIs. For these reasons, this patch makes the send
> > > > API return the real send result. This way, users will not need to register the
> > > > redundant callback and I think the return value will align with their common
> > > > expectation.
> > > >
> > > Clients submit a message into the Mailbox subsystem to be sent out to
> > > the remote side which can happen immediately or later.
> > > If submission fails, clients get immediately notified. If transmission
> > > fails (which is now internal to the subsystem) it is reported to the
> > > client by a callback.
> > > If the API was called mbox_submit_message (which it actually is)
> > > instead of mbox_send_message, there would be no confusion.
> > > We can argue how good/bad the current implementation is, but the fact
> > > is that it is here. And I am reluctant to cause churn without good
> > > reason.
> > > Again, as I said, any, _legal_, setup scenario will help me come over
> > > my reluctance.
> > >
> > > Thanks
> > > Jassi
> >
> > Hi Jassi, can we continue discussing this issue from where we left off last
> > time?
> >
> Long passionate essays are difficult to read, so I haven't yet. A
> simple description of some setup that you think is not supported, will
> keep the discussion focused.
> If your platform is supported but you think the api is not clear,
> updates to the documentation are welcome

Sorry that it was hard for you to read. The long form was to explain what is
misaligned and problematic with data and examples for better understanding
because your previous long essays did not make much sense to me. Please go
through it and let me know if anything is unclear to you. In the mean time, I
will prepare a new version of patch with some update to the API doc.

Thanks,
Joonwon Kang


^ permalink raw reply

* Re: [PATCH v3 2/2] mailbox: Make mbox_send_message() return error code when tx fails
From: Jassi Brar @ 2026-04-18  2:50 UTC (permalink / raw)
  To: Joonwon Kang
  Cc: akpm, angelogioacchino.delregno, jonathanh, linux-arm-kernel,
	linux-kernel, linux-mediatek, linux-tegra, matthias.bgg, stable,
	thierry.reding
In-Reply-To: <20260417084335.2092188-1-joonwonkang@google.com>

On Fri, Apr 17, 2026 at 3:43 AM Joonwon Kang <joonwonkang@google.com> wrote:
>
> > On Fri, Apr 3, 2026 at 10:19 AM Joonwon Kang <joonwonkang@google.com> wrote:
> > >
> > > > On Thu, Apr 2, 2026 at 12:07 PM Joonwon Kang <joonwonkang@google.com> wrote:
> > > > >
> > > > > When the mailbox controller failed transmitting message, the error code
> > > > > was only passed to the client's tx done handler and not to
> > > > > mbox_send_message(). For this reason, the function could return a false
> > > > > success. This commit resolves the issue by introducing the tx status and
> > > > > checking it before mbox_send_message() returns.
> > > > >
> > > > Can you please share the scenario when this becomes necessary? This
> > > > can potentially change the ground underneath some clients, so we have
> > > > to be sure this is really useful.
> > >
> > > I would say the problem here is generic enough to apply to all the cases where
> > > the send result needs to be checked. Since the return value of the send API is
> > > not the real send result, any users who believe that this blocking send API
> > > will return the real send result could fall for that. For example, users may
> > > think the send was successful even though it was not actually. I believe it is
> > > uncommon that users have to register a callback solely to get the send result
> > > even though they are using the blocking send API already. Also, I guess there
> > > is no special reason why only the mailbox send API should work this way among
> > > other typical blocking send APIs. For these reasons, this patch makes the send
> > > API return the real send result. This way, users will not need to register the
> > > redundant callback and I think the return value will align with their common
> > > expectation.
> > >
> > Clients submit a message into the Mailbox subsystem to be sent out to
> > the remote side which can happen immediately or later.
> > If submission fails, clients get immediately notified. If transmission
> > fails (which is now internal to the subsystem) it is reported to the
> > client by a callback.
> > If the API was called mbox_submit_message (which it actually is)
> > instead of mbox_send_message, there would be no confusion.
> > We can argue how good/bad the current implementation is, but the fact
> > is that it is here. And I am reluctant to cause churn without good
> > reason.
> > Again, as I said, any, _legal_, setup scenario will help me come over
> > my reluctance.
> >
> > Thanks
> > Jassi
>
> Hi Jassi, can we continue discussing this issue from where we left off last
> time?
>
Long passionate essays are difficult to read, so I haven't yet. A
simple description of some setup that you think is not supported, will
keep the discussion focused.
If your platform is supported but you think the api is not clear,
updates to the documentation are welcome

Thanks,
Jassi


^ permalink raw reply

* Re: [PATCH 1/2] arm64: dts: imx8mq: Correct MIPI CSI clocks
From: Sebastian Krzyszkowiak @ 2026-04-18  1:12 UTC (permalink / raw)
  To: robh, krzk+dt, conor+dt, Frank.Li, s.hauer, festevam, shawnguo,
	martin.kepplinger, Robby Cai
  Cc: kernel, devicetree, imx, linux-arm-kernel, linux-kernel
In-Reply-To: <20260417110200.753678-2-robby.cai@nxp.com>

On piątek, 17 kwietnia 2026 13:01:59 czas środkowoeuropejski letni Robby Cai 
wrote:
> CSI capture may intermittently fail due to mismatched clock rates. The
> previous configuration violated the timing requirement stated in the
> i.MX8MQ Reference Manual:
> 
>   "The frequency of clk must be exactly equal to or greater than the RX
>    byte clock coming from the RX DPHY."
> 
> Update the clock configuration to ensure that the CSI core clock rate is
> equal to or greater than the incoming DPHY byte clock. The updated clock
> ratios are consistent with those used in NXP's downstream BSP.

I believe this is a misreading of the docs.

IMX8MQ_CLK_CSIX_PHY_REF refers to the UI pixel clock (clk_ui), not the RX DPHY 
byte clock. All this change would do is to break streaming with more than 100 
Mpixels per second / 1064 Mbps per MIPI lane.

As mentioned in the reference manual:

"The frequency of clk_ui must be such that the data received on the data_out 
output is greater than or equal to the total bandwidth of the physical MIPI 
interface. Clk_ui has no relationship requirement with regards to ‘clk’ other 
than the bandwidth requirement mentioned previously."

> Fixes: bcadd5f66c2a ("arm64: dts: imx8mq: add mipi csi phy and csi bridge
> descriptions") Cc: stable@vger.kernel.org
> Signed-off-by: Robby Cai <robby.cai@nxp.com>
> ---
>  arch/arm64/boot/dts/freescale/imx8mq.dtsi | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/boot/dts/freescale/imx8mq.dtsi
> b/arch/arm64/boot/dts/freescale/imx8mq.dtsi index
> 6a25e219832c..165716d08e64 100644
> --- a/arch/arm64/boot/dts/freescale/imx8mq.dtsi
> +++ b/arch/arm64/boot/dts/freescale/imx8mq.dtsi
> @@ -1377,7 +1377,7 @@ mipi_csi1: csi@30a70000 {
>  				assigned-clocks = <&clk 
IMX8MQ_CLK_CSI1_CORE>,
>  				    <&clk 
IMX8MQ_CLK_CSI1_PHY_REF>,
>  				    <&clk IMX8MQ_CLK_CSI1_ESC>;
> -				assigned-clock-rates = 
<266000000>, <333000000>, <66000000>;
> +				assigned-clock-rates = 
<133000000>, <100000000>, <66000000>;
>  				assigned-clock-parents = <&clk 
IMX8MQ_SYS1_PLL_266M>,
>  					<&clk 
IMX8MQ_SYS2_PLL_1000M>,
>  					<&clk 
IMX8MQ_SYS1_PLL_800M>;
> @@ -1429,7 +1429,7 @@ mipi_csi2: csi@30b60000 {
>  				assigned-clocks = <&clk 
IMX8MQ_CLK_CSI2_CORE>,
>  				    <&clk 
IMX8MQ_CLK_CSI2_PHY_REF>,
>  				    <&clk IMX8MQ_CLK_CSI2_ESC>;
> -				assigned-clock-rates = 
<266000000>, <333000000>, <66000000>;
> +				assigned-clock-rates = 
<133000000>, <100000000>, <66000000>;
>  				assigned-clock-parents = <&clk 
IMX8MQ_SYS1_PLL_266M>,
>  					<&clk 
IMX8MQ_SYS2_PLL_1000M>,
>  					<&clk 
IMX8MQ_SYS1_PLL_800M>;






^ permalink raw reply

* Re: [PATCH v2 0/4] usb: dwc3: xilinx: Add Versal2 MMI USB 3.2 controller support
From: Thinh Nguyen @ 2026-04-18  0:33 UTC (permalink / raw)
  To: Pandey, Radhey Shyam
  Cc: Thinh Nguyen, Radhey Shyam Pandey, gregkh@linuxfoundation.org,
	robh@kernel.org, krzk+dt@kernel.org, conor+dt@kernel.org,
	michal.simek@amd.com, p.zabel@pengutronix.de,
	linux-usb@vger.kernel.org, devicetree@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, git@amd.com
In-Reply-To: <faf2421e-0d12-424d-abf8-ad490f5421ff@amd.com>

On Mon, Apr 13, 2026, Pandey, Radhey Shyam wrote:
> > On Tue, Mar 31, 2026, Radhey Shyam Pandey wrote:
> > > This series introduces support for the Multi-Media Integrated (MMI) USB
> > > 3.2 Dual-Role Device (DRD) controller on Xilinx Versal2 platforms.
> > > 
> > > The controller supports SSP(10-Gbps), SuperSpeed, high-speed, full-speed
> > > and low-speed operation modes.
> > > 
> > > USB2 and USB3 PHY support Physical connectivity via the Type-C
> > > connectivity. DWC3 wrapper IP IO space is in SLCR so reg is made
> > > optional.
> > > 
> > > The driver is required for the clock, reset and platform specific
> > > initialization (coherency/TX_DEEMPH etc). In this initial version typec
> > > reversibility is not implemented and it is assumed that USB3 PHY TCA mux
> > > programming is done by MMI configuration data object (CDOs) and TI PD
> > > controller is configured using external tiva programmer on VEK385
> > > evaluation board.
> > > 
> > > Changes for v2:
> > > - DT binding: fix MHz spacing (SI convention), reorder description
> > >    before $ref in xlnx,usb-syscon, restore zynqmp-dwc3 example and add
> > >    versal2-mmi-dwc3 example, fix node name for no-reg case, use 1/1
> > >    address/size configuration and lowercase hex in syscon offsets.
> > > - Split config struct refactoring (device_get_match_data,dwc3_xlnx_config)
> > >    into a separate preparatory patch.
> > > - Fix error message capitalization to lowercase per kernel convention.
> > > - Rename property snps,lcsr_tx_deemph to snps,lcsr-tx-deemph (hyphens).
> > > - Fix double space in comment and missing blank line in core.h.
> > > - Use platform data instead of of_device_is_compatible() check for
> > >    deemphasis support.
> > > 
> > > Link: https://urldefense.com/v3/__https://lore.kernel.org/all/20251119193036.2666877-1-radhey.shyam.pandey@amd.com/__;!!A4F2R9G_pg!YSeyY-bpQrMLqswAc1cWND5CSHvGFygPGMEMpR9amrRMnRFjYrFZktzbLzEzVZcQmOW34IUAfwRKHwy7B8p_ciUorWGJsA$
> > > 
> > > Radhey Shyam Pandey (4):
> > >    dt-bindings: usb: dwc3-xilinx: Add MMI USB support on Versal Gen2
> > >      platform
> > >    usb: dwc3: xilinx: Introduce dwc3_xlnx_config for per-platform data
> > >    usb: dwc3: xilinx: Add Versal2 MMI USB 3.2 controller support
> > >    usb: dwc3: xilinx: Add support to program MMI USB TX deemphasis
> > > 
> > >   .../devicetree/bindings/usb/dwc3-xilinx.yaml  | 70 ++++++++++++++-
> > >   drivers/usb/dwc3/core.c                       | 17 ++++
> > >   drivers/usb/dwc3/core.h                       |  8 ++
> > >   drivers/usb/dwc3/dwc3-xilinx.c                | 89 +++++++++++++++----
> > >   4 files changed, 166 insertions(+), 18 deletions(-)
> > > 
> > > 
> > > base-commit: 46b513250491a7bfc97d98791dbe6a10bcc8129d
> > > -- 
> > > 2.43.0
> > > 
> > Hi Radhey,
> > 
> > Do you have plans to convert dwc3-xilinx to using the new flatten model?
> > The change you have here fits better for the new glue model.
> Thanks Thinh for the review.
> 
> I have looked into the newly introduced flattened model introduced by
> commit 613a2e655d4d ("usb: dwc3: core: Expose core driver as library").
> Moving to that approach would require switching to the new DT binding
> and doing a large refactor.
> 
> Given this series is already implemented and under review,
> I suggest we get it merged first, then evaluate the flattened models
> benefits and limitations and plan a follow‑up migration if it still
> makes sense. If there are no objections, I'll send out v3.
> 

Sorry for the delay. I've provided some feedbacks to this series.

Thanks,
Thinh

^ permalink raw reply

* Re: [PATCH v2 2/4] usb: dwc3: xilinx: Introduce dwc3_xlnx_config for per-platform data
From: Thinh Nguyen @ 2026-04-18  0:32 UTC (permalink / raw)
  To: Radhey Shyam Pandey
  Cc: gregkh@linuxfoundation.org, robh@kernel.org, krzk+dt@kernel.org,
	conor+dt@kernel.org, michal.simek@amd.com, Thinh Nguyen,
	p.zabel@pengutronix.de, linux-usb@vger.kernel.org,
	devicetree@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, git@amd.com
In-Reply-To: <20260330190304.1841593-3-radhey.shyam.pandey@amd.com>

On Tue, Mar 31, 2026, Radhey Shyam Pandey wrote:
> Replace the direct pltfm_init function pointer in struct dwc3_xlnx with
> a const pointer to a new struct dwc3_xlnx_config. This groups
> per-platform configuration in one place and allows future patches to add
> platform-specific fields (e.g. tx_deemph) without growing dwc3_xlnx.
> 
> While at it, switch from of_match_node() to device_get_match_data() to
> simplify the match data lookup.
> 
> Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
> ---
> Changes for v2:
> - New patch, split from "Add Versal2 MMI USB 3.2 controller support".
> - Use device_get_match_data() instead of of_match_node().
> ---
>  drivers/usb/dwc3/dwc3-xilinx.c | 28 ++++++++++++++++++++--------
>  1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/usb/dwc3/dwc3-xilinx.c b/drivers/usb/dwc3/dwc3-xilinx.c
> index f41b0da5e89d..bb59b56726e7 100644
> --- a/drivers/usb/dwc3/dwc3-xilinx.c
> +++ b/drivers/usb/dwc3/dwc3-xilinx.c
> @@ -12,6 +12,7 @@
>  #include <linux/clk.h>
>  #include <linux/of.h>
>  #include <linux/platform_device.h>
> +#include <linux/property.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/gpio/consumer.h>
>  #include <linux/of_platform.h>
> @@ -41,12 +42,18 @@
>  #define XLNX_USB_FPD_POWER_PRSNT		0x80
>  #define FPD_POWER_PRSNT_OPTION			BIT(0)
>  
> +struct dwc3_xlnx;
> +
> +struct dwc3_xlnx_config {
> +	int				(*pltfm_init)(struct dwc3_xlnx *data);
> +};
> +
>  struct dwc3_xlnx {
>  	int				num_clocks;
>  	struct clk_bulk_data		*clks;
>  	struct device			*dev;
>  	void __iomem			*regs;
> -	int				(*pltfm_init)(struct dwc3_xlnx *data);
> +	const struct dwc3_xlnx_config	*dwc3_config;
>  	struct phy			*usb3_phy;
>  };
>  
> @@ -241,14 +248,22 @@ static int dwc3_xlnx_init_zynqmp(struct dwc3_xlnx *priv_data)
>  	return ret;
>  }
>  
> +static const struct dwc3_xlnx_config zynqmp_config = {
> +	.pltfm_init = dwc3_xlnx_init_zynqmp,
> +};
> +
> +static const struct dwc3_xlnx_config versal_config = {
> +	.pltfm_init = dwc3_xlnx_init_versal,
> +};
> +
>  static const struct of_device_id dwc3_xlnx_of_match[] = {
>  	{
>  		.compatible = "xlnx,zynqmp-dwc3",
> -		.data = &dwc3_xlnx_init_zynqmp,
> +		.data = &zynqmp_config,
>  	},
>  	{
>  		.compatible = "xlnx,versal-dwc3",
> -		.data = &dwc3_xlnx_init_versal,
> +		.data = &versal_config,
>  	},
>  	{ /* Sentinel */ }
>  };
> @@ -284,7 +299,6 @@ static int dwc3_xlnx_probe(struct platform_device *pdev)
>  	struct dwc3_xlnx		*priv_data;
>  	struct device			*dev = &pdev->dev;
>  	struct device_node		*np = dev->of_node;
> -	const struct of_device_id	*match;
>  	void __iomem			*regs;
>  	int				ret;
>  
> @@ -296,9 +310,7 @@ static int dwc3_xlnx_probe(struct platform_device *pdev)
>  	if (IS_ERR(regs))
>  		return dev_err_probe(dev, PTR_ERR(regs), "failed to map registers\n");
>  
> -	match = of_match_node(dwc3_xlnx_of_match, pdev->dev.of_node);
> -
> -	priv_data->pltfm_init = match->data;
> +	priv_data->dwc3_config = device_get_match_data(dev);
>  	priv_data->regs = regs;
>  	priv_data->dev = dev;
>  
> @@ -314,7 +326,7 @@ static int dwc3_xlnx_probe(struct platform_device *pdev)
>  	if (ret)
>  		return ret;
>  
> -	ret = priv_data->pltfm_init(priv_data);
> +	ret = priv_data->dwc3_config->pltfm_init(priv_data);

Though this won't hit now, but we should check if dwc3_config exists
before accessing it.

BR,
Thinh

>  	if (ret)
>  		goto err_clk_put;
>  
> -- 
> 2.43.0
> 

^ permalink raw reply

* Re: [PATCH v2 4/4] usb: dwc3: xilinx: Add support to program MMI USB TX deemphasis
From: Thinh Nguyen @ 2026-04-18  0:28 UTC (permalink / raw)
  To: Radhey Shyam Pandey
  Cc: gregkh@linuxfoundation.org, robh@kernel.org, krzk+dt@kernel.org,
	conor+dt@kernel.org, michal.simek@amd.com, Thinh Nguyen,
	p.zabel@pengutronix.de, linux-usb@vger.kernel.org,
	devicetree@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, git@amd.com
In-Reply-To: <20260330190304.1841593-5-radhey.shyam.pandey@amd.com>

On Tue, Mar 31, 2026, Radhey Shyam Pandey wrote:
> Introduces support for programming the 18-bit TX Deemphasis value that
> drives the pipe_TxDeemph signal, as defined in the PIPE4 specification.
> 
> The configured value is recommended by Synopsys and is intended for
> standard (non-compliance) operation. These Gen2 equalization settings
> have been validated through both internal and external compliance
> testing. By applying this setting, the stability of USB 3.2 enumeration
> is improved and now SuperSpeedPlus devices are consistently recognized as
> USB 3.2 Gen 2 by the MMI USB Host controller.
> 
> Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
> ---
> Changes for v2:
> - Don't use compatible check for deemphasis programming.
> - Rename property "snps,lcsr_tx_deemph" to "snps,lcsr-tx-deemph"
>   (hyphens per kernel convention).
> - Fix double space in LCSR_TX_DEEMPH register comment.
> - Add blank line between register offset define and "Bit fields" section.
> ---
>  drivers/usb/dwc3/core.c        | 17 +++++++++++++++++
>  drivers/usb/dwc3/core.h        |  8 ++++++++
>  drivers/usb/dwc3/dwc3-xilinx.c | 15 ++++++++++++---
>  3 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> index 161a4d58b2ce..e678a53a90b3 100644
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -646,6 +646,15 @@ static void dwc3_config_soc_bus(struct dwc3 *dwc)
>  		reg |= DWC3_GSBUSCFG0_REQINFO(dwc->gsbuscfg0_reqinfo);
>  		dwc3_writel(dwc, DWC3_GSBUSCFG0, reg);
>  	}
> +
> +	if (dwc->csr_tx_deemph_field_1 != DWC3_LCSR_TX_DEEMPH_UNSPECIFIED) {
> +		u32 reg;
> +
> +		reg = dwc3_readl(dwc, DWC3_LCSR_TX_DEEMPH);
> +		reg &= ~DWC3_LCSR_TX_DEEMPH_MASK(~0);
> +		reg |= DWC3_LCSR_TX_DEEMPH_MASK(dwc->csr_tx_deemph_field_1);
> +		dwc3_writel(dwc, DWC3_LCSR_TX_DEEMPH, reg);
> +	}
>  }
>  
>  static int dwc3_core_ulpi_init(struct dwc3 *dwc)
> @@ -1671,11 +1680,13 @@ static void dwc3_core_exit_mode(struct dwc3 *dwc)
>  static void dwc3_get_software_properties(struct dwc3 *dwc,
>  					 const struct dwc3_properties *properties)
>  {
> +	u32 csr_tx_deemph_field_1;
>  	struct device *tmpdev;
>  	u16 gsbuscfg0_reqinfo;
>  	int ret;
>  
>  	dwc->gsbuscfg0_reqinfo = DWC3_GSBUSCFG0_REQINFO_UNSPECIFIED;
> +	dwc->csr_tx_deemph_field_1 = DWC3_LCSR_TX_DEEMPH_UNSPECIFIED;
>  
>  	if (properties->gsbuscfg0_reqinfo !=
>  	    DWC3_GSBUSCFG0_REQINFO_UNSPECIFIED) {
> @@ -1693,6 +1704,12 @@ static void dwc3_get_software_properties(struct dwc3 *dwc,
>  					       &gsbuscfg0_reqinfo);
>  		if (!ret)
>  			dwc->gsbuscfg0_reqinfo = gsbuscfg0_reqinfo;
> +
> +		ret = device_property_read_u32(tmpdev,
> +					       "snps,lcsr-tx-deemph",
> +					       &csr_tx_deemph_field_1);
> +		if (!ret)
> +			dwc->csr_tx_deemph_field_1 = csr_tx_deemph_field_1;
>  	}
>  }
>  
> diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
> index a35b3db1f9f3..99874ad09730 100644
> --- a/drivers/usb/dwc3/core.h
> +++ b/drivers/usb/dwc3/core.h
> @@ -181,6 +181,8 @@
>  
>  #define DWC3_LLUCTL(n)		(0xd024 + ((n) * 0x80))
>  
> +#define DWC3_LCSR_TX_DEEMPH	0xd060
> +

This should be DWC3_LCSR_TX_DEEMPH(n) where n is the USB3 port number

>  /* Bit fields */
>  
>  /* Global SoC Bus Configuration INCRx Register 0 */
> @@ -198,6 +200,10 @@
>  #define DWC3_GSBUSCFG0_REQINFO(n)	(((n) & 0xffff) << 16)
>  #define DWC3_GSBUSCFG0_REQINFO_UNSPECIFIED	0xffffffff
>  
> +/* LCSR_TX_DEEMPH Register: setting TX deemphasis used in normal operation in gen2 */
> +#define DWC3_LCSR_TX_DEEMPH_MASK(n)		((n) & 0x3ffff)
> +#define DWC3_LCSR_TX_DEEMPH_UNSPECIFIED		0xffffffff
> +
>  /* Global Debug LSP MUX Select */
>  #define DWC3_GDBGLSPMUX_ENDBC		BIT(15)	/* Host only */
>  #define DWC3_GDBGLSPMUX_HOSTSELECT(n)	((n) & 0x3fff)
> @@ -1180,6 +1186,7 @@ struct dwc3_glue_ops {
>   * @wakeup_pending_funcs: Indicates whether any interface has requested for
>   *			 function wakeup in bitmap format where bit position
>   *			 represents interface_id.
> + * @csr_tx_deemph_field_1: stores TX deemphasis used in Gen2 operation.

How do you plan to apply this for the case of multiple USB3 ports. Only
to the first USB3 port0 or all of them? Document how you want to handle
this.

>   */
>  struct dwc3 {
>  	struct work_struct	drd_work;
> @@ -1417,6 +1424,7 @@ struct dwc3 {
>  	struct dentry		*debug_root;
>  	u32			gsbuscfg0_reqinfo;
>  	u32			wakeup_pending_funcs;
> +	u32			csr_tx_deemph_field_1;
>  };
>  
>  #define INCRX_BURST_MODE 0
> diff --git a/drivers/usb/dwc3/dwc3-xilinx.c b/drivers/usb/dwc3/dwc3-xilinx.c
> index f2dee28bdc65..44008856ee73 100644
> --- a/drivers/usb/dwc3/dwc3-xilinx.c
> +++ b/drivers/usb/dwc3/dwc3-xilinx.c
> @@ -41,11 +41,13 @@
>  #define PIPE_CLK_SELECT				0
>  #define XLNX_USB_FPD_POWER_PRSNT		0x80
>  #define FPD_POWER_PRSNT_OPTION			BIT(0)
> +#define XLNX_MMI_USB_TX_DEEMPH_DEF		0x8c45
>  
>  struct dwc3_xlnx;
>  
>  struct dwc3_xlnx_config {
>  	int				(*pltfm_init)(struct dwc3_xlnx *data);
> +	u32				tx_deemph;
>  	bool				map_resource;
>  };
>  
> @@ -284,6 +286,7 @@ static const struct dwc3_xlnx_config versal_config = {
>  
>  static const struct dwc3_xlnx_config versal2_config = {
>  	.pltfm_init = dwc3_xlnx_init_versal2,
> +	.tx_deemph = XLNX_MMI_USB_TX_DEEMPH_DEF,
>  };
>  
>  static const struct of_device_id dwc3_xlnx_of_match[] = {
> @@ -303,10 +306,12 @@ static const struct of_device_id dwc3_xlnx_of_match[] = {
>  };
>  MODULE_DEVICE_TABLE(of, dwc3_xlnx_of_match);
>  
> -static int dwc3_set_swnode(struct device *dev)
> +static int dwc3_set_swnode(struct dwc3_xlnx *priv_data)
>  {
> +	struct device *dev = priv_data->dev;
> +	const struct dwc3_xlnx_config *config = priv_data->dwc3_config;
>  	struct device_node *np = dev->of_node, *dwc3_np;
> -	struct property_entry props[2];
> +	struct property_entry props[3];
>  	int prop_idx = 0, ret = 0;
>  
>  	dwc3_np = of_get_compatible_child(np, "snps,dwc3");
> @@ -320,6 +325,10 @@ static int dwc3_set_swnode(struct device *dev)
>  	if (of_dma_is_coherent(dwc3_np))
>  		props[prop_idx++] = PROPERTY_ENTRY_U16("snps,gsbuscfg0-reqinfo",
>  						       0xffff);
> +	if (config->tx_deemph)

We should set the tx_deemph to the DWC3_LCSR_TX_DEEMPH_UNSPECIFIED by
default and check against that instead.

> +		props[prop_idx++] = PROPERTY_ENTRY_U32("snps,lcsr-tx-deemph",
> +						       config->tx_deemph);
> +
>  	of_node_put(dwc3_np);
>  
>  	if (prop_idx)
> @@ -368,7 +377,7 @@ static int dwc3_xlnx_probe(struct platform_device *pdev)
>  	if (ret)
>  		goto err_clk_put;
>  
> -	ret = dwc3_set_swnode(dev);
> +	ret = dwc3_set_swnode(priv_data);
>  	if (ret)
>  		goto err_clk_put;
>  
> -- 
> 2.43.0
> 

BR,
Thinh

^ permalink raw reply

* [PATCH v2 19/19] Add standalone crypto kernel module technical documentation
From: Jay Wang @ 2026-04-18  0:20 UTC (permalink / raw)
  To: Herbert Xu, David S . Miller, linux-crypto, Masahiro Yamada,
	linux-kbuild
  Cc: Jay Wang, Vegard Nossum, Nicolai Stange, Ilia Okomin,
	Hazem Mohamed Abuelfotoh, Bjoern Doebel, Martin Pohlack,
	Benjamin Herrenschmidt, Nathan Chancellor, Nicolas Schier,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Luis Chamberlain,
	Petr Pavlu, Daniel Gomez, Sami Tolvanen, David Howells,
	David Woodhouse, Jarkko Sakkinen, Ignat Korchagin, Lukas Wunner,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	linux-arm-kernel, x86, linux-modules
In-Reply-To: <20260418002032.2877-1-wanjay@amazon.com>

Technical guide covering implementation details and usage of the
standalone crypto kernel module feature.

Signed-off-by: Jay Wang <wanjay@amazon.com>
---
 crypto/fips140/README | 404 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 404 insertions(+)
 create mode 100644 crypto/fips140/README

diff --git a/crypto/fips140/README b/crypto/fips140/README
new file mode 100644
index 0000000000000..52488d5726d25
--- /dev/null
+++ b/crypto/fips140/README
@@ -0,0 +1,404 @@
+## 1. Introduction
+
+Amazon Linux is releasing a new kernel feature that converts the previously built-in kernel crypto subsystem into a standalone kernel module. This module becomes the carrier of the kernel crypto subsystem and can be loaded at early boot to provide the same functionality as the original built-in crypto. The primary motivation for this modularization is to streamline Federal Information Processing Standards (FIPS) validation, a critical cryptographic certification for cloud computing users doing business with the U.S. government.
+ 
+In a bit more detail, previously, FIPS certification was tied to the entire kernel image, meaning non-crypto updates could potentially invalidate certification. With this feature, FIPS certification is tied only to the crypto module. Therefore, once the module is certified, loading this certified module on newer kernels automatically makes those kernels FIPS-certified. As a result, this approach can save re-certification costs and 12-18 months of waiting time by reducing the need for repeated FIPS re-certification cycles.
+
+This document provides technical details on how this feature is designed and implemented for users or developers who are interested in developing upon it, and is organized as follows:
+- Section 2 - Getting Started: Quick start on how to enable the feature
+- Section 3 - Workflow Overview: Changes this feature brings to build and runtime
+- Section 4 - Design Implementation Details: Technical deep-dive into each component
+- Section 5 - Customizing and Extending Crypto Module: How to select crypto to be included and extend to new crypto/architectures
+- Section 6 - Reusing a Certified Module in Practice: Different reuse and maintenance strategies and their tradeoffs
+
+## 2. Getting Started
+
+This section provides a quick start guide for developers on how to enable, compile and use the standalone cryptography module feature.
+
+### 2.1 Basic Configuration
+
+The feature is controlled by a single configuration option:
+```
+CONFIG_CRYPTO_FIPS140_EXTMOD=y
+```
+What it does: When enabled, automatically redirects a set of cryptographic algorithms from the main kernel into a standalone module `crypto/fips140/fips140.ko`. The cryptographic algorithms that are redirected need to satisfy all the following conditions, otherwise the cryptography will remain in its original form:
+1. Must be configured as built-in (i.e., `CONFIG_CRYPTO_*=y`). This means cryptography already configured as modular (i.e., `CONFIG_CRYPTO_*=m`) are not redirected as they are already modularized.
+2. Must be among a list, which can be customized by developers as described in Section 5.
+
+When disabled, the kernel behaves as before.
+
+### 2.2 Build Process
+
+Once `CONFIG_CRYPTO_FIPS140_EXTMOD=y` is set, no additional steps are required. The standalone module will be built automatically as part of the standard kernel build process:
+```
+make -j$(nproc)
+# or
+make vmlinux
+```
+**What happens automatically (No user action required):**
+1. Build the module as `crypto/fips140/fips140.ko`
+2. The cryptography module will be loaded at boot time
+3. All kernel cryptographic services will provide the same functionality as before (i.e., prior to introducing this new feature) once boot completes.
+
+### 2.3 Advanced Configuration Options
+
+**Using External Cryptography Module:**
+```
+CONFIG_CRYPTO_FIPS140_EXTMOD_SOURCE=y
+```
+By default, `CONFIG_CRYPTO_FIPS140_EXTMOD_SOURCE` is not set, meaning the freshly built cryptography module is used. Otherwise, the pre-built standalone cryptography module from `fips140_build/crypto/fips140/fips140.ko` and modular cryptography such as `fips140_build/crypto/aes.ko` (need to manually place pre-built modules in these locations before the build) are included in kernel packaging (e.g., during `make modules_install`) and are used at later boot time.
+
+**Dual Version Support:**
+```
+CONFIG_CRYPTO_FIPS140_DUAL_VERSION=y
+```
+Encapsulate two versions of `fips140.ko` into kernel: one is freshly built for non-FIPS mode usage, another is pre-built specified by `fips140_build/crypto/fips140/fips140.ko` for FIPS mode usage. The appropriate version is selected and loaded at boot time based on boot time FIPS mode status.
+
+### 2.4 Verification
+
+To verify the feature is working, after install and boot with the new kernel:
+```
+# Check if fips140.ko module is loaded
+lsmod | grep fips140
+# Check if crypto algorithms are served by the fips140 module
+cat /proc/crypto | grep module | grep fips140
+```
+
+## 3. Workflow Overview
+
+This section provides an overview without delving into deep technical details of the changes the standalone cryptography module feature introduces. When this feature is enabled, it introduces changes to both the kernel build and booting process. 
+
+3.1 Build-Time Changes
+
+Kernel cryptography subsystem consists of both cryptography management infrastructure (e.g., `crypto/api.c`, `crypto/algapi.c`, etc), along with hundreds of different cryptography algorithms (e.g., `crypto/arc4.c`).
+
+**Traditional Build Process:**
+Traditionally, cryptography management infrastructure are always built-in to the kernel, while cryptographic algorithms can be configured to be built either as built-in (`CONFIG_CRYPTO_*=y`) or as separate modular (`CONFIG_CRYPTO_*=m`) `.ko` file depending on kernel configuration:
+As a result, the builtin cryptography management infrastructure and cryptographic algorithms are statically linked into the kernel binary:
+```
+cryptographic algorithms source files → compiled as .o objfiles →  linked into vmlinux → single kernel binary
+```
+**With Standalone Cryptography Module:**
+This feature automatically transforms the builtin cryptographic components into a standalone cryptography module, `fips140.ko`. To do so, it develops a new kernel build rule `crypto-objs-$(CONFIG_CRYPTO_*)` such that, once this build rule is applied to a cryptographic algorithm, such cryptographic algorithm will be automatically collected into the cryptography module if it is configured as built-in (i.e, `CONFIG_CRYPTO_*=y`), for example:
+```
+// in crypto/asymmetric_keys/Makefile
+- obj-$(CONFIG_ASYMMETRIC_KEY_TYPE) += asymmetric_keys.o
++ crypto-objs-$(CONFIG_ASYMMETRIC_KEY_TYPE) += asymmetric_keys.o
+```
+Such build change allows the modularization transformation to only affect selected cryptographic algorithms (i.e, where the `crypto-objs-$(CONFIG_CRYPTO_*`) is applied).
+
+Then, after the `fips140.ko` is generated, it will be embedded back into main kernel vmlinux as a replacement part. The purpose of this embedding, instead of traditionally putting the `fips140.ko` into filesystem, is a preparation to allow the module to be loaded early enough even before the filesystem is ready.
+
+The new build process is illustrated below.
+```
+cryptographic algorithms source files → compiled as .o objfiles → automatically collected and linked into fips140.ko → embedded fips140.ko into vmlinux as a replaceable binary
+```
+
+### 3.2 Runtime Changes
+
+**Traditional Boot Process:**
+The kernel initializes the cryptographic subsystem early during boot, executing each cryptographic initialization routine accordingly. These initialization routines may depend on other cryptographic components or other kernel subsystems, so their invocation follows a well-defined execution order to ensure they are initialized before their first use.
+```
+kernel starts → cryptography subsystem initialization → cryptography subsystem available → other components use cryptography
+```
+**With Standalone Cryptography Module:**
+At the start of kernel boot, compared to a regular kernel, the first major change introduced by this feature is that no cryptography services are initially available — since the entire cryptography subsystem has been decoupled from the main kernel.
+To ensure that the cryptography subsystem becomes available early enough (before the first kernel component that requires cryptography services), the standalone cryptography kernel module must be loaded at a very early stage, even before the filesystem becomes available.
+
+However, the regular module loading mechanism relies on placing kernel modules in the filesystem and loading them from there, which creates a chicken-and-egg problem — the cryptography module cannot be loaded until the filesystem is ready, yet some kernel components may require cryptography services even before that point.
+
+To address this, the second change introduced by this feature is that the cryptography kernel module is loaded directly from memory, leveraging the earlier compilation changes that embed the module binary into the main kernel image. Afterward, the feature includes a “plug-in” mechanism that connects the decoupled cryptography subsystem back to the main kernel, ensuring that kernel cryptography users can correctly locate and invoke the cryptography routine entry points.
+
+Finally, to ensure proper initialization, the feature guarantees that all cryptography algorithms and the cryptography management infra execute their initialization routines in the exact same order as they would if they were built-in.
+
+The process described above is illustrated below.
+```
+kernel starts → no cryptography available → load fips140.ko from memory → plug cryptography back to kernel → module initialization → cryptographic services available → other components use cryptography
+```
+
+## 4. Design Implementation Details
+
+While the earlier sections provide a holistic view of how this feature shapes the kernel, this section provides deeper design details on how these functionalities are realized. There are three key design components:
+1. A specialized compile rule that automatically compiles and collects all built-in cryptographic algorithm object files to generate the final module binary under arbitrary kernel configurations, and then embeds the generated binary into the main kernel image for early loading.
+2. A mechanism to convert interactions between the cryptography subsystem and the main kernel into a pluggable interface.
+3. A module loading and initialization process that ensures the cryptography subsystem is properly initialized as if it were built-in.
+
+### 4.1. Specialized Compilation System
+
+**Automatic Collection and Linking of Built-in Cryptographic Algorithm Objects:**
+The first step in generating the `fips140.ko` module is to compile and collect built-in cryptographic components (i.e., those specified by `CONFIG_CRYPTO_*=y`).
+Traditionally, the existing module build process requires all module components (e.g., source files) to reside in a single directory. However, this approach is not suitable for our case, where hundreds of cryptographic algorithm source files are scattered across multiple directories.
+
+A naïve approach would be to create a separate Makefile that duplicates the original build rules with adjusted paths.
+However, this method is not scalable due to the large number of cryptographic build rules, many of which are highly customized and can vary under different Kconfig settings, making such a separate Makefile even more complex.
+Moreover, this approach cannot ensure that built-in cryptographic algorithms are completely removed from the main kernel, which would result in redundant cryptographic code being included in both the kernel and the module.
+
+To tackle this challenge, we automated the object collection and linking process by introducing special build logic for the kernel cryptography subsystem.
+Specifically, to automatically collect cryptography object files while preserving their original compilation settings (such as flags, headers, and paths), we introduced a new compilation rule:
+```
+crypto-objs-y += *.o
+```
+This replaces the original `obj-y += *.o` rule in cryptography Makefiles later, for example:
+```
+// in crypto/asymmetric_keys/Makefile
+- obj-$(CONFIG_ASYMMETRIC_KEY_TYPE) += asymmetric_keys.o
++ crypto-objs-$(CONFIG_ASYMMETRIC_KEY_TYPE) += asymmetric_keys.o
+asymmetric_keys-y := \
+    asymmetric_type.o \
+    restrict.o \
+    signature.o
+```
+in the cryptography subsystem Makefiles, allowing most of the existing Makefile logic to be reused.
+As a result, when the standalone cryptography module feature is enabled, any cryptographic algorithm configured as built-in (for example, `crypto-objs-$(CONFIG_ASYMMETRIC_KEY_TYPE) += asymmetric_keys.o` where `CONFIG_ASYMMETRIC_KEY_TYPE=y`) will be automatically collected and linked into a single final object binary, `fips140.o`.
+During this process, a special compilation flag (`-DFIPS_MODULE=1`) is applied to instruct each object file to be compiled in a module-specific manner. This flag will later be used to generate the pluggable interface on both the main kernel side and the module side from the same source code.
+
+The implementation details are as follows: it follows a similar methodology used by the `obj-y` collection process for building `vmlinux.o`. The `crypto-objs-y` rule is placed in `scripts/Makefile.build`, which is executed by each directory Makefile to collect the corresponding crypto object files. Each directory then creates a `crypto-module.a` archive that contains all `crypto-objs-y += <object>.o` files under that directory. In the parent directories, these `crypto-module.a` archives are recursively included into the parent’s own `crypto-module.a`, and this process continues upward until the final `fips140.o` is generated.
+
+**A Separate Module Generation Pipeline for Building the Final Kernel Module from Linked Cryptographic Algorithm Object:**
+With the linked cryptographic algorithm object (i.e., `fips140.o`), the next step is to generate the final kernel module, `fips140.ko`.
+
+A direct approach would be to inject the `fips140.ko` module build into the existing modules generation pipeline (i.e., `make modules`) by providing our pre-generated `fips140.o`. However, we choose not to do this because it would create a circular make rule dependency (which is invalid in Makefiles and causes build failures), resulting in mutual dependencies between the modules and vmlinux targets (i.e., `modules:vmlinux` and `vmlinux:modules` at the same time).
+This happens for the following reasons:
+1. Since we will later embed `fips140.ko` into the final kernel image (as described in the next section), we must make vmlinux depend on `fips140.ko`. In other words: `vmlinux: fips140.ko`.
+2. When the kernel is built with `CONFIG_DEBUG_INFO_BTF_MODULES=y`, it requires: modules: vmlinux. This is because `CONFIG_DEBUG_INFO_BTF_MODULES=y` takes vmlinux as input to generate BTF info for the module, and inserts such info into the `.ko` module by default.
+3. If we choose to inject `fips140.ko` into make modules, this would create a make rule dependency: `fips140.ko: modules`. Combined with items 1 and 2, this eventually creates an invalid circular dependency between vmlinux and modules.
+
+Due to these reasons, the design choice is to use a separate make pipeline (defined as `fips140-ready` in the Makefile). This new pipeline reuses the same module generation scripts used by make modules but adds additional logic in `scripts/Makefile.{modfinal|modinst|modpost}` and `scripts/mod/modpost.c` to handle module symbol generation and verification correctly. 
+
+**A Seamless Process That Embeds the Generated Binary Into the Main Kernel Image for Early Loading:**
+As mentioned earlier, in order to load the standalone cryptography module early in the boot process—before the filesystem is ready—the module binary must be embedded into the final kernel image (i.e., vmlinux) so that it can be loaded directly from memory.
+We intend for this embedding process to be completely seamless and automatically triggered whenever vmlinux is built (i.e., during `make vmlinux`).
+
+To achieve this, the feature adds a Make dependency rule so that vmlinux depends on `fips140.ko`.
+It also modifies the vmlinux link rules (i.e., `arch/<arch>/kernel/vmlinux.lds.S`, `scripts/Makefile.vmlinux`, and `scripts/link-vmlinux.sh`) so that the generated module binary is finally combined with `vmlinux.o`.
+
+In addition, we allow multiple cryptography module binary versions (for example, a certified cryptography binary and a latest, up-to-date but uncertified one) to be embedded into the main kernel image to serve different user needs. This design allows regular (non-FIPS) users to benefit from the latest cryptographic updates, while FIPS-mode users continue to use the certified cryptography module.
+
+To support this, we introduce an optional configuration, `CONFIG_CRYPTO_FIPS140_DUAL_VERSION`. When enabled, this option allows two cryptography module versions to be embedded within a single kernel build and ensures that the appropriate module is selected and loaded at boot time based on the system’s FIPS mode status.
+
+### 4.2. Pluggable Interface Between the Built-in Cryptography Subsystem and the Main Kernel
+
+Although the module binary (`fips140.ko`) has been embedded into the final kernel image (`vmlinux`) as described in the previous section, it is not linked to the kernel in any way. This is because `fips140.ko` is embedded in a data-only manner, so the main kernel cannot directly call any functions or access any data defined in the module binary. Since the main kernel and modules can only interact through exported symbols (i.e., via `EXPORT_SYMBOL()`), this also applies to the crypto kernel module — the main kernel can only interact with the crypto functions and variables defined in the crypto module through exported symbols, meaning these functions and variables must also have their symbols exported in the module after they are moved from the main kernel to the module.
+
+However, simply making these crypto symbols symbol-exported in the module without additional handling would cause the kernel to fail to compile. This is because the existing kernel module symbol resolution mechanism is essentially one-way: it supports symbols defined in the main kernel and referenced by kernel modules. However, it does not support the reverse case — symbols defined in a kernel module but used by the main kernel — which is exactly the crypto module case, as there are many crypto users still residing in the main kernel. The reason is that compilation of the main kernel requires all symbol addresses to be known to achieve a successful linking phase.
+
+To address this, we introduce a pluggable interface to support this reverse-direction symbol resolution between crypto symbols defined in the module and referenced by the main kernel, by placing **address placeholders** at all crypto usage points in the main kernel. These address placeholders are initially set to NULL during compilation to provide a concrete address that satisfies the linking phase. Then, during runtime, once the cryptography kernel module is loaded, these placeholders are updated to the correct addresses before their first use in the main kernel. In the rest of this section, we first introduce this pluggable interface mechanism, and then explain how to apply it to the built-in cryptographic algorithms and variables.
+
+**The Pluggable Interface Mechanism:**
+There are two types of address holders used to achieve this pluggable interface:
+- Function addresses (the majority): We use a trampoline to redirect the original jump instruction to another location whose target destination is held by the value of a function pointer. To avoid additional security concerns, such as the function pointer being arbitrarily modified, these function pointers are made `__ro_after_init` to ensure they cannot be modified after kernel init. We implement this function-address placeholder as the `DEFINE_CRYPTO_FN_REDIRECT()` wrapper.
+- Variable addresses (the remaining smaller portion): For these, we use a pointer of the corresponding data type. We implement this address placeholder as the `DECLARE_CRYPTO_VAR()` and `DEFINE_CRYPTO_API_STUB()` wrappers:
+
+These wrappers are applied to each symbol-exported (i.e., `EXPORT_SYMBOL()`) cryptographic function and variable (details on how to apply them are described later). Once applied, the wrappers are compiled differently for the main kernel and for the built-in cryptographic algorithm source code—acting as the “outlet” and the “plug,” respectively—using different compilation flags (`-DFIPS_MODULE`) introduced by our customized build rules described earlier.
+
+As a result, the kernel can successfully compile even when the built-in cryptographic algorithms are removed, thanks to these address placeholders. At boot time, the placeholders initially hold NULL, but since no cryptography users exist at that stage, the kernel can still start booting correctly. After the cryptography module is loaded, the placeholders are dynamically updated to the correct addresses later (by `do_crypto_var()` and `do_crypto_fn()`, described in a later section).
+
+**Applying the Pluggable Interface Mechanism to Cryptographic Algorithms:**
+
+To apply these pluggable interface wrappers to a cryptographic algorithm and make them take effect, we follow the steps below (using `crypto/asymmetric_keys/asymmetric_type.c` as an example):
+1. **Apply `crypto-objs-y` compile rule to the cryptographic algorithm:**
+```
+// in crypto/asymmetric_keys/Makefile
+- obj-$(CONFIG_ASYMMETRIC_KEY_TYPE) += asymmetric_keys.o
++ crypto-objs-$(CONFIG_ASYMMETRIC_KEY_TYPE) += asymmetric_keys.o
+asymmetric_keys-y := \
+    asymmetric_type.o \
+    restrict.o \
+    signature.o
+```
+2. **Locate the communication point between the cryptographic algorithm and the main kernel:**
+
+The cryptography subsystem is designed such that most interactions between the main kernel and cryptographic algorithms occur through exported symbols using `EXPORT_SYMBOL()` wrappers.
+This kernel design exists because most cryptographic algorithm implementations must support both built-in and modular modes. 
+
+Consequently, the cryptographic functions and variables exported by `EXPORT_SYMBOL()` are a well-defined and identifiable interface between the cryptography subsystem and the main kernel: 
+```
+// in crypto/asymmetric_keys/asymmetric_type.c 
+//Exported cryptographic function:
+bool asymmetric_key_id_same(const struct asymmetric_key_id *kid1,
+                const struct asymmetric_key_id *kid2) {...}
+EXPORT_SYMBOL_GPL(asymmetric_key_id_same); 
+//Exported cryptographic variable:
+struct key_type key_type_asymmetric = {...};
+EXPORT_SYMBOL_GPL(key_type_asymmetric); 
+```
+
+3. **Redirect crypto symbol references in the main kernel to address placeholders:**
+
+With the placeholders in place, the remaining problem is directing the main kernel to use them rather than the original symbols. Since all crypto users must include the corresponding header files to obtain function and variable declarations, the headers are a natural place to perform this redirection. Each declaration is transformed using a macro that hooks it to the corresponding placeholder.
+
+For exported variable symbols (a small number, ~10 symbols), their declaration in the header file is replaced with the `DECLARE_CRYPTO_VAR()` wrapper to redirect variable access from a concrete address to a placeholder:
+```
+// in include/keys/asymmetric-type.h
+// for exported cryptographic variables:
+- struct key_type key_type_asymmetric;
++ DECLARE_CRYPTO_VAR(CONFIG_ASYMMETRIC_KEY_TYPE, key_type_asymmetric, struct key_type, );
++ #if defined(CONFIG_CRYPTO_FIPS140_EXTMOD) && !defined(FIPS_MODULE) && IS_BUILTIN(CONFIG_ASYMMETRIC_KEY_TYPE)
++ #define key_type_asymmetric (*((struct key_type*)CRYPTO_VAR_NAME(key_type_asymmetric)))
++ #endif 
+```
+By doing so, we can automatically force all cryptography users to go through the placeholders, because those users already include the same header file.
+The wrapper also takes the cryptographic algorithm Kconfig symbol as a parameter, so that when a cryptographic algorithm is built as a module (for example, `CONFIG_ASYMMETRIC_KEY_TYPE=m`), the original function declarations remain unchanged and are not affected.
+
+For exported function symbols (the majority, ~hundreds), a similar approach could be taken, but instead we use an automated method to redirect function address usage to placeholders during the kernel compilation process. This makes the crypto module implementation less intrusive to the kernel source tree, as no header file modifications are needed. To achieve this, a linker option `--wrap=<symbols-to-redirect>` is leveraged to rename all uses of crypto functions in the main kernel to dedicated trampolines that act as address placeholders. As a consequence, all references to crypto function symbols are automatically redirected to the address placeholders, avoiding mass intrusive changes to the mainline kernel source tree.
+
+4. **Add the address-placeholder definition wrappers into a dedicated file `fips140-var-redirect.c`:**
+
+After redirecting crypto users to use address placeholders, we also need to add the definitions of those address placeholders.
+
+For exported variable symbols (a small number, ~10 symbols), add the placeholder definition wrapper `DEFINE_CRYPTO_VAR_STUB` to a dedicated file `fips140-var-redirect.c`.
+```
+// in crypto/fips140/fips140-var-redirect.c
+// for exported cryptographic variables:
++ #undef key_type_asymmetric
++ DEFINE_CRYPTO_VAR_STUB(key_type_asymmetric);
++ #endif
+```
+This file will be compiled separately and acts as both the “outlet” and the “plug” for the main kernel and the cryptography module, respectively.
+
+For exported function symbols (the majority, ~hundreds), a similar wrapper `DEFINE_CRYPTO_FN_REDIRECT()` is used, but again, the application of this wrapper is automated, so there is no need to manually apply it. Instead, it is generated automatically by the script `crypto/fips140/gen-fips140-fn-redirect.sh` on every kernel build.
+
+We apply the above steps to both architecture-independent and architecture-specific cryptographic algorithms.
+
+### 4.3. Initialization Synchronization
+
+To ensure the embedded `fips140.ko` module binary provides the same cryptography functionality as the regular kernel, the kernel needs:
+1. A module loader to load the module binary directly from memory,
+2. A mechanism to plug the module back into the kernel by updating the address placeholders, and
+3. Correct cryptography subsystem initialization, as if the cryptographic algorithms were still built-in.
+
+**Directly Load Module Binary from Memory:**
+Regular modules are loaded from the filesystem and undergo signature verification on the module binary, which relies on cryptographic operations. However, since we have already fully decoupled the cryptography subsystem, we must skip this step for this `fips140.ko` module.
+To achieve this, we add a new loader function `load_crypto_module_mem()` that can load the module binary directly from memory at the designed address without checking the signature. Since the module binary is embedded into main kernel in an ELF section, as specified in the linker script:
+```
+// in arch/<arch>/kernel/vmlinux.lds.S
+    .fips140_embedded : AT(ADDR(.fips140_embedded) - LOAD_OFFSET) {
+        . = ALIGN(8);
+        _binary_fips140_ko_start = .;
+        KEEP(*(.fips140_module_data))
+        _binary_fips140_ko_end = .;
+    }
+```
+Therefore, the runtime memory address of the module can be accessed directly by the module loader to invoke the new loader function `load_crypto_module_mem()`.
+
+**Plug Back the Module by Updating Address Placeholder Values:**
+To update the address placeholders in the main kernel to the correct addresses matching the loaded module, after compilation the placeholder values are placed into dedicated key-value data structures, which reside in ELF sections `__crypto_fn_keys` and `__crypto_var_keys`.
+This can be seen from the definition of the placeholder's key-value data structure:
+```
+#define __CRYPTO_FN_KEY(sym)					\
+	extern void *__fips140_fn_ptr_##sym;			\
+	static struct _crypto_fn_key __##sym##_fn_key		\
+		__used						\
+		__section("__crypto_fn_keys")			\ // Place in a dedicated ELF Section
+		__aligned(__alignof__(struct _crypto_fn_key)) = {	\
+		.ptr = (void **)&__fips140_fn_ptr_##sym,		\
+		.func = (void *)&sym,				\
+	};
+
+#define DEFINE_CRYPTO_VAR_STUB(name) \
+    static struct crypto_var_key __crypto_##name##_var_key \
+        __used \
+        __section("__crypto_var_keys") \ // Place in a dedicated ELF Section
+        __aligned(__alignof__(struct crypto_var_key)) = \
+    { \
+        .ptr = &CRYPTO_VAR_NAME(name), \
+        .var = (void*)&name, \
+    };
+```
+The purpose of doing this is to allow the main kernel to quickly locate the placeholders and update them to the correct addresses. The update functions are defined as `do_crypto_var()` and `do_crypto_fn()`, which are executed at module load.
+
+As a result, all cryptography users in the main kernel can now call the cryptographic functions as if they were built-in.
+
+**Initialize Cryptography Subsystem as if it Were Built-in:**
+Cryptographic components must be properly initialized before use, and this initialization is typically achieved through dedicated initialization functions (e.g., `module_init(crypto_init_func)` or `late_initcall(crypto_init_func)`). Traditionally, these init functions are executed automatically as part of the kernel boot phase. However, now that they are moved to a crypto module, there needs to be a way to collect and execute them.
+
+To collect these init functions, the init wrappers (e.g., `module_init()` and `late_initcall`) are modified to automatically place the wrapped crypto init function into a dedicated list in the crypto module, for example:
+```
+// in include/linux/module.h
+#define subsys_initcall(fn) \
+	static initcall_t __used __section(".fips_initcall0") \ // a dedicated list
+		__fips_##fn = fn;
+
+#define module_init(initfn) \
+	static initcall_t __used __section(".fips_initcall1") \ // a dedicated list
+		__fips_##initfn = initfn;
+```
+By doing so, all init functions are now aggregated for execution by `run_initcalls()` at module load.
+
+Besides collecting these crypto init functions, rather than simply executing them, another key consideration is that their execution order often has strict requirements. In other words, for these collected crypto init functions, we must ensure that their initialization order is preserved as before because failure to follow the correct order can result in kernel panic.
+
+To address this, we introduce a synchronization mechanism between the main kernel and the module to ensure all cryptographic algorithms are executed in the correct kernel boot phase. In more details, we spawn the module initialization process `fips_loader_init()` as an async thread `fips140_sync_thread()`, in which we call `run_initcalls()` to execute the initialization calls of each cryptographic algorithm.
+Then, we introduce synchronization helpers such as `wait_until_fips140_level_sync(int level)` to ensure the initialization order of all cryptographic algorithms is synchronized with the main kernel.
+
+## 5. Customization and Extension of Cryptography Module
+
+This section describes how developers can customize which cryptographic algorithms are included in the standalone cryptography module, as well as extend this feature to other cryptographic algorithms or hardware architectures.
+
+### 5.1. Cryptography Selection Mechanism
+
+The feature automatically includes cryptographic algorithms that meet specific criteria:
+1. **Built-in Configuration**: Only cryptographic algorithms configured as `CONFIG_CRYPTO_*=y` are candidates for inclusion
+2. **Explicit Inclusion**: Cryptographic algorithms must be explicitly converted using the `crypto-objs-$(CONFIG__CRYPTO_*`) build rule
+
+### 5.2. Extend Support to New Cryptographic Algorithms
+
+To extend support to a new cryptographic algorithm in the standalone module, follow these steps:
+
+**Step 1: Update the Makefile**
+```
+# in crypto/[algorithm]/Makefile
+- obj-$(CONFIG_CRYPTO_ALGORITHM) += algorithm.o
++ crypto-objs-$(CONFIG_CRYPTO_ALGORITHM) += algorithm.o
+```
+For Architecture-Specific Cryptographic Algorithms:
+- Apply the `crypto-objs-` rule in the appropriate `arch/*/crypto/Makefile`
+
+**Step 2: Add Pluggable Interface Support**
+If the cryptographic algorithm has symbol-exported variables via `EXPORT_SYMBOL()`, add the pluggable interface wrappers. There is no need to manually apply wrappers for symbol-exported functions:
+```
+// Example: in include/keys/asymmetric-type.h
+// for exported cryptographic variables:
+- struct key_type key_type_asymmetric;
++ DECLARE_CRYPTO_VAR(CONFIG_ASYMMETRIC_KEY_TYPE, key_type_asymmetric, struct key_type, );
++ #if defined(CONFIG_CRYPTO_FIPS140_EXTMOD) && !defined(FIPS_MODULE) && IS_BUILTIN(CONFIG_ASYMMETRIC_KEY_TYPE)
++ #define key_type_asymmetric (*((struct key_type*)CRYPTO_VAR_NAME(key_type_asymmetric)))
++ #endif 
+```
+Then, add the corresponding stubs in `crypto/fips140/fips140-var-redirect.c`:
+```
++ #undef key_type_asymmetric
++ DEFINE_CRYPTO_VAR_STUB(key_type_asymmetric);
++ #endif
+```
+For Architecture-Specific Cryptographic Algorithms:
+- Include architecture-specific stubs in `arch/*/crypto/fips140/fips140-var-redirect.c`:
+
+### 5.3. Architecture-Specific Extensions
+
+**Extending to New Architectures:**
+Currently supported architectures are x86_64 and ARM64. To extend this feature to additional architectures:
+1. **Update Linker Scripts**: Add ELF sections in `arch/[new-arch]/kernel/vmlinux.lds.S`:
+```
+.fips140_embedded : AT(ADDR(.fips140_embedded) - LOAD_OFFSET) {
+    . = ALIGN(8);
+    _binary_fips140_ko_start = .;
+    KEEP(*(.fips140_module_data))
+    _binary_fips140_ko_end = .;
+}
+```
+2. **Create Architecture-Specific Files**: Set up `arch/[new-arch]/crypto/fips140/` directory with Makefile and `fips140-var-redirect.c` following the pattern used in x86_64 and ARM64.
+
+## 6. Reusing a certified module in practice
+
+With the crypto subsystem restructured as a loadable module, a previously certified module can be reused across kernel updates. Crypto development does not freeze in the meantime — updated modules can always be submitted for fresh certification. The reuse of an already-certified module is intended to simply bridge the gap until the new certification arrives, that is, using the old one before the new one gets certified. How much of the certified module to reuse, however, involves a tradeoff between crypto feature availability, certification turnaround time, and engineering effort.
+
+The most conservative option is no reuse at all: abandon the previous certification and submit the updated crypto module for a full FIPS certification. This allows the crypto subsystem to benefit from the latest upstream changes, but at the cost of the full 12-to-18-month waiting period, according to experiences from downstream distributions.
+
+At the opposite end, distributions can choose to reuse the certified module binary entirely on a newer kernel. This enables the new kernel to be validated on day one. The tradeoff is clear: besides forgoing crypto updates, engineers must ensure ABI compatibility between the updated kernel and the module.
+
+A middle ground is to reuse only the source code of the certified module, freezing it and recompiling against the updated main kernel. Since the source code remains unchanged, the new module can go through a Non-Security Relevant ([NSRL](https://csrc.nist.gov/csrc/media/Projects/cryptographic-module-validation-program/documents/fips%20140-3/FIPS-140-3-CMVP%20Management%20Manual.pdf)) process — a simpler FIPS re-certification that typically reduces the waiting time from 12–18 months down to 3–4 months. Compared to reusing the binary entirely, this option requires less engineering effort, since engineers need only maintain source-level API compatibility (i.e., by patching the main kernel source code) rather than binary-level ABI compatibility between the crypto module and the main kernel.
+
+In summary, converting the kernel crypto subsystem into a loadable module enables reuse of a certified module across kernel updates. Whether through binary reuse, source-code reuse, or fresh certification, different choices represent different tradeoffs, and distributions can balance crypto feature availability, certification turnaround time, and engineering effort according to their needs.
+
+---
+Written by Jay Wang <wanjay@amazon.com> <jay.wang.upstream@gmail.com>, Amazon Linux
\ No newline at end of file
-- 
2.47.3



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox