Linux-ARM-Kernel Archive on lore.kernel.org

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] arm64: Add HOTPLUG_PARALLEL support for secondary CPUs
From: Jinjie Ruan @ 2026-06-18  9:24 UTC (permalink / raw)
  To: catalin.marinas, will, tsbogend, pjw, palmer, aou, alex, tglx,
	mingo, bp, dave.hansen, hpa, peterz, kees, nathan, linusw,
	jpoimboe, lukas.bulwahn, ryan.roberts, ojeda, maz, timothy.hayes,
	lpieralisi, thuth, menglong8.dong, oupton, yeoreum.yun,
	miko.lenczewski, broonie, kevin.brodsky, james.clark, tabba,
	mrigendra.chaubey, arnd, anshuman.khandual, x86, linux-kernel,
	linux-arm-kernel, linux-mips, linux-riscv, apatel, mhklinux
  Cc: ruanjinjie

Support for parallel secondary CPU bringup is already utilized by x86,
MIPS, and RISC-V. This patch brings this capability to the arm64
architecture.

Introduce CONFIG_PARALLEL_SMT_PRIMARY_FIRST to avoid primary SMT threads
to boot first constraint.

And Add a 'cpu' parameter to update_cpu_boot_status() to allow updating the
boot status at a per-CPU granularity during parallel bringup.

Rework the global `secondary_data` accessed during early boot into
a per-CPU array `cpu_boot_data` to allow secondary CPUs to boot
in parallel.

And reuse `__cpu_logical_map` array in the early boot code in head.S
to resolve each secondary CPU's logical ID concurrently.

Changes in v2:
- Remove RFC.
- Add Tested-by.
- Fix AI review issues in [1].
- Add arch_cpuhp_init_parallel_bringup() to check psci boot.
- Reuse `__cpu_logical_map` instead of a new aray.
- Defer rcutree_report_cpu_starting() until after
  check_local_cpu_capabilities() to prevent a potential control CPU
  deadlock if an early capability check fails.
- Move the assembly in head.S to a macro called `mpidr_to_cpuid`.
- Add `SECONDARY_DATA_SHIFT` for `lsl` to access `cpu_boot_data`.
- Add sizeof(struct secondary_data) power of 2 assert check.
- Expand testing with more data collected from real hardware.

[1] https://sashiko.dev/#/patchset/20260611133809.3854977-1-ruanjinjie%40huawei.com

Jinjie Ruan (4):
  cpu/hotplug: Introduce CONFIG_PARALLEL_SMT_PRIMARY_FIRST
  arm64: smp: Pass CPU ID to update_cpu_boot_status()
  arm64: smp: Defer RCU reporting until after local CPU capability
    checks
  arm64: Add HOTPLUG_PARALLEL support for secondary CPUs

 arch/Kconfig                    |  4 +++
 arch/arm64/Kconfig              |  1 +
 arch/arm64/include/asm/smp.h    | 17 ++++++++++---
 arch/arm64/kernel/asm-offsets.c |  4 +++
 arch/arm64/kernel/cpufeature.c  | 22 ++++++++--------
 arch/arm64/kernel/head.S        | 36 ++++++++++++++++++++++++++
 arch/arm64/kernel/smp.c         | 45 ++++++++++++++++++++++++++++-----
 arch/arm64/mm/context.c         |  4 +--
 arch/mips/Kconfig               |  1 +
 arch/riscv/Kconfig              |  1 +
 arch/x86/Kconfig                |  1 +
 kernel/cpu.c                    |  6 ++++-
 12 files changed, 119 insertions(+), 23 deletions(-)

-- 
2.34.1



^ permalink raw reply

* [PATCH v2 1/4] cpu/hotplug: Introduce CONFIG_PARALLEL_SMT_PRIMARY_FIRST
From: Jinjie Ruan @ 2026-06-18  9:24 UTC (permalink / raw)
  To: catalin.marinas, will, tsbogend, pjw, palmer, aou, alex, tglx,
	mingo, bp, dave.hansen, hpa, peterz, kees, nathan, linusw,
	jpoimboe, lukas.bulwahn, ryan.roberts, ojeda, maz, timothy.hayes,
	lpieralisi, thuth, menglong8.dong, oupton, yeoreum.yun,
	miko.lenczewski, broonie, kevin.brodsky, james.clark, tabba,
	mrigendra.chaubey, arnd, anshuman.khandual, x86, linux-kernel,
	linux-arm-kernel, linux-mips, linux-riscv, apatel, mhklinux
  Cc: ruanjinjie
In-Reply-To: <20260618092444.1316336-1-ruanjinjie@huawei.com>

During parallel CPU bringup, x86 requires primary SMT threads to boot
first to avoid siblings stopping during microcode updates. This constraint
is architecture-specific and unnecessary for other platforms
like arm64.

Introduce CONFIG_PARALLEL_SMT_PRIMARY_FIRST to decouple this constraint.
Platforms requiring this temporal order (e.g., x86) can select it
in Kconfig. Other architectures (e.g., arm64) can leave it unselected
to entirely bypass the SMT branch via the preprocessor.

Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
---
 arch/Kconfig       | 4 ++++
 arch/mips/Kconfig  | 1 +
 arch/riscv/Kconfig | 1 +
 arch/x86/Kconfig   | 1 +
 kernel/cpu.c       | 6 +++++-
 5 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index e86880045158..0365d2df2659 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -102,6 +102,10 @@ config HOTPLUG_PARALLEL
 	bool
 	select HOTPLUG_SPLIT_STARTUP
 
+config PARALLEL_SMT_PRIMARY_FIRST
+	bool
+	depends on HOTPLUG_PARALLEL
+
 config GENERIC_IRQ_ENTRY
 	bool
 
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 4364f3dba688..84e11ac0cf71 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -642,6 +642,7 @@ config EYEQ
 	select MIPS_CPU_SCACHE
 	select MIPS_GIC
 	select MIPS_L1_CACHE_SHIFT_7
+	select PARALLEL_SMT_PRIMARY_FIRST if HOTPLUG_PARALLEL
 	select PCI_DRIVERS_GENERIC
 	select SMP_UP if SMP
 	select SWAP_IO_SPACE
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index d235396c4514..0cc49aecc841 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -210,6 +210,7 @@ config RISCV
 	select OF
 	select OF_EARLY_FLATTREE
 	select OF_IRQ
+	select PARALLEL_SMT_PRIMARY_FIRST if HOTPLUG_PARALLEL
 	select PCI_DOMAINS_GENERIC if PCI
 	select PCI_ECAM if (ACPI && PCI)
 	select PCI_MSI if PCI
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f3f7cb01d69d..3ad4115ad051 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -314,6 +314,7 @@ config X86
 	select NEED_PER_CPU_PAGE_FIRST_CHUNK
 	select NEED_SG_DMA_LENGTH
 	select NUMA_MEMBLKS			if NUMA
+	select PARALLEL_SMT_PRIMARY_FIRST	if HOTPLUG_PARALLEL
 	select PCI_DOMAINS			if PCI
 	select PCI_LOCKLESS_CONFIG		if PCI
 	select PERF_EVENTS
diff --git a/kernel/cpu.c b/kernel/cpu.c
index bc4f7a9ba64e..7ef8cdf4d239 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1792,6 +1792,7 @@ static int __init parallel_bringup_parse_param(char *arg)
 }
 early_param("cpuhp.parallel", parallel_bringup_parse_param);
 
+#ifdef CONFIG_PARALLEL_SMT_PRIMARY_FIRST
 #ifdef CONFIG_HOTPLUG_SMT
 static inline bool cpuhp_smt_aware(void)
 {
@@ -1811,7 +1812,8 @@ static inline const struct cpumask *cpuhp_get_primary_thread_mask(void)
 {
 	return cpu_none_mask;
 }
-#endif
+#endif /* CONFIG_HOTPLUG_SMT */
+#endif /* CONFIG_PARALLEL_SMT_PRIMARY_FIRST */
 
 bool __weak arch_cpuhp_init_parallel_bringup(void)
 {
@@ -1837,6 +1839,7 @@ static bool __init cpuhp_bringup_cpus_parallel(unsigned int ncpus)
 	if (!__cpuhp_parallel_bringup)
 		return false;
 
+#ifdef CONFIG_PARALLEL_SMT_PRIMARY_FIRST
 	if (cpuhp_smt_aware()) {
 		const struct cpumask *pmask = cpuhp_get_primary_thread_mask();
 		static struct cpumask tmp_mask __initdata;
@@ -1857,6 +1860,7 @@ static bool __init cpuhp_bringup_cpus_parallel(unsigned int ncpus)
 		cpumask_andnot(&tmp_mask, mask, pmask);
 		mask = &tmp_mask;
 	}
+#endif /* CONFIG_PARALLEL_SMT_PRIMARY_FIRST */
 
 	/* Bring the not-yet started CPUs up */
 	cpuhp_bringup_mask(mask, ncpus, CPUHP_BP_KICK_AP);
-- 
2.34.1



^ permalink raw reply related

* Re: [PATCH v1 03/11] KVM: arm64: Use guard()/scoped_guard() in arm64 KVM EL1 code
From: Fuad Tabba @ 2026-06-18  9:24 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, Will Deacon, Catalin Marinas, Quentin Perret,
	Vincent Donnefort, Sebastian Ene, Per Larsen, Suzuki K Poulose,
	Zenghui Yu, Joey Gouly, Steffen Eiden, Mark Rutland,
	Jonathan Cameron, Hyunwoo Kim, linux-arm-kernel, kvmarm,
	linux-kernel
In-Reply-To: <86jyrwrymb.wl-maz@kernel.org>

On Thu, 18 Jun 2026 at 10:23, Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 12 Jun 2026 07:59:17 +0100,
> tabba@google.com wrote:
> >
> > Convert the manual mutex_lock()/spin_lock() pairs in
> > arch/arm64/kvm/{pkvm,arm,mmu,reset,psci}.c to guard(mutex),
> > guard(spinlock) and scoped_guard(), dropping unlock-only goto labels in
> > favour of direct returns. Centralised cleanup gotos that still serve
> > other resources are preserved.
> >
> > reset.c uses scoped_guard() rather than guard() so the lock covers only
> > the small read/update window inside kvm_reset_vcpu(), leaving the rest
> > of the function outside the critical section.
>
> To be brutally honest, I don't think this sort of widespread changes
> bring us anything. This is just churn.
>
> Sure, if you are reworking a particular bit of code that is goto-heavy
> for the purpose of error handling, this has the potential to cleanup
> the code *while you are changing it*.
>
> But doing it for the sake of doing it? I think we have bigger fish to
> fry right now.

I understand what you mean. Would you like me to drop all of the guard
patches, or only those that go beyond the code changed in this series?

Thanks,
/fuad

>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.


^ permalink raw reply

* Re: [PATCH RFC v4 10/12] reset: zte: Add a zx297520v3 reset driver
From: Philipp Zabel @ 2026-06-18  9:24 UTC (permalink / raw)
  To: Stefan Dösinger, Michael Turquette, Stephen Boyd,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Brian Masney
  Cc: linux-clk, devicetree, linux-kernel, linux-arm-kernel
In-Reply-To: <20260616-zx29clk-v4-10-ca994bd22e9d@gmail.com>

On Di, 2026-06-16 at 23:26 +0300, Stefan Dösinger wrote:
> This drives the auxiliary devices created by the clock driver.

Which auxiliary devices? Which clock driver?

> Signed-off-by: Stefan Dösinger <stefandoesinger@gmail.com>
> ---
>  MAINTAINERS                          |   1 +
>  drivers/reset/Kconfig                |  11 ++
>  drivers/reset/Makefile               |   1 +
>  drivers/reset/reset-zte-zx297520v3.c | 224 +++++++++++++++++++++++++++++++++++
>  4 files changed, 237 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f1f0459b2c72..55bf0290343a 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3871,6 +3871,7 @@ F:	Documentation/devicetree/zte,zx297520v3-*
>  F:	arch/arm/boot/dts/zte/
>  F:	arch/arm/mach-zte/
>  F:	drivers/clk/zte/
> +F:	drivers/reset/reset-zte-zx297520v3.c
>  F:	include/dt-bindings/clock/zte,zx297520v3-clk.h
>  
>  ARM/ZYNQ ARCHITECTURE
> diff --git a/drivers/reset/Kconfig b/drivers/reset/Kconfig
> index d009eb0849a3..116dd23f1b8e 100644
> --- a/drivers/reset/Kconfig
> +++ b/drivers/reset/Kconfig
> @@ -404,6 +404,17 @@ config RESET_UNIPHIER_GLUE
>  	  on UniPhier SoCs. Say Y if you want to control reset signals
>  	  provided by the glue layer.
>  
> +config RESET_ZTE_ZX297520V3
> +	tristate "ZTE zx297520v3 Reset Driver"
> +	depends on (ARCH_ZTE || COMPILE_TEST)
> +	default CLK_ZTE_ZX297520V3
> +	select AUXILIARY_BUS
> +	help
> +	  This enables the reset controller for ZTE zx297520v3 SoCs. The reset
> +	  controller is part of the clock controller on this SoC. This driver
> +	  operates on an auxiliary device exposed by the clock driver. Enable
> +	  this driver if you plan to boot the kernel on a zx297520v3 based SoC.
> +
>  config RESET_ZYNQ
>  	bool "ZYNQ Reset Driver" if COMPILE_TEST
>  	default ARCH_ZYNQ
> diff --git a/drivers/reset/Makefile b/drivers/reset/Makefile
> index 3e52569bd276..9a8a48d44dc4 100644
> --- a/drivers/reset/Makefile
> +++ b/drivers/reset/Makefile
> @@ -50,5 +50,6 @@ obj-$(CONFIG_RESET_TI_TPS380X) += reset-tps380x.o
>  obj-$(CONFIG_RESET_TN48M_CPLD) += reset-tn48m.o
>  obj-$(CONFIG_RESET_UNIPHIER) += reset-uniphier.o
>  obj-$(CONFIG_RESET_UNIPHIER_GLUE) += reset-uniphier-glue.o
> +obj-$(CONFIG_RESET_ZTE_ZX297520V3) += reset-zte-zx297520v3.o
>  obj-$(CONFIG_RESET_ZYNQ) += reset-zynq.o
>  obj-$(CONFIG_RESET_ZYNQMP) += reset-zynqmp.o
> diff --git a/drivers/reset/reset-zte-zx297520v3.c b/drivers/reset/reset-zte-zx297520v3.c
> new file mode 100644
> index 000000000000..2022f4df2ebd
> --- /dev/null
> +++ b/drivers/reset/reset-zte-zx297520v3.c
> @@ -0,0 +1,224 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2026 Stefan Dösinger
> + */
> +#include <dt-bindings/clock/zte,zx297520v3-clk.h>
> +#include <linux/reset-controller.h>
> +#include <linux/platform_device.h>

What is this used for?

> +#include <linux/auxiliary_bus.h>
> +#include <linux/clk-provider.h>

What is this used for?

> +#include <linux/mfd/syscon.h>
> +#include <linux/regmap.h>
> +#include <linux/iopoll.h>
> +#include <linux/delay.h>
> +
> +struct zte_reset_reg {
> +	u32 mask, wait_mask;
> +	u16 reg;
> +};
> +
> +struct zte_reset_info {
> +	const struct zte_reset_reg *resets;
> +	unsigned int num;
> +};
> +
> +struct zte_reset {
> +	struct reset_controller_dev rcdev;
> +	struct regmap *map;
> +	const struct zte_reset_reg *resets;
> +};
> +
> +static inline struct zte_reset *to_zte_reset(struct reset_controller_dev *rcdev)
> +{
> +	return container_of(rcdev, struct zte_reset, rcdev);
> +}
> +
> +static int zx29_rst_assert(struct reset_controller_dev *rcdev, unsigned long id)
> +{
> +	struct zte_reset *rst = to_zte_reset(rcdev);
> +
> +	return regmap_clear_bits(rst->map, rst->resets[id].reg, rst->resets[id].mask);
> +}
> +
> +static int zx29_rst_deassert(struct reset_controller_dev *rcdev, unsigned long id)
> +{
> +	struct zte_reset *rst = to_zte_reset(rcdev);
> +	int res;
> +	u32 val;
> +
> +	res = regmap_set_bits(rst->map, rst->resets[id].reg, rst->resets[id].mask);
> +	if (res)
> +		return res;
> +
> +	/* This is a special case used only by USB reset */
> +	if (rst->resets[id].wait_mask) {
> +		return regmap_read_poll_timeout(rst->map, rst->resets[id].reg + 4, val,
> +						val & rst->resets[id].wait_mask, 1, 100);
> +	}
> +
> +	return 0;
> +}
> +
> +static int zx29_rst_status(struct reset_controller_dev *rcdev, unsigned long id)
> +{
> +	struct zte_reset *rst = to_zte_reset(rcdev);
> +	int res;
> +
> +	res = regmap_test_bits(rst->map, rst->resets[id].reg, rst->resets[id].mask);
> +	if (res < 0)
> +		return res;
> +
> +	return !res;
> +}
> +
> +static const struct reset_control_ops zx29_rst_ops = {
> +	.assert		= zx29_rst_assert,
> +	.deassert	= zx29_rst_deassert,
> +	.status		= zx29_rst_status,
> +};
> +
> +static const struct zte_reset_reg zx297520v3_top_resets[] = {
> +	/* This bit is set by ZTE's cpko.ko blob, it looks like a reset bit for the LTE DSP
> +	 * coprocessor. Clocks for it are in matrixclk.
> +	 */
> +	[ZX297520V3_ZSP_RESET]       = { .reg = 0x13c, .mask = BIT(0)            },
> +
> +	[ZX297520V3_UART0_RESET]     = { .reg = 0x78,  .mask = BIT(6)  | BIT(7)  },

Is this a single reset line controlled by two bits (do you know what
they are)? Or might these actually be two different reset controls that
are just always set together?

> +	[ZX297520V3_I2C0_RESET]      = { .reg = 0x74,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_RTC_RESET]       = { .reg = 0x74,  .mask = BIT(4)  | BIT(5)  },
> +	[ZX297520V3_TIMER_T08_RESET] = { .reg = 0x78,  .mask = BIT(4)  | BIT(5)  },
> +	[ZX297520V3_TIMER_T09_RESET] = { .reg = 0x78,  .mask = BIT(2)  | BIT(3)  },
> +	[ZX297520V3_PMM_RESET]       = { .reg = 0x74,  .mask = BIT(0)  | BIT(1)  },
> +
> +	/* I haven't found any clocks for GPIO. It probably wouldn't make much
> +	 * sense anyway. Only one reset bit per controller.
> +	 */
> +	[ZX297520V3_GPIO_RESET]      = { .reg =  0x74, .mask = BIT(3)            },
> +	[ZX297520V3_GPIO8_RESET]     = { .reg =  0x74, .mask = BIT(2)            },
> +
> +	[ZX297520V3_TIMER_T12_RESET] = { .reg =  0x74, .mask = BIT(6)  | BIT(7)  },
> +	[ZX297520V3_TIMER_T13_RESET] = { .reg =  0x7c, .mask = BIT(0)  | BIT(1)  },
> +	[ZX297520V3_TIMER_T14_RESET] = { .reg =  0x7c, .mask = BIT(2)  | BIT(3)  },
> +	[ZX297520V3_TIMER_T15_RESET] = { .reg =  0x74, .mask = BIT(10) | BIT(11) },
> +	[ZX297520V3_TIMER_T16_RESET] = { .reg =  0x7c, .mask = BIT(4)  | BIT(5)  },
> +	[ZX297520V3_TIMER_T17_RESET] = { .reg = 0x12c, .mask = BIT(0)  | BIT(1)  },
> +	[ZX297520V3_WDT_T18_RESET]   = { .reg =  0x74, .mask = BIT(12) | BIT(13) },
> +	[ZX297520V3_USIM1_RESET]     = { .reg =  0x74, .mask = BIT(14) | BIT(15) },
> +	[ZX297520V3_AHB_RESET]       = { .reg =  0x70, .mask = BIT(0)  | BIT(1)  },
> +
> +	/* USB reset. This is slightly special because it needs to wait for a ready bit after
> +	 * deasserting.
> +	 */
> +	[ZX297520V3_USB_RESET]      =  { .reg = 0x80,   .mask = BIT(3) | BIT(4) | BIT(5),
> +		.wait_mask = BIT(1)},

Same as above, are these actually three separate reset lines?

> +	[ZX297520V3_HSIC_RESET]      = { .reg = 0x80,   .mask = BIT(0) | BIT(1) | BIT(2),
> +		.wait_mask = BIT(0)},
> +};
> +
> +static const struct zte_reset_info zx297520v3_top_info = {
> +	.resets = zx297520v3_top_resets,
> +	.num = ARRAY_SIZE(zx297520v3_top_resets),
> +};
> +
> +static const struct zte_reset_reg zx297520v3_matrix_resets[] = {
> +	[ZX297520V3_CPU_RESET]       = { .reg =  0x28, .mask = BIT(1)            },
> +	[ZX297520V3_EDCP_RESET]      = { .reg =  0x68, .mask = BIT(0)            },
> +	[ZX297520V3_SD0_RESET]       = { .reg =  0x58, .mask = BIT(1)            },
> +	[ZX297520V3_SD1_RESET]       = { .reg =  0x58, .mask = BIT(0)            },
> +	[ZX297520V3_NAND_RESET]      = { .reg =  0x58, .mask = BIT(4)            },
> +	[ZX297520V3_PDCFG_RESET]     = { .reg =  0x94, .mask = BIT(20)           },
> +	[ZX297520V3_SSC_RESET]       = { .reg =  0x94, .mask = BIT(24)           },
> +	[ZX297520V3_GMAC_RESET]      = { .reg = 0x114, .mask = BIT(0)  | BIT(1)  },
> +	[ZX297520V3_VOU_RESET]       = { .reg = 0x16c, .mask = BIT(0)            },
> +};
> +
> +static const struct zte_reset_info zx297520v3_matrix_info = {
> +	.resets = zx297520v3_matrix_resets,
> +	.num = ARRAY_SIZE(zx297520v3_matrix_resets),
> +};
> +
> +static const struct zte_reset_reg zx297520v3_lsp_resets[] = {
> +	[ZX297520V3_TIMER_L1_RESET]  = { .reg = 0x04,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_WDT_L2_RESET]    = { .reg = 0x08,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_WDT_L3_RESET]    = { .reg = 0x0c,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_PWM_RESET]       = { .reg = 0x10,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_I2S0_RESET]      = { .reg = 0x14,  .mask = BIT(8)  | BIT(9)  },
> +	/* 0x18: Not writeable */
> +	[ZX297520V3_I2S1_RESET]      = { .reg = 0x1c,  .mask = BIT(8)  | BIT(9)  },
> +	/* 0x20: Not writeable */
> +	[ZX297520V3_QSPI_RESET]      = { .reg = 0x24,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_UART1_RESET]     = { .reg = 0x28,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_I2C1_RESET]      = { .reg = 0x2c,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_SPI0_RESET]      = { .reg = 0x30,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_TIMER_LB_RESET]  = { .reg = 0x34,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_TIMER_LC_RESET]  = { .reg = 0x38,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_UART2_RESET]     = { .reg = 0x3c,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_WDT_LE_RESET]    = { .reg = 0x40,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_TIMER_LF_RESET]  = { .reg = 0x44,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_SPI1_RESET]      = { .reg = 0x48,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_TIMER_L11_RESET] = { .reg = 0x4c,  .mask = BIT(8)  | BIT(9)  },
> +	[ZX297520V3_TDM_RESET]       = { .reg = 0x50,  .mask = BIT(8)  | BIT(9)  },
> +};
> +
> +static const struct zte_reset_info zx297520v3_lsp_info = {
> +	.resets = zx297520v3_lsp_resets,
> +	.num = ARRAY_SIZE(zx297520v3_lsp_resets),
> +};
> +
> +static int reset_zx297520v3_probe(struct auxiliary_device *adev,
> +				  const struct auxiliary_device_id *id)
> +{
> +	const struct zte_reset_info *drv_info;
> +	struct device *dev = &adev->dev;
> +	struct zte_reset *rst;
> +
> +	drv_info = (struct zte_reset_info *)id->driver_data;
> +
> +	rst = devm_kzalloc(dev, sizeof(*rst), GFP_KERNEL);
> +	if (!rst)
> +		return -ENOMEM;
> +
> +	rst->resets = drv_info->resets;
> +	rst->rcdev.owner = THIS_MODULE;
> +	rst->rcdev.nr_resets = drv_info->num;
> +	rst->rcdev.ops = &zx29_rst_ops;
> +	rst->rcdev.of_node = dev->of_node;
> +	rst->rcdev.dev = dev;
> +	rst->rcdev.of_reset_n_cells = 1;

No need to set of_reset_n_cells if of_xlate is not set. Here
reset_controller_register will use fwnode_n_cells and set it to 1
anyway.

> +
> +	rst->map = device_node_to_regmap(dev->of_node);
> +	if (IS_ERR(rst->map))
> +		return dev_err_probe(rdev, PTR_ERR(rst->map), "Cannot get parent syscon regmap\n");
> +
> +	return devm_reset_controller_register(dev, &rst->rcdev);
> +}
> +
> +static const struct auxiliary_device_id reset_zx297520v3_ids[] = {
> +	{
> +		.name = "clk_zte.zx297520v3_toprst",
> +		.driver_data = (kernel_ulong_t)&zx297520v3_top_info,
> +	},
> +	{
> +		.name = "clk_zte.zx297520v3_matrixrst",
> +		.driver_data = (kernel_ulong_t)&zx297520v3_matrix_info,
> +	},
> +	{
> +		.name = "clk_zte.zx297520v3_lsprst",
> +		.driver_data = (kernel_ulong_t)&zx297520v3_lsp_info,
> +	},
> +	{ },
> +};
> +

Drop this empty line.

> +MODULE_DEVICE_TABLE(auxiliary, reset_zx297520v3_ids);
> +
> +static struct auxiliary_driver reset_zx297520v3_drv = {
> +	.name = "zx297520v3_reset",
> +	.id_table = reset_zx297520v3_ids,
> +	.probe = reset_zx297520v3_probe,
> +};
> +

Drop this empty line.

> +module_auxiliary_driver(reset_zx297520v3_drv);
> +
> +MODULE_AUTHOR("Stefan Dösinger <stefandoesinger@gmail.com>");
> +MODULE_DESCRIPTION("ZTE zx297520v3 reset driver");
> +MODULE_LICENSE("GPL");

regards
Philipp


^ permalink raw reply

* Re: [PATCH v1 03/11] KVM: arm64: Use guard()/scoped_guard() in arm64 KVM EL1 code
From: Marc Zyngier @ 2026-06-18  9:23 UTC (permalink / raw)
  To: tabba
  Cc: Oliver Upton, Will Deacon, Catalin Marinas, Quentin Perret,
	Vincent Donnefort, Sebastian Ene, Per Larsen, Suzuki K Poulose,
	Zenghui Yu, Joey Gouly, Steffen Eiden, Mark Rutland,
	Jonathan Cameron, Hyunwoo Kim, linux-arm-kernel, kvmarm,
	linux-kernel
In-Reply-To: <20260612065925.755562-4-tabba@google.com>

On Fri, 12 Jun 2026 07:59:17 +0100,
tabba@google.com wrote:
> 
> Convert the manual mutex_lock()/spin_lock() pairs in
> arch/arm64/kvm/{pkvm,arm,mmu,reset,psci}.c to guard(mutex),
> guard(spinlock) and scoped_guard(), dropping unlock-only goto labels in
> favour of direct returns. Centralised cleanup gotos that still serve
> other resources are preserved.
> 
> reset.c uses scoped_guard() rather than guard() so the lock covers only
> the small read/update window inside kvm_reset_vcpu(), leaving the rest
> of the function outside the critical section.

To be brutally honest, I don't think this sort of widespread changes
bring us anything. This is just churn.

Sure, if you are reworking a particular bit of code that is goto-heavy
for the purpose of error handling, this has the potential to cleanup
the code *while you are changing it*.

But doing it for the sake of doing it? I think we have bigger fish to
fry right now.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply

* [PATCH v15 9/9] lib/tests: memcpy_kunit: add memcpy_mc() and memcpy_mc_large() test
From: Ruidong Tian @ 2026-06-18  9:21 UTC (permalink / raw)
  To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
	xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
	dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
	christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
	tianruidong
In-Reply-To: <20260618092124.3901230-1-tianruidong@linux.alibaba.com>

memcpy_mc() is the Machine-Check safe memcpy variant that returns the
number of bytes NOT copied on a hardware memory error, or 0 on success.

Add two test cases modeled after the existing memcpy_test() and
memcpy_large_test() implementations.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
 lib/tests/memcpy_kunit.c | 121 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 120 insertions(+), 1 deletion(-)

diff --git a/lib/tests/memcpy_kunit.c b/lib/tests/memcpy_kunit.c
index 812c1fa20fd9..87585fbe78c7 100644
--- a/lib/tests/memcpy_kunit.c
+++ b/lib/tests/memcpy_kunit.c
@@ -554,6 +554,121 @@ static void copy_mc_page_test(struct kunit *test)
 }
 #endif /* __HAVE_ARCH_COPY_MC_PAGE */
 
+#ifdef __HAVE_ARCH_MEMCPY_MC
+/*
+ * memcpy_mc() is a Machine-Check safe memcpy variant.
+ * Signature: int memcpy_mc(void *dst, const void *src, size_t len)
+ * Returns:   0 on success, or number of bytes NOT copied on MC error.
+ *
+ * In the normal (no-poison) path it must behave identically to memcpy()
+ * and always return 0.
+ */
+static void memcpy_mc_test(struct kunit *test)
+{
+#define TEST_OP "memcpy_mc"
+	struct some_bytes control = {
+		.data = { 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+			  0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+			  0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+			  0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+		},
+	};
+	struct some_bytes zero = { };
+	struct some_bytes middle = {
+		.data = { 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+			  0x20, 0x20, 0x20, 0x20, 0x00, 0x00, 0x00, 0x00,
+			  0x00, 0x00, 0x00, 0x20, 0x20, 0x20, 0x20, 0x20,
+			  0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+		},
+	};
+	struct some_bytes three = {
+		.data = { 0x00, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+			  0x20, 0x00, 0x00, 0x20, 0x20, 0x20, 0x20, 0x20,
+			  0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+			  0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
+			},
+	};
+	struct some_bytes dest = { };
+	unsigned long ret;
+	int count;
+	u8 *ptr;
+
+	/* Verify static initializers. */
+	check(control, 0x20);
+	check(zero, 0);
+	compare("static initializers", dest, zero);
+
+	/* Verify assignment. */
+	dest = control;
+	compare("direct assignment", dest, control);
+
+	/* Verify complete overwrite. */
+	ret = memcpy_mc(dest.data, zero.data, sizeof(dest.data));
+	KUNIT_ASSERT_EQ(test, ret, 0);
+	compare("complete overwrite", dest, zero);
+
+	/* Verify middle overwrite: 7 bytes at offset 12. */
+	dest = control;
+	ret = memcpy_mc(dest.data + 12, zero.data, 7);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+	compare("middle overwrite", dest, middle);
+
+	/* Verify zero-length copy is a no-op. */
+	dest = control;
+	ret = memcpy_mc(dest.data, zero.data, 0);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+	compare("zero length", dest, control);
+
+	/* Verify argument side-effects aren't repeated. */
+	dest = control;
+	ptr = dest.data;
+	count = 1;
+	ret = memcpy_mc(ptr++, zero.data, count++);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+	ptr += 8;
+	ret = memcpy_mc(ptr++, zero.data, count++);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+	compare("argument side-effects", dest, three);
+#undef TEST_OP
+}
+
+static void memcpy_mc_large_test(struct kunit *test)
+{
+	init_large(test);
+
+	/* Sweep 1..1024 bytes x shifting offset to cover all template paths. */
+	for (int bytes = 1; bytes <= ARRAY_SIZE(large_src); bytes++) {
+		for (int offset = 0; offset < ARRAY_SIZE(large_src); offset++) {
+			int right_zero_pos = offset + bytes;
+			int right_zero_size = ARRAY_SIZE(large_dst) - right_zero_pos;
+			int ret;
+
+			ret = memcpy_mc(large_dst + offset, large_src, bytes);
+			KUNIT_ASSERT_EQ_MSG(test, ret, 0,
+				"memcpy_mc returned %d with size %d at offset %d",
+				ret, bytes, offset);
+
+			/* No write before copy area. */
+			KUNIT_ASSERT_EQ_MSG(test,
+				memcmp(large_dst, large_zero, offset), 0,
+				"with size %d at offset %d", bytes, offset);
+			/* No write after copy area. */
+			KUNIT_ASSERT_EQ_MSG(test,
+				memcmp(&large_dst[right_zero_pos], large_zero,
+				       right_zero_size), 0,
+				"with size %d at offset %d", bytes, offset);
+			/* Byte-for-byte exact. */
+			KUNIT_ASSERT_EQ_MSG(test,
+				memcmp(large_dst + offset, large_src, bytes), 0,
+				"with size %d at offset %d", bytes, offset);
+
+			memset(large_dst + offset, 0, bytes);
+		}
+		cond_resched();
+	}
+}
+#endif /* __HAVE_ARCH_MEMCPY_MC */
+
 static struct kunit_case memcpy_test_cases[] = {
 	KUNIT_CASE(memset_test),
 	KUNIT_CASE(memcpy_test),
@@ -564,6 +679,10 @@ static struct kunit_case memcpy_test_cases[] = {
 	KUNIT_CASE(copy_page_test),
 #ifdef __HAVE_ARCH_COPY_MC_PAGE
 	KUNIT_CASE(copy_mc_page_test),
+#endif
+#ifdef __HAVE_ARCH_MEMCPY_MC
+	KUNIT_CASE(memcpy_mc_test),
+	KUNIT_CASE_SLOW(memcpy_mc_large_test),
 #endif
 	{}
 };
@@ -575,5 +694,5 @@ static struct kunit_suite memcpy_test_suite = {
 
 kunit_test_suite(memcpy_test_suite);
 
-MODULE_DESCRIPTION("test cases for memcpy(), memmove(), memset() and copy_page()");
+MODULE_DESCRIPTION("test cases for memcpy(), memmove(), memset(), copy_page() and memcpy_mc()");
 MODULE_LICENSE("GPL");
-- 
2.39.3



^ permalink raw reply related

* [PATCH v15 8/9] lib/test: memcpy_kunit: add copy_page() and copy_mc_page() tests
From: Ruidong Tian @ 2026-06-18  9:21 UTC (permalink / raw)
  To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
	xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
	dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
	christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
	tianruidong
In-Reply-To: <20260618092124.3901230-1-tianruidong@linux.alibaba.com>

Add KUnit tests for copy_page() and copy_mc_page(), modeled after
the existing memcpy_test() style: a static page-aligned src and a
two-page dst, filled with random bytes plus non-zero edges, then
verify byte-for-byte equality and that the adjacent page is
untouched.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
 lib/tests/memcpy_kunit.c | 67 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/lib/tests/memcpy_kunit.c b/lib/tests/memcpy_kunit.c
index d36933554e46..812c1fa20fd9 100644
--- a/lib/tests/memcpy_kunit.c
+++ b/lib/tests/memcpy_kunit.c
@@ -493,6 +493,67 @@ static void memmove_overlap_test(struct kunit *test)
 	}
 }
 
+/* --- Page-sized copy tests --- */
+
+static u8 page_src[PAGE_SIZE] __aligned(PAGE_SIZE);
+static u8 page_dst[PAGE_SIZE * 2] __aligned(PAGE_SIZE);
+static const u8 page_pattern[PAGE_SIZE] __aligned(PAGE_SIZE);
+
+static void init_page(struct kunit *test)
+{
+	/* Get many bit patterns. */
+	get_random_bytes(page_src, PAGE_SIZE);
+
+	/* Make sure we have non-zero edges. */
+	set_random_nonzero(test, &page_src[0]);
+	set_random_nonzero(test, &page_src[PAGE_SIZE - 1]);
+
+	memset(page_dst, 0xA5, ARRAY_SIZE(page_dst));
+	memset(page_pattern, 0xA5, PAGE_SIZE);
+}
+
+static void copy_page_test(struct kunit *test)
+{
+	init_page(test);
+
+	/* Copy. */
+	copy_page(page_dst, page_src);
+
+	/* Verify byte-for-byte exact. */
+	KUNIT_ASSERT_EQ_MSG(test,
+		memcmp(page_dst, page_src, PAGE_SIZE), 0,
+		"copy_page content mismatch with random data");
+
+	/* Verify no overflow into second page. */
+	KUNIT_ASSERT_EQ_MSG(test,
+		memcmp(page_dst + PAGE_SIZE, page_pattern, PAGE_SIZE), 0,
+		"copy_page overflow into adjacent page");
+}
+
+#ifdef __HAVE_ARCH_COPY_MC_PAGE
+static void copy_mc_page_test(struct kunit *test)
+{
+	int ret;
+
+	init_page(test);
+
+	/* Copy and check return value. */
+	ret = copy_mc_page(page_dst, page_src);
+	KUNIT_ASSERT_EQ_MSG(test, ret, 0,
+		"copy_mc_page returned %d on clean memory", ret);
+
+	/* Verify byte-for-byte exact. */
+	KUNIT_ASSERT_EQ_MSG(test,
+		memcmp(page_dst, page_src, PAGE_SIZE), 0,
+		"copy_mc_page content mismatch with random data");
+
+	/* Verify no overflow into second page. */
+	KUNIT_ASSERT_EQ_MSG(test,
+		memcmp(page_dst + PAGE_SIZE, page_pattern, PAGE_SIZE), 0,
+		"copy_mc_page overflow into adjacent page");
+}
+#endif /* __HAVE_ARCH_COPY_MC_PAGE */
+
 static struct kunit_case memcpy_test_cases[] = {
 	KUNIT_CASE(memset_test),
 	KUNIT_CASE(memcpy_test),
@@ -500,6 +561,10 @@ static struct kunit_case memcpy_test_cases[] = {
 	KUNIT_CASE_SLOW(memmove_test),
 	KUNIT_CASE_SLOW(memmove_large_test),
 	KUNIT_CASE_SLOW(memmove_overlap_test),
+	KUNIT_CASE(copy_page_test),
+#ifdef __HAVE_ARCH_COPY_MC_PAGE
+	KUNIT_CASE(copy_mc_page_test),
+#endif
 	{}
 };
 
@@ -510,5 +575,5 @@ static struct kunit_suite memcpy_test_suite = {
 
 kunit_test_suite(memcpy_test_suite);
 
-MODULE_DESCRIPTION("test cases for memcpy(), memmove(), and memset()");
+MODULE_DESCRIPTION("test cases for memcpy(), memmove(), memset() and copy_page()");
 MODULE_LICENSE("GPL");
-- 
2.39.3



^ permalink raw reply related

* [PATCH v15 7/9] arm64: introduce copy_mc_to_kernel() implementation
From: Ruidong Tian @ 2026-06-18  9:21 UTC (permalink / raw)
  To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
	xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
	dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
	christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
	tianruidong
In-Reply-To: <20260618092124.3901230-1-tianruidong@linux.alibaba.com>

From: Tong Tiangen <tongtiangen@huawei.com>

The copy_mc_to_kernel() helper is memory copy implementation that handles
source exceptions. It can be used in memory copy scenarios that tolerate
hardware memory errors(e.g: pmem_read/dax_copy_to_iter).

Currently, only x86 and ppc support this helper, Add this for ARM64 as
well, if ARCH_HAS_COPY_MC is defined, by implementing copy_mc_to_kernel()
and memcpy_mc() functions.

Because there is no caller-saved GPR is available for saving "bytes not
copied" in memcpy(), the memcpy_mc() is referenced to the implementation
of copy_from_user(). In addition, the fixup of MOPS insn is not considered
at present.

[Ruidong: refactor memcpy_mc on top of the new memcpy implementation.]

Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
 arch/arm64/include/asm/string.h  |   5 +
 arch/arm64/include/asm/uaccess.h |  17 +++
 arch/arm64/lib/Makefile          |   2 +-
 arch/arm64/lib/memcpy.S          | 251 +++----------------------------
 arch/arm64/lib/memcpy_mc.S       |  56 +++++++
 arch/arm64/lib/memcpy_template.S | 250 ++++++++++++++++++++++++++++++
 mm/kasan/shadow.c                |  12 ++
 7 files changed, 359 insertions(+), 234 deletions(-)
 create mode 100644 arch/arm64/lib/memcpy_mc.S
 create mode 100644 arch/arm64/lib/memcpy_template.S

diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
index 3a3264ff47b9..2e81f6c00cdd 100644
--- a/arch/arm64/include/asm/string.h
+++ b/arch/arm64/include/asm/string.h
@@ -35,6 +35,10 @@ extern void *memchr(const void *, int, __kernel_size_t);
 extern void *memcpy(void *, const void *, __kernel_size_t);
 extern void *__memcpy(void *, const void *, __kernel_size_t);
 
+#define __HAVE_ARCH_MEMCPY_MC
+extern unsigned long memcpy_mc(void *, const void *, __kernel_size_t);
+extern unsigned long __memcpy_mc(void *, const void *, __kernel_size_t);
+
 #define __HAVE_ARCH_MEMMOVE
 extern void *memmove(void *, const void *, __kernel_size_t);
 extern void *__memmove(void *, const void *, __kernel_size_t);
@@ -57,6 +61,7 @@ void memcpy_flushcache(void *dst, const void *src, size_t cnt);
  */
 
 #define memcpy(dst, src, len) __memcpy(dst, src, len)
+#define memcpy_mc(dst, src, len) __memcpy_mc(dst, src, len)
 #define memmove(dst, src, len) __memmove(dst, src, len)
 #define memset(s, c, n) __memset(s, c, n)
 
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index b0c83a08dda9..93277eca2268 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -499,5 +499,22 @@ static inline size_t probe_subpage_writeable(const char __user *uaddr,
 }
 
 #endif /* CONFIG_ARCH_HAS_SUBPAGE_FAULTS */
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+/**
+ * copy_mc_to_kernel - memory copy that handles source exceptions
+ *
+ * @to:		destination address
+ * @from:	source address
+ * @size:	number of bytes to copy
+ *
+ * Return 0 for success, or bytes not copied.
+ */
+static inline unsigned long __must_check
+copy_mc_to_kernel(void *to, const void *from, unsigned long size)
+{
+	return memcpy_mc(to, from, size);
+}
+#define copy_mc_to_kernel copy_mc_to_kernel
+#endif
 
 #endif /* __ASM_UACCESS_H */
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 1f4c3f743a20..a5820e6c33d4 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -7,7 +7,7 @@ lib-y		:= clear_user.o delay.o copy_from_user.o		\
 
 lib-$(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) += uaccess_flushcache.o
 
-lib-$(CONFIG_ARCH_HAS_COPY_MC) += copy_mc_page.o
+lib-$(CONFIG_ARCH_HAS_COPY_MC) += copy_mc_page.o memcpy_mc.o
 
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
 
diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
index 9b99106fb95f..ab48c5c798d1 100644
--- a/arch/arm64/lib/memcpy.S
+++ b/arch/arm64/lib/memcpy.S
@@ -15,247 +15,32 @@
  *
  */
 
-#define L(label) .L ## label
+	.macro ldrb1 reg, addr:vararg
+	ldrb  \reg, \addr
+	.endm
 
-#define dstin	x0
-#define src	x1
-#define count	x2
-#define dst	x3
-#define srcend	x4
-#define dstend	x5
-#define A_l	x6
-#define A_lw	w6
-#define A_h	x7
-#define B_l	x8
-#define B_lw	w8
-#define B_h	x9
-#define C_l	x10
-#define C_lw	w10
-#define C_h	x11
-#define D_l	x12
-#define D_h	x13
-#define E_l	x14
-#define E_h	x15
-#define F_l	x16
-#define F_h	x17
-#define G_l	count
-#define G_h	dst
-#define H_l	src
-#define H_h	srcend
-#define tmp1	x14
+	.macro ldr1 reg, addr:vararg
+	ldr   \reg, \addr
+	.endm
 
-/* This implementation handles overlaps and supports both memcpy and memmove
-   from a single entry point.  It uses unaligned accesses and branchless
-   sequences to keep the code small, simple and improve performance.
+	.macro ldp1 reg1, reg2, addr:vararg
+	ldp   \reg1, \reg2, \addr
+	.endm
 
-   Copies are split into 3 main cases: small copies of up to 32 bytes, medium
-   copies of up to 128 bytes, and large copies.  The overhead of the overlap
-   check is negligible since it is only required for large copies.
-
-   Large copies use a software pipelined loop processing 64 bytes per iteration.
-   The destination pointer is 16-byte aligned to minimize unaligned accesses.
-   The loop tail is handled by always copying 64 bytes from the end.
-*/
-
-SYM_FUNC_START_LOCAL(__pi_memcpy_generic)
-	add	srcend, src, count
-	add	dstend, dstin, count
-	cmp	count, 128
-	b.hi	L(copy_long)
-	cmp	count, 32
-	b.hi	L(copy32_128)
-
-	/* Small copies: 0..32 bytes.  */
-	cmp	count, 16
-	b.lo	L(copy16)
-	ldp	A_l, A_h, [src]
-	ldp	D_l, D_h, [srcend, -16]
-	stp	A_l, A_h, [dstin]
-	stp	D_l, D_h, [dstend, -16]
-	ret
-
-	/* Copy 8-15 bytes.  */
-L(copy16):
-	tbz	count, 3, L(copy8)
-	ldr	A_l, [src]
-	ldr	A_h, [srcend, -8]
-	str	A_l, [dstin]
-	str	A_h, [dstend, -8]
-	ret
-
-	.p2align 3
-	/* Copy 4-7 bytes.  */
-L(copy8):
-	tbz	count, 2, L(copy4)
-	ldr	A_lw, [src]
-	ldr	B_lw, [srcend, -4]
-	str	A_lw, [dstin]
-	str	B_lw, [dstend, -4]
+	.macro ret1
 	ret
+	.endm
 
-	/* Copy 0..3 bytes using a branchless sequence.  */
-L(copy4):
-	cbz	count, L(copy0)
-	lsr	tmp1, count, 1
-	ldrb	A_lw, [src]
-	ldrb	C_lw, [srcend, -1]
-	ldrb	B_lw, [src, tmp1]
-	strb	A_lw, [dstin]
-	strb	B_lw, [dstin, tmp1]
-	strb	C_lw, [dstend, -1]
-L(copy0):
-	ret
-
-	.p2align 4
-	/* Medium copies: 33..128 bytes.  */
-L(copy32_128):
-	ldp	A_l, A_h, [src]
-	ldp	B_l, B_h, [src, 16]
-	ldp	C_l, C_h, [srcend, -32]
-	ldp	D_l, D_h, [srcend, -16]
-	cmp	count, 64
-	b.hi	L(copy128)
-	stp	A_l, A_h, [dstin]
-	stp	B_l, B_h, [dstin, 16]
-	stp	C_l, C_h, [dstend, -32]
-	stp	D_l, D_h, [dstend, -16]
-	ret
-
-	.p2align 4
-	/* Copy 65..128 bytes.  */
-L(copy128):
-	ldp	E_l, E_h, [src, 32]
-	ldp	F_l, F_h, [src, 48]
-	cmp	count, 96
-	b.ls	L(copy96)
-	ldp	G_l, G_h, [srcend, -64]
-	ldp	H_l, H_h, [srcend, -48]
-	stp	G_l, G_h, [dstend, -64]
-	stp	H_l, H_h, [dstend, -48]
-L(copy96):
-	stp	A_l, A_h, [dstin]
-	stp	B_l, B_h, [dstin, 16]
-	stp	E_l, E_h, [dstin, 32]
-	stp	F_l, F_h, [dstin, 48]
-	stp	C_l, C_h, [dstend, -32]
-	stp	D_l, D_h, [dstend, -16]
-	ret
-
-	.p2align 4
-	/* Copy more than 128 bytes.  */
-L(copy_long):
-	/* Use backwards copy if there is an overlap.  */
-	sub	tmp1, dstin, src
-	cbz	tmp1, L(copy0)
-	cmp	tmp1, count
-	b.lo	L(copy_long_backwards)
-
-	/* Copy 16 bytes and then align dst to 16-byte alignment.  */
-
-	ldp	D_l, D_h, [src]
-	and	tmp1, dstin, 15
-	bic	dst, dstin, 15
-	sub	src, src, tmp1
-	add	count, count, tmp1	/* Count is now 16 too large.  */
-	ldp	A_l, A_h, [src, 16]
-	stp	D_l, D_h, [dstin]
-	ldp	B_l, B_h, [src, 32]
-	ldp	C_l, C_h, [src, 48]
-	ldp	D_l, D_h, [src, 64]!
-	subs	count, count, 128 + 16	/* Test and readjust count.  */
-	b.ls	L(copy64_from_end)
-
-L(loop64):
-	stp	A_l, A_h, [dst, 16]
-	ldp	A_l, A_h, [src, 16]
-	stp	B_l, B_h, [dst, 32]
-	ldp	B_l, B_h, [src, 32]
-	stp	C_l, C_h, [dst, 48]
-	ldp	C_l, C_h, [src, 48]
-	stp	D_l, D_h, [dst, 64]!
-	ldp	D_l, D_h, [src, 64]!
-	subs	count, count, 64
-	b.hi	L(loop64)
-
-	/* Write the last iteration and copy 64 bytes from the end.  */
-L(copy64_from_end):
-	ldp	E_l, E_h, [srcend, -64]
-	stp	A_l, A_h, [dst, 16]
-	ldp	A_l, A_h, [srcend, -48]
-	stp	B_l, B_h, [dst, 32]
-	ldp	B_l, B_h, [srcend, -32]
-	stp	C_l, C_h, [dst, 48]
-	ldp	C_l, C_h, [srcend, -16]
-	stp	D_l, D_h, [dst, 64]
-	stp	E_l, E_h, [dstend, -64]
-	stp	A_l, A_h, [dstend, -48]
-	stp	B_l, B_h, [dstend, -32]
-	stp	C_l, C_h, [dstend, -16]
-	ret
-
-	.p2align 4
-
-	/* Large backwards copy for overlapping copies.
-	   Copy 16 bytes and then align dst to 16-byte alignment.  */
-L(copy_long_backwards):
-	ldp	D_l, D_h, [srcend, -16]
-	and	tmp1, dstend, 15
-	sub	srcend, srcend, tmp1
-	sub	count, count, tmp1
-	ldp	A_l, A_h, [srcend, -16]
-	stp	D_l, D_h, [dstend, -16]
-	ldp	B_l, B_h, [srcend, -32]
-	ldp	C_l, C_h, [srcend, -48]
-	ldp	D_l, D_h, [srcend, -64]!
-	sub	dstend, dstend, tmp1
-	subs	count, count, 128
-	b.ls	L(copy64_from_start)
-
-L(loop64_backwards):
-	stp	A_l, A_h, [dstend, -16]
-	ldp	A_l, A_h, [srcend, -16]
-	stp	B_l, B_h, [dstend, -32]
-	ldp	B_l, B_h, [srcend, -32]
-	stp	C_l, C_h, [dstend, -48]
-	ldp	C_l, C_h, [srcend, -48]
-	stp	D_l, D_h, [dstend, -64]!
-	ldp	D_l, D_h, [srcend, -64]!
-	subs	count, count, 64
-	b.hi	L(loop64_backwards)
-
-	/* Write the last iteration and copy 64 bytes from the start.  */
-L(copy64_from_start):
-	ldp	G_l, G_h, [src, 48]
-	stp	A_l, A_h, [dstend, -16]
-	ldp	A_l, A_h, [src, 32]
-	stp	B_l, B_h, [dstend, -32]
-	ldp	B_l, B_h, [src, 16]
-	stp	C_l, C_h, [dstend, -48]
-	ldp	C_l, C_h, [src]
-	stp	D_l, D_h, [dstend, -64]
-	stp	G_l, G_h, [dstin, 48]
-	stp	A_l, A_h, [dstin, 32]
-	stp	B_l, B_h, [dstin, 16]
-	stp	C_l, C_h, [dstin]
-	ret
-SYM_FUNC_END(__pi_memcpy_generic)
-
-#ifdef CONFIG_AS_HAS_MOPS
+	.macro cpy1 dst, src, count
 	.arch_extension mops
-SYM_FUNC_START(__pi_memcpy)
-alternative_if_not ARM64_HAS_MOPS
-	b	__pi_memcpy_generic
-alternative_else_nop_endif
+	cpyp [\dst]!, [\src]!, \count!
+	cpym [\dst]!, [\src]!, \count!
+	cpye [\dst]!, [\src]!, \count!
+	.endm
 
-	mov	dst, dstin
-	cpyp	[dst]!, [src]!, count!
-	cpym	[dst]!, [src]!, count!
-	cpye	[dst]!, [src]!, count!
-	ret
+SYM_FUNC_START(__pi_memcpy)
+#include "memcpy_template.S"
 SYM_FUNC_END(__pi_memcpy)
-#else
-SYM_FUNC_ALIAS(__pi_memcpy, __pi_memcpy_generic)
-#endif
 
 SYM_FUNC_ALIAS(__memcpy, __pi_memcpy)
 EXPORT_SYMBOL(__memcpy)
diff --git a/arch/arm64/lib/memcpy_mc.S b/arch/arm64/lib/memcpy_mc.S
new file mode 100644
index 000000000000..d9ce8279d91f
--- /dev/null
+++ b/arch/arm64/lib/memcpy_mc.S
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2012-2021, Arm Limited.
+ *
+ * Adapted from the original at:
+ * https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include <asm/asm-uaccess.h>
+
+/* Assumptions:
+ *
+ * ARMv8-a, AArch64, unaligned accesses.
+ *
+ */
+
+	.macro ldrb1 reg, addr:vararg
+	KERNEL_SEA(9998f, ldrb  \reg, \addr)
+	.endm
+
+	.macro ldr1 reg, addr:vararg
+	KERNEL_SEA(9998f, ldr   \reg, \addr)
+	.endm
+
+	.macro ldp1 reg1, reg2, addr:vararg
+	KERNEL_SEA(9998f, ldp   \reg1, \reg2, \addr)
+	.endm
+
+	.macro ret1
+	mov	x0, #0
+	ret
+	.endm
+
+	.macro cpy1 dst, src, count
+	.arch_extension mops
+	KERNEL_SEA(9998f, cpyp [\dst]!, [\src]!, \count!)
+	KERNEL_SEA(9996f, cpym [\dst]!, [\src]!, \count!)
+	KERNEL_SEA(9996f, cpye [\dst]!, [\src]!, \count!)
+	.endm
+
+SYM_FUNC_START(__memcpy_mc)
+#include "memcpy_template.S"
+
+	// Exception fixups
+9996:	b.cs	9998f
+	// Registers are in Option A format
+	add	dst, dst, count
+9998:	sub	x0, dstend, dstin			// bytes not copied
+	ret
+SYM_FUNC_END(__memcpy_mc)
+
+EXPORT_SYMBOL(__memcpy_mc)
+SYM_FUNC_ALIAS_WEAK(memcpy_mc, __memcpy_mc)
+EXPORT_SYMBOL(memcpy_mc)
diff --git a/arch/arm64/lib/memcpy_template.S b/arch/arm64/lib/memcpy_template.S
new file mode 100644
index 000000000000..a8b496f8f651
--- /dev/null
+++ b/arch/arm64/lib/memcpy_template.S
@@ -0,0 +1,250 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2012-2021, Arm Limited.
+ *
+ * Adapted from the original at:
+ * https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/* Assumptions:
+ *
+ * ARMv8-a, AArch64, unaligned accesses.
+ *
+ */
+
+#define L(label) .L ## label
+
+#define dstin	x0
+#define src	x1
+#define count	x2
+#define dst	x3
+#define srcend	x4
+#define dstend	x5
+#define A_l	x6
+#define A_lw	w6
+#define A_h	x7
+#define B_l	x8
+#define B_lw	w8
+#define B_h	x9
+#define C_l	x10
+#define C_lw	w10
+#define C_h	x11
+#define D_l	x12
+#define D_h	x13
+#define E_l	x14
+#define E_h	x15
+#define F_l	x16
+#define F_h	x17
+#define G_l	count
+#define G_h	dst
+#define H_l	src
+#define H_h	srcend
+#define tmp1	x14
+
+/* This implementation handles overlaps and supports both memcpy and memmove
+   from a single entry point.  It uses unaligned accesses and branchless
+   sequences to keep the code small, simple and improve performance.
+
+   Copies are split into 3 main cases: small copies of up to 32 bytes, medium
+   copies of up to 128 bytes, and large copies.  The overhead of the overlap
+   check is negligible since it is only required for large copies.
+
+   Large copies use a software pipelined loop processing 64 bytes per iteration.
+   The destination pointer is 16-byte aligned to minimize unaligned accesses.
+   The loop tail is handled by always copying 64 bytes from the end.
+*/
+
+	add	dstend, dstin, count
+
+#ifdef CONFIG_AS_HAS_MOPS
+alternative_if_not ARM64_HAS_MOPS
+	b	L(no_mops)
+alternative_else_nop_endif
+	mov	dst, dstin
+	cpy1	dst, src, count
+	ret1
+#endif
+
+L(no_mops):
+	add	srcend, src, count
+	cmp	count, 128
+	b.hi	L(copy_long)
+	cmp	count, 32
+	b.hi	L(copy32_128)
+
+	/* Small copies: 0..32 bytes.  */
+	cmp	count, 16
+	b.lo	L(copy16)
+	ldp1	A_l, A_h, [src]
+	ldp1	D_l, D_h, [srcend, -16]
+	stp	A_l, A_h, [dstin]
+	stp	D_l, D_h, [dstend, -16]
+	ret1
+
+	/* Copy 8-15 bytes.  */
+L(copy16):
+	tbz	count, 3, L(copy8)
+	ldr1	A_l, [src]
+	ldr1	A_h, [srcend, -8]
+	str	A_l, [dstin]
+	str	A_h, [dstend, -8]
+	ret1
+
+	.p2align 3
+	/* Copy 4-7 bytes.  */
+L(copy8):
+	tbz	count, 2, L(copy4)
+	ldr1	A_lw, [src]
+	ldr1	B_lw, [srcend, -4]
+	str	A_lw, [dstin]
+	str	B_lw, [dstend, -4]
+	ret1
+
+	/* Copy 0..3 bytes using a branchless sequence.  */
+L(copy4):
+	cbz	count, L(copy0)
+	lsr	tmp1, count, 1
+	ldrb1	A_lw, [src]
+	ldrb1	C_lw, [srcend, -1]
+	ldrb1	B_lw, [src, tmp1]
+	strb	A_lw, [dstin]
+	strb	B_lw, [dstin, tmp1]
+	strb	C_lw, [dstend, -1]
+L(copy0):
+	ret1
+
+	.p2align 4
+	/* Medium copies: 33..128 bytes.  */
+L(copy32_128):
+	ldp1	A_l, A_h, [src]
+	ldp1	B_l, B_h, [src, 16]
+	ldp1	C_l, C_h, [srcend, -32]
+	ldp1	D_l, D_h, [srcend, -16]
+	cmp	count, 64
+	b.hi	L(copy128)
+	stp	A_l, A_h, [dstin]
+	stp	B_l, B_h, [dstin, 16]
+	stp	C_l, C_h, [dstend, -32]
+	stp	D_l, D_h, [dstend, -16]
+	ret1
+
+	.p2align 4
+	/* Copy 65..128 bytes.  */
+L(copy128):
+	ldp1	E_l, E_h, [src, 32]
+	ldp1	F_l, F_h, [src, 48]
+	cmp	count, 96
+	b.ls	L(copy96)
+	ldp1	G_l, G_h, [srcend, -64]
+	ldp1	H_l, H_h, [srcend, -48]
+	stp	G_l, G_h, [dstend, -64]
+	stp	H_l, H_h, [dstend, -48]
+L(copy96):
+	stp	A_l, A_h, [dstin]
+	stp	B_l, B_h, [dstin, 16]
+	stp	E_l, E_h, [dstin, 32]
+	stp	F_l, F_h, [dstin, 48]
+	stp	C_l, C_h, [dstend, -32]
+	stp	D_l, D_h, [dstend, -16]
+	ret1
+
+	.p2align 4
+	/* Copy more than 128 bytes.  */
+L(copy_long):
+	/* Use backwards copy if there is an overlap.  */
+	sub	tmp1, dstin, src
+	cbz	tmp1, L(copy0)
+	cmp	tmp1, count
+	b.lo	L(copy_long_backwards)
+
+	/* Copy 16 bytes and then align dst to 16-byte alignment.  */
+
+	ldp1	D_l, D_h, [src]
+	and	tmp1, dstin, 15
+	bic	dst, dstin, 15
+	sub	src, src, tmp1
+	add	count, count, tmp1	/* Count is now 16 too large.  */
+	ldp1	A_l, A_h, [src, 16]
+	stp	D_l, D_h, [dstin]
+	ldp1	B_l, B_h, [src, 32]
+	ldp1	C_l, C_h, [src, 48]
+	ldp1	D_l, D_h, [src, 64]!
+	subs	count, count, 128 + 16	/* Test and readjust count.  */
+	b.ls	L(copy64_from_end)
+
+L(loop64):
+	stp	A_l, A_h, [dst, 16]
+	ldp1	A_l, A_h, [src, 16]
+	stp	B_l, B_h, [dst, 32]
+	ldp1	B_l, B_h, [src, 32]
+	stp	C_l, C_h, [dst, 48]
+	ldp1	C_l, C_h, [src, 48]
+	stp	D_l, D_h, [dst, 64]!
+	ldp1	D_l, D_h, [src, 64]!
+	subs	count, count, 64
+	b.hi	L(loop64)
+
+	/* Write the last iteration and copy 64 bytes from the end.  */
+L(copy64_from_end):
+	ldp1	E_l, E_h, [srcend, -64]
+	stp	A_l, A_h, [dst, 16]
+	ldp1	A_l, A_h, [srcend, -48]
+	stp	B_l, B_h, [dst, 32]
+	ldp1	B_l, B_h, [srcend, -32]
+	stp	C_l, C_h, [dst, 48]
+	ldp1	C_l, C_h, [srcend, -16]
+	stp	D_l, D_h, [dst, 64]
+	stp	E_l, E_h, [dstend, -64]
+	stp	A_l, A_h, [dstend, -48]
+	stp	B_l, B_h, [dstend, -32]
+	stp	C_l, C_h, [dstend, -16]
+	ret1
+
+	.p2align 4
+
+	/* Large backwards copy for overlapping copies.
+	   Copy 16 bytes and then align dst to 16-byte alignment.  */
+L(copy_long_backwards):
+	ldp1	D_l, D_h, [srcend, -16]
+	and	tmp1, dstend, 15
+	sub	srcend, srcend, tmp1
+	sub	count, count, tmp1
+	ldp1	A_l, A_h, [srcend, -16]
+	stp	D_l, D_h, [dstend, -16]
+	ldp1	B_l, B_h, [srcend, -32]
+	ldp1	C_l, C_h, [srcend, -48]
+	ldp1	D_l, D_h, [srcend, -64]!
+	sub	dstend, dstend, tmp1
+	subs	count, count, 128
+	b.ls	L(copy64_from_start)
+
+L(loop64_backwards):
+	stp	A_l, A_h, [dstend, -16]
+	ldp1	A_l, A_h, [srcend, -16]
+	stp	B_l, B_h, [dstend, -32]
+	ldp1	B_l, B_h, [srcend, -32]
+	stp	C_l, C_h, [dstend, -48]
+	ldp1	C_l, C_h, [srcend, -48]
+	stp	D_l, D_h, [dstend, -64]!
+	ldp1	D_l, D_h, [srcend, -64]!
+	subs	count, count, 64
+	b.hi	L(loop64_backwards)
+
+	/* Write the last iteration and copy 64 bytes from the start.  */
+L(copy64_from_start):
+	ldp1	G_l, G_h, [src, 48]
+	stp	A_l, A_h, [dstend, -16]
+	ldp1	A_l, A_h, [src, 32]
+	stp	B_l, B_h, [dstend, -32]
+	ldp1	B_l, B_h, [src, 16]
+	stp	C_l, C_h, [dstend, -48]
+	ldp1	C_l, C_h, [src]
+	stp	D_l, D_h, [dstend, -64]
+	stp	G_l, G_h, [dstin, 48]
+	stp	A_l, A_h, [dstin, 32]
+	stp	B_l, B_h, [dstin, 16]
+	stp	C_l, C_h, [dstin]
+	ret1
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index d286e0a04543..da21a13151b9 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -79,6 +79,18 @@ void *memcpy(void *dest, const void *src, size_t len)
 }
 #endif
 
+#ifdef __HAVE_ARCH_MEMCPY_MC
+#undef memcpy_mc
+unsigned long memcpy_mc(void *dest, const void *src, size_t len)
+{
+	if (!kasan_check_range(src, len, false, _RET_IP_) ||
+	    !kasan_check_range(dest, len, true, _RET_IP_))
+		return len;
+
+	return __memcpy_mc(dest, src, len);
+}
+#endif
+
 void *__asan_memset(void *addr, int c, ssize_t len)
 {
 	if (!kasan_check_range(addr, len, true, _RET_IP_))
-- 
2.39.3



^ permalink raw reply related

* [PATCH v15 6/9] arm64: support copy_mc_[user]_highpage()
From: Ruidong Tian @ 2026-06-18  9:21 UTC (permalink / raw)
  To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
	xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
	dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
	christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
	tianruidong
In-Reply-To: <20260618092124.3901230-1-tianruidong@linux.alibaba.com>

From: Tong Tiangen <tongtiangen@huawei.com>

Currently, many scenarios that can tolerate memory errors when copying page
have been supported in the kernel[1~9], all of which are implemented by
copy_mc_[user]_highpage(). arm64 should also support this mechanism.

Due to mte, arm64 needs to have its own copy_mc_[user]_highpage()
architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and
__HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it.

Add new helper copy_mc_page() which provide a page copy implementation with
hardware memory error safe. The code logic of copy_mc_page() is the same as
copy_page(), the main difference is that the ldp insn of copy_mc_page()
contains the fixup type EX_TYPE_KACCESS_SEA, therefore, the
main logic is extracted to copy_page_template.S. In addition, the fixup of
MOPS insn is not considered at present.

[Ruidong: add FEAT_MOPS support]

[1] commit d302c2398ba2 ("mm, hwpoison: when copy-on-write hits poison, take page offline")
[2] commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults")
[3] commit 6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()")
[4] commit 98c76c9f1ef7 ("mm/khugepaged: recover from poisoned anonymous memory")
[5] commit 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory")
[6] commit 658be46520ce ("mm: support poison recovery from copy_present_page()")
[7] commit aa549f923f5e ("mm: support poison recovery from do_cow_fault()")
[8] commit f00b295b9b61 ("fs: hugetlbfs: support poisoned recover from hugetlbfs_migrate_folio()")
[9] commit 060913999d7a ("mm: migrate: support poisoned recover from migrate folio")

Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/asm-extable.h |  4 ++
 arch/arm64/include/asm/mte.h         |  9 ++++
 arch/arm64/include/asm/page.h        | 12 +++++
 arch/arm64/lib/Makefile              |  2 +
 arch/arm64/lib/copy_mc_page.S        | 44 +++++++++++++++
 arch/arm64/lib/copy_page.S           | 67 ++++-------------------
 arch/arm64/lib/copy_page_template.S  | 70 ++++++++++++++++++++++++
 arch/arm64/lib/mte.S                 | 29 ++++++++++
 arch/arm64/mm/copypage.c             | 80 ++++++++++++++++++++++++++++
 include/linux/highmem.h              |  8 +++
 11 files changed, 270 insertions(+), 56 deletions(-)
 create mode 100644 arch/arm64/lib/copy_mc_page.S
 create mode 100644 arch/arm64/lib/copy_page_template.S

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe60738e5943..831b20d45893 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -21,6 +21,7 @@ config ARM64
 	select ARCH_HAS_CACHE_LINE_SIZE
 	select ARCH_HAS_CC_PLATFORM
 	select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
+	select ARCH_HAS_COPY_MC if ACPI_APEI_GHES
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DEBUG_VIRTUAL
 	select ARCH_HAS_DEBUG_VM_PGTABLE
diff --git a/arch/arm64/include/asm/asm-extable.h b/arch/arm64/include/asm/asm-extable.h
index 8450ec5a3af6..9305ea77482a 100644
--- a/arch/arm64/include/asm/asm-extable.h
+++ b/arch/arm64/include/asm/asm-extable.h
@@ -10,6 +10,10 @@
 #define EX_TYPE_ACCESS_ERR_ZERO		2
 #define EX_TYPE_UACCESS_CPY		3
 #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD	4
+/*
+ * Kernel access: used in kernel context for both regular load/store
+ * instructions and MOPS (memory copy/set) instructions.
+ */
 #define EX_TYPE_KACCESS_SEA		5
 
 /* Data fields for EX_TYPE_ACCESS_ERR_ZERO */
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 7f7b97e09996..a0b1757f4847 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -98,6 +98,11 @@ static inline bool try_page_mte_tagging(struct page *page)
 void mte_zero_clear_page_tags(void *addr);
 void mte_sync_tags(pte_t pte, unsigned int nr_pages);
 void mte_copy_page_tags(void *kto, const void *kfrom);
+
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+int mte_copy_mc_page_tags(void *kto, const void *kfrom);
+#endif
+
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
 void mte_cpu_setup(void);
@@ -134,6 +139,10 @@ static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
 {
 }
+static inline int mte_copy_mc_page_tags(void *kto, const void *kfrom)
+{
+	return 0;
+}
 static inline void mte_thread_init_user(void)
 {
 }
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index e25d0d18f6d7..5c4c9f974b68 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -29,6 +29,18 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+int copy_mc_page(void *to, const void *from);
+#define __HAVE_ARCH_COPY_MC_PAGE
+
+int copy_mc_highpage(struct page *to, struct page *from);
+#define __HAVE_ARCH_COPY_MC_HIGHPAGE
+
+int copy_mc_user_highpage(struct page *to, struct page *from,
+		unsigned long vaddr, struct vm_area_struct *vma);
+#define __HAVE_ARCH_COPY_MC_USER_HIGHPAGE
+#endif
+
 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 						unsigned long vaddr);
 #define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 448c917494f3..1f4c3f743a20 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -7,6 +7,8 @@ lib-y		:= clear_user.o delay.o copy_from_user.o		\
 
 lib-$(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) += uaccess_flushcache.o
 
+lib-$(CONFIG_ARCH_HAS_COPY_MC) += copy_mc_page.o
+
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
 
 obj-$(CONFIG_ARM64_MTE) += mte.o
diff --git a/arch/arm64/lib/copy_mc_page.S b/arch/arm64/lib/copy_mc_page.S
new file mode 100644
index 000000000000..f936e0c98611
--- /dev/null
+++ b/arch/arm64/lib/copy_mc_page.S
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <linux/linkage.h>
+#include <linux/const.h>
+#include <asm/assembler.h>
+#include <asm/page.h>
+#include <asm/cpufeature.h>
+#include <asm/alternative.h>
+#include <asm/asm-extable.h>
+#include <asm/asm-uaccess.h>
+
+/*
+ * Copy a page from src to dest (both are page aligned) with memory error safe
+ *
+ * Parameters:
+ *	x0 - dest
+ *	x1 - src
+ * Returns:
+ * 	x0 - Return 0 if copy success, or -EFAULT if anything goes wrong
+ *	     while copying.
+ */
+	.macro ldp1 reg1, reg2, ptr, val
+	KERNEL_SEA(9998f, ldp \reg1, \reg2, [\ptr, \val])
+	.endm
+
+	.macro cpy1 dst, src, count
+	.arch_extension mops
+	KERNEL_SEA(9998f, cpypwn [\dst]!, [\src]!, \count!)
+	KERNEL_SEA(9998f, cpymwn [\dst]!, [\src]!, \count!)
+	KERNEL_SEA(9998f, cpyewn [\dst]!, [\src]!, \count!)
+	.endm
+
+SYM_FUNC_START(__pi_copy_mc_page)
+#include "copy_page_template.S"
+
+	mov x0, #0
+	ret
+
+9998:	mov x0, #-EFAULT
+	ret
+
+SYM_FUNC_END(__pi_copy_mc_page)
+SYM_FUNC_ALIAS(copy_mc_page, __pi_copy_mc_page)
+EXPORT_SYMBOL(copy_mc_page)
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index e6374e7e5511..e520777b5150 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -17,65 +17,20 @@
  *	x0 - dest
  *	x1 - src
  */
-SYM_FUNC_START(__pi_copy_page)
-#ifdef CONFIG_AS_HAS_MOPS
-	.arch_extension mops
-alternative_if_not ARM64_HAS_MOPS
-	b	.Lno_mops
-alternative_else_nop_endif
-
-	mov	x2, #PAGE_SIZE
-	cpypwn	[x0]!, [x1]!, x2!
-	cpymwn	[x0]!, [x1]!, x2!
-	cpyewn	[x0]!, [x1]!, x2!
-	ret
-.Lno_mops:
-#endif
-	ldp	x2, x3, [x1]
-	ldp	x4, x5, [x1, #16]
-	ldp	x6, x7, [x1, #32]
-	ldp	x8, x9, [x1, #48]
-	ldp	x10, x11, [x1, #64]
-	ldp	x12, x13, [x1, #80]
-	ldp	x14, x15, [x1, #96]
-	ldp	x16, x17, [x1, #112]
-
-	add	x0, x0, #256
-	add	x1, x1, #128
-1:
-	tst	x0, #(PAGE_SIZE - 1)
 
-	stnp	x2, x3, [x0, #-256]
-	ldp	x2, x3, [x1]
-	stnp	x4, x5, [x0, #16 - 256]
-	ldp	x4, x5, [x1, #16]
-	stnp	x6, x7, [x0, #32 - 256]
-	ldp	x6, x7, [x1, #32]
-	stnp	x8, x9, [x0, #48 - 256]
-	ldp	x8, x9, [x1, #48]
-	stnp	x10, x11, [x0, #64 - 256]
-	ldp	x10, x11, [x1, #64]
-	stnp	x12, x13, [x0, #80 - 256]
-	ldp	x12, x13, [x1, #80]
-	stnp	x14, x15, [x0, #96 - 256]
-	ldp	x14, x15, [x1, #96]
-	stnp	x16, x17, [x0, #112 - 256]
-	ldp	x16, x17, [x1, #112]
+	.macro ldp1 reg1, reg2, ptr, val
+	ldp \reg1, \reg2, [\ptr, \val]
+	.endm
 
-	add	x0, x0, #128
-	add	x1, x1, #128
-
-	b.ne	1b
-
-	stnp	x2, x3, [x0, #-256]
-	stnp	x4, x5, [x0, #16 - 256]
-	stnp	x6, x7, [x0, #32 - 256]
-	stnp	x8, x9, [x0, #48 - 256]
-	stnp	x10, x11, [x0, #64 - 256]
-	stnp	x12, x13, [x0, #80 - 256]
-	stnp	x14, x15, [x0, #96 - 256]
-	stnp	x16, x17, [x0, #112 - 256]
+	.macro cpy1 dst, src, count
+	.arch_extension mops
+	cpypwn [\dst]!, [\src]!, \count!
+	cpymwn [\dst]!, [\src]!, \count!
+	cpyewn [\dst]!, [\src]!, \count!
+	.endm
 
+SYM_FUNC_START(__pi_copy_page)
+#include "copy_page_template.S"
 	ret
 SYM_FUNC_END(__pi_copy_page)
 SYM_FUNC_ALIAS(copy_page, __pi_copy_page)
diff --git a/arch/arm64/lib/copy_page_template.S b/arch/arm64/lib/copy_page_template.S
new file mode 100644
index 000000000000..e5afbeaaad25
--- /dev/null
+++ b/arch/arm64/lib/copy_page_template.S
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2012 ARM Ltd.
+ */
+
+/*
+ * Copy a page from src to dest (both are page aligned)
+ *
+ * Parameters:
+ *	x0 - dest
+ *	x1 - src
+ */
+dstin	.req	x0
+src	.req	x1
+
+#ifdef CONFIG_AS_HAS_MOPS
+alternative_if_not ARM64_HAS_MOPS
+	b	.Lno_mops
+alternative_else_nop_endif
+	mov	x2, #PAGE_SIZE
+	cpy1	dst, src, x2
+	b	.Lexitfunc
+.Lno_mops:
+#endif
+
+	ldp1	x2, x3, x1, #0
+	ldp1	x4, x5, x1, #16
+	ldp1	x6, x7, x1, #32
+	ldp1	x8, x9, x1, #48
+	ldp1	x10, x11, x1, #64
+	ldp1	x12, x13, x1, #80
+	ldp1	x14, x15, x1, #96
+	ldp1	x16, x17, x1, #112
+
+	add	x0, x0, #256
+	add	x1, x1, #128
+1:
+	tst	x0, #(PAGE_SIZE - 1)
+
+	stnp	x2, x3, [x0, #-256]
+	ldp1	x2, x3, x1, #0
+	stnp	x4, x5, [x0, #16 - 256]
+	ldp1	x4, x5, x1, #16
+	stnp	x6, x7, [x0, #32 - 256]
+	ldp1	x6, x7, x1, #32
+	stnp	x8, x9, [x0, #48 - 256]
+	ldp1	x8, x9, x1, #48
+	stnp	x10, x11, [x0, #64 - 256]
+	ldp1	x10, x11, x1, #64
+	stnp	x12, x13, [x0, #80 - 256]
+	ldp1	x12, x13, x1, #80
+	stnp	x14, x15, [x0, #96 - 256]
+	ldp1	x14, x15, x1, #96
+	stnp	x16, x17, [x0, #112 - 256]
+	ldp1	x16, x17, x1, #112
+
+	add	x0, x0, #128
+	add	x1, x1, #128
+
+	b.ne	1b
+
+	stnp	x2, x3, [x0, #-256]
+	stnp	x4, x5, [x0, #16 - 256]
+	stnp	x6, x7, [x0, #32 - 256]
+	stnp	x8, x9, [x0, #48 - 256]
+	stnp	x10, x11, [x0, #64 - 256]
+	stnp	x12, x13, [x0, #80 - 256]
+	stnp	x14, x15, [x0, #96 - 256]
+	stnp	x16, x17, [x0, #112 - 256]
+.Lexitfunc:
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 5018ac03b6bf..1afe3ef1502c 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -80,6 +80,35 @@ SYM_FUNC_START(mte_copy_page_tags)
 	ret
 SYM_FUNC_END(mte_copy_page_tags)
 
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+/*
+ * Copy the tags from the source page to the destination one with memory error safe
+ *   x0 - address of the destination page
+ *   x1 - address of the source page
+ * Returns:
+ *   x0 - Return 0 if copy success, or
+ *        -EFAULT if anything goes wrong while copying.
+ */
+SYM_FUNC_START(mte_copy_mc_page_tags)
+	mov	x2, x0
+	mov	x3, x1
+	multitag_transfer_size x5, x6
+1:
+KERNEL_SEA(2f, ldgm	x4, [x3])
+	stgm	x4, [x2]
+	add	x2, x2, x5
+	add	x3, x3, x5
+	tst	x2, #(PAGE_SIZE - 1)
+	b.ne	1b
+
+	mov x0, #0
+	ret
+
+2:	mov x0, #-EFAULT
+	ret
+SYM_FUNC_END(mte_copy_mc_page_tags)
+#endif
+
 /*
  * Read tags from a user buffer (one tag per byte) and set the corresponding
  * tags at the given kernel address. Used by PTRACE_POKEMTETAGS.
diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index cd5912ba617b..c22918ed0f3c 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -72,3 +72,83 @@ void copy_user_highpage(struct page *to, struct page *from,
 	flush_dcache_page(to);
 }
 EXPORT_SYMBOL_GPL(copy_user_highpage);
+
+#ifdef CONFIG_ARCH_HAS_COPY_MC
+/*
+ * Return -EFAULT if anything goes wrong while copying page or mte.
+ */
+int copy_mc_highpage(struct page *to, struct page *from)
+{
+	void *kto = page_address(to);
+	void *kfrom = page_address(from);
+	struct folio *src = page_folio(from);
+	struct folio *dst = page_folio(to);
+	unsigned int i, nr_pages;
+	int ret;
+
+	ret = copy_mc_page(kto, kfrom);
+	if (ret)
+		return -EFAULT;
+
+	if (kasan_hw_tags_enabled())
+		page_kasan_tag_reset(to);
+
+	if (!system_supports_mte())
+		return 0;
+
+	if (folio_test_hugetlb(src)) {
+		if (!folio_test_hugetlb_mte_tagged(src) ||
+		    from != folio_page(src, 0))
+			return 0;
+
+		WARN_ON_ONCE(!folio_try_hugetlb_mte_tagging(dst));
+
+		/*
+		 * Populate tags for all subpages.
+		 *
+		 * Don't assume the first page is head page since
+		 * huge page copy may start from any subpage.
+		 */
+		nr_pages = folio_nr_pages(src);
+		for (i = 0; i < nr_pages; i++) {
+			kfrom = page_address(folio_page(src, i));
+			kto = page_address(folio_page(dst, i));
+			ret = mte_copy_mc_page_tags(kto, kfrom);
+			if (ret)
+				return -EFAULT;
+		}
+		folio_set_hugetlb_mte_tagged(dst);
+	} else if (page_mte_tagged(from)) {
+		/* It's a new page, shouldn't have been tagged yet */
+		WARN_ON_ONCE(!try_page_mte_tagging(to));
+
+		ret = mte_copy_mc_page_tags(kto, kfrom);
+		if (ret)
+			return -EFAULT;
+		set_page_mte_tagged(to);
+	}
+	/*
+	 * memory_failure_queue() is not called here because on arm64
+	 * the firmware (GHES) has already reported the hardware memory
+	 * error and queued the page for memory_failure() handling via
+	 * ghes_do_memory_failure().
+	 */
+	return 0;
+}
+EXPORT_SYMBOL(copy_mc_highpage);
+
+int copy_mc_user_highpage(struct page *to, struct page *from,
+			unsigned long vaddr, struct vm_area_struct *vma)
+{
+	int ret;
+
+	ret = copy_mc_highpage(to, from);
+	if (ret)
+		return ret;
+
+	flush_dcache_page(to);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(copy_mc_user_highpage);
+#endif
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 18dc4aca4aa1..f168c9d4ad0e 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -424,6 +424,7 @@ static inline void copy_highpage(struct page *to, struct page *from)
 #endif
 
 #ifdef copy_mc_to_kernel
+#ifndef __HAVE_ARCH_COPY_MC_USER_HIGHPAGE
 /*
  * If architecture supports machine check exception handling, define the
  * #MC versions of copy_user_highpage and copy_highpage. They copy a memory
@@ -449,7 +450,9 @@ static inline int copy_mc_user_highpage(struct page *to, struct page *from,
 
 	return ret ? -EFAULT : 0;
 }
+#endif
 
+#ifndef __HAVE_ARCH_COPY_MC_HIGHPAGE
 static inline int copy_mc_highpage(struct page *to, struct page *from)
 {
 	unsigned long ret;
@@ -468,20 +471,25 @@ static inline int copy_mc_highpage(struct page *to, struct page *from)
 
 	return ret ? -EFAULT : 0;
 }
+#endif
 #else
+#ifndef __HAVE_ARCH_COPY_MC_USER_HIGHPAGE
 static inline int copy_mc_user_highpage(struct page *to, struct page *from,
 					unsigned long vaddr, struct vm_area_struct *vma)
 {
 	copy_user_highpage(to, from, vaddr, vma);
 	return 0;
 }
+#endif
 
+#ifndef __HAVE_ARCH_COPY_MC_HIGHPAGE
 static inline int copy_mc_highpage(struct page *to, struct page *from)
 {
 	copy_highpage(to, from);
 	return 0;
 }
 #endif
+#endif
 
 static inline void memcpy_page(struct page *dst_page, size_t dst_off,
 			       struct page *src_page, size_t src_off,
-- 
2.39.3



^ permalink raw reply related

* [PATCH v15 5/9] mm/hwpoison: return -EFAULT when copy fail in copy_mc_[user]_highpage()
From: Ruidong Tian @ 2026-06-18  9:21 UTC (permalink / raw)
  To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
	xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
	dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
	christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
	tianruidong, Jonathan Cameron, Mauro Carvalho Chehab
In-Reply-To: <20260618092124.3901230-1-tianruidong@linux.alibaba.com>

From: Tong Tiangen <tongtiangen@huawei.com>

Currently, copy_mc_[user]_highpage() returns zero on success, or in case
of failures, the number of bytes that weren't copied.

While tracking the number of not copied works fine for x86 and PPC, There
are some difficulties in doing the same thing on ARM64 because there is no
available caller-saved register in copy_page()(lib/copy_page.S) to save
"bytes not copied", and the following copy_mc_page() will also encounter
the same problem.

Consider the caller of copy_mc_[user]_highpage() cannot do any processing
on the remaining data(The page has hardware errors), they only check if
copy was succeeded or not, make the interface more generic by using an
error code when copy fails (-EFAULT) or return zero on success.

Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 include/linux/highmem.h | 8 ++++----
 mm/khugepaged.c         | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index af03db851a1d..18dc4aca4aa1 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -427,8 +427,8 @@ static inline void copy_highpage(struct page *to, struct page *from)
 /*
  * If architecture supports machine check exception handling, define the
  * #MC versions of copy_user_highpage and copy_highpage. They copy a memory
- * page with #MC in source page (@from) handled, and return the number
- * of bytes not copied if there was a #MC, otherwise 0 for success.
+ * page with #MC in source page (@from) handled, and return -EFAULT if there
+ * was a #MC, otherwise 0 for success.
  */
 static inline int copy_mc_user_highpage(struct page *to, struct page *from,
 					unsigned long vaddr, struct vm_area_struct *vma)
@@ -447,7 +447,7 @@ static inline int copy_mc_user_highpage(struct page *to, struct page *from,
 	if (ret)
 		memory_failure_queue(page_to_pfn(from), 0);
 
-	return ret;
+	return ret ? -EFAULT : 0;
 }
 
 static inline int copy_mc_highpage(struct page *to, struct page *from)
@@ -466,7 +466,7 @@ static inline int copy_mc_highpage(struct page *to, struct page *from)
 	if (ret)
 		memory_failure_queue(page_to_pfn(from), 0);
 
-	return ret;
+	return ret ? -EFAULT : 0;
 }
 #else
 static inline int copy_mc_user_highpage(struct page *to, struct page *from,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b8452dbdb043..cf1b78eed3c3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -810,7 +810,7 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
 			continue;
 		}
 		src_page = pte_page(pteval);
-		if (copy_mc_user_highpage(page, src_page, src_addr, vma) > 0) {
+		if (copy_mc_user_highpage(page, src_page, src_addr, vma)) {
 			result = SCAN_COPY_MC;
 			break;
 		}
@@ -2143,7 +2143,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		for (i = 0; i < nr_pages; i++) {
-			if (copy_mc_highpage(dst, folio_page(folio, i)) > 0) {
+			if (copy_mc_highpage(dst, folio_page(folio, i))) {
 				result = SCAN_COPY_MC;
 				goto rollback;
 			}
-- 
2.39.3



^ permalink raw reply related

* [PATCH v15 4/9] arm64: enable recover from synchronous external abort in kernel context
From: Ruidong Tian @ 2026-06-18  9:21 UTC (permalink / raw)
  To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
	xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
	dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
	christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
	tianruidong
In-Reply-To: <20260618092124.3901230-1-tianruidong@linux.alibaba.com>

For the arm64 kernel, when it processes hardware memory errors for
synchronize notifications(do_sea()), if the errors is consumed within the
kernel, the current processing is panic. However, it is not optimal.

Take copy_from/to_user for example, If ld* triggers a memory error, even in
kernel mode, only the associated process is affected. Killing the user
process and isolating the corrupt page is a better choice.

Add new fixup type EX_TYPE_KACCESS_SEA to identify insn that can recover
from memory errors triggered by access to kernel memory, and this fixup
type is used in __arch_copy_to_user(), This make the regular copy_to_user()
will handle kernel memory errors.

[Ruidong: modify subject and rename EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR to
EX_TYPE_KACCESS_SEA]

Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
 arch/arm64/include/asm/asm-extable.h |  5 +++++
 arch/arm64/include/asm/asm-uaccess.h |  4 ++++
 arch/arm64/include/asm/extable.h     |  1 +
 arch/arm64/lib/copy_to_user.S        | 10 +++++-----
 arch/arm64/mm/extable.c              | 28 ++++++++++++++++++++++++++
 arch/arm64/mm/fault.c                | 30 ++++++++++++++++++++--------
 6 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/asm-extable.h b/arch/arm64/include/asm/asm-extable.h
index 06b19023939b..8450ec5a3af6 100644
--- a/arch/arm64/include/asm/asm-extable.h
+++ b/arch/arm64/include/asm/asm-extable.h
@@ -10,6 +10,7 @@
 #define EX_TYPE_ACCESS_ERR_ZERO		2
 #define EX_TYPE_UACCESS_CPY		3
 #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD	4
+#define EX_TYPE_KACCESS_SEA		5
 
 /* Data fields for EX_TYPE_ACCESS_ERR_ZERO */
 #define EX_DATA_REG_ERR_SHIFT	0
@@ -76,6 +77,10 @@
 	__ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_UACCESS_CPY, \uaccess_is_write)
 	.endm
 
+	.macro          _asm_extable_kaccess_sea, insn, fixup
+	__ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_KACCESS_SEA, 0)
+	.endm
+
 #else /* __ASSEMBLER__ */
 
 #include <linux/stringify.h>
diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
index 12aa6a283249..27bf8edbf597 100644
--- a/arch/arm64/include/asm/asm-uaccess.h
+++ b/arch/arm64/include/asm/asm-uaccess.h
@@ -57,6 +57,10 @@ alternative_else_nop_endif
 	.endm
 #endif
 
+#define KERNEL_SEA(l, x...)			\
+9999:	x;					\
+	_asm_extable_kaccess_sea	9999b, l
+
 #define USER(l, x...)				\
 9999:	x;					\
 	_asm_extable_uaccess	9999b, l
diff --git a/arch/arm64/include/asm/extable.h b/arch/arm64/include/asm/extable.h
index 9dc39612bdf5..47c851d7df4f 100644
--- a/arch/arm64/include/asm/extable.h
+++ b/arch/arm64/include/asm/extable.h
@@ -48,4 +48,5 @@ bool ex_handler_bpf(const struct exception_table_entry *ex,
 #endif /* !CONFIG_BPF_JIT */
 
 bool fixup_exception(struct pt_regs *regs, unsigned long esr);
+bool fixup_exception_me(struct pt_regs *regs);
 #endif
diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S
index 819f2e3fc7a9..6103f5b0a2d0 100644
--- a/arch/arm64/lib/copy_to_user.S
+++ b/arch/arm64/lib/copy_to_user.S
@@ -20,7 +20,7 @@
  *	x0 - bytes not copied
  */
 	.macro ldrb1 reg, ptr, val
-	ldrb  \reg, [\ptr], \val
+	KERNEL_SEA(9998f, ldrb  \reg, [\ptr], \val)
 	.endm
 
 	.macro strb1 reg, ptr, val
@@ -28,7 +28,7 @@
 	.endm
 
 	.macro ldrh1 reg, ptr, val
-	ldrh  \reg, [\ptr], \val
+	KERNEL_SEA(9998f, ldrh  \reg, [\ptr], \val)
 	.endm
 
 	.macro strh1 reg, ptr, val
@@ -36,7 +36,7 @@
 	.endm
 
 	.macro ldr1 reg, ptr, val
-	ldr \reg, [\ptr], \val
+	KERNEL_SEA(9998f, ldr \reg, [\ptr], \val)
 	.endm
 
 	.macro str1 reg, ptr, val
@@ -44,7 +44,7 @@
 	.endm
 
 	.macro ldp1 reg1, reg2, ptr, val
-	ldp \reg1, \reg2, [\ptr], \val
+	KERNEL_SEA(9998f, ldp \reg1, \reg2, [\ptr], \val)
 	.endm
 
 	.macro stp1 reg1, reg2, ptr, val
@@ -74,7 +74,7 @@ SYM_FUNC_START(__arch_copy_to_user)
 9997:	cmp	dst, dstin
 	b.ne	9998f
 	// Before being absolutely sure we couldn't copy anything, try harder
-	ldrb	tmp1w, [srcin]
+KERNEL_SEA(9998f, ldrb	tmp1w, [srcin])
 USER(9998f, sttrb tmp1w, [dst])
 	add	dst, dst, #1
 9998:	sub	x0, end, dst			// bytes not copied
diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index 76b18780f1f9..20a7a9eeed94 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -109,7 +109,35 @@ bool fixup_exception(struct pt_regs *regs, unsigned long esr)
 		return ex_handler_uaccess_cpy(ex, regs, esr);
 	case EX_TYPE_LOAD_UNALIGNED_ZEROPAD:
 		return ex_handler_load_unaligned_zeropad(ex, regs);
+	/*
+	 * Kernel address faults (e.g. copy_to_user reading from kernel src).
+	 * Do not fixup here: a translation fault on a kernel address is a
+	 * kernel bug (e.g. NULL pointer dereference) and must oops.
+	 * Only SEA (hardware memory error) should be fixed up, which is
+	 * handled by fixup_exception_me() through the do_sea path.
+	 */
+	case EX_TYPE_KACCESS_SEA:
+		return false;
 	}
 
 	BUG();
 }
+
+bool fixup_exception_me(struct pt_regs *regs)
+{
+	const struct exception_table_entry *ex;
+
+	ex = search_exception_tables(instruction_pointer(regs));
+	if (!ex)
+		return false;
+
+	switch (ex->type) {
+	case EX_TYPE_ACCESS_ERR_ZERO:
+		return ex_handler_access_err_zero(ex, regs);
+	case EX_TYPE_KACCESS_SEA:
+		regs->pc = get_ex_fixup(ex);
+		return true;
+	}
+
+	return false;
+}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 0f3c5c7ca054..b775c0928a53 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -858,21 +858,35 @@ static int do_bad(unsigned long far, unsigned long esr, struct pt_regs *regs)
 	return 1; /* "fault" */
 }
 
+/*
+ * APEI claimed this as a firmware-first notification.
+ * Some processing deferred to task_work before ret_to_user().
+ */
+static int do_apei_claim_sea(struct pt_regs *regs)
+{
+	int ret;
+
+	ret = apei_claim_sea(regs);
+	if (ret)
+		return ret;
+
+	if (!user_mode(regs)) {
+		if (!fixup_exception_me(regs))
+			return -ENOENT;
+	}
+
+	return ret;
+}
+
 static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
 {
 	const struct fault_info *inf;
 	unsigned long siaddr;
 
-	inf = esr_to_fault_info(esr);
-
-	if (user_mode(regs) && apei_claim_sea(regs) == 0) {
-		/*
-		 * APEI claimed this as a firmware-first notification.
-		 * Some processing deferred to task_work before ret_to_user().
-		 */
+	if (do_apei_claim_sea(regs) == 0)
 		return 0;
-	}
 
+	inf = esr_to_fault_info(esr);
 	if (esr & ESR_ELx_FnV) {
 		siaddr = 0;
 	} else {
-- 
2.39.3



^ permalink raw reply related

* [PATCH v15 3/9] arm64: extable: merge UACCESS_ERR_ZERO and KACCESS_ERR_ZERO into ACCESS_ERR_ZERO
From: Ruidong Tian @ 2026-06-18  9:21 UTC (permalink / raw)
  To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
	xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
	dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
	christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
	tianruidong
In-Reply-To: <20260618092124.3901230-1-tianruidong@linux.alibaba.com>

EX_TYPE_UACCESS_ERR_ZERO and EX_TYPE_KACCESS_ERR_ZERO have identical
handling in fixup_exception(): both unconditionally invoke
ex_handler_uaccess_err_zero() to set the error register to -EFAULT,
zero the destination register, and branch to the fixup address.

Merge them into a single EX_TYPE_ACCESS_ERR_ZERO to reduce redundancy
and renumber the subsequent types accordingly.

The _ASM_EXTABLE_UACCESS_ERR_ZERO and _ASM_EXTABLE_KACCESS_ERR_ZERO
helper macros are preserved as-is for caller readability, but both now
emit the unified EX_TYPE_ACCESS_ERR_ZERO type.

Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
---
 arch/arm64/include/asm/asm-extable.h | 15 +++++++--------
 arch/arm64/mm/extable.c              |  7 +++----
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/asm-extable.h b/arch/arm64/include/asm/asm-extable.h
index d67e2fdd1aee..06b19023939b 100644
--- a/arch/arm64/include/asm/asm-extable.h
+++ b/arch/arm64/include/asm/asm-extable.h
@@ -7,12 +7,11 @@
 
 #define EX_TYPE_NONE			0
 #define EX_TYPE_BPF			1
-#define EX_TYPE_UACCESS_ERR_ZERO	2
-#define EX_TYPE_KACCESS_ERR_ZERO	3
-#define EX_TYPE_UACCESS_CPY		4
-#define EX_TYPE_LOAD_UNALIGNED_ZEROPAD	5
+#define EX_TYPE_ACCESS_ERR_ZERO		2
+#define EX_TYPE_UACCESS_CPY		3
+#define EX_TYPE_LOAD_UNALIGNED_ZEROPAD	4
 
-/* Data fields for EX_TYPE_UACCESS_ERR_ZERO */
+/* Data fields for EX_TYPE_ACCESS_ERR_ZERO */
 #define EX_DATA_REG_ERR_SHIFT	0
 #define EX_DATA_REG_ERR		GENMASK(4, 0)
 #define EX_DATA_REG_ZERO_SHIFT	5
@@ -43,7 +42,7 @@
 
 #define _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, err, zero)		\
 	__ASM_EXTABLE_RAW(insn, fixup, 					\
-			  EX_TYPE_UACCESS_ERR_ZERO,			\
+			  EX_TYPE_ACCESS_ERR_ZERO,			\
 			  (						\
 			    EX_DATA_REG(ERR, err) |			\
 			    EX_DATA_REG(ZERO, zero)			\
@@ -96,7 +95,7 @@
 #define _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, err, zero)		\
 	__DEFINE_ASM_GPR_NUMS						\
 	__ASM_EXTABLE_RAW(#insn, #fixup, 				\
-			  __stringify(EX_TYPE_UACCESS_ERR_ZERO),	\
+			  __stringify(EX_TYPE_ACCESS_ERR_ZERO),	\
 			  "("						\
 			    EX_DATA_REG(ERR, err) " | "			\
 			    EX_DATA_REG(ZERO, zero)			\
@@ -105,7 +104,7 @@
 #define _ASM_EXTABLE_KACCESS_ERR_ZERO(insn, fixup, err, zero)		\
 	__DEFINE_ASM_GPR_NUMS						\
 	__ASM_EXTABLE_RAW(#insn, #fixup, 				\
-			  __stringify(EX_TYPE_KACCESS_ERR_ZERO),	\
+			  __stringify(EX_TYPE_ACCESS_ERR_ZERO),	\
 			  "("						\
 			    EX_DATA_REG(ERR, err) " | "			\
 			    EX_DATA_REG(ZERO, zero)			\
diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index 6e0528831cd3..76b18780f1f9 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -41,7 +41,7 @@ get_ex_fixup(const struct exception_table_entry *ex)
 	return ((unsigned long)&ex->fixup + ex->fixup);
 }
 
-static bool ex_handler_uaccess_err_zero(const struct exception_table_entry *ex,
+static bool ex_handler_access_err_zero(const struct exception_table_entry *ex,
 					struct pt_regs *regs)
 {
 	int reg_err = FIELD_GET(EX_DATA_REG_ERR, ex->data);
@@ -103,9 +103,8 @@ bool fixup_exception(struct pt_regs *regs, unsigned long esr)
 	switch (ex->type) {
 	case EX_TYPE_BPF:
 		return ex_handler_bpf(ex, regs);
-	case EX_TYPE_UACCESS_ERR_ZERO:
-	case EX_TYPE_KACCESS_ERR_ZERO:
-		return ex_handler_uaccess_err_zero(ex, regs);
+	case EX_TYPE_ACCESS_ERR_ZERO:
+		return ex_handler_access_err_zero(ex, regs);
 	case EX_TYPE_UACCESS_CPY:
 		return ex_handler_uaccess_cpy(ex, regs, esr);
 	case EX_TYPE_LOAD_UNALIGNED_ZEROPAD:
-- 
2.39.3



^ permalink raw reply related

* [PATCH v15 2/9] ACPI: APEI: GHES: use exception context to gate SIGBUS on poison consumption
From: Ruidong Tian @ 2026-06-18  9:21 UTC (permalink / raw)
  To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
	xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
	dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
	christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
	tianruidong
In-Reply-To: <20260618092124.3901230-1-tianruidong@linux.alibaba.com>

When a GHES SEA (Synchronous External Abort) fires while the CPU
was executing in kernel mode, it typically means that kernel code
itself consumed a poisoned memory location -- e.g. copy_from_user()
/ copy_to_user() invoked from a ioctl() or write() syscall touched
a poisoned user page or page-cache page on behalf of the task.

The expected behaviour in that case is that the faulting kernel
helper returns via its extable fixup and the syscall returns an
error (e.g. -EFAULT) to user space. It is NOT appropriate to deliver
SIGBUS to the current task: the task did not directly dereference
the poisoned address, the kernel did on its behalf, and the kernel
is able to recover.

Up to now ghes_handle_memory_failure() unconditionally promoted any
synchronous recoverable memory error to MF_ACTION_REQUIRED, which
ends up SIGBUS on current -- regardless of whether the poison was
consumed from user space or from inside the kernel on the task's
behalf. That kills tasks that should instead have seen a plain
syscall error.

To fix this, the execution mode in which the exception was taken
must be captured at the arch-level entry point, where pt_regs (and
hence user_mode(regs)) are still available. The estatus node that
later drains the error in IRQ / process context no longer has
access to the original regs.

Introduce:

    enum context { ... };

and plumb the value all the way down to the queued estatus node:

 * Add an 'enum context context' field to struct ghes_estatus_node
   and record it in ghes_in_nmi_queue_one_entry().
 * Extend ghes_notify_sea() and the internal
   ghes_in_nmi_spool_from_list() with an enum context parameter.

Then consume the recorded context in ghes_handle_memory_failure()
for the GHES_SEV_RECOVERABLE / sync path:

    flags = sync && context == GHES_CTX_USER ? MF_ACTION_REQUIRED : 0;

i.e. MF_ACTION_REQUIRED (and thus SIGBUS via the task_work path) is
only raised for user-mode poison consumption. Synchronous errors
taken in kernel mode fall back to memory_failure_queue() with
flags=0, asynchronously isolating the poisoned page while letting
the faulting kernel helper's extable fixup return -EFAULT
to user space.

Paths that pass NO_USE are unaffected:
sync is false for them, so flags stays 0 as before.

Signed-off-by: Ruidong Tian  <tianruidong@linux.alibaba.com>
---
 arch/arm64/kernel/acpi.c |  2 +-
 drivers/acpi/apei/ghes.c | 36 ++++++++++++++++++++----------------
 include/acpi/ghes.h      | 15 +++++++++++++--
 3 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/kernel/acpi.c b/arch/arm64/kernel/acpi.c
index 5891f92c2035..fa74f32c6e8c 100644
--- a/arch/arm64/kernel/acpi.c
+++ b/arch/arm64/kernel/acpi.c
@@ -409,7 +409,7 @@ int apei_claim_sea(struct pt_regs *regs)
 	 */
 	local_daif_restore(DAIF_ERRCTX);
 	nmi_enter();
-	err = ghes_notify_sea();
+	err = ghes_notify_sea(GHES_CTX(regs));
 	nmi_exit();
 
 	/*
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3236a3ce79d6..2c39adfb584a 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -529,7 +529,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
 }
 
 static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
-				       int sev, bool sync)
+				       int sev, bool sync, enum ghes_exec_ctx context)
 {
 	int flags = -1;
 	int sec_sev = ghes_severity(gdata->error_severity);
@@ -543,7 +543,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
 		flags = MF_SOFT_OFFLINE;
 	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
-		flags = sync ? MF_ACTION_REQUIRED : 0;
+		flags = sync && context == GHES_CTX_USER ? MF_ACTION_REQUIRED : 0;
 
 	if (flags != -1)
 		return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -552,10 +552,10 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
 }
 
 static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
-				     int sev, bool sync)
+				     int sev, bool sync, enum ghes_exec_ctx context)
 {
 	struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
-	int flags = sync ? MF_ACTION_REQUIRED : 0;
+	int flags = sync && context == GHES_CTX_USER ? MF_ACTION_REQUIRED : 0;
 	int length = gdata->error_data_length;
 	char error_type[120];
 	bool queued = false;
@@ -910,7 +910,8 @@ static void ghes_log_hwerr(int sev, guid_t *sec_type)
 }
 
 static void ghes_do_proc(struct ghes *ghes,
-			 const struct acpi_hest_generic_status *estatus)
+			 const struct acpi_hest_generic_status *estatus,
+			 enum ghes_exec_ctx context)
 {
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
@@ -937,11 +938,11 @@ static void ghes_do_proc(struct ghes *ghes,
 			atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
 
 			arch_apei_report_mem_error(sev, mem_err);
-			queued = ghes_handle_memory_failure(gdata, sev, sync);
+			queued = ghes_handle_memory_failure(gdata, sev, sync, context);
 		} else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
 			ghes_handle_aer(gdata);
 		} else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
-			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
+			queued = ghes_handle_arm_hw_error(gdata, sev, sync, context);
 		} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
 			struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
 
@@ -1190,7 +1191,7 @@ static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, estatus))
 			ghes_estatus_cache_add(ghes->generic, estatus);
 	}
-	ghes_do_proc(ghes, estatus);
+	ghes_do_proc(ghes, estatus, GHES_CTX_NA);
 
 out:
 	ghes_clear_estatus(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
@@ -1297,7 +1298,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 		len = cper_estatus_len(estatus);
 		node_len = GHES_ESTATUS_NODE_LEN(len);
 
-		ghes_do_proc(estatus_node->ghes, estatus);
+		ghes_do_proc(estatus_node->ghes, estatus, estatus_node->context);
 
 		if (!ghes_estatus_cached(estatus)) {
 			generic = estatus_node->generic;
@@ -1335,7 +1336,8 @@ static void ghes_print_queued_estatus(void)
 }
 
 static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
-				       enum fixed_addresses fixmap_idx)
+				       enum fixed_addresses fixmap_idx,
+				       enum ghes_exec_ctx context)
 {
 	struct acpi_hest_generic_status *estatus, tmp_header;
 	struct ghes_estatus_node *estatus_node;
@@ -1364,6 +1366,7 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
 	if (!estatus_node)
 		return -ENOMEM;
 
+	estatus_node->context = context;
 	estatus_node->ghes = ghes;
 	estatus_node->generic = ghes->generic;
 	estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
@@ -1398,14 +1401,15 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
 }
 
 static int ghes_in_nmi_spool_from_list(struct list_head *rcu_list,
-				       enum fixed_addresses fixmap_idx)
+				       enum fixed_addresses fixmap_idx,
+				       enum ghes_exec_ctx context)
 {
 	int ret = -ENOENT;
 	struct ghes *ghes;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(ghes, rcu_list, list) {
-		if (!ghes_in_nmi_queue_one_entry(ghes, fixmap_idx))
+		if (!ghes_in_nmi_queue_one_entry(ghes, fixmap_idx, context))
 			ret = 0;
 	}
 	rcu_read_unlock();
@@ -1488,7 +1492,7 @@ static LIST_HEAD(ghes_sea);
  * Return 0 only if one of the SEA error sources successfully reported an error
  * record sent from the firmware.
  */
-int ghes_notify_sea(void)
+int ghes_notify_sea(enum ghes_exec_ctx context)
 {
 	static DEFINE_RAW_SPINLOCK(ghes_notify_lock_sea);
 	int rv;
@@ -1497,7 +1501,7 @@ int ghes_notify_sea(void)
 		return -ENOENT;
 
 	raw_spin_lock(&ghes_notify_lock_sea);
-	rv = ghes_in_nmi_spool_from_list(&ghes_sea, FIX_APEI_GHES_SEA);
+	rv = ghes_in_nmi_spool_from_list(&ghes_sea, FIX_APEI_GHES_SEA, context);
 	raw_spin_unlock(&ghes_notify_lock_sea);
 
 	return rv;
@@ -1552,7 +1556,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 		return ret;
 
 	raw_spin_lock(&ghes_notify_lock_nmi);
-	if (!ghes_in_nmi_spool_from_list(&ghes_nmi, FIX_APEI_GHES_NMI))
+	if (!ghes_in_nmi_spool_from_list(&ghes_nmi, FIX_APEI_GHES_NMI, GHES_CTX_NA))
 		ret = NMI_HANDLED;
 	raw_spin_unlock(&ghes_notify_lock_nmi);
 
@@ -1606,7 +1610,7 @@ static void ghes_nmi_init_cxt(void)
 static int __ghes_sdei_callback(struct ghes *ghes,
 				enum fixed_addresses fixmap_idx)
 {
-	if (!ghes_in_nmi_queue_one_entry(ghes, fixmap_idx)) {
+	if (!ghes_in_nmi_queue_one_entry(ghes, fixmap_idx, GHES_CTX_NA)) {
 		irq_work_queue(&ghes_proc_irq_work);
 
 		return 0;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 8d7e5caef3f1..8460707ea4b0 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -33,10 +33,21 @@ struct ghes {
 	void __iomem *error_status_vaddr;
 };
 
+enum ghes_exec_ctx {
+	GHES_CTX_NA = -1,
+	GHES_CTX_KERNEL = 0,
+	GHES_CTX_USER = 1
+};
+
+#define GHES_CTX(regs)	((regs) ? (user_mode(regs) ? GHES_CTX_USER \
+						   : GHES_CTX_KERNEL) \
+				: GHES_CTX_NA)
+
 struct ghes_estatus_node {
 	struct llist_node llnode;
 	struct acpi_hest_generic *generic;
 	struct ghes *ghes;
+	enum ghes_exec_ctx context;
 };
 
 struct ghes_estatus_cache {
@@ -135,9 +146,9 @@ static inline void *acpi_hest_get_next(struct acpi_hest_generic_data *gdata)
 	     section = acpi_hest_get_next(section))
 
 #ifdef CONFIG_ACPI_APEI_SEA
-int ghes_notify_sea(void);
+int ghes_notify_sea(enum ghes_exec_ctx context);
 #else
-static inline int ghes_notify_sea(void) { return -ENOENT; }
+static inline int ghes_notify_sea(enum ghes_exec_ctx context) { return -ENOENT; }
 #endif
 
 struct notifier_block;
-- 
2.39.3



^ permalink raw reply related

* [PATCH v15 1/9] uaccess: add generic fallback version of copy_mc_to_user()
From: Ruidong Tian @ 2026-06-18  9:21 UTC (permalink / raw)
  To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
	xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
	dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
	christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
	tianruidong, Mauro Carvalho Chehab, Jonathan Cameron
In-Reply-To: <20260618092124.3901230-1-tianruidong@linux.alibaba.com>

From: Tong Tiangen <tongtiangen@huawei.com>

x86/powerpc has it's implementation of copy_mc_to_user(), we add generic
fallback in include/linux/uaccess.h prepare for other architechures to
enable CONFIG_ARCH_HAS_COPY_MC.

Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 arch/powerpc/include/asm/uaccess.h | 1 +
 arch/x86/include/asm/uaccess.h     | 1 +
 include/linux/uaccess.h            | 8 ++++++++
 3 files changed, 10 insertions(+)

diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index e98c628e3899..073de098d45a 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -432,6 +432,7 @@ copy_mc_to_user(void __user *to, const void *from, unsigned long n)
 
 	return n;
 }
+#define copy_mc_to_user copy_mc_to_user
 #endif
 
 extern size_t copy_from_user_flushcache(void *dst, const void __user *src, size_t size);
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 3a0dd3c2b233..308b0854d1d5 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -496,6 +496,7 @@ copy_mc_to_kernel(void *to, const void *from, unsigned len);
 
 unsigned long __must_check
 copy_mc_to_user(void __user *to, const void *from, unsigned len);
+#define copy_mc_to_user copy_mc_to_user
 #endif
 
 /*
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 56328601218c..13b4a3a15437 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -250,6 +250,14 @@ copy_mc_to_kernel(void *dst, const void *src, size_t cnt)
 }
 #endif
 
+#ifndef copy_mc_to_user
+static inline unsigned long __must_check
+copy_mc_to_user(void __user *dst, const void *src, unsigned long cnt)
+{
+	return copy_to_user(dst, src, cnt);
+}
+#endif
+
 static __always_inline void pagefault_disabled_inc(void)
 {
 	current->pagefault_disabled++;
-- 
2.39.3



^ permalink raw reply related

* [PATCH v15 0/8] arm64: add ARCH_HAS_COPY_MC support
From: Ruidong Tian @ 2026-06-18  9:21 UTC (permalink / raw)
  To: catalin.marinas, will, rafael, tony.luck, guohanjun, mchehab,
	xueshuai, tongtiangen, james.morse, robin.murphy, andreyknvl,
	dvyukov, vincenzo.frascino, mpe, npiggin, ryabinin.a.a, glider,
	christophe.leroy, aneesh.kumar, naveen.n.rao, tglx, mingo
  Cc: linux-arm-kernel, linux-mm, linuxppc-dev, linux-kernel, kasan-dev,
	tianruidong

This series continues Tong Tiangen's work on arm64 ARCH_HAS_COPY_MC
support. We encounter the same problem, and from a forward-looking
perspective, large-memory ARM machines will suffer more from this class
of issues, which motivates us to push this feature upstream.

Problem
=========
With the increase of memory capacity and density, the probability of memory
error also increases. The increasing size and density of server RAM in data
centers and clouds have shown increased uncorrectable memory errors.

Currently, more and more scenarios that can tolerate memory errors, such as
COW[1,2,8,9], KSM copy[3], coredump copy[4], khugepaged[5,6], uaccess copy[7],
page migration[10,11], etc.

Solution
=========

This patchset introduces a new processing framework on ARM64, which enables
ARM64 to support error recovery in the above scenarios, and more scenarios
can be expanded based on this in the future.

In arm64, memory error handling in do_sea(), which is divided into two cases:
 1. If the user state consumed the memory errors, the solution is to kill
    the user process and isolate the error page.
 2. If the kernel state consumed the memory errors, the solution is to
    panic.

For case 2, Undifferentiated panic may not be the optimal choice, as it can
be handled better. In some scenarios, we can avoid panic, such as uaccess,
if the uaccess fails due to memory error, only the user process will be
affected, returning an error to the caller and isolating the user page with
hardware memory errors is a better choice.

[1]  commit d302c2398ba2 ("mm, hwpoison: when copy-on-write hits poison, take page offline")
[2]  commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults")
[3]  commit 6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()")
[4]  commit 245f09226893 ("mm: hwpoison: coredump: support recovery from dump_user_range()")
[5]  commit 98c76c9f1ef7 ("mm/khugepaged: recover from poisoned anonymous memory")
[6]  commit 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory")
[7]  commit 278b917f8cb9 ("x86/mce: Add _ASM_EXTABLE_CPY for copy user access")
[8]  commit 658be46520ce ("mm: support poison recovery from copy_present_page()")
[9]  commit aa549f923f5e ("mm: support poison recovery from do_cow_fault()")
[10] commit f00b295b9b61 ("fs: hugetlbfs: support poisoned recover from hugetlbfs_migrate_folio()")
[11] commit 060913999d7a ("mm: migrate: support poisoned recover from migrate folio")

------------------
Test result:

Tested on Kunpeng 920.

1. copy_page(), copy_mc_page() basic function test pass, and the disassembly
   contents remains the same before and after refactor.

2. copy_to/from_user() access kernel NULL pointer raise translation fault
   and dump error message then die(), test pass.

3. Test following scenarios: copy_from_user(), get_user(), COW.

   Before patched: trigger a hardware memory error then panic.
   After  patched: trigger a hardware memory error without panic.

   Testing step:
   step1. start an user-process.
   step2. poison(einj) the user-process's page.
   step3: user-process access the poison page in kernel mode, then trigger SEA.
   step4: the kernel will not panic, only the user process is killed, the poison
          page is isolated. (before patched, the kernel will panic in do_sea())

   The above tests can also be reproduced using ras-tools with extra patch[1], 
   which provides einj-based injection and validation for all MC-safe recovery paths.

   MM subsystem (hwpoison recovery via copy_mc_[user_]highpage / copy_mc_to_kernel):

     einj_mem_uc cow_anon            # wp_page_copy
     einj_mem_uc cow_anon_pinned     # copy_present_page (DMA-pinned)
     einj_mem_uc cow_hugetlb         # hugetlb CoW
     einj_mem_uc cow_private_filemap # do_cow_fault
     einj_mem_uc khugepaged_anon     # MADV_COLLAPSE anon
     einj_mem_uc khugepaged_file     # MADV_COLLAPSE file
     einj_mem_uc move_pages_numa     # migrate_folio
     einj_mem_uc migrate_pages_numa  # migrate_pages cross-node
     einj_mem_uc mbind_move          # mbind MPOL_MF_MOVE
     einj_mem_uc migrate_hugetlb     # hugetlbfs_migrate_folio
     einj_mem_uc coredump            # dump_page_copy -> copy_mc_to_kernel

   uaccess (copy_from_user direction):

     einj_mem_uc pwrite_uc           # pwrite(2)
     einj_mem_uc writev_uc           # writev(2)
     einj_mem_uc send_uc             # send(2) AF_UNIX
     einj_mem_uc sendmsg_uc          # sendmsg(2)
     einj_mem_uc setsockopt_uc       # setsockopt(2)
     einj_mem_uc netlink_send_uc     # AF_NETLINK sendto(2)
     einj_mem_uc msgsnd_uc           # SysV msgsnd(2)
     einj_mem_uc mq_send_uc          # POSIX mq_send(3)
     einj_mem_uc semop_uc            # semop(2)
     einj_mem_uc process_vm_writev_uc # process_vm_writev(2)

   Repo: https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
   [1] https://lore.kernel.org/all/20260617015211.3962419-1-tianruidong@linux.alibaba.com/

------------------

Benefits
=========
According to Huawei's statistics from their storage products, memory errors
triggered in kernel-mode by COW and page cache read (uaccess) scenarios
account for more than 50%. With this patchset deployed, all kernel panics
caused by COW and page cache memory errors are eliminated. 
Alibaba Cloud has also observed memory errors occurring in uaccess contexts.

Since V14:
1. Added additional recoverable scenarios (copy_present_page, do_cow_fault,
   hugetlbfs_migrate_folio, migrate_folio) to the description, as reminded
   by Kefeng Wang.
2. Renamed EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR to EX_TYPE_KACCESS_SEA.
3. Applied review comments from Xueshuai Liu.

Since V13:
1. Changed MC-safe functions to return an error rather than kill the user
   process. When a user program invokes a syscall and the kernel encounters
   a memory error during uaccess, killing the process is unexpected; the
   syscall should return an error.
2. Added FEAT_MOPS support for the copy_page_mc paths.
3. Refactored copy_page() and memcpy() on top of the shared memcpy_template,
   reducing duplicated assembly code.

Since v12:
Thanks to the suggestions of Jonathan, Mark, and Mauro, the following modifications
are made:
1. Rebase to latest kernel version.
2. Patch1, add Jonathan's and Mauro's review-by.
3. Patch2, modified do_apei_claim_sea() according to Mark's and Jonathan's suggestions,
   and optimized the commit message according to Mark's suggestions(Added description of
   the impact on regular copy_to_user()).
4. Patch3, optimized the commit message according to Mauro's suggestions and add Jonathan's
   review-by.
5. Patch4, modified copy_mc_user_highpage() and Optimized the commit message according to
   Jonathan's suggestions(no functional changes).
6. Patch5, optimized the commit message according to Mauro's suggestions.
7. Patch4/5, FEAT_MOPS is added to the code logic. Currently, the fixup is not performed
   on the MOPS instruction. 
8. Remove patch6 in v12 according to Jonathan's suggestions.

Since v11:
1. Rebase to latest kernel version 6.9-rc1.
2. Add patch 5, Since the problem described in "Since V10 Besides 3" has
   been solved in a50026bdb867 ('iov_iter: get rid of 'copy_mc' flag').
3. Add the benefit of applying the patch set to our company to the description of patch0.

Since V10:
 Accroding Mark's suggestion:
 1. Merge V10's patch2 and patch3 to V11's patch2.
 2. Patch2(V11): use new fixup_type for ld* in copy_to_user(), fix fatal
    issues (NULL kernel pointeraccess) been fixup incorrectly.
 3. Patch2(V11): refactoring the logic of do_sea().
 4. Patch4(V11): Remove duplicate assembly logic and remove do_mte().

 Besides:
 1. Patch2(V11): remove st* insn's fixup, st* generally not trigger memory error.
 2. Split a part of the logic of patch2(V11) to patch5(V11), for detail,
    see patch5(V11)'s commit msg.
 3. Remove patch6(v10) “arm64: introduce copy_mc_to_kernel() implementation”.
    During modification, some problems that cannot be solved in a short
    period are found. The patch will be released after the problems are
    solved.
 4. Add test result in this patch.
 5. Modify patchset title, do not use machine check and remove "-next".

Since V9:
 1. Rebase to latest kernel version 6.8-rc2.
 2. Add patch 6/6 to support copy_mc_to_kernel().

Since V8:
 1. Rebase to latest kernel version and fix topo in some of the patches.
 2. According to the suggestion of Catalin, I attempted to modify the
    return value of function copy_mc_[user]_highpage() to bytes not copied.
    During the modification process, I found that it would be more
    reasonable to return -EFAULT when copy error occurs (referring to the
    newly added patch 4). 

    For ARM64, the implementation of copy_mc_[user]_highpage() needs to
    consider MTE. Considering the scenario where data copying is successful
    but the MTE tag copying fails, it is also not reasonable to return
    bytes not copied.
 3. Considering the recent addition of machine check safe support for
    multiple scenarios, modify commit message for patch 5 (patch 4 for V8).

Since V7:
 Currently, there are patches supporting recover from poison
 consumption for the cow scenario[1]. Therefore, Supporting cow
 scenario under the arm64 architecture only needs to modify the relevant
 code under the arch/.
 [1]https://lore.kernel.org/lkml/20221031201029.102123-1-tony.luck@intel.com/

Since V6:
 Resend patches that are not merged into the mainline in V6.

Since V5:
 1. Add patch2/3 to add uaccess assembly helpers.
 2. Optimize the implementation logic of arm64_do_kernel_sea() in patch8.
 3. Remove kernel access fixup in patch9.
 All suggestion are from Mark. 

Since V4:
 1. According Michael's suggestion, add patch5.
 2. According Mark's suggestiog, do some restructuring to arm64
 extable, then a new adaptation of machine check safe support is made based
 on this.
 3. According Mark's suggestion, support machine check safe in do_mte() in
 cow scene.
 4. In V4, two patches have been merged into -next, so V5 not send these
 two patches.

Since V3:
 1. According to Robin's suggestion, direct modify user_ldst and
 user_ldp in asm-uaccess.h and modify mte.S.
 2. Add new macro USER_MC in asm-uaccess.h, used in copy_from_user.S
 and copy_to_user.S.
 3. According to Robin's suggestion, using micro in copy_page_mc.S to
 simplify code.
 4. According to KeFeng's suggestion, modify powerpc code in patch1.
 5. According to KeFeng's suggestion, modify mm/extable.c and some code
 optimization.

Since V2:
 1. According to Mark's suggestion, all uaccess can be recovered due to
    memory error.
 2. Scenario pagecache reading is also supported as part of uaccess
    (copy_to_user()) and duplication code problem is also solved. 
    Thanks for Robin's suggestion.
 3. According Mark's suggestion, update commit message of patch 2/5.
 4. According Borisllav's suggestion, update commit message of patch 1/5.

Since V1:
 1.Consistent with PPC/x86, Using CONFIG_ARCH_HAS_COPY_MC instead of
   ARM64_UCE_KERNEL_RECOVERY.
 2.Add two new scenes, cow and pagecache reading.
 3.Fix two small bug(the first two patch).

V1 in here:
https://lore.kernel.org/lkml/20220323033705.3966643-1-tongtiangen@huawei.com/

Ruidong Tian (5):
  ACPI: APEI: GHES: use exception context to gate SIGBUS on poison
    consumption
  arm64: extable: merge UACCESS_ERR_ZERO and KACCESS_ERR_ZERO into
    ACCESS_ERR_ZERO
  arm64: enable recover from synchronous external abort in kernel
    context
  lib/test: memcpy_kunit: add copy_page() and copy_mc_page() tests
  lib/tests: memcpy_kunit: add memcpy_mc() and memcpy_mc_large() test

Tong Tiangen (4):
  uaccess: add generic fallback version of copy_mc_to_user()
  mm/hwpoison: return -EFAULT when copy fail in
    copy_mc_[user]_highpage()
  arm64: support copy_mc_[user]_highpage()
  arm64: introduce copy_mc_to_kernel() implementation

 arch/arm64/Kconfig                   |   1 +
 arch/arm64/include/asm/asm-extable.h |  24 ++-
 arch/arm64/include/asm/asm-uaccess.h |   4 +
 arch/arm64/include/asm/extable.h     |   1 +
 arch/arm64/include/asm/mte.h         |   9 +
 arch/arm64/include/asm/page.h        |  12 ++
 arch/arm64/include/asm/string.h      |   5 +
 arch/arm64/include/asm/uaccess.h     |  17 ++
 arch/arm64/kernel/acpi.c             |   2 +-
 arch/arm64/lib/Makefile              |   2 +
 arch/arm64/lib/copy_mc_page.S        |  44 +++++
 arch/arm64/lib/copy_page.S           |  67 ++-----
 arch/arm64/lib/copy_page_template.S  |  70 ++++++++
 arch/arm64/lib/copy_to_user.S        |  10 +-
 arch/arm64/lib/memcpy.S              | 251 ++-------------------------
 arch/arm64/lib/memcpy_mc.S           |  56 ++++++
 arch/arm64/lib/memcpy_template.S     | 250 ++++++++++++++++++++++++++
 arch/arm64/lib/mte.S                 |  29 ++++
 arch/arm64/mm/copypage.c             |  80 +++++++++
 arch/arm64/mm/extable.c              |  35 +++-
 arch/arm64/mm/fault.c                |  30 +++-
 arch/powerpc/include/asm/uaccess.h   |   1 +
 arch/x86/include/asm/uaccess.h       |   1 +
 drivers/acpi/apei/ghes.c             |  36 ++--
 include/acpi/ghes.h                  |  15 +-
 include/linux/highmem.h              |  16 +-
 include/linux/uaccess.h              |   8 +
 lib/tests/memcpy_kunit.c             | 186 +++++++++++++++++++-
 mm/kasan/shadow.c                    |  12 ++
 mm/khugepaged.c                      |   4 +-
 30 files changed, 938 insertions(+), 340 deletions(-)
 create mode 100644 arch/arm64/lib/copy_mc_page.S
 create mode 100644 arch/arm64/lib/copy_page_template.S
 create mode 100644 arch/arm64/lib/memcpy_mc.S
 create mode 100644 arch/arm64/lib/memcpy_template.S

-- 
2.39.3

^ permalink raw reply

* [PATCH v7 2/3] arm64: dts: imx95: Add dma, intr, aer and pme interrupts for PCIe
From: hongxing.zhu @ 2026-06-18  9:20 UTC (permalink / raw)
  To: robh, krzk+dt, conor+dt, bhelgaas, frank.li, l.stach, lpieralisi,
	kwilczynski, mani, s.hauer, kernel, festevam
  Cc: linux-pci, linux-arm-kernel, devicetree, imx, linux-kernel,
	Richard Zhu
In-Reply-To: <20260618092100.3669556-1-hongxing.zhu@oss.nxp.com>

From: Richard Zhu <hongxing.zhu@nxp.com>

The current PCIe device tree configuration only defines the MSI
interrupt, which is sufficient for basic PCIe operation but limits
advanced functionality.

Add the following interrupt lines to pcie0 and pcie1 nodes:
- dma: DMA interrupt for PCIe DMA operations
- intr: General controller events and link state changes
- aer: Advanced Error Reporting interrupt
- pme: Power Management Event interrupt

This enables enhanced PCIe features and capabilities that were
previously unavailable due to missing interrupt definitions.

Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
---
 arch/arm64/boot/dts/freescale/imx95.dtsi | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/boot/dts/freescale/imx95.dtsi b/arch/arm64/boot/dts/freescale/imx95.dtsi
index 3e35c956a4d7a..1a9803f967901 100644
--- a/arch/arm64/boot/dts/freescale/imx95.dtsi
+++ b/arch/arm64/boot/dts/freescale/imx95.dtsi
@@ -1945,8 +1945,12 @@ pcie0: pcie@4c300000 {
 			bus-range = <0x00 0xff>;
 			num-lanes = <1>;
 			num-viewport = <8>;
-			interrupts = <GIC_SPI 310 IRQ_TYPE_LEVEL_HIGH>;
-			interrupt-names = "msi";
+			interrupts = <GIC_SPI 310 IRQ_TYPE_LEVEL_HIGH>,
+				     <GIC_SPI 311 IRQ_TYPE_LEVEL_HIGH>,
+				     <GIC_SPI 310 IRQ_TYPE_LEVEL_HIGH>,
+				     <GIC_SPI 310 IRQ_TYPE_LEVEL_HIGH>,
+				     <GIC_SPI 310 IRQ_TYPE_LEVEL_HIGH>;
+			interrupt-names = "msi", "dma", "intr", "aer", "pme";
 			#interrupt-cells = <1>;
 			interrupt-map-mask = <0 0 0 0x7>;
 			interrupt-map = <0 0 0 1 &gic 0 0 GIC_SPI 306 IRQ_TYPE_LEVEL_HIGH>,
@@ -2020,8 +2024,12 @@ pcie1: pcie@4c380000 {
 			bus-range = <0x00 0xff>;
 			num-lanes = <1>;
 			num-viewport = <8>;
-			interrupts = <GIC_SPI 316 IRQ_TYPE_LEVEL_HIGH>;
-			interrupt-names = "msi";
+			interrupts = <GIC_SPI 316 IRQ_TYPE_LEVEL_HIGH>,
+				     <GIC_SPI 317 IRQ_TYPE_LEVEL_HIGH>,
+				     <GIC_SPI 316 IRQ_TYPE_LEVEL_HIGH>,
+				     <GIC_SPI 316 IRQ_TYPE_LEVEL_HIGH>,
+				     <GIC_SPI 316 IRQ_TYPE_LEVEL_HIGH>;
+			interrupt-names = "msi", "dma", "intr", "aer", "pme";
 			#interrupt-cells = <1>;
 			interrupt-map-mask = <0 0 0 0x7>;
 			interrupt-map = <0 0 0 1 &gic 0 0 GIC_SPI 312 IRQ_TYPE_LEVEL_HIGH>,
-- 
2.34.1



^ permalink raw reply related

* [PATCH v7 3/3] PCI: imx6: Add root port reset to support link recovery
From: hongxing.zhu @ 2026-06-18  9:21 UTC (permalink / raw)
  To: robh, krzk+dt, conor+dt, bhelgaas, frank.li, l.stach, lpieralisi,
	kwilczynski, mani, s.hauer, kernel, festevam
  Cc: linux-pci, linux-arm-kernel, devicetree, imx, linux-kernel,
	Richard Zhu
In-Reply-To: <20260618092100.3669556-1-hongxing.zhu@oss.nxp.com>

From: Richard Zhu <hongxing.zhu@nxp.com>

The PCIe link can go down due to various unexpected circumstances. Add
root port reset support to enable link recovery for the i.MX PCIe
controller when the optional "intr" interrupt is present.

When a link down event occurs, reset the root port by: uninitializing the
PCIe controller, re-initializing it, and restarting the link.

On i.MX95 platforms, link events and PME share the same interrupt line.
The link event interrupt cannot use a threaded-only IRQ handler because
the PME driver uses request_irq() with only the IRQF_SHARED flag set,
which requires a primary handler.

To handle this shared interrupt scenario, register a primary interrupt
handler with IRQF_SHARED for link events and manipulate the link event
enable bits to ensure the shared interrupt source triggers only one
handler at a time.

Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
---
 drivers/pci/controller/dwc/pci-imx6.c | 132 ++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)

diff --git a/drivers/pci/controller/dwc/pci-imx6.c b/drivers/pci/controller/dwc/pci-imx6.c
index 773ab65b2afac..3de70f41b0b85 100644
--- a/drivers/pci/controller/dwc/pci-imx6.c
+++ b/drivers/pci/controller/dwc/pci-imx6.c
@@ -79,6 +79,11 @@
 #define IMX95_SID_MASK				GENMASK(5, 0)
 #define IMX95_MAX_LUT				32
 
+#define IMX95_LINK_INT_CTRL_STS			0x1040
+#define IMX95_PE0_INT_STS			0x10e8
+#define IMX95_LINK_DOWN_INT_STS			BIT(11)
+#define IMX95_LINK_DOWN_INT_EN			BIT(10)
+
 #define IMX95_PCIE_RST_CTRL			0x3010
 #define IMX95_PCIE_COLD_RST			BIT(0)
 
@@ -126,6 +131,8 @@ enum imx_pcie_variants {
 #define IMX_PCIE_MAX_INSTANCES	2
 
 struct imx_pcie;
+static int imx_pcie_reset_root_port(struct pci_host_bridge *bridge,
+				    struct pci_dev *pdev);
 
 struct imx_pcie_drvdata {
 	enum imx_pcie_variants variant;
@@ -158,6 +165,7 @@ struct imx_pcie {
 	bool			supports_clkreq;
 	bool			enable_ext_refclk;
 	struct regmap		*iomuxc_gpr;
+	int			lnk_intr;
 	u16			msi_ctrl;
 	u32			controller_id;
 	struct reset_control	*pciephy_reset;
@@ -1394,6 +1402,13 @@ static int imx_pcie_host_init(struct dw_pcie_rp *pp)
 
 	imx_setup_phy_mpll(imx_pcie);
 
+	/*
+	 * Callback invoked by PCI core when link down is detected and
+	 * recovery is needed.
+	 */
+	if (pp->bridge)
+		pp->bridge->reset_root_port = imx_pcie_reset_root_port;
+
 	return 0;
 
 err_phy_off:
@@ -1661,6 +1676,9 @@ static int imx_pcie_suspend_noirq(struct device *dev)
 	if (!(imx_pcie->drvdata->flags & IMX_PCIE_FLAG_SUPPORTS_SUSPEND))
 		return 0;
 
+	if (imx_pcie->lnk_intr > 0)
+		regmap_clear_bits(imx_pcie->iomuxc_gpr, IMX95_LINK_INT_CTRL_STS,
+				  IMX95_LINK_DOWN_INT_EN);
 	imx_pcie_msi_save_restore(imx_pcie, true);
 	if (imx_check_flag(imx_pcie, IMX_PCIE_FLAG_HAS_LUT))
 		imx_pcie_lut_save(imx_pcie);
@@ -1711,6 +1729,9 @@ static int imx_pcie_resume_noirq(struct device *dev)
 	if (imx_check_flag(imx_pcie, IMX_PCIE_FLAG_HAS_LUT))
 		imx_pcie_lut_restore(imx_pcie);
 	imx_pcie_msi_save_restore(imx_pcie, false);
+	if (imx_pcie->lnk_intr > 0)
+		regmap_set_bits(imx_pcie->iomuxc_gpr, IMX95_LINK_INT_CTRL_STS,
+				IMX95_LINK_DOWN_INT_EN);
 
 	return 0;
 }
@@ -1720,6 +1741,86 @@ static const struct dev_pm_ops imx_pcie_pm_ops = {
 				  imx_pcie_resume_noirq)
 };
 
+static irqreturn_t imx_pcie_lnk_irq_isr(int irq, void *priv)
+{
+	struct imx_pcie *imx_pcie = priv;
+	struct dw_pcie *pci = imx_pcie->pci;
+	struct device *dev = pci->dev;
+	u32 val;
+
+	regmap_read(imx_pcie->iomuxc_gpr, IMX95_LINK_INT_CTRL_STS, &val);
+	if (val & IMX95_LINK_DOWN_INT_STS) {
+		dev_dbg(dev, "PCIe link down detected, initiating recovery\n");
+		/* Clear link down interrupt status by writing 1b'1 to it */
+		regmap_set_bits(imx_pcie->iomuxc_gpr, IMX95_LINK_INT_CTRL_STS,
+				IMX95_LINK_DOWN_INT_STS);
+		if (!(val & IMX95_LINK_DOWN_INT_EN))
+			return IRQ_NONE;
+		regmap_clear_bits(imx_pcie->iomuxc_gpr, IMX95_LINK_INT_CTRL_STS,
+				  IMX95_LINK_DOWN_INT_EN);
+
+		return IRQ_WAKE_THREAD;
+	}
+
+	regmap_read(imx_pcie->iomuxc_gpr, IMX95_PE0_INT_STS, &val);
+	if (unlikely(val))
+		regmap_write(imx_pcie->iomuxc_gpr, IMX95_PE0_INT_STS, val);
+
+	return IRQ_NONE;
+}
+
+static irqreturn_t imx_pcie_lnk_irq_thread(int irq, void *priv)
+{
+	struct imx_pcie *imx_pcie = priv;
+	struct dw_pcie *pci = imx_pcie->pci;
+	struct dw_pcie_rp *pp = &pci->pp;
+	struct pci_dev *port;
+
+	for_each_pci_bridge(port, pp->bridge->bus)
+		if (pci_pcie_type(port) == PCI_EXP_TYPE_ROOT_PORT)
+			pci_host_handle_link_down(port);
+
+	regmap_set_bits(imx_pcie->iomuxc_gpr, IMX95_LINK_INT_CTRL_STS,
+			IMX95_LINK_DOWN_INT_EN);
+
+	return IRQ_HANDLED;
+}
+
+static int imx_pcie_reset_root_port(struct pci_host_bridge *bridge,
+				    struct pci_dev *pdev)
+{
+	struct pci_bus *bus = bridge->bus;
+	struct dw_pcie_rp *pp = bus->sysdata;
+	struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
+	struct imx_pcie *imx_pcie = to_imx_pcie(pci);
+	int ret;
+
+	imx_pcie_msi_save_restore(imx_pcie, true);
+	if (imx_check_flag(imx_pcie, IMX_PCIE_FLAG_HAS_LUT))
+		imx_pcie_lut_save(imx_pcie);
+	imx_pcie_stop_link(pci);
+	imx_pcie_host_exit(pp);
+
+	ret = imx_pcie_host_init(pp);
+	if (ret) {
+		dev_err(pci->dev, "Failed to re-init PCIe\n");
+		return ret;
+	}
+	ret = dw_pcie_setup_rc(pp);
+	if (ret)
+		return ret;
+
+	imx_pcie_start_link(pci);
+	dw_pcie_wait_for_link(pci);
+
+	if (imx_check_flag(imx_pcie, IMX_PCIE_FLAG_HAS_LUT))
+		imx_pcie_lut_restore(imx_pcie);
+	imx_pcie_msi_save_restore(imx_pcie, false);
+
+	dev_dbg(pci->dev, "Root port reset completed\n");
+	return 0;
+}
+
 static int imx_pcie_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
@@ -1919,15 +2020,46 @@ static int imx_pcie_probe(struct platform_device *pdev)
 			val |= PCI_MSI_FLAGS_ENABLE;
 			dw_pcie_writew_dbi(pci, offset + PCI_MSI_FLAGS, val);
 		}
+
+		/* Get link event irq if it is present */
+		imx_pcie->lnk_intr = platform_get_irq_byname_optional(pdev, "intr");
+		if (imx_pcie->lnk_intr == -EPROBE_DEFER) {
+			ret = -EPROBE_DEFER;
+			goto err_host_deinit;
+		}
+		if (imx_pcie->lnk_intr > 0) {
+			ret = devm_request_threaded_irq(dev, imx_pcie->lnk_intr,
+							imx_pcie_lnk_irq_isr,
+							imx_pcie_lnk_irq_thread,
+							IRQF_SHARED,
+							"lnk", imx_pcie);
+			if (ret) {
+				dev_err_probe(dev, ret,
+					      "unable to request LNK IRQ\n");
+				goto err_host_deinit;
+			}
+
+			regmap_set_bits(imx_pcie->iomuxc_gpr,
+					IMX95_LINK_INT_CTRL_STS,
+					IMX95_LINK_DOWN_INT_EN);
+		}
 	}
 
 	return 0;
+
+err_host_deinit:
+	dw_pcie_host_deinit(&pci->pp);
+
+	return ret;
 }
 
 static void imx_pcie_shutdown(struct platform_device *pdev)
 {
 	struct imx_pcie *imx_pcie = platform_get_drvdata(pdev);
 
+	if (imx_pcie->lnk_intr > 0)
+		regmap_clear_bits(imx_pcie->iomuxc_gpr, IMX95_LINK_INT_CTRL_STS,
+				  IMX95_LINK_DOWN_INT_EN);
 	/* bring down link, so bootloader gets clean state in case of reboot */
 	imx_pcie_assert_core_reset(imx_pcie);
 	imx_pcie_assert_perst(imx_pcie, true);
-- 
2.34.1



^ permalink raw reply related

* [PATCH v7 1/3] dt-bindings: imx6q-pcie: Add optional intr/aer/pme interrupts for i.MX95
From: hongxing.zhu @ 2026-06-18  9:20 UTC (permalink / raw)
  To: robh, krzk+dt, conor+dt, bhelgaas, frank.li, l.stach, lpieralisi,
	kwilczynski, mani, s.hauer, kernel, festevam
  Cc: linux-pci, linux-arm-kernel, devicetree, imx, linux-kernel,
	Richard Zhu, Frank Li
In-Reply-To: <20260618092100.3669556-1-hongxing.zhu@oss.nxp.com>

From: Richard Zhu <hongxing.zhu@nxp.com>

The i.MX95 PCIe controller introduces three additional dedicated hardware
interrupt lines for specific events:
- intr: general controller events
- aer: Advanced Error Reporting events
- pme: Power Management Events

These interrupts are optional on i.MX95. PCIe basic functionality
(enumeration, configuration, and data transfer) works correctly without
them, as the controller can operate using only the existing msi interrupt.

Earlier i.MX PCIe variants (imx6q, imx6sx, imx6qp, imx7d, imx8mm, imx8mp,
imx8mq, imx8q) do not have these three dedicated interrupt lines.

Update the binding to allow up to 5 interrupts for i.MX95, while
restricting earlier variants to a maximum of 2 interrupts using
conditional constraints (if/then schema). This ensures the schema
accurately reflects the hardware capabilities of each SoC variant.

Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
---
 .../bindings/pci/fsl,imx6q-pcie.yaml          | 25 +++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.yaml b/Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.yaml
index e8b8131f5f23b..4f56e8e4f1008 100644
--- a/Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.yaml
+++ b/Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.yaml
@@ -58,12 +58,18 @@ properties:
     items:
       - description: builtin MSI controller.
       - description: builtin DMA controller.
+      - description: PCIe event interrupt.
+      - description: builtin AER SPI standalone interrupt line.
+      - description: builtin PME SPI standalone interrupt line.
 
   interrupt-names:
     minItems: 1
     items:
       - const: msi
       - const: dma
+      - const: intr
+      - const: aer
+      - const: pme
 
   reset-gpio:
     deprecated: true
@@ -249,6 +255,25 @@ allOf:
             - const: ref
             - const: extref  # Optional
 
+  - if:
+      properties:
+        compatible:
+          enum:
+            - fsl,imx6q-pcie
+            - fsl,imx6sx-pcie
+            - fsl,imx6qp-pcie
+            - fsl,imx7d-pcie
+            - fsl,imx8mm-pcie
+            - fsl,imx8mp-pcie
+            - fsl,imx8mq-pcie
+            - fsl,imx8q-pcie
+    then:
+      properties:
+        interrupts:
+          maxItems: 2
+        interrupt-names:
+          maxItems: 2
+
 unevaluatedProperties: false
 
 examples:
-- 
2.34.1



^ permalink raw reply related

* [PATCH v7 0/3] Add root port reset to support link recovery
From: hongxing.zhu @ 2026-06-18  9:20 UTC (permalink / raw)
  To: robh, krzk+dt, conor+dt, bhelgaas, frank.li, l.stach, lpieralisi,
	kwilczynski, mani, s.hauer, kernel, festevam
  Cc: linux-pci, linux-arm-kernel, devicetree, imx, linux-kernel

Based on the following patch-set[1] issued by Mani.
Add support for resetting the Root Port for i.MX PCIe to enable link recovery.

[1] [PATCH v8 0/5] PCI: Add support for resetting the Root Ports in a platform specific way

PCIe links can go down due to various unexpected circumstances. This patch series
adds root port reset support for link recovery on i.MX PCIe controllers when the
optional "intr" interrupt is present.

When a link down event is detected, the root port reset uninitializes and
reinitializes the PCIe controller, then restarts the PCIe link.

On i.MX95 platforms, link events and PME share the same interrupt line.
Link event interrupts cannot use only an IRQ thread handler because the PME
driver uses request_irq() to bind the PME interrupt directly with only the
IRQF_SHARED flag set.

To address this, we register one handler with IRQF_SHARED for link event
interrupts and manipulate the enable bits of link events to ensure the same
interrupt source is triggered only once at a time.

Additionally, this series adds 'intr', 'aer', and 'pme' interrupt entries to
the i.MX6Q PCIe binding to support PCIe event-based interrupts for general
controller events, Advanced Error Reporting, and Power Management Events
respectively.

Changes in v7:
- Remove the redundant maxItem setting of interrupt property.
- Update driver codes refer to sashiko-reviews

Changes in v6:
- Use conditional constraints (if/then schema) to specify that these three
optional interrupts are only valid for the i.MX95 variant, while other
variants like imx6q should not have them.
- Change lnk_intr data type from u32 to int to properly handle negative
error codes returned by platform_get_irq_byname_optional().
- Replace platform_get_irq_byname() with platform_get_irq_byname_optional()
to suppress unnecessary error messages when the optional link event IRQ is
not present in the device tree.
- To avoid inadvertently clear the pending W1C status bit, clear the W1C
bit firstly, then do the regmap_clear_bits().

Changes in v5:
- Update the commit message of the first dt-binding patch for clarity.
- Add explicit comment explaining that writing 1 to IMX95_LINK_DOWN_INT_STS
clears the bit

Changes in v4:
- Set these new added three interrupts as optional interrupt.

Changes in v3:
- Don't add a new if:block; Drop the maxItems constraint of the interrupts
  property for i.MX95 PCIe.
- Add constraints for the interrupts property for other variants.
- Regarding the ABI break: add descriptions explaining why these new
  interrupts are mandatory and required by i.MX95 PCIe.

Changes in v2:
- Constrain the new added three interrupt entries to be valid only for the
  i.MX95 variant using conditional schemas

[PATCH v7 1/3] dt-bindings: imx6q-pcie: Add optional intr/aer/pme
[PATCH v7 2/3] arm64: dts: imx95: Add dma, intr, aer and pme
[PATCH v7 3/3] PCI: imx6: Add root port reset to support link

Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.yaml |  25 +++++++++++++++++
arch/arm64/boot/dts/freescale/imx95.dtsi                  |  16 ++++++++---
drivers/pci/controller/dwc/pci-imx6.c                     | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 169 insertions(+), 4 deletions(-)

^ permalink raw reply

* Re: [PATCH net] net: ethernet: ti: icssg: guard PA stat lookups
From: Simon Horman @ 2026-06-18  9:10 UTC (permalink / raw)
  To: Philippe Schenker
  Cc: netdev, Philippe Schenker, danishanwar, rogerq, linux-arm-kernel,
	stable, Andrew Lunn, David Carlier, David S. Miller, Eric Dumazet,
	Jacob Keller, Jakub Kicinski, Kevin Hao, Meghana Malladi,
	Paolo Abeni, Vadim Fedorenko, linux-kernel
In-Reply-To: <20260616143642.1972071-1-dev@pschenker.ch>

On Tue, Jun 16, 2026 at 04:35:34PM +0200, Philippe Schenker wrote:
> From: Philippe Schenker <philippe.schenker@impulsing.ch>
> 
> icssg_ndo_get_stats64() unconditionally calls emac_get_stat_by_name()
> with FW PA stat names regardless of whether the PA stats block is
> present on the hardware.  emac_get_stat_by_name() already guards the
> PA stats lookup with `if (emac->prueth->pa_stats)`; when that pointer
> is NULL the lookup falls through to netdev_err() and returns -EINVAL.
> Because ndo_get_stats64 is polled regularly by the networking stack
> this produces thousands of log entries of the form:
> 
>   icssg-prueth icssg1-eth end0: Invalid stats FW_RX_ERROR
> 
> A secondary consequence is that the int(-EINVAL) return value is
> implicitly widened to a near-ULLONG_MAX unsigned value when accumulated
> into the __u64 fields of rtnl_link_stats64, silently corrupting the
> rx_errors, rx_dropped and tx_dropped counters reported by `ip -s link`.
> 
> Every other PA-aware code path in the driver is already guarded with
> the same `if (emac->prueth->pa_stats)` check.  Apply the same guard
> here.
> 
> Fixes: 0d15a26b247d ("net: ti: icssg-prueth: Add ICSSG FW Stats")

nit: no blank line between tags

> 
> Signed-off-by: Philippe Schenker <philippe.schenker@impulsing.ch>
> 
> Cc: danishanwar@ti.com
> Cc: rogerq@kernel.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: stable@vger.kernel.org

Reviewed-by: Simon Horman <horms@kernel.org>



^ permalink raw reply

* Re: [RFC PATCH v2 1/3] mm/huge_memory: make persistent huge zero folio read-only
From: David Hildenbrand (Arm) @ 2026-06-18  9:06 UTC (permalink / raw)
  To: Xueyuan Chen
  Cc: dave.hansen, akpm, linux-mm, linux-kernel, linux-arm-kernel, x86,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, luto, peterz,
	hpa, ljs, liam, vbabka, rppt, surenb, mhocko, ziy, baolin.wang,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, yang, jannh
In-Reply-To: <20260617141547.144275-1-xueyuan.chen21@gmail.com>

On 6/17/26 16:15, Xueyuan Chen wrote:
> 
> On Wed, Jun 17, 2026 at 01:50:08PM +0200, David Hildenbrand (Arm) wrote:
> 
> Hi, David
> 
>> Yes, kerneldoc please.
> 
> Ack.
> 
>>
>> We're adjusting the directmap, remapping a r/w page to be r/o. I think we should
>> be very clear about which transition we expect+support.
>>
>> Also, I rather hate the "set_memory" naming scheme ... "set_direct_map" is
>> clearer. Anyhow ...
>>
>> Now we are throwing a "arch_make_pages_*" into the mix.
>>
>> Should it really contain the "arch"?
>> Should it really contain the "make" ?
>>
>> Why can't we just reuse set_memory_ro and pass address+nr_pages? (highmem check?
>> Could that be moved in there?)
>>
>> Or do we want a "change_direct_map_ro()" / "remap_direct_map_ro" interface?
>>
>>
> 
> How about naming it int set_direct_map_ro(struct page *page, unsigned nr)?

To distinguish it from "set_memory*" cruft, maybe best to use "remap" or
"adjust" instead.

-- 
Cheers,

David


^ permalink raw reply

* [PATCH] iommu/io-pgtable-arm: Add support for contiguous hint bit
From: Vijayanand Jitta @ 2026-06-18  9:02 UTC (permalink / raw)
  To: Joerg Roedel (AMD), Will Deacon, Robin Murphy
  Cc: linux-arm-msm, iommu, linux-kernel, linux-arm-kernel,
	Prakash Gupta, Vijayanand Jitta

From: Prakash Gupta <prakash.gupta@oss.qualcomm.com>

Add support for the contiguous hint (CONT) bit in ARM LPAE page tables.
When a set of consecutive PTEs map a naturally-aligned contiguous block
of memory, the CONT bit can be set on all entries in the group to allow
the hardware to combine them into a single TLB entry, improving TLB
utilization.

The contiguous hint sizes per granule are:

  Page Size | CONT PTE |  PMD  | CONT PMD
  ----------+----------+-------+---------
      4K    |   64K    |   2M  |   32M
     16K    |    2M    |  32M  |    1G
     64K    |    2M    | 512M  |   16G

Contiguous hint sizes are advertised in pgsize_bitmap, analogous to
how the CPU MMU advertises them via hugetlb hstates, so that IOMMU API
users (e.g. __iommu_dma_alloc_pages()) can align allocations to these
sizes and benefit from the TLB optimization automatically.

Support is gated behind CONFIG_IOMMU_IO_PGTABLE_CONTIG_HINT, which
provides a compile-time opt-out for hardware affected by SMMU errata
related to the contiguous bit.

On the mapping side, __arm_lpae_map() detects when the requested size
matches a contiguous range at the next level, sets the CONT bit on all
PTEs in the group, then recurses with the base block size and an
adjusted pgcount.

On the unmapping side, the CONT bit is cleared from all PTEs in the
affected contiguous group before any individual entry is invalidated,
following the Break-Before-Make requirement of the architecture.

Tested on QEMU (arm64/SMMUv3) with iommu_map()/iommu_unmap() of
contiguous hint sizes; verified the CONT bit is correctly set on map
and cleared on unmap via page table walk.

Co-developed-by: Vijayanand Jitta <vijayanand.jitta@oss.qualcomm.com>
Signed-off-by: Vijayanand Jitta <vijayanand.jitta@oss.qualcomm.com>
Signed-off-by: Prakash Gupta <prakash.gupta@oss.qualcomm.com>
---
 drivers/iommu/Kconfig          |  16 +++
 drivers/iommu/io-pgtable-arm.c | 216 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 226 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 6e07bd69467a3..1c514361c5c9e 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -50,6 +50,22 @@ config IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST
 
 	  If unsure, say N here.
 
+config IOMMU_IO_PGTABLE_CONTIG_HINT
+	bool "Enable contiguous hint"
+	depends on IOMMU_IO_PGTABLE_LPAE
+	default y
+	help
+	  Enable contiguous hint (CONT bit) support for the ARM LPAE page
+	  table allocator. Contiguous hint sizes are advertised in the
+	  pgsize_bitmap so that IOMMU API users can align allocations to
+	  these sizes and benefit from improved TLB utilization, analogous
+	  to how the CPU MMU advertises contiguous sizes via hugetlb.
+
+	  Disabling this option provides a compile-time opt-out for
+	  hardware affected by SMMU errata related to the contiguous bit.
+
+	  If unsure, say Y here.
+
 config IOMMU_IO_PGTABLE_ARMV7S
 	bool "ARMv7/v8 Short Descriptor Format"
 	select IOMMU_IO_PGTABLE
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 476c0e25631af..9fc60520177f1 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -86,6 +86,21 @@
 /* Software bit for solving coherency races */
 #define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
 
+/* PTE Contiguous Bit */
+#define ARM_LPAE_PTE_CONT		(((arm_lpae_iopte)1) << 52)
+
+/*
+ * CONTIG HINT SUPPORT TABLE
+ *
+ *---------------------------------------------------
+ *| Page Size | CONT PTE |  PMD  | CONT PMD |  PUD  |
+ *---------------------------------------------------
+ *|     4K    |   64K    |   2M  |    32M   |   1G  |
+ *|    16K    |    2M    |  32M  |     1G   |       |
+ *|    64K    |    2M    | 512M  |    16G   |       |
+ *---------------------------------------------------
+ */
+
 /* Stage-1 PTE */
 #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
 #define ARM_LPAE_PTE_AP_RDONLY_BIT	7
@@ -453,6 +468,111 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
 	return old;
 }
 
+#ifdef CONFIG_IOMMU_IO_PGTABLE_CONTIG_HINT
+static inline int arm_lpae_cont_ptes(unsigned long size)
+{
+	if (size == SZ_4K)
+		return 16;
+	if (size == SZ_16K)
+		return 128;
+	if (size == SZ_64K)
+		return 32;
+	return 1;
+}
+
+static inline unsigned long arm_lpae_cont_pte_size(unsigned long size)
+{
+	return arm_lpae_cont_ptes(size) * size;
+}
+
+static inline int arm_lpae_cont_pmds(unsigned long size)
+{
+	if (size == SZ_2M)
+		return 16;
+	if (size == SZ_32M)
+		return 32;
+	if (size == SZ_512M)
+		return 32;
+	return 1;
+}
+
+static inline unsigned long arm_lpae_cont_pmd_size(unsigned long size)
+{
+	return arm_lpae_cont_pmds(size) * size;
+}
+
+static unsigned long arm_lpae_get_cont_sizes(struct io_pgtable_cfg *cfg)
+{
+	unsigned long pg_size, pmd_size;
+	int pg_shift, bits_per_level;
+
+	if (!cfg->pgsize_bitmap)
+		return 0;
+
+	pg_shift = __ffs(cfg->pgsize_bitmap);
+	bits_per_level = pg_shift - ilog2(sizeof(arm_lpae_iopte));
+	pg_size = (1UL << pg_shift);
+	pmd_size = (pg_size << bits_per_level);
+
+	return (arm_lpae_cont_pte_size(pg_size) | arm_lpae_cont_pmd_size(pmd_size));
+}
+
+static u32 arm_lpae_find_num_cont(struct arm_lpae_io_pgtable *data, int lvl)
+{
+	if (lvl == ARM_LPAE_MAX_LEVELS - 2)
+		return arm_lpae_cont_pmds(ARM_LPAE_BLOCK_SIZE(lvl, data));
+	else if (lvl == ARM_LPAE_MAX_LEVELS - 1)
+		return arm_lpae_cont_ptes(ARM_LPAE_BLOCK_SIZE(lvl, data));
+	else
+		return 1;
+}
+
+static u32 arm_lpae_check_num_cont(struct arm_lpae_io_pgtable *data, size_t size, int lvl)
+{
+	int num_cont;
+
+	num_cont = arm_lpae_find_num_cont(data, lvl);
+	if (size == num_cont * ARM_LPAE_BLOCK_SIZE(lvl, data))
+		return num_cont;
+	else
+		return 1;
+}
+
+static bool arm_lpae_pte_is_contiguous_range(struct arm_lpae_io_pgtable *data,
+					     unsigned long size,
+					     int lvl, u32 *num_cont)
+{
+	unsigned long block_size;
+
+	*num_cont = arm_lpae_find_num_cont(data, lvl);
+	block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+
+	return (size == ((*num_cont) * block_size));
+}
+#else
+static unsigned long arm_lpae_get_cont_sizes(struct io_pgtable_cfg *cfg)
+{
+	return 0;
+}
+
+static u32 arm_lpae_find_num_cont(struct arm_lpae_io_pgtable *data, int lvl)
+{
+	return 1;
+}
+
+static u32 arm_lpae_check_num_cont(struct arm_lpae_io_pgtable *data, size_t size, int lvl)
+{
+	return 1;
+}
+
+static bool arm_lpae_pte_is_contiguous_range(struct arm_lpae_io_pgtable *data,
+					     unsigned long size,
+					     int lvl, u32 *num_cont)
+{
+	return false;
+}
+#endif
+
 static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 			  phys_addr_t paddr, size_t size, size_t pgcount,
 			  arm_lpae_iopte prot, int lvl, arm_lpae_iopte *ptep,
@@ -463,6 +583,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 	size_t tblsz = ARM_LPAE_GRANULE(data);
 	struct io_pgtable_cfg *cfg = &data->iop.cfg;
 	int ret = 0, num_entries, max_entries, map_idx_start;
+	u32 num_cont = 1;
 
 	/* Find our entry at the current level */
 	map_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
@@ -505,6 +626,24 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
 		return -EEXIST;
 	}
 
+	if (arm_lpae_pte_is_contiguous_range(data, size, lvl + 1, &num_cont)) {
+		size_t ct_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+
+		/* Set cont bit */
+		prot |= ARM_LPAE_PTE_CONT;
+
+		/*
+		 * Since size here would be of CONT_PTE or CONT_PMD (e.g. SZ_64K/SZ_32M
+		 * in case of 4K PAGE_SIZE), but actual mappings are in multiples of
+		 * SZ_4K/SZ_2M, call __arm_lpae_map with ct_size and update pgcount
+		 * accordingly by num_cont * pgcount.
+		 */
+		ret = __arm_lpae_map(data, iova, paddr, ct_size,
+				     num_cont * pgcount,
+				     prot, lvl + 1, cptep, gfp, mapped);
+		return ret;
+	}
+
 	/* Rinse, repeat */
 	return __arm_lpae_map(data, iova, paddr, size, pgcount, prot, lvl + 1,
 			      cptep, gfp, mapped);
@@ -653,6 +792,48 @@ static void arm_lpae_free_pgtable(struct io_pgtable *iop)
 	kfree(data);
 }
 
+#ifdef CONFIG_IOMMU_IO_PGTABLE_CONTIG_HINT
+static void arm_lpae_cont_clear(struct arm_lpae_io_pgtable *data,
+				unsigned long iova, int lvl,
+				arm_lpae_iopte *ptep, size_t num_entries)
+{
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	u32 num_cont = arm_lpae_find_num_cont(data, lvl);
+	arm_lpae_iopte *cont_ptep;
+	arm_lpae_iopte *cont_ptep_start;
+	unsigned long cont_iova;
+	int offset, itr;
+
+	cont_ptep = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
+	cont_iova = round_down(iova,
+			       ARM_LPAE_BLOCK_SIZE(lvl, data) * num_cont);
+	cont_ptep += ARM_LPAE_LVL_IDX(cont_iova, lvl, data);
+	cont_ptep_start = cont_ptep;
+
+	/*
+	 * iova may not be aligned to the contiguous group boundary; include
+	 * any leading entries so round_up() covers all overlapping groups.
+	 */
+	offset = ARM_LPAE_LVL_IDX(iova, lvl, data) -
+		 ARM_LPAE_LVL_IDX(cont_iova, lvl, data);
+	num_entries = round_up(offset + num_entries, num_cont);
+
+	for (itr = 0; itr < num_entries; itr++) {
+		WRITE_ONCE(*cont_ptep, READ_ONCE(*cont_ptep) & ~ARM_LPAE_PTE_CONT);
+		cont_ptep++;
+	}
+
+	if (!cfg->coherent_walk)
+		__arm_lpae_sync_pte(cont_ptep_start, num_entries, cfg);
+}
+#else
+static void arm_lpae_cont_clear(struct arm_lpae_io_pgtable *data,
+				unsigned long iova, int lvl,
+				arm_lpae_iopte *ptep, size_t num_entries)
+{
+}
+#endif
+
 static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 			       struct iommu_iotlb_gather *gather,
 			       unsigned long iova, size_t size, size_t pgcount,
@@ -660,7 +841,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 {
 	arm_lpae_iopte pte;
 	struct io_pgtable *iop = &data->iop;
-	int i = 0, num_entries, max_entries, unmap_idx_start;
+	int i = 0, num_cont = 1, num_entries, max_entries, unmap_idx_start;
 
 	/* Something went horribly wrong and we ran out of page table */
 	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
@@ -675,9 +856,15 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 	}
 
 	/* If the size matches this level, we're in the right place */
-	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+	if (size == ARM_LPAE_BLOCK_SIZE(lvl, data) ||
+	    (size == arm_lpae_find_num_cont(data, lvl) *
+		     ARM_LPAE_BLOCK_SIZE(lvl, data))) {
+		size_t pte_size;
+
 		max_entries = arm_lpae_max_entries(unmap_idx_start, data);
-		num_entries = min_t(int, pgcount, max_entries);
+		num_cont = arm_lpae_check_num_cont(data, size, lvl);
+		num_entries = min_t(int, num_cont * pgcount, max_entries);
+		pte_size = size / num_cont;
 
 		/* Find and handle non-leaf entries */
 		for (i = 0; i < num_entries; i++) {
@@ -687,11 +874,27 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 				break;
 			}
 
+			/*
+			 * Break-Before-Make: before invalidating any leaf
+			 * entry, clear the CONT bit from every entry in the
+			 * contiguous group(s) and flush the TLB, as required
+			 * by the architecture.  arm_lpae_cont_clear() covers
+			 * the full [iova, iova + num_entries * pte_size) range
+			 * via round_up(), so subsequent entries read back
+			 * CONT=0 and skip this block.
+			 */
+			if (pte & ARM_LPAE_PTE_CONT) {
+				arm_lpae_cont_clear(data, iova, lvl, ptep, num_entries);
+				io_pgtable_tlb_flush_walk(iop, iova,
+							  num_entries * pte_size,
+							  ARM_LPAE_GRANULE(data));
+			}
+
 			if (!iopte_leaf(pte, lvl, iop->fmt)) {
 				__arm_lpae_clear_pte(&ptep[i], &iop->cfg, 1);
 
 				/* Also flush any partial walks */
-				io_pgtable_tlb_flush_walk(iop, iova + i * size, size,
+				io_pgtable_tlb_flush_walk(iop, iova + i * pte_size, pte_size,
 							  ARM_LPAE_GRANULE(data));
 				__arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
 			}
@@ -702,9 +905,9 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
 
 		if (gather && !iommu_iotlb_gather_queued(gather))
 			for (int j = 0; j < i; j++)
-				io_pgtable_tlb_add_page(iop, gather, iova + j * size, size);
+				io_pgtable_tlb_add_page(iop, gather, iova + j * pte_size, pte_size);
 
-		return i * size;
+		return i * pte_size;
 	} else if (iopte_leaf(pte, lvl, iop->fmt)) {
 		WARN_ONCE(true, "Unmap of a partial large IOPTE is not allowed");
 		return 0;
@@ -943,6 +1146,7 @@ static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
 	}
 
 	cfg->pgsize_bitmap &= page_sizes;
+	cfg->pgsize_bitmap |= arm_lpae_get_cont_sizes(cfg);
 	cfg->ias = min(cfg->ias, max_addr_bits);
 	cfg->oas = min(cfg->oas, max_addr_bits);
 }

---
base-commit: 4fa3f5fabb30bf00d7475d5a33459ea83d639bf9
change-id: 20260618-iommu_contig_hint-71ae491fbb52

Best regards,
--  
Vijayanand Jitta <vijayanand.jitta@oss.qualcomm.com>



^ permalink raw reply related

* [PATCH 2/3] KVM: arm64: Remove unreachable early checks in pkvm_init_host_vm()
From: Fuad Tabba @ 2026-06-18  9:01 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Catalin Marinas, Will Deacon
  Cc: Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Vincent Donnefort, Keir Fraser, Hyunwoo Kim, Fuad Tabba,
	linux-arm-kernel, kvmarm, linux-kernel
In-Reply-To: <20260618090128.3913688-1-tabba@google.com>

pkvm_init_host_vm() runs once from kvm_arch_init_vm(), while the VM is
still being allocated and is not yet reachable by another thread. Both
early checks therefore test impossible state: is_created is still false
(it is only set on first vCPU run) and the handle is still zero (this
function is what reserves it). Neither branch can be taken.

Remove them.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/pkvm.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index 053e4f733e4b..67b90a58fbea 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -230,13 +230,6 @@ int pkvm_init_host_vm(struct kvm *kvm, unsigned long type)
 	int ret;
 	bool protected = type & KVM_VM_TYPE_ARM_PROTECTED;
 
-	if (pkvm_hyp_vm_is_created(kvm))
-		return -EINVAL;
-
-	/* VM is already reserved, no need to proceed. */
-	if (kvm->arch.pkvm.handle)
-		return 0;
-
 	/* Reserve the VM in hyp and obtain a hyp handle for the VM. */
 	ret = kvm_call_hyp_nvhe(__pkvm_reserve_vm);
 	if (ret < 0)
-- 
2.54.0.1189.g8c84645362-goog



^ permalink raw reply related

* [PATCH 3/3] KVM: arm64: Drop redundant READ_ONCE() in pkvm_hyp_vm_is_created()
From: Fuad Tabba @ 2026-06-18  9:01 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Catalin Marinas, Will Deacon
  Cc: Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Vincent Donnefort, Keir Fraser, Hyunwoo Kim, Fuad Tabba,
	linux-arm-kernel, kvmarm, linux-kernel
In-Reply-To: <20260618090128.3913688-1-tabba@google.com>

is_created is written under config_lock. Every concurrent reader is
serialised against that write: pkvm_create_hyp_vm() under config_lock,
and the memslot path (kvm_arch_prepare_memory_region) via slots_lock,
which the creation writer also holds. The teardown-path accesses have no
concurrent writer. The read is therefore serialised, and the READ_ONCE()
is unnecessary.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/pkvm.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index 67b90a58fbea..008766273912 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -185,7 +185,11 @@ static int __pkvm_create_hyp_vm(struct kvm *kvm)
 
 bool pkvm_hyp_vm_is_created(struct kvm *kvm)
 {
-	return READ_ONCE(kvm->arch.pkvm.is_created);
+	/*
+	 * Serialised by config_lock/slots_lock, or by VM lifecycle at
+	 * teardown, so a plain read suffices.
+	 */
+	return kvm->arch.pkvm.is_created;
 }
 
 int pkvm_create_hyp_vm(struct kvm *kvm)
-- 
2.54.0.1189.g8c84645362-goog



^ permalink raw reply related

* [PATCH 1/3] KVM: arm64: Drop the unused EL2-side is_created write
From: Fuad Tabba @ 2026-06-18  9:01 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Catalin Marinas, Will Deacon
  Cc: Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Vincent Donnefort, Keir Fraser, Hyunwoo Kim, Fuad Tabba,
	linux-arm-kernel, kvmarm, linux-kernel
In-Reply-To: <20260618090128.3913688-1-tabba@google.com>

init_pkvm_hyp_vm() sets is_created on the EL2-private VM struct, but the
hypervisor never reads it: pkvm_hyp_vm_is_created() and every other
consumer operate on the host's struct kvm, a distinct allocation from
the EL2-private copy. The field is write-only at EL2.

Remove the store; host-side is_created tracking is unaffected.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/hyp/nvhe/pkvm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c
index eb1c10120f9f..30dd4b2afc26 100644
--- a/arch/arm64/kvm/hyp/nvhe/pkvm.c
+++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c
@@ -433,7 +433,6 @@ static void init_pkvm_hyp_vm(struct kvm *host_kvm, struct pkvm_hyp_vm *hyp_vm,
 	hyp_vm->host_kvm = host_kvm;
 	hyp_vm->kvm.created_vcpus = nr_vcpus;
 	hyp_vm->kvm.arch.pkvm.is_protected = READ_ONCE(host_kvm->arch.pkvm.is_protected);
-	hyp_vm->kvm.arch.pkvm.is_created = true;
 	hyp_vm->kvm.arch.flags = 0;
 	pkvm_init_features_from_host(hyp_vm, host_kvm);
 
-- 
2.54.0.1189.g8c84645362-goog



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox