Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Add dynamic CSU register sysfs interface
From: Ronak Jain @ 2026-04-08 11:42 UTC (permalink / raw)
  To: michal.simek, senthilnathan.thangaraj
  Cc: linux-kernel, linux-arm-kernel, ronak.jain

This patch series adds support for exposing CSU registers through a
sysfs interface. The implementation uses dynamic discovery via the
PM_QUERY_DATA firmware API to determine available registers at
runtime, making the interface flexible and maintainable without
requiring kernel changes when firmware capabilities evolve.

Background:

The ZynqMP platform has several CSU registers that are useful for
system configuration and debugging. Previously, accessing these
registers required direct memory access or custom tools. This series
provides a standardized sysfs interface that leverages existing
firmware APIs for secure access.

Key Features:

- Dynamic register discovery using PM_QUERY_DATA API
  * PM_QID_GET_NODE_COUNT: Query number of available registers
  * PM_QID_GET_NODE_NAME: Query register names by index
- Automatic sysfs attribute creation under csu_registers/ group
- Read operations via existing IOCTL_READ_REG firmware API
- Write operations via existing IOCTL_MASK_WRITE_REG firmware API
- Firmware-enforced access control for read-only registers

Currently Supported Registers:

- multiboot (CSU_MULTI_BOOT): Boot mode configuration
- idcode (CSU_IDCODE): Device identification (read-only)
- pcap-status (CSU_PCAP_STATUS): PCAP status (read-only)

The sysfs interface is available at:
  /sys/devices/platform/firmware:zynqmp-firmware/csu_registers/

Usage Examples:

Reading a register:
  # cat /sys/devices/platform/firmware:zynqmp-firmware/csu_registers/idcode

Writing a register (mask and value in hex):
  # echo "0xFFFFFFFF 0x0" > /sys/devices/platform/firmware:zynqmp-firmware/csu_registers/multiboot


Testing:

- Verified register read operations return correct values
- Verified write operations update registers correctly
- Verified read-only registers reject write attempts
- Verified dynamic discovery works with different firmware versions


Ronak Jain (2):
  Documentation: ABI: add sysfs interface for ZynqMP CSU registers
  firmware: zynqmp: Add dynamic CSU register discovery and sysfs
    interface

 .../ABI/stable/sysfs-driver-firmware-zynqmp   |  33 +++
 MAINTAINERS                                   |  10 +
 drivers/firmware/xilinx/Makefile              |   2 +-
 drivers/firmware/xilinx/zynqmp-csu-reg.c      | 249 ++++++++++++++++++
 drivers/firmware/xilinx/zynqmp-csu-reg.h      |  18 ++
 drivers/firmware/xilinx/zynqmp.c              |   6 +
 include/linux/firmware/xlnx-zynqmp.h          |   4 +-
 7 files changed, 320 insertions(+), 2 deletions(-)
 create mode 100644 drivers/firmware/xilinx/zynqmp-csu-reg.c
 create mode 100644 drivers/firmware/xilinx/zynqmp-csu-reg.h

-- 
2.34.1



^ permalink raw reply

* [PATCH 1/2] Documentation: ABI: add sysfs interface for ZynqMP CSU registers
From: Ronak Jain @ 2026-04-08 11:42 UTC (permalink / raw)
  To: michal.simek, senthilnathan.thangaraj
  Cc: linux-kernel, linux-arm-kernel, ronak.jain
In-Reply-To: <20260408114244.2852015-1-ronak.jain@amd.com>

Document the new sysfs interface that exposes Configuration Security
Unit (CSU) registers through the zynqmp-firmware driver.

The interface is available under:

  /sys/devices/platform/firmware:zynqmp-firmware/csu_registers/

The CSU registers are discovered at boot time using the PM_QUERY_DATA
firmware API. The following registers are currently supported:

  - multiboot     (CSU_MULTI_BOOT)
  - idcode        (CSU_IDCODE, read-only)
  - pcap-status   (CSU_PCAP_STATUS, read-only)

Read operations use the existing IOCTL_READ_REG firmware interface,
while write operations use IOCTL_MASK_WRITE_REG.

Access control is enforced by the firmware. Write attempts to
read-only registers are rejected by firmware even though the sysfs file
permissions allow writes.

Document the ABI entry accordingly.

Signed-off-by: Ronak Jain <ronak.jain@amd.com>
---
 .../ABI/stable/sysfs-driver-firmware-zynqmp   | 33 +++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-driver-firmware-zynqmp b/Documentation/ABI/stable/sysfs-driver-firmware-zynqmp
index c3fec3c835af..f537f7d9bb55 100644
--- a/Documentation/ABI/stable/sysfs-driver-firmware-zynqmp
+++ b/Documentation/ABI/stable/sysfs-driver-firmware-zynqmp
@@ -254,3 +254,36 @@ Description:
 		The expected result is 500.
 
 Users:		Xilinx
+
+What:		/sys/devices/platform/firmware\:zynqmp-firmware/csu_registers/*
+Date:		March 2026
+KernelVersion:	7.1
+Contact:	"Ronak Jain" <ronak.jain@amd.com>
+Description:
+		Read/Write CSU (Configuration Security Unit) registers.
+
+		This interface provides dynamic access to CSU registers that are
+		discovered from the firmware at boot time using PM_QUERY_DATA API.
+
+		The supported registers are:
+
+		- multiboot: CSU_MULTI_BOOT register
+		- idcode: CSU_IDCODE register (read-only)
+		- pcap-status: CSU_PCAP_STATUS register (read-only)
+
+		Read operations use the existing IOCTL_READ_REG API.
+		Write operations use the existing IOCTL_MASK_WRITE_REG API.
+
+		The firmware enforces access control - read-only registers will reject
+		write attempts even though the sysfs permissions show write access.
+
+		Usage for reading::
+
+		    # cat /sys/devices/platform/firmware\:zynqmp-firmware/csu_registers/multiboot
+		    # cat /sys/devices/platform/firmware\:zynqmp-firmware/csu_registers/idcode
+
+		Usage for writing (mask and value are in hexadecimal)::
+
+		    # echo 0xFFFFFFF 0x0 > /sys/devices/platform/firmware\:zynqmp-firmware/csu_registers/multiboot
+
+Users:		Xilinx/AMD
-- 
2.34.1



^ permalink raw reply related

* [PATCH v2] KVM: arm64: Reject non compliant SMCCC function calls in pKVM
From: Sebastian Ene @ 2026-04-08 11:41 UTC (permalink / raw)
  To: catalin.marinas, kvmarm, linux-arm-kernel, linux-kernel,
	android-kvm
  Cc: joey.gouly, korneld, maz, mrigendra.chaubey, oupton, perlarsen,
	sebastianene, suzuki.poulose, will, yuzenghui

Prevent the propagation of a function-id that has the top bits set since
this is not compliant with the SMCCC spec and can overlap with the
already known function-id decoders. (eg. if we invoke an smc with
0xffffffffc4000012 it will be decoded as a PSCI reset call). Instead,
make it clear that we don't support it and return an error.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
NOTE: This is based on linux-next, next-20260407 to avoid a minor
conflict with a previously submitted patch (commit cf6348af645b).  

Changelog:

v1 -> v2:
* dropped the changes to the function signature that were accepting
  64-bit function-ids.
* applied Mark's suggestion to make it clear that we don't accept non
  standard SMCCC calls.
* revised commit message & updated the title.

Link to v1: 
https://lore.kernel.org/all/20260401123201.389906-1-sebastianene@google.com/

---
 arch/arm64/kvm/hyp/nvhe/hyp-main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 73f2e0221e70..cca4b07c8d61 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -805,6 +805,10 @@ static void handle_host_smc(struct kvm_cpu_context *host_ctxt)
 	}
 
 	func_id &= ~ARM_SMCCC_CALL_HINTS;
+	if (upper_32_bits(func_id)) {
+		cpu_reg(host_ctxt, 0) = SMCCC_RET_NOT_SUPPORTED;
+		goto exit_skip_instr;
+	}
 
 	handled = kvm_host_psci_handler(host_ctxt, func_id);
 	if (!handled)
-- 
2.53.0.1213.gd9a14994de-goog



^ permalink raw reply related

* Re: [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag
From: Dev Jain @ 2026-04-08 11:36 UTC (permalink / raw)
  To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
	will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21
In-Reply-To: <20260408025115.27368-8-baohua@kernel.org>



On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> For vmap(), detect pages with the same page_shift and map them in
> batches, avoiding the pgtable zigzag caused by per-page mapping.
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---

In patch 4, you eliminate the pagetable rewalk, and in patch 5,
you re-introduce it, then in this patch you eliminate it again.
So please just squash this into #5.

>  mm/vmalloc.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6643ec0288cd..3c3b7217693a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3551,6 +3551,8 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
>  		pgprot_t prot, struct page **pages)
>  {
>  	unsigned int count = (end - addr) >> PAGE_SHIFT;
> +	unsigned int prev_shift = 0, idx = 0;
> +	unsigned long map_addr = addr;
>  	int err;
>  
>  	err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
> @@ -3562,15 +3564,29 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
>  		unsigned int shift = PAGE_SHIFT +
>  			get_vmap_batch_order(pages, count - i, i);
>  
> -		err = vmap_range_noflush(addr, addr + (1UL << shift),
> -				page_to_phys(pages[i]), prot, shift);
> -		if (err)
> -			goto out;
> +		if (!i)
> +			prev_shift = shift;
> +
> +		if (shift != prev_shift) {
> +			err = vmap_small_pages_range_noflush(map_addr, addr,
> +					prot, pages + idx,
> +					min(prev_shift, PMD_SHIFT));
> +			if (err)
> +				goto out;
> +			prev_shift = shift;
> +			map_addr = addr;
> +			idx = i;
> +		}
>  
>  		addr += 1UL << shift;
>  		i += 1U << (shift - PAGE_SHIFT);
>  	}
>  
> +	/* Remaining */
> +	if (map_addr < end)
> +		err = vmap_small_pages_range_noflush(map_addr, end,
> +				prot, pages + idx, min(prev_shift, PMD_SHIFT));
> +
>  out:
>  	flush_cache_vmap(addr, end);
>  	return err;



^ permalink raw reply

* [PATCH v2 5/5] KVM: arm64: selftests: Add GICv2 IGROUPR writability test
From: David Woodhouse @ 2026-04-08 11:30 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Paolo Bonzini,
	Shuah Khan, David Woodhouse, Raghavendra Rao Ananta, Eric Auger,
	Kees Cook, Arnd Bergmann, Nathan Chancellor, linux-arm-kernel,
	kvmarm, linux-kernel, kvm, linux-kselftest
In-Reply-To: <20260408113256.2095505-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Test that GICv2 IGROUPR writability is consistently gated by the IIDR
implementation revision for both guest and userspace paths:

  Default (no IIDR write): implementation_rev defaults to 3, groups
    writable from both guest and userspace.
  Rev 1: IGROUPR reads as zero (group 0), writes ignored from both
    guest and userspace.
  Rev 2: IGROUPR is writable from both guest and userspace.

This test requires GICv2 emulation support (GICv3 with GICv2 compat
CPU interface) and will be skipped on hardware without it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../selftests/kvm/arm64/vgic_group_v2.c       | 168 ++++++++++++++++++
 2 files changed, 169 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/arm64/vgic_group_v2.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index df729a70124f..878d7cb92555 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -178,6 +178,7 @@ TEST_GEN_PROGS_arm64 += arm64/vgic_init
 TEST_GEN_PROGS_arm64 += arm64/vgic_irq
 TEST_GEN_PROGS_arm64 += arm64/vgic_lpi_stress
 TEST_GEN_PROGS_arm64 += arm64/vgic_group_iidr
+TEST_GEN_PROGS_arm64 += arm64/vgic_group_v2
 TEST_GEN_PROGS_arm64 += arm64/vpmu_counter_access
 TEST_GEN_PROGS_arm64 += arm64/no-vgic-v3
 TEST_GEN_PROGS_arm64 += arm64/idreg-idst
diff --git a/tools/testing/selftests/kvm/arm64/vgic_group_v2.c b/tools/testing/selftests/kvm/arm64/vgic_group_v2.c
new file mode 100644
index 000000000000..6d4bad44bae7
--- /dev/null
+++ b/tools/testing/selftests/kvm/arm64/vgic_group_v2.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * vgic_group_v2.c - Test GICv2 IGROUPR behaviour across IIDR revisions
+ *
+ * Validate that the GICD_IIDR implementation revision controls GICv2
+ * IGROUPR writability for both guest and userspace:
+ *   Default (no IIDR write): groups writable (implementation_rev defaults to 3)
+ *   Rev 1: IGROUPR reads as zero (group 0), writes ignored
+ *   Rev 2: IGROUPR is guest and userspace configurable
+ */
+#include <linux/sizes.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "gic.h"
+#include "gic_v3.h"
+#include "vgic.h"
+
+#define NR_IRQS		64
+
+#define V2_DIST_BASE	0x8000000ULL
+#define V2_CPU_BASE	0x8010000ULL
+#define V2_DIST_GVA	((volatile void *)V2_DIST_BASE)
+
+#define SPI_IGROUPR	(GICD_IGROUPR + (32 / 32) * 4)
+
+static uint64_t shared_rev;
+static uint64_t guest_result;
+
+static void guest_code(void)
+{
+	uint32_t before, after;
+
+	before = readl(V2_DIST_GVA + SPI_IGROUPR);
+	writel(0x5a5a5a5a, V2_DIST_GVA + SPI_IGROUPR);
+	after = readl(V2_DIST_GVA + SPI_IGROUPR);
+
+	guest_result = ((uint64_t)before << 32) | after;
+	GUEST_DONE();
+}
+
+static int create_v2_gic(struct kvm_vm *vm)
+{
+	uint32_t nr_irqs = NR_IRQS;
+	uint64_t addr;
+	int gic_fd;
+
+	gic_fd = __kvm_create_device(vm, KVM_DEV_TYPE_ARM_VGIC_V2);
+	if (gic_fd < 0)
+		return gic_fd;
+
+	addr = V2_DIST_BASE;
+	kvm_device_attr_set(gic_fd, KVM_DEV_ARM_VGIC_GRP_ADDR,
+			    KVM_VGIC_V2_ADDR_TYPE_DIST, &addr);
+	addr = V2_CPU_BASE;
+	kvm_device_attr_set(gic_fd, KVM_DEV_ARM_VGIC_GRP_ADDR,
+			    KVM_VGIC_V2_ADDR_TYPE_CPU, &addr);
+
+	virt_map(vm, V2_DIST_BASE, V2_DIST_BASE,
+		 vm_calc_num_guest_pages(vm->mode, SZ_64K));
+	virt_map(vm, V2_CPU_BASE, V2_CPU_BASE,
+		 vm_calc_num_guest_pages(vm->mode, SZ_64K));
+
+	kvm_device_attr_set(gic_fd, KVM_DEV_ARM_VGIC_GRP_NR_IRQS,
+			    0, &nr_irqs);
+	return gic_fd;
+}
+
+static void run_test(int set_iidr_rev)
+{
+	struct kvm_vcpu *vcpus[1];
+	struct kvm_vm *vm;
+	struct ucall uc;
+	uint32_t before, after, igroupr, iidr;
+	int gic_fd;
+	bool expect_writable;
+
+	if (set_iidr_rev >= 0)
+		pr_info("Testing GICv2 IIDR revision %d\n", set_iidr_rev);
+	else
+		pr_info("Testing GICv2 IIDR default (no write)\n");
+
+	test_disable_default_vgic();
+	vm = vm_create_with_vcpus(1, guest_code, vcpus);
+
+	gic_fd = create_v2_gic(vm);
+	TEST_REQUIRE(gic_fd >= 0);
+
+	if (set_iidr_rev >= 0) {
+		kvm_device_attr_get(gic_fd, KVM_DEV_ARM_VGIC_GRP_DIST_REGS,
+				    GICD_IIDR, &iidr);
+		iidr &= ~GICD_IIDR_REVISION_MASK;
+		iidr |= set_iidr_rev << GICD_IIDR_REVISION_SHIFT;
+		kvm_device_attr_set(gic_fd, KVM_DEV_ARM_VGIC_GRP_DIST_REGS,
+				    GICD_IIDR, &iidr);
+	}
+
+	kvm_device_attr_set(gic_fd, KVM_DEV_ARM_VGIC_GRP_CTRL,
+			    KVM_DEV_ARM_VGIC_CTRL_INIT, NULL);
+
+	/*
+	 * Default (no IIDR write) gets implementation_rev=3 from vgic_init(),
+	 * so groups should be writable. Rev 1 = not writable. Rev 2+ = writable.
+	 */
+	expect_writable = (set_iidr_rev != 1);
+
+	/* Test userspace IGROUPR write */
+	igroupr = 0xa5a5a5a5;
+	kvm_device_attr_set(gic_fd, KVM_DEV_ARM_VGIC_GRP_DIST_REGS,
+			    SPI_IGROUPR, &igroupr);
+	igroupr = 0;
+	kvm_device_attr_get(gic_fd, KVM_DEV_ARM_VGIC_GRP_DIST_REGS,
+			    SPI_IGROUPR, &igroupr);
+
+	if (expect_writable)
+		TEST_ASSERT(igroupr == 0xa5a5a5a5,
+			    "Userspace write should succeed: got 0x%08x", igroupr);
+	else
+		TEST_ASSERT(igroupr == 0x00000000,
+			    "Userspace write should be ignored: got 0x%08x", igroupr);
+
+	/* Reset IGROUPR to 0 via userspace for rev 2+ before guest test */
+	if (expect_writable) {
+		igroupr = 0;
+		kvm_device_attr_set(gic_fd, KVM_DEV_ARM_VGIC_GRP_DIST_REGS,
+				    SPI_IGROUPR, &igroupr);
+	}
+
+	/* Test guest IGROUPR write */
+	sync_global_to_guest(vm, guest_result);
+	vcpu_run(vcpus[0]);
+
+	switch (get_ucall(vcpus[0], &uc)) {
+	case UCALL_ABORT:
+		REPORT_GUEST_ASSERT(uc);
+		break;
+	case UCALL_DONE:
+		break;
+	default:
+		TEST_FAIL("Unexpected ucall %lu", uc.cmd);
+	}
+
+	sync_global_from_guest(vm, guest_result);
+	before = guest_result >> 32;
+	after = guest_result & 0xffffffff;
+
+	TEST_ASSERT(before == 0x00000000,
+		    "Initial IGROUPR should be 0 (group 0): got 0x%08x", before);
+
+	if (expect_writable)
+		TEST_ASSERT(after == 0x5a5a5a5a,
+			    "Guest write should succeed: got 0x%08x", after);
+	else
+		TEST_ASSERT(after == 0x00000000,
+			    "Guest write should be ignored: got 0x%08x", after);
+
+	close(gic_fd);
+	kvm_vm_free(vm);
+}
+
+int main(int argc, char *argv[])
+{
+	run_test(-1);  /* default */
+	run_test(1);   /* rev 1 */
+	run_test(2);   /* rev 2 */
+	return 0;
+}
-- 
2.51.0



^ permalink raw reply related

* [PATCH v2 4/5] KVM: arm64: vgic: Remove v2_groups_user_writable and use IIDR revision directly
From: David Woodhouse @ 2026-04-08 11:30 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Paolo Bonzini,
	Shuah Khan, David Woodhouse, Raghavendra Rao Ananta, Eric Auger,
	Kees Cook, Arnd Bergmann, Nathan Chancellor, linux-arm-kernel,
	kvmarm, linux-kernel, kvm, linux-kselftest
In-Reply-To: <20260408113256.2095505-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

The v2_groups_user_writable flag was introduced to gate GICv2 userspace
IGROUPR writes until userspace explicitly wrote the IIDR, signalling
awareness of the group semantics. However, the guest write path through
vgic_mmio_write_group() was never gated by this flag, allowing a GICv2
guest to modify interrupt groups regardless of whether userspace had
opted in.

Rather than adding the same flag check to the guest path, remove the
flag entirely and make both guest and userspace IGROUPR writability
follow the IIDR implementation revision directly. Groups are writable
when the revision is >= 2, which is the case when userspace explicitly
sets the IIDR to revision 2 or 3. When userspace does not write the
IIDR, vgic_init() defaults to KVM_VGIC_IMP_REV_LATEST (currently 3),
so the behaviour is unchanged for userspace that doesn't set the IIDR.

This also fixes the inconsistency where a GICv2 guest could write
IGROUPR even when the IIDR had not been explicitly set by userspace.

Fixes: d53c2c29ae0d ("KVM: arm/arm64: vgic: Allow configuration of interrupt groups")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/arm64/kvm/vgic/vgic-mmio-v2.c | 16 +++++-----------
 include/kvm/arm_vgic.h             |  3 ---
 2 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v2.c b/arch/arm64/kvm/vgic/vgic-mmio-v2.c
index e5714f7fd2ec..e5fc673a1ea9 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio-v2.c
+++ b/arch/arm64/kvm/vgic/vgic-mmio-v2.c
@@ -84,21 +84,15 @@ static int vgic_mmio_uaccess_write_v2_misc(struct kvm_vcpu *vcpu,
 			return -EINVAL;
 
 		/*
-		 * If we observe a write to GICD_IIDR we know that userspace
-		 * has been updated and has had a chance to cope with older
-		 * kernels (VGICv2 IIDR.Revision == 0) incorrectly reporting
-		 * interrupts as group 1, and therefore we now allow groups to
-		 * be user writable.  Doing this by default would break
-		 * migration from old kernels to new kernels with legacy
-		 * userspace.
+		 * Allow userspace to select the GICv2 IIDR revision.
+		 * Group writability follows the revision directly:
+		 * groups are guest/user writable for revision >= 2.
 		 */
 		reg = FIELD_GET(GICD_IIDR_REVISION_MASK, val);
 		switch (reg) {
+		case KVM_VGIC_IMP_REV_1:
 		case KVM_VGIC_IMP_REV_2:
 		case KVM_VGIC_IMP_REV_3:
-			vcpu->kvm->arch.vgic.v2_groups_user_writable = true;
-			fallthrough;
-		case KVM_VGIC_IMP_REV_1:
 			dist->implementation_rev = reg;
 			return 0;
 		default:
@@ -114,7 +108,7 @@ static int vgic_mmio_uaccess_write_v2_group(struct kvm_vcpu *vcpu,
 					    gpa_t addr, unsigned int len,
 					    unsigned long val)
 {
-	if (vcpu->kvm->arch.vgic.v2_groups_user_writable)
+	if (vgic_get_implementation_rev(vcpu) >= KVM_VGIC_IMP_REV_2)
 		vgic_mmio_write_group(vcpu, addr, len, val);
 
 	return 0;
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 90fb6cd3c91c..cdfab2c20877 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -253,9 +253,6 @@ struct vgic_dist {
 #define KVM_VGIC_IMP_REV_3	3 /* GICv3 GICR_CTLR.{IW,CES,RWP} */
 #define KVM_VGIC_IMP_REV_LATEST	KVM_VGIC_IMP_REV_3
 
-	/* Userspace can write to GICv2 IGROUPR */
-	bool			v2_groups_user_writable;
-
 	/* Do injected MSIs require an additional device ID? */
 	bool			msis_require_devid;
 
-- 
2.51.0



^ permalink raw reply related

* [PATCH v2 2/5] KVM: arm64: vgic: Allow userspace to set IIDR revision 1
From: David Woodhouse @ 2026-04-08 11:30 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Paolo Bonzini,
	Shuah Khan, David Woodhouse, Raghavendra Rao Ananta, Eric Auger,
	Kees Cook, Arnd Bergmann, Nathan Chancellor, linux-arm-kernel,
	kvmarm, linux-kernel, kvm, linux-kselftest
In-Reply-To: <20260408113256.2095505-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Allow userspace to select GICD_IIDR revision 1, which restores the
original pre-d53c2c29ae0d ("KVM: arm/arm64: vgic: Allow configuration
of interrupt groups") behaviour where interrupt groups are not
guest-configurable.

When revision 1 is selected:
 - GICv2: IGROUPR reads as zero (group 0), writes are ignored
 - GICv3: IGROUPR reads as all-ones (group 1), writes are ignored
 - v2_groups_user_writable is not set

This is implemented by checking the implementation revision in
vgic_mmio_write_group() and suppressing writes when the revision is
below 2. The read side needs no change since the per-IRQ group reset
values already match the expected behaviour.

Note that d53c2c29ae0d wired guest IGROUPR writes directly to
vgic_mmio_write_group() without any revision check, while only gating
the userspace write path via v2_groups_user_writable. This meant a
guest could modify interrupt groups even at revision 1, which was
never intended. The write_group revision check fixes both the guest
and GICv3 userspace paths.

Fixes: d53c2c29ae0d ("KVM: arm/arm64: vgic: Allow configuration of interrupt groups")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/arm64/kvm/vgic/vgic-mmio-v2.c | 3 +++
 arch/arm64/kvm/vgic/vgic-mmio-v3.c | 4 ++++
 arch/arm64/kvm/vgic/vgic-mmio.c    | 4 ++++
 include/kvm/arm_vgic.h             | 1 +
 4 files changed, 12 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v2.c b/arch/arm64/kvm/vgic/vgic-mmio-v2.c
index 0643e333db35..e5714f7fd2ec 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio-v2.c
+++ b/arch/arm64/kvm/vgic/vgic-mmio-v2.c
@@ -20,6 +20,7 @@
  * Revision 1: Report GICv2 interrupts as group 0 instead of group 1
  * Revision 2: Interrupt groups are guest-configurable and signaled using
  * 	       their configured groups.
+ * Revision 3: GICv2 behaviour is unchanged from revision 2.
  */
 
 static unsigned long vgic_mmio_read_v2_misc(struct kvm_vcpu *vcpu,
@@ -96,6 +97,8 @@ static int vgic_mmio_uaccess_write_v2_misc(struct kvm_vcpu *vcpu,
 		case KVM_VGIC_IMP_REV_2:
 		case KVM_VGIC_IMP_REV_3:
 			vcpu->kvm->arch.vgic.v2_groups_user_writable = true;
+			fallthrough;
+		case KVM_VGIC_IMP_REV_1:
 			dist->implementation_rev = reg;
 			return 0;
 		default:
diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
index 5913a20d8301..0130db71cfc9 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
@@ -74,8 +74,11 @@ bool vgic_supports_direct_sgis(struct kvm *kvm)
 /*
  * The Revision field in the IIDR have the following meanings:
  *
+ * Revision 1: Interrupt groups are not guest-configurable.
+ * 	       IGROUPR reads as all-ones (group 1), writes ignored.
  * Revision 2: Interrupt groups are guest-configurable and signaled using
  * 	       their configured groups.
+ * Revision 3: GICR_CTLR.{IR,CES} are advertised.
  */
 
 static unsigned long vgic_mmio_read_v3_misc(struct kvm_vcpu *vcpu,
@@ -196,6 +199,7 @@ static int vgic_mmio_uaccess_write_v3_misc(struct kvm_vcpu *vcpu,
 
 		reg = FIELD_GET(GICD_IIDR_REVISION_MASK, val);
 		switch (reg) {
+		case KVM_VGIC_IMP_REV_1:
 		case KVM_VGIC_IMP_REV_2:
 		case KVM_VGIC_IMP_REV_3:
 			dist->implementation_rev = reg;
diff --git a/arch/arm64/kvm/vgic/vgic-mmio.c b/arch/arm64/kvm/vgic/vgic-mmio.c
index a573b1f0c6cb..4fbe0ad22adf 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio.c
+++ b/arch/arm64/kvm/vgic/vgic-mmio.c
@@ -73,6 +73,10 @@ void vgic_mmio_write_group(struct kvm_vcpu *vcpu, gpa_t addr,
 	int i;
 	unsigned long flags;
 
+	/* Revision 1 and below: groups are not guest-configurable. */
+	if (vgic_get_implementation_rev(vcpu) < KVM_VGIC_IMP_REV_2)
+		return;
+
 	for (i = 0; i < len * 8; i++) {
 		struct vgic_irq *irq = vgic_get_vcpu_irq(vcpu, intid + i);
 
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index f2eafc65bbf4..90fb6cd3c91c 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -248,6 +248,7 @@ struct vgic_dist {
 
 	/* Implementation revision as reported in the GICD_IIDR */
 	u32			implementation_rev;
+#define KVM_VGIC_IMP_REV_1	1 /* GICv2 interrupts as group 0 */
 #define KVM_VGIC_IMP_REV_2	2 /* GICv2 restorable groups */
 #define KVM_VGIC_IMP_REV_3	3 /* GICv3 GICR_CTLR.{IW,CES,RWP} */
 #define KVM_VGIC_IMP_REV_LATEST	KVM_VGIC_IMP_REV_3
-- 
2.51.0



^ permalink raw reply related

* [PATCH v2 0/5] KVM: arm64: vgic: Fix IIDR revision handling and add revision 1
From: David Woodhouse @ 2026-04-08 11:30 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Paolo Bonzini,
	Shuah Khan, David Woodhouse, Raghavendra Rao Ananta, Eric Auger,
	Kees Cook, Arnd Bergmann, Nathan Chancellor, linux-arm-kernel,
	kvmarm, linux-kernel, kvm, linux-kselftest

The uaccess write handlers for GICD_IIDR extract the revision field
from the wrong variable, making it impossible for userspace to actually
change the implementation revision. Fix that.

Additionally, allow userspace to select IIDR revision 1, restoring the
behaviour from before commit d53c2c29ae0d ("KVM: arm/arm64: vgic: Allow
configuration of interrupt groups") where interrupt groups are not
guest-configurable. This is needed by hypervisors that were reverting
that commit to preserve the original guest-visible semantics, and to
allow for a safely controlled deployment of the new behaviour.

For GICv2, kill the v2_groups_user_writable flag and make the behaviour 
depend directly on the IIDR. The existing default behaviour of setting 
the IIDR to revision 3 and allowing the groups to be writable by the 
*guest* but just not by userspace was just weird, and almost certainly
not intentional. (New in v2 posting).

Tested on Graviton 3 (Neoverse-V1) metal for GICv3 selftests, and
under QEMU TCG with GICv2 emulation for GICv2 selftests.

v2:
 • Fixed -Wdiscarded-qualifiers warning from 0-day bot.
 • Remove GICv2 v2_groups_user_writable flag and just use IIDR.
 • Address Marc's review feedback (no special cases in read_group,
   other minor cleanups).

v1: https://lore.kernel.org/all/20260407210949.2076251-1-dwmw2@infradead.org/

David Woodhouse (5):
      KVM: arm64: vgic: Fix IIDR revision field extracted from wrong value
      KVM: arm64: vgic: Allow userspace to set IIDR revision 1
      KVM: arm64: selftests: Add vgic IIDR revision test
      KVM: arm64: vgic: Remove v2_groups_user_writable and use IIDR revision directly
      KVM: arm64: selftests: Add GICv2 IGROUPR writability test

 arch/arm64/kvm/vgic/vgic-mmio-v2.c                 |  18 +--
 arch/arm64/kvm/vgic/vgic-mmio-v3.c                 |   6 +-
 arch/arm64/kvm/vgic/vgic-mmio.c                    |   4 +
 include/kvm/arm_vgic.h                             |   4 +-
 tools/testing/selftests/kvm/Makefile.kvm           |   2 +
 .../testing/selftests/kvm/arm64/vgic_group_iidr.c  | 118 +++++++++++++++
 tools/testing/selftests/kvm/arm64/vgic_group_v2.c  | 168 +++++++++++++++++++++
 7 files changed, 306 insertions(+), 14 deletions(-)



^ permalink raw reply

* [PATCH v2 1/5] KVM: arm64: vgic: Fix IIDR revision field extracted from wrong value
From: David Woodhouse @ 2026-04-08 11:30 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Paolo Bonzini,
	Shuah Khan, David Woodhouse, Raghavendra Rao Ananta, Eric Auger,
	Kees Cook, Arnd Bergmann, Nathan Chancellor, linux-arm-kernel,
	kvmarm, linux-kernel, kvm, linux-kselftest
In-Reply-To: <20260408113256.2095505-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

The uaccess write handlers for GICD_IIDR in both GICv2 and GICv3
extract the revision field from 'reg' (the current IIDR value read back
from the emulated distributor) instead of 'val' (the value userspace is
trying to write). This means userspace can never actually change the
implementation revision — the extracted value is always the current one.

Fix the FIELD_GET to use 'val' so that userspace can select a different
revision for migration compatibility.

Fixes: 49a1a2c70a7f ("KVM: arm64: vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/arm64/kvm/vgic/vgic-mmio-v2.c | 2 +-
 arch/arm64/kvm/vgic/vgic-mmio-v3.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v2.c b/arch/arm64/kvm/vgic/vgic-mmio-v2.c
index 406845b3117c..0643e333db35 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio-v2.c
+++ b/arch/arm64/kvm/vgic/vgic-mmio-v2.c
@@ -91,7 +91,7 @@ static int vgic_mmio_uaccess_write_v2_misc(struct kvm_vcpu *vcpu,
 		 * migration from old kernels to new kernels with legacy
 		 * userspace.
 		 */
-		reg = FIELD_GET(GICD_IIDR_REVISION_MASK, reg);
+		reg = FIELD_GET(GICD_IIDR_REVISION_MASK, val);
 		switch (reg) {
 		case KVM_VGIC_IMP_REV_2:
 		case KVM_VGIC_IMP_REV_3:
diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
index 89edb84d1ac6..5913a20d8301 100644
--- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
@@ -194,7 +194,7 @@ static int vgic_mmio_uaccess_write_v3_misc(struct kvm_vcpu *vcpu,
 		if ((reg ^ val) & ~GICD_IIDR_REVISION_MASK)
 			return -EINVAL;
 
-		reg = FIELD_GET(GICD_IIDR_REVISION_MASK, reg);
+		reg = FIELD_GET(GICD_IIDR_REVISION_MASK, val);
 		switch (reg) {
 		case KVM_VGIC_IMP_REV_2:
 		case KVM_VGIC_IMP_REV_3:
-- 
2.51.0



^ permalink raw reply related

* [PATCH v2 3/5] KVM: arm64: selftests: Add vgic IIDR revision test
From: David Woodhouse @ 2026-04-08 11:30 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Paolo Bonzini,
	Shuah Khan, David Woodhouse, Raghavendra Rao Ananta, Eric Auger,
	Kees Cook, Arnd Bergmann, Nathan Chancellor, linux-arm-kernel,
	kvmarm, linux-kernel, kvm, linux-kselftest
In-Reply-To: <20260408113256.2095505-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Test that the GICD_IIDR implementation revision correctly controls
guest-visible behaviour for GICv3:

  Revision 1: IGROUPR reads as all-ones (group 1), writes are ignored.
              GICR_CTLR.{IR,CES} not advertised.
  Revision 2: IGROUPR is guest-configurable (read/write).
              GICR_CTLR.{IR,CES} not advertised.
  Revision 3: IGROUPR is guest-configurable (read/write).
              GICR_CTLR.{IR,CES} advertised.

For each revision, the test sets the IIDR via KVM_DEV_ARM_VGIC_GRP_DIST_REGS
before initializing the vGIC, then runs a guest that verifies the
expected IGROUPR and GICR_CTLR behaviour.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../selftests/kvm/arm64/vgic_group_iidr.c     | 118 ++++++++++++++++++
 2 files changed, 119 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/arm64/vgic_group_iidr.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 6471fa214a9f..df729a70124f 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -177,6 +177,7 @@ TEST_GEN_PROGS_arm64 += arm64/vcpu_width_config
 TEST_GEN_PROGS_arm64 += arm64/vgic_init
 TEST_GEN_PROGS_arm64 += arm64/vgic_irq
 TEST_GEN_PROGS_arm64 += arm64/vgic_lpi_stress
+TEST_GEN_PROGS_arm64 += arm64/vgic_group_iidr
 TEST_GEN_PROGS_arm64 += arm64/vpmu_counter_access
 TEST_GEN_PROGS_arm64 += arm64/no-vgic-v3
 TEST_GEN_PROGS_arm64 += arm64/idreg-idst
diff --git a/tools/testing/selftests/kvm/arm64/vgic_group_iidr.c b/tools/testing/selftests/kvm/arm64/vgic_group_iidr.c
new file mode 100644
index 000000000000..0073ccc19e92
--- /dev/null
+++ b/tools/testing/selftests/kvm/arm64/vgic_group_iidr.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * vgic_group_iidr.c - Test IGROUPR behaviour across IIDR revisions
+ *
+ * Validate that the GICD_IIDR implementation revision controls
+ * IGROUPR semantics for GICv3:
+ *   Rev 1: IGROUPR reads as all-ones (group 1), writes ignored
+ *   Rev 2+: IGROUPR is guest-configurable (read/write)
+ */
+#include <linux/sizes.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "gic.h"
+#include "gic_v3.h"
+#include "vgic.h"
+
+#define NR_IRQS		128
+#define SPI_IGROUPR	(GICD_IGROUPR + (32 / 32) * 4) /* intids 32-63 */
+
+static uint64_t shared_rev;
+
+static void guest_code(void)
+{
+	uint32_t val;
+
+	val = readl(GICD_BASE_GVA + SPI_IGROUPR);
+
+	if (shared_rev == 1) {
+		/* Rev 1: all group 1, guest writes must be ignored */
+		GUEST_ASSERT_EQ(val, 0xffffffff);
+		writel(0x0, GICD_BASE_GVA + SPI_IGROUPR);
+		val = readl(GICD_BASE_GVA + SPI_IGROUPR);
+		GUEST_ASSERT_EQ(val, 0xffffffff);
+		writel(0x55aa55aa, GICD_BASE_GVA + SPI_IGROUPR);
+		val = readl(GICD_BASE_GVA + SPI_IGROUPR);
+		GUEST_ASSERT_EQ(val, 0xffffffff);
+	} else {
+		/* Rev 2/3: guest-configurable */
+		writel(0xa5a5a5a5, GICD_BASE_GVA + SPI_IGROUPR);
+		val = readl(GICD_BASE_GVA + SPI_IGROUPR);
+		GUEST_ASSERT_EQ(val, 0xa5a5a5a5);
+		writel(0x0, GICD_BASE_GVA + SPI_IGROUPR);
+		val = readl(GICD_BASE_GVA + SPI_IGROUPR);
+		GUEST_ASSERT_EQ(val, 0x0);
+	}
+
+	/* Rev 3: GICR_CTLR advertises IR and CES. Rev 1/2: it does not. */
+	val = readl(GICR_BASE_GVA + GICR_CTLR);
+	if (shared_rev >= 3)
+		GUEST_ASSERT(val & (GICR_CTLR_IR | GICR_CTLR_CES));
+	else
+		GUEST_ASSERT(!(val & (GICR_CTLR_IR | GICR_CTLR_CES)));
+
+	GUEST_DONE();
+}
+
+static void run_test(int rev)
+{
+	struct kvm_vcpu *vcpus[1];
+	struct kvm_vm *vm;
+	struct ucall uc;
+	uint32_t iidr;
+	int gic_fd;
+
+	pr_info("Testing IIDR revision %d\n", rev);
+
+	test_disable_default_vgic();
+	vm = vm_create_with_vcpus(1, guest_code, vcpus);
+
+	gic_fd = __vgic_v3_setup(vm, 1, NR_IRQS);
+	TEST_ASSERT(gic_fd >= 0, "Failed to create vGICv3");
+
+	/* Set the requested IIDR revision before init. */
+	kvm_device_attr_get(gic_fd, KVM_DEV_ARM_VGIC_GRP_DIST_REGS,
+			    GICD_IIDR, &iidr);
+	iidr &= ~GICD_IIDR_REVISION_MASK;
+	iidr |= rev << GICD_IIDR_REVISION_SHIFT;
+	kvm_device_attr_set(gic_fd, KVM_DEV_ARM_VGIC_GRP_DIST_REGS,
+			    GICD_IIDR, &iidr);
+
+	__vgic_v3_init(gic_fd);
+
+	/* Verify the revision was applied. */
+	kvm_device_attr_get(gic_fd, KVM_DEV_ARM_VGIC_GRP_DIST_REGS,
+			    GICD_IIDR, &iidr);
+	TEST_ASSERT(((iidr & GICD_IIDR_REVISION_MASK) >> GICD_IIDR_REVISION_SHIFT) == rev,
+		    "IIDR revision readback: expected %d, got %d",
+		    rev, (iidr & GICD_IIDR_REVISION_MASK) >> GICD_IIDR_REVISION_SHIFT);
+
+	/* Tell the guest which revision we set. */
+	sync_global_to_guest(vm, shared_rev);
+	shared_rev = rev;
+	sync_global_to_guest(vm, shared_rev);
+
+	vcpu_run(vcpus[0]);
+	switch (get_ucall(vcpus[0], &uc)) {
+	case UCALL_ABORT:
+		REPORT_GUEST_ASSERT(uc);
+		break;
+	case UCALL_DONE:
+		break;
+	default:
+		TEST_FAIL("Unexpected ucall %lu", uc.cmd);
+	}
+
+	close(gic_fd);
+	kvm_vm_free(vm);
+}
+
+int main(int argc, char *argv[])
+{
+	run_test(1);
+	run_test(2);
+	run_test(3);
+	return 0;
+}
-- 
2.51.0



^ permalink raw reply related

* Re: [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible
From: Dev Jain @ 2026-04-08 11:22 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21
In-Reply-To: <CAGsJ_4xCtFe=5ofj4FW6cqu-fgR+K9BM7FPZRdAWOGP3YKtNzQ@mail.gmail.com>



On 08/04/26 10:42 am, Barry Song wrote:
> On Wed, Apr 8, 2026 at 12:20 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
>>> In many cases, the pages passed to vmap() may include high-order
>>> pages allocated with __GFP_COMP flags. For example, the systemheap
>>> often allocates pages in descending order: order 8, then 4, then 0.
>>> Currently, vmap() iterates over every page individually—even pages
>>> inside a high-order block are handled one by one.
>>>
>>> This patch detects high-order pages and maps them as a single
>>> contiguous block whenever possible.
>>>
>>> An alternative would be to implement a new API, vmap_sg(), but that
>>> change seems to be large in scope.
>>>
>>> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
>>> ---
>>
>> Coincidentally, I was working on the same thing :)
> 
> Interesting, thanks — at least I’ve got one good reviewer :-)
> 
>>
>> We have a usecase regarding Arm TRBE and SPE aux buffers.
>>
>> I'll take a look at your patches later, but my implementation is the
> 
> Yes. Please.
> 
> 
>> following, if you have any comments. I have squashed the patches into
>> a single diff.
> 
> Thanks very much, Dev. What you’ve done is quite similar to
> patches 5/8 and 6/8, although the code differs somewhat.
> 
>>
>>
>>
>> From ccb9670a52b7f50b1f1e07b579a1316f76b84811 Mon Sep 17 00:00:00 2001
>> From: Dev Jain <dev.jain@arm.com>
>> Date: Thu, 26 Feb 2026 16:21:29 +0530
>> Subject: [PATCH] arm64/perf: map AUX buffer with large pages
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  .../hwtracing/coresight/coresight-etm-perf.c  |  3 +-
>>  drivers/hwtracing/coresight/coresight-trbe.c  |  3 +-
>>  drivers/perf/arm_spe_pmu.c                    |  5 +-
>>  mm/vmalloc.c                                  | 86 ++++++++++++++++---
>>  4 files changed, 79 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c
>> index 72017dcc3b7f1..e90a430af86bb 100644
>> --- a/drivers/hwtracing/coresight/coresight-etm-perf.c
>> +++ b/drivers/hwtracing/coresight/coresight-etm-perf.c
>> @@ -984,7 +984,8 @@ int __init etm_perf_init(void)
>>
>>         etm_pmu.capabilities            = (PERF_PMU_CAP_EXCLUSIVE |
>>                                            PERF_PMU_CAP_ITRACE |
>> -                                          PERF_PMU_CAP_AUX_PAUSE);
>> +                                          PERF_PMU_CAP_AUX_PAUSE |
>> +                                          PERF_PMU_CAP_AUX_PREFER_LARGE);
>>
>>         etm_pmu.attr_groups             = etm_pmu_attr_groups;
>>         etm_pmu.task_ctx_nr             = perf_sw_context;
>> diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
>> index 1511f8eb95afb..74e6ad891e236 100644
>> --- a/drivers/hwtracing/coresight/coresight-trbe.c
>> +++ b/drivers/hwtracing/coresight/coresight-trbe.c
>> @@ -760,7 +760,8 @@ static void *arm_trbe_alloc_buffer(struct coresight_device *csdev,
>>         for (i = 0; i < nr_pages; i++)
>>                 pglist[i] = virt_to_page(pages[i]);
>>
>> -       buf->trbe_base = (unsigned long)vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
>> +       buf->trbe_base = (unsigned long)vmap(pglist, nr_pages,
>> +                        VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
>>         if (!buf->trbe_base) {
>>                 kfree(pglist);
>>                 kfree(buf);
>> diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
>> index dbd0da1116390..90c349fd66b2c 100644
>> --- a/drivers/perf/arm_spe_pmu.c
>> +++ b/drivers/perf/arm_spe_pmu.c
>> @@ -1027,7 +1027,7 @@ static void *arm_spe_pmu_setup_aux(struct perf_event *event, void **pages,
>>         for (i = 0; i < nr_pages; ++i)
>>                 pglist[i] = virt_to_page(pages[i]);
>>
>> -       buf->base = vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
>> +       buf->base = vmap(pglist, nr_pages, VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
>>         if (!buf->base)
>>                 goto out_free_pglist;
>>
>> @@ -1064,7 +1064,8 @@ static int arm_spe_pmu_perf_init(struct arm_spe_pmu *spe_pmu)
>>         spe_pmu->pmu = (struct pmu) {
>>                 .module = THIS_MODULE,
>>                 .parent         = &spe_pmu->pdev->dev,
>> -               .capabilities   = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
>> +               .capabilities   = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE |
>> +                                 PERF_PMU_CAP_AUX_PREFER_LARGE,
>>                 .attr_groups    = arm_spe_pmu_attr_groups,
>>                 /*
>>                  * We hitch a ride on the software context here, so that
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 61caa55a44027..8482463d41203 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -660,14 +660,14 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>                 pgprot_t prot, struct page **pages, unsigned int page_shift)
>>  {
>>         unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
>> -
>> +       unsigned long step = 1UL << (page_shift - PAGE_SHIFT);
>>         WARN_ON(page_shift < PAGE_SHIFT);
>>
>>         if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>>                         page_shift == PAGE_SHIFT)
>>                 return vmap_small_pages_range_noflush(addr, end, prot, pages);
>>
>> -       for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
>> +       for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
>>                 int err;
>>
>>                 err = vmap_range_noflush(addr, addr + (1UL << page_shift),
>> @@ -678,8 +678,9 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>
>>                 addr += 1UL << page_shift;
>>         }
>> -
>> -       return 0;
>> +       if (IS_ALIGNED(nr, step))
>> +               return 0;
>> +       return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
>>  }
>>
>>  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>> @@ -3514,6 +3515,50 @@ void vunmap(const void *addr)
>>  }
>>  EXPORT_SYMBOL(vunmap);
>>
>> +static inline unsigned int vm_shift(pgprot_t prot, unsigned long size)
>> +{
>> +       if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
>> +               return PMD_SHIFT;
>> +
>> +       return arch_vmap_pte_supported_shift(size);
>> +}
>> +
>> +static inline int __vmap_huge(struct page **pages, pgprot_t prot,
>> +               unsigned long addr, unsigned int count)
>> +{
>> +       unsigned int i = 0;
>> +       unsigned int shift;
>> +       unsigned long nr;
>> +
>> +       while (i < count) {
>> +               nr = num_pages_contiguous(pages + i, count - i);
>> +               shift = vm_shift(prot, nr << PAGE_SHIFT);
>> +               if (vmap_pages_range(addr, addr + (nr << PAGE_SHIFT),
>> +                                    pgprot_nx(prot), pages + i, shift) < 0) {
>> +                       return 1;
>> +               }
> 
> One observation on my side is that the performance gain is somewhat
> offset by page table zigzagging caused by what you are doing here -
> iterating each mem segment by vmap_pages_range() .

I recall having observed this problem half an year back, and I wrote
code similar to what you did with patch 3 - but I didn't observe any
performance improvement. I think that was because I was testing
vmalloc - most of the cost there lies in the page allocation.

So looks like this indeed is a benefit for vmap.

> 
> In patch 3/8, I enhanced vmap_small_pages_range_noflush() to
> avoid repeated pgd → p4d → pud → pmd → pte traversals for page
> shifts other than PAGE_SHIFT. This improves performance for
> vmalloc as well as vmap(). Then, in patch 7/8, I adopt the new
> vmap_small_pages_range_noflush() and eliminate the iteration.
> 
>> +               i += nr;
>> +               addr += (nr << PAGE_SHIFT);
>> +       }
>> +       return 0;
>> +}
>> +
>> +static unsigned long max_contiguous_stride_order(struct page **pages,
>> +               pgprot_t prot, unsigned int count)
>> +{
>> +       unsigned long max_shift = PAGE_SHIFT;
>> +       unsigned int i = 0;
>> +
>> +       while (i < count) {
>> +               unsigned long nr = num_pages_contiguous(pages + i, count - i);
>> +               unsigned long shift = vm_shift(prot, nr << PAGE_SHIFT);
>> +
>> +               max_shift = max(max_shift, shift);
>> +               i += nr;
>> +       }
>> +       return max_shift;
>> +}
>> +
>>  /**
>>   * vmap - map an array of pages into virtually contiguous space
>>   * @pages: array of page pointers
>> @@ -3552,15 +3597,32 @@ void *vmap(struct page **pages, unsigned int count,
>>                 return NULL;
>>
>>         size = (unsigned long)count << PAGE_SHIFT;
>> -       area = get_vm_area_caller(size, flags, __builtin_return_address(0));
>> +       if (flags & VM_ALLOW_HUGE_VMAP) {
>> +               /* determine from page array, the max alignment */
>> +               unsigned long max_shift = max_contiguous_stride_order(pages, prot, count);
>> +
>> +               area = __get_vm_area_node(size, 1 << max_shift, max_shift, flags,
>> +                                         VMALLOC_START, VMALLOC_END, NUMA_NO_NODE,
>> +                                         GFP_KERNEL, __builtin_return_address(0));
>> +       } else {
>> +               area = get_vm_area_caller(size, flags, __builtin_return_address(0));
>> +       }
>>         if (!area)
>>                 return NULL;
>>
>>         addr = (unsigned long)area->addr;
>> -       if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
>> -                               pages, PAGE_SHIFT) < 0) {
>> -               vunmap(area->addr);
>> -               return NULL;
>> +
>> +       if (flags & VM_ALLOW_HUGE_VMAP) {
>> +               if (__vmap_huge(pages, prot, addr, count)) {
>> +                       vunmap(area->addr);
>> +                       return NULL;
>> +               }
>> +       } else {
>> +               if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
>> +                                       pages, PAGE_SHIFT) < 0) {
>> +                       vunmap(area->addr);
>> +                       return NULL;
>> +               }
>>         }
>>
>>         if (flags & VM_MAP_PUT_PAGES) {
>> @@ -4011,11 +4073,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
>>                  * their allocations due to apply_to_page_range not
>>                  * supporting them.
>>                  */
>> -
>> -               if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
>> -                       shift = PMD_SHIFT;
>> -               else
>> -                       shift = arch_vmap_pte_supported_shift(size);
>> +               shift = vm_shift(prot, size);
> 
> What I actually did is different. In patches 1/8 and 2/8, I
> extended the arm64 levels to support N * CONT_PTE, and let the
> final PTE mapping use the maximum possible batch after avoiding
> zigzag. This further improves all orders greater than CONT_PTE.
> 
> Thanks
> Barry



^ permalink raw reply

* RE: [PATCH v4 net-next 13/14] net: dsa: netc: initialize buffer bool table and implement flow-control
From: Wei Fang @ 2026-04-08 11:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Claudiu Manoil, Vladimir Oltean, Clark Wang,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	pabeni@redhat.com, robh@kernel.org, krzk+dt@kernel.org,
	conor+dt@kernel.org, f.fainelli@gmail.com, Frank Li,
	chleroy@kernel.org, horms@kernel.org, linux@armlinux.org.uk,
	andrew@lunn.ch, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org, imx@lists.linux.dev
In-Reply-To: <20260403011745.1795759-1-kuba@kernel.org>

> > +static u32 netc_get_buffer_pool_num(struct netc_switch *priv)
> > +{
> > +	/* The BPCAPR register is a read only register, the hardware
> > +	 * guarantees that the BPCAPR_NUM_BP field will not be 0.
> > +	 */
> > +	return netc_base_rd(&priv->regs, NETC_BPCAPR) & BPCAPR_NUM_BP;
> > +}
> [ ... ]
> > +static void netc_ipv_to_buffer_pool_mapping(struct netc_switch *priv)
> > +{
> > +	int bp_per_port = priv->num_bp / priv->info->num_ports;
> > +	int q = NETC_IPV_NUM / bp_per_port;
> > +	int r = NETC_IPV_NUM % bp_per_port;
> 
> If the hardware register read in netc_get_buffer_pool_num() returns a value
> smaller than priv->info->num_ports (for instance, during a bus error or
> reset state), bp_per_port will evaluate to 0.
> 
> Can this lead to a division by zero panic here?
> 
> Should priv->num_bp be validated against the number of ports before
> performing this calculation to ensure the driver fails gracefully if the
> hardware returns an unexpected value?

If the hardware is so unstable and prone to such malfunctions, that would
be terrible, but I believe it won't. The AI ​​review is being a bit too nitpicky.
Anyway, I will ll add this check to avoid AI review warnings.


^ permalink raw reply

* Re: [PATCH 0/3] arm-smmu-v3: Add PMCG child support and update PMU MMIO mapping
From: Robin Murphy @ 2026-04-08 11:15 UTC (permalink / raw)
  To: Peng Fan (OSS), Will Deacon, Joerg Roedel, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Mark Rutland
  Cc: linux-arm-kernel, iommu, devicetree, linux-kernel,
	linux-perf-users, Peng Fan
In-Reply-To: <20260408-smmu-perf-v1-0-d75dac96e828@nxp.com>

On 2026-04-08 8:51 am, Peng Fan (OSS) wrote:
> This patch series adds proper support for describing and probing the
> Arm SMMU v3 PMCG (Performance Monitor Control Group) as a child node of
> the SMMU in Devicetree, and updates the relevant drivers accordingly.
> 
> The SMMU v3 architecture allows an optional PMCG block, typically
> associated with TCUs, to be implemented within the SMMU register
> address space. For example, mmu700 PMCG is at the offset 0x2000 of the
> TCU page 0.

But what's wrong with the existing binding? Especially given that it 
even has an upstream user already:

https://git.kernel.org/torvalds/c/aef9703dcbf8

> Patch 1 updates the SMMU v3 Devicetree binding to allow PMCG child nodes,
> referencing the existing arm,smmu-v3-pmcg binding.
> 
> Patch 2 updates the arm-smmu-v3 driver to populate platform devices for
> child nodes described in DT once the SMMU probe succeeds.
> 
> Patch 3 updates the SMMUv3 PMU driver to correctly handle MMIO mapping when
> PMCG is described as a child node. The PMCG registers occupy a sub-region
> of the parent SMMU MMIO window, which is already requested by the SMMU

That has not been the case since 52f3fab0067d ("iommu/arm-smmu-v3: Don't 
reserve implementation defined register space") nearly 6 years ago, 
where the whole purpose was to support Arm's PMCG implementation 
properly. What kernel is this based on?

Thanks,
Robin.

> Signed-off-by: Peng Fan <peng.fan@nxp.com>
> ---
> Peng Fan (3):
>        dt-bindings: iommu: arm-smmu-v3: Allow PMU child nodes
>        iommu/arm-smmu-v3: Populate PMU child devices from Devicetree
>        perf/arm-smmuv3: Avoid double-requesting shared SMMU MMIO for PMCG
> 
>   .../devicetree/bindings/iommu/arm,smmu-v3.yaml        | 10 ++++++++++
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c           |  3 +++
>   drivers/perf/arm_smmuv3_pmu.c                         | 19 ++++++++++++++++---
>   3 files changed, 29 insertions(+), 3 deletions(-)
> ---
> base-commit: f3e6330d7fe42b204af05a2dbc68b379e0ad179e
> change-id: 20260408-smmu-perf-754367fe66c8
> 
> Best regards,



^ permalink raw reply

* RE: [PATCH v4 net-next 12/14] net: dsa: netc: add more basic functions support
From: Wei Fang @ 2026-04-08 11:09 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Claudiu Manoil, Vladimir Oltean, Clark Wang,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	pabeni@redhat.com, robh@kernel.org, krzk+dt@kernel.org,
	conor+dt@kernel.org, f.fainelli@gmail.com, Frank Li,
	chleroy@kernel.org, horms@kernel.org, linux@armlinux.org.uk,
	andrew@lunn.ch, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org, imx@lists.linux.dev
In-Reply-To: <20260403011743.1795726-1-kuba@kernel.org>

> > +static int netc_add_standalone_vlan_entry(struct netc_switch *priv)
> > +{
> > +	u32 bitmap_stg = VFT_STG_ID(0) | netc_available_port_bitmap(priv);
> 
> Does putting all standalone ports into a single shared VLAN 0 break the DSA
> isolation requirements?
> 
> If a user manually adds a static FDB entry to a standalone port, it is
> programmed into VLAN 0. Because all standalone ports reside in this VLAN, a
> frame arriving on one standalone port with a matching destination MAC might
> be hardware-switched directly to another standalone port, bypassing the CPU.

I think the FDB entry can only be added to the CPU port if the port is a
standalone port not a bridge port. So this case will not be true.

> 
> [ ... ]
> 
> > +static int netc_port_set_fdb_entry(struct netc_port *np,
> > +				   const unsigned char *addr, u16 vid)
> > +{
> 
> [ ... ]
> 
> > +	/* If the entry already exists, but not on this port, we need to
> > +	 * update the port bitmap. In general, it should only be valid
> > +	 * for multicast or broadcast address.
> > +	 */
> > +	port_bitmap ^= BIT(port);
> > +	entry->cfge.port_bitmap = cpu_to_le32(port_bitmap);
> 
> Can this XOR operation cause multi-port floods during unicast MAC migrations?
> 
> When a unicast MAC address moves to a new port, the bridging layer issues an
> ADD for the new port. The XOR operation leaves the old port bit set alongside
> the new port bit. Does the driver need to differentiate between unicast and
> multicast addresses and explicitly overwrite the bitmap for unicast entries?

I think the old FDB entry should be removed first when a unicast MAC address
moves to a new port rather than only adding a new entry. If AI refers to adding
before deleting, then this problem will indeed occur, and it is unavoidable.

> 
> [ ... ]
> 
> > +static int netc_port_del_fdb_entry(struct netc_port *np,
> > +				   const unsigned char *addr, u16 vid)
> > +{
> > +	struct netc_switch *priv = np->switch_priv;
> > +	struct ntmp_user *ntmp = &priv->ntmp;
> > +	struct netc_fdb_entry *entry;
> > +	int port = np->dp->index;
> > +	u32 port_bitmap;
> > +	int err = 0;
> > +
> > +	mutex_lock(&priv->fdbt_lock);
> > +
> > +	entry = netc_lookup_fdb_entry(priv, addr, vid);
> > +	if (unlikely(!entry))
> > +		goto unlock_fdbt;
> 
> Will this prevent dynamically learned FDB entries from being deleted?
> 
> Since the CPU port is configured with MLO_HW, the switch hardware

Currently we only support single port mode, so only support static FDB
entry. The MLO_HW should be removed from the code, I will improve
the logic.

> dynamically
> learns MAC addresses, but they are not added to the software priv->fdb_list.
> If a user attempts to delete a dynamic entry, the software lookup will fail
> here and return 0 (success), leaving the stale dynamically learned entry
> active in hardware.
> 
> [ ... ]
> 
> > +static int netc_port_change_mtu(struct dsa_switch *ds,
> > +				int port, int mtu)
> > +{
> > +	u32 max_frame_size = mtu + VLAN_ETH_HLEN + ETH_FCS_LEN;
> > +	struct netc_port *np = NETC_PORT(ds, port);
> > +
> > +	if (dsa_is_cpu_port(ds, port))
> > +		max_frame_size += NETC_TAG_MAX_LEN;
> 
> Does this double-count the DSA tag overhead for the CPU port?
> 
> The DSA core computes the CPU port's MTU by adding the tagger overhead
> before
> calling the switch's .port_change_mtu() function. If the maximum MTU of 9582
> bytes is requested, the CPU port receives an MTU of 9596. Adding
> NETC_TAG_MAX_LEN again here results in 9628 bytes, which exceeds the
> hardware limit of 9600 (NETC_MAX_FRAME_LEN) and might lead to truncation.

It seems a bug, I will fix it.

> 
> [ ... ]
> 
> > +static int netc_port_fdb_add(struct dsa_switch *ds, int port,
> > +			     const unsigned char *addr, u16 vid,
> > +			     struct dsa_db db)
> > +{
> > +	struct netc_port *np = NETC_PORT(ds, port);
> > +
> > +	/* Currently, we only support standalone port mode, so all VLANs
> > +	 * should be converted to NETC_STANDALONE_PVID.
> > +	 */
> > +	return netc_port_set_fdb_entry(np, addr, NETC_STANDALONE_PVID);
> > +}
> 
> Is it safe to silently override user-requested VLAN IDs to VLAN 0?
> 
> If a user explicitly attempts to configure a MAC address for a specific VLAN,
> the hardware is programmed for VLAN 0 and returns success. Should this
> instead check if the requested vid is not 0 and not NETC_STANDALONE_PVID,
> and return -EOPNOTSUPP to inform the networking stack?

Okay, I will add a check.

> 
> [ ... ]
> 
> > +static void netc_port_set_host_flood(struct dsa_switch *ds, int port,
> > +				     bool uc, bool mc)
> > +{
> > +	struct netc_port *np = NETC_PORT(ds, port);
> > +
> > +	if (np->uc == uc && np->mc == mc)
> > +		return;
> > +
> > +	/* IPFT does not support in-place updates to the KEYE element,
> > +	 * so we need to delete the old IPFT entry and then add a new
> > +	 * one.
> > +	 */
> > +	if (np->host_flood)
> > +		netc_port_remove_host_flood(np);
> > +
> > +	if (netc_port_add_host_flood_rule(np, uc, mc))
> > +		dev_err(ds->dev, "Failed to add host flood rule on port %d\n",
> > +			port);
> > +}
> 
> Could this permanently lose the host flood configuration if the new rule
> addition fails?
> 
> If the call to netc_port_add_host_flood_rule() fails (for example, due to
> -ENOMEM), the function logs an error and returns, leaving the port entirely
> without a host flood rule. Should the driver revert to the previous
> configuration on failure to avoid blackholing traffic meant for the CPU?

Okay, I will improve it.



^ permalink raw reply

* Re: [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes
From: Dev Jain @ 2026-04-08 11:08 UTC (permalink / raw)
  To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
	will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21
In-Reply-To: <20260408025115.27368-4-baohua@kernel.org>



On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> vmap_small_pages_range_noflush() provides a clean interface by taking
> struct page **pages and mapping them via direct PTE iteration. This
> avoids the page table zigzag seen when using

"Zigzag" is ambiguous. Just say "page table rewalk". Also please
elaborate on why the rewalk is happening currently.

> vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
> 
> Extend it to support larger page_shift values, and add PMD- and
> contiguous-PTE mappings as well.

So we can drop the "small" here since now it supports larger chunks
as well.

Also at this point the code you add is a no-op since you pass PAGE_SHIFT.
Let us just squash patch 4 into this. This patch looks weird retaining
the pagetable-rewalk algorithm when it literally adds functionality
to avoid that.

> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/vmalloc.c | 54 ++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 42 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 57eae99d9909..5bf072297536 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -524,8 +524,9 @@ void vunmap_range(unsigned long addr, unsigned long end)
>  
>  static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
> +	unsigned int steps = 1;
>  	int err = 0;
>  	pte_t *pte;
>  
> @@ -543,6 +544,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	do {
>  		struct page *page = pages[*nr];
>  
> +		steps = 1;
>  		if (WARN_ON(!pte_none(ptep_get(pte)))) {
>  			err = -EBUSY;
>  			break;
> @@ -556,9 +558,24 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  			break;
>  		}
>  
> +#ifdef CONFIG_HUGETLB_PAGE
> +		if (shift != PAGE_SHIFT) {
> +			unsigned long pfn = page_to_pfn(page), size;
> +
> +			size = arch_vmap_pte_range_map_size(addr, end, pfn, shift);
> +			if (size != PAGE_SIZE) {
> +				steps = size >> PAGE_SHIFT;
> +				pte_t entry = pfn_pte(pfn, prot);
> +
> +				entry = arch_make_huge_pte(entry, ilog2(size), 0);
> +				set_huge_pte_at(&init_mm, addr, pte, entry, size);
> +				continue;
> +			}
> +		}
> +#endif
> +
>  		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
> -		(*nr)++;
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += steps, *nr += steps, addr += PAGE_SIZE * steps, addr != end);
>  
>  	lazy_mmu_mode_disable();
>  	*mask |= PGTBL_PTE_MODIFIED;
> @@ -568,7 +585,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> @@ -578,7 +595,20 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
> +
> +		if (shift == PMD_SHIFT) {
> +			struct page *page = pages[*nr];
> +			phys_addr_t phys_addr = page_to_phys(page);
> +
> +			if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
> +						shift)) {
> +				*mask |= PGTBL_PMD_MODIFIED;
> +				*nr += 1 << (shift - PAGE_SHIFT);
> +				continue;
> +			}
> +		}
> +
> +		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
>  			return -ENOMEM;
>  	} while (pmd++, addr = next, addr != end);
>  	return 0;
> @@ -586,7 +616,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>  
>  static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -596,7 +626,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = pud_addr_end(addr, end);
> -		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
> +		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
>  			return -ENOMEM;
>  	} while (pud++, addr = next, addr != end);
>  	return 0;
> @@ -604,7 +634,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>  
>  static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
>  	p4d_t *p4d;
>  	unsigned long next;
> @@ -614,14 +644,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = p4d_addr_end(addr, end);
> -		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
> +		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
>  			return -ENOMEM;
>  	} while (p4d++, addr = next, addr != end);
>  	return 0;
>  }
>  
>  static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> -		pgprot_t prot, struct page **pages)
> +		pgprot_t prot, struct page **pages, unsigned int shift)
>  {
>  	unsigned long start = addr;
>  	pgd_t *pgd;
> @@ -636,7 +666,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_bad(*pgd))
>  			mask |= PGTBL_PGD_MODIFIED;
> -		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
> +		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
>  		if (err)
>  			break;
>  	} while (pgd++, addr = next, addr != end);
> @@ -665,7 +695,7 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>  
>  	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>  			page_shift == PAGE_SHIFT)
> -		return vmap_small_pages_range_noflush(addr, end, prot, pages);
> +		return vmap_small_pages_range_noflush(addr, end, prot, pages, PAGE_SHIFT);
>  
>  	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
>  		int err;



^ permalink raw reply

* Re: [PATCH 1/2] coresight: etm4x: fix inconsistencies with sysfs configration
From: Yeoreum Yun @ 2026-04-08 11:02 UTC (permalink / raw)
  To: Leo Yan
  Cc: coresight, linux-arm-kernel, linux-kernel, suzuki.poulose,
	mike.leach, james.clark, alexander.shishkin
In-Reply-To: <20260407143028.GM356832@e132581.arm.com>

Hi Leo,

[...]

> As Suzuki suggested in another reply, we need to extract capabilities
> into a separate structure.  I'd also extract status related registers
> into a new structure:
>
>   struct etm4_cap {
>       int nr_ss_cmp;
>       bool pe_comparator;    // TRCSSCSRn.PC
>       bool dv_comparator;    // TRCSSCSRn.DV
>       bool da_comparator;    // TRCSSCSRn.DA
>       bool inst_comparator;  // TRCSSCSRn.INST
>
>       int ns_ex_level;
>       int nr_pe;
>       int nr_pe_cmp;
>       int nr_resource;
>       ...
>   }
>
>   struct etm4_status_reg {
>       u32 ss_status[ETM_MAX_SS_CMP];
>       u32 cntr_val[ETMv4_MAX_CNTR];
>   }

Hmm, I don't think the cntr_val doesn't need to be separated into
etm4_status_reg since they're configurable by sysfs.

BTW from etmv4_config, I think parts of capabilites are only:
  - ss_status
  - s_ex_level

I think it would be okay to include all of this information into
struct etm4_cap not dedicate etm4_status_reg structure.

BTW, Is it required to sustain TRCSSCSR<n>.PENDING in sysfs after
re-enable sysfs-session? (enable->disable->enable) while it's always
cleared in perf mode?

Thanks!

--
Sincerely,
Yeoreum Yun


^ permalink raw reply

* Re: [RFC V1 00/16] arm64/mm: Enable 128 bit page table entries
From: Ryan Roberts @ 2026-04-08 11:01 UTC (permalink / raw)
  To: Anshuman Khandual, David Hildenbrand (Arm), linux-arm-kernel
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, Lorenzo Stoakes,
	Andrew Morton, Mike Rapoport, Linu Cherian, linux-kernel,
	linux-mm
In-Reply-To: <8d2c9ecb-ae33-42f2-a8ed-66b3286b9286@arm.com>

On 08/04/2026 11:53, Anshuman Khandual wrote:
> On 07/04/26 8:14 PM, David Hildenbrand (Arm) wrote:
>> On 2/24/26 06:11, Anshuman Khandual wrote:
>>> FEAT_D128 is a new arm architecture feature adding support for VMSAv9-128
>>> translation system. FEAT_D128 is an optional feature from ARMV9.3 onwards.
>>> So with this feature arm64 platforms could have two different translation
>>> systems, VMSAv8-64 and VMSAv9-128 could selectively be enabled.
>>>
>>> FEAT_D128 adds 128 bit page table entries, thus supporting larger physical
>>> and virtual address range while also expanding available room for more MMU
>>> management feature bits both for HW and SW. 
>>>
>>> This series has been split into two parts. Generic MM changes followed by
>>> arm64 platform changes, finally enabling D128 with a new config ARM64_D128.
>>>
>>> READ_ONCE() on page table entries get routed via level specific pxdp_get()
>>> helpers which platforms could then override when required. These accessors
>>> on arm64 platform help in ensuring page table accesses are performed in an
>>> atomic manner while reading 128 bit page table entries.
>>>
>>> All ARM64_VA_BITS and ARM64_PA_BITS combinations for all page sizes are now
>>> supported both on D64 and D128 translation regimes. Although new 56 bits VA
>>> space is not yet supported. Similarly FEAT_D128 skip level is not supported
>>> currently.
>>>
>>> Basic page table geometry has been changed with D128 as there are now fewer
>>> entries per level. Please refer to the following table for leaf entry sizes
>>>
>>>                     D64              D128
>>> ------------------------------------------------
>>> | PAGE_SIZE |   PMD  |  PUD  |   PMD  |   PUD  |
>>> -----------------------------|-----------------|
>>> |     4K    |    2M  |  1G   |    1M  |  256M  |
>>> |    16K    |   32M  | 64G   |   16M  |   16G  |
>>> |    64K    |  512M  |  4T   |  256M  |    1T  |
>>> ------------------------------------------------
>>>
>>
>> Interesting. That means user space will have it even harder to optimize
>> for THP sizes.
>>
>> What's the effect on cont-pte? Do they still span the same number of
>> entries and there is effectively no change?
> 
> The numbers are the same for 4K base page size but will need
> some changes for 16K and 64K base page sizes. Something that
> git missed in this series, will fix it.

Really - I thought the contiguous sizes were the same for D128 as they are for
D64? What's the difference? Perhaps it's different for level 2, but for level 3,
I'm pretty sure it remains:

PAGE_SIZE	CONT_SIZE	NR_PTES		CONT_ORDER
4K		64K		16		4
16K		2M		128		7
64K		2M		32		5

Thanks,
Ryan

> 
>>
>>> From arm64 kernel features perspective KVM, KASAN and UNMAP_KERNEL_AT_EL0
>>> are currently not supported as well.
>>>
>>> Open Questions:
>>>
>>> - Do we need to support UNMAP_KERNEL_AT_EL0 with D128
>>> - Do we need to emulate traditional D64 sizes at PUD, PMD level with D128
>>
>> It would certainly make user space interaction easier. But then, user
>> space already has to consider various PMD sizes (and is better of
>> querying /sys/kernel/mm/transparent_hugepage/hpage_pmd_size instead of
>> hardcoding it). s390x, for example, also has 1M PMD size.
>>> I guess with "emulating" you mean something simple like always
>> allocating order-1 page tables that effectively have the same number of
>> page table entries?
> 
> Yeah - thought something similar.
> 
>>
>> The would be an option, but I recall that the pte_map_* infrastructure
>> currently expects that leaf page tables only ever span a single page.
>>> So it wouldn't really give us a lot of easy benefit I guess.
> 
> Right. So probably need to figure all other benefits this might
> add besides just the user space facing interactions as you have
> mentioned earlier.



^ permalink raw reply

* Re: [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
From: Barry Song @ 2026-04-08 11:00 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21
In-Reply-To: <02b0464a-13f6-4e0c-ad69-0f494bfacbf4@arm.com>

On Wed, Apr 8, 2026 at 6:32 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> > For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
> > we can batch CONT_PTE settings instead of handling them individually.
> >
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
> >  arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> > index a42c05cf5640..bf31c11ebd3b 100644
> > --- a/arch/arm64/mm/hugetlbpage.c
> > +++ b/arch/arm64/mm/hugetlbpage.c
> > @@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
> >               contig_ptes = CONT_PTES;
> >               break;
> >       default:
> > +             if (size < CONT_PMD_SIZE && size > 0 &&
> > +                             IS_ALIGNED(size, CONT_PTE_SIZE)) {
>
> Nit: Having the lower bound check before upper bound is natural to
> read, so this should be size > 0 && size < CONT_PMD_SIZE (i.e written
> the other way around).

Thanks very much for reviewing, Dev. As we discussed in patch 0/8,
this should be
PMD_SIZE, not CONT_PMD_SIZE. I will use size > 0 && size < PMD_SIZE
in the next version.

>
> Also IS_ALIGNED needs to go below size.

Sure, thanks!

>
>
> > +                     contig_ptes = size >> PAGE_SHIFT;
> > +                     *pgsize = PAGE_SIZE;
> > +                     break;
> > +             }
> >               WARN_ON(!__hugetlb_valid_size(size));
> >       }
> >
> > @@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
> >       case CONT_PTE_SIZE:
> >               return pte_mkcont(entry);
> >       default:
> > +             if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
> > +                             IS_ALIGNED(pagesize, CONT_PTE_SIZE))
> > +                     return pte_mkcont(entry);

Here it should be pagesize > 0 && pagesize < PMD_SIZE as well :-)

> > +
> >               break;
> >       }
> >       pr_warn("%s: unrecognized huge page size 0x%lx\n",
>

Best Regards
Barry


^ permalink raw reply

* Re: [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
From: Dev Jain @ 2026-04-08 10:55 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21
In-Reply-To: <CAGsJ_4xqxmHWBahN-yX10XcEwaHpvypCkwWDLHMY_1P_SzCeMg@mail.gmail.com>



On 08/04/26 4:21 pm, Barry Song wrote:
> On Wed, Apr 8, 2026 at 5:14 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
>>> This patchset accelerates ioremap, vmalloc, and vmap when the memory
>>> is physically fully or partially contiguous. Two techniques are used:
>>>
>>> 1. Avoid page table zigzag when setting PTEs/PMDs for multiple memory
>>>    segments
>>> 2. Use batched mappings wherever possible in both vmalloc and ARM64
>>>    layers
>>>
>>> Patches 1–2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
>>> CONT-PTE regions instead of just one.
>>>
>>> Patches 3–4 extend vmap_small_pages_range_noflush() to support page
>>> shifts other than PAGE_SHIFT. This allows mapping multiple memory
>>> segments for vmalloc() without zigzagging page tables.
>>>
>>> Patches 5–8 add huge vmap support for contiguous pages. This not only
>>> improves performance but also enables PMD or CONT-PTE mapping for the
>>> vmapped area, reducing TLB pressure.
>>>
>>> Many thanks to Xueyuan Chen for his substantial testing efforts
>>> on RK3588 boards.
>>>
>>> On the RK3588 8-core ARM64 SoC, with tasks pinned to CPU2 and
>>> the performance CPUfreq policy enabled, Xueyuan’s tests report:
>>>
>>> * ioremap(1 MB): 1.2× faster
>>> * vmalloc(1 MB) mapping time (excluding allocation) with
>>>   VM_ALLOW_HUGE_VMAP: 1.5× faster
>>> * vmap(): 5.6× faster when memory includes some order-8 pages,
>>>   with no regression observed for order-0 pages
>>>
>>> Barry Song (Xiaomi) (8):
>>>   arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
>>>     setup
>>>   arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
>>>     CONT_PTE
>>>   mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger
>>>     page_shift sizes
>>>   mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
>>>   mm/vmalloc: map contiguous pages in batches for vmap() if possible
>>>   mm/vmalloc: align vm_area so vmap() can batch mappings
>>>   mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable
>>>     zigzag
>>>   mm/vmalloc: Stop scanning for compound pages after encountering small
>>>     pages in vmap
>>>
>>>  arch/arm64/include/asm/vmalloc.h |   6 +-
>>>  arch/arm64/mm/hugetlbpage.c      |  10 ++
>>>  mm/vmalloc.c                     | 178 +++++++++++++++++++++++++------
>>>  3 files changed, 161 insertions(+), 33 deletions(-)
>>>
>>
>> On Linux VM on Apple M3, running mm-selftests:
> 
> Dev, thanks for your report. Sorry for the silly typo—
> Xueyuan’s vmalloc/vmap tests don’t trigger that case yet.
> 
> it should be fixed by:
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index bf31c11ebd3b..25b9fce1ec6a 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -110,7 +110,7 @@ static inline int num_contig_ptes(unsigned long
> size, size_t *pgsize)
>                 contig_ptes = CONT_PTES;
>                 break;
>         default:
> -               if (size < CONT_PMD_SIZE && size > 0 &&
> +               if (size < PMD_SIZE && size > 0 &&
>                                 IS_ALIGNED(size, CONT_PTE_SIZE)) {
>                         contig_ptes = size >> PAGE_SHIFT;
>                         *pgsize = PAGE_SIZE;
> @@ -365,7 +365,7 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int
> shift, vm_flags_t flags)
>         case CONT_PTE_SIZE:
>                 return pte_mkcont(entry);
>         default:
> -               if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
> +               if (pagesize < PMD_SIZE && pagesize > 0 &&
>                                 IS_ALIGNED(pagesize, CONT_PTE_SIZE))
>                         return pte_mkcont(entry);

Yeah indeed the problem was that a PMD chunk was being treated as 512 ptes
rather than 1 PMD. This fixes it.

> 
>>
>>  ./run_vmtests.sh -t "hugetlb"
>>
>> TAP version 13
>> # -----------------------
>> # running ./hugepage-mmap
>> # -----------------------
>> # TAP version 13
>> # 1..1
>> # # Returned address is 0xffffe7c00000
>>
>>
>>
>> [   30.884630] kernel BUG at mm/page_table_check.c:86!
>> [   30.884701] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
>> [   30.886803] Modules linked in:
>> [   30.887217] CPU: 3 UID: 0 PID: 1869 Comm: hugepage-mmap Not tainted 7.0.0-rc5+ #86 PREEMPT
>> [   30.888218] Hardware name: linux,dummy-virt (DT)
>> [   30.889413] pstate: a1400005 (NzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>> [   30.889901] pc : page_table_check_clear.part.0+0x128/0x1a0
>> [   30.890337] lr : page_table_check_clear.part.0+0x7c/0x1a0
>> [   30.890714] sp : ffff800084da3ad0
>> [   30.890946] x29: ffff800084da3ad0 x28: 0000000000000001 x27: 0010000000000001
>> [   30.891434] x26: 0040000000000040 x25: ffffa06bb8fb9000 x24: 00000000ffffffff
>> [   30.891932] x23: 0000000000000001 x22: 0000000000000000 x21: ffffa06bb8997810
>> [   30.892514] x20: 0000000000113e39 x19: 0000000000113e38 x18: 0000000000000000
>> [   30.893007] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>> [   30.893500] x14: ffffa06bb7013780 x13: 0000fffff7f90fff x12: 0000000000000000
>> [   30.893990] x11: 1fffe0001a1282c1 x10: ffff0000d094160c x9 : ffffa06bb568a858
>> [   30.894479] x8 : ffff5f95c8474000 x7 : 0000000000000000 x6 : ffff00017fffc500
>> [   30.894973] x5 : ffff000191208fc0 x4 : 0000000000000000 x3 : 0000000000004000
>> [   30.895449] x2 : 0000000000000000 x1 : 00000000ffffffff x0 : ffff0000c071f1b8
>> [   30.895875] Call trace:
>> [   30.896027]  page_table_check_clear.part.0+0x128/0x1a0 (P)
>> [   30.896369]  page_table_check_clear+0xc8/0x138
>> [   30.896776]  __page_table_check_ptes_set+0xe4/0x1e8
>> [   30.897073]  __set_ptes_anysz+0x2e4/0x308
>> [   30.897327]  set_huge_pte_at+0xec/0x210
>> [   30.897561]  hugetlb_no_page+0x1ec/0x8e0
>> [   30.897807]  hugetlb_fault+0x188/0x740
>> [   30.898036]  handle_mm_fault+0x294/0x2c0
>> [   30.898283]  do_page_fault+0x120/0x748
>> [   30.898539]  do_translation_fault+0x68/0x90
>> [   30.898796]  do_mem_abort+0x4c/0xa8
>> [   30.899011]  el0_da+0x2c/0x90
>> [   30.899205]  el0t_64_sync_handler+0xd0/0xe8
>> [   30.899461]  el0t_64_sync+0x198/0x1a0
>> [   30.899688] Code: 91001021 b8f80022 51000441 36fffd41 (d4210000)
>> [   30.900053] ---[ end trace 0000000000000000 ]---
>>
>>
>>
>> The bug is at
>>
>> BUG_ON(atomic_dec_return(&ptc->file_map_count) < 0);
>>
>> My tree is mm-unstable, commit 3fa44141e0bb.
>>
> 
> Thanks
> Barry



^ permalink raw reply

* Re: [RFC V1 00/16] arm64/mm: Enable 128 bit page table entries
From: Anshuman Khandual @ 2026-04-08 10:53 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-arm-kernel
  Cc: Catalin Marinas, Will Deacon, Ryan Roberts, Mark Rutland,
	Lorenzo Stoakes, Andrew Morton, Mike Rapoport, Linu Cherian,
	linux-kernel, linux-mm
In-Reply-To: <a77b39a4-c6dd-4a8a-8ea4-2bdc31bd3601@kernel.org>

On 07/04/26 8:14 PM, David Hildenbrand (Arm) wrote:
> On 2/24/26 06:11, Anshuman Khandual wrote:
>> FEAT_D128 is a new arm architecture feature adding support for VMSAv9-128
>> translation system. FEAT_D128 is an optional feature from ARMV9.3 onwards.
>> So with this feature arm64 platforms could have two different translation
>> systems, VMSAv8-64 and VMSAv9-128 could selectively be enabled.
>>
>> FEAT_D128 adds 128 bit page table entries, thus supporting larger physical
>> and virtual address range while also expanding available room for more MMU
>> management feature bits both for HW and SW. 
>>
>> This series has been split into two parts. Generic MM changes followed by
>> arm64 platform changes, finally enabling D128 with a new config ARM64_D128.
>>
>> READ_ONCE() on page table entries get routed via level specific pxdp_get()
>> helpers which platforms could then override when required. These accessors
>> on arm64 platform help in ensuring page table accesses are performed in an
>> atomic manner while reading 128 bit page table entries.
>>
>> All ARM64_VA_BITS and ARM64_PA_BITS combinations for all page sizes are now
>> supported both on D64 and D128 translation regimes. Although new 56 bits VA
>> space is not yet supported. Similarly FEAT_D128 skip level is not supported
>> currently.
>>
>> Basic page table geometry has been changed with D128 as there are now fewer
>> entries per level. Please refer to the following table for leaf entry sizes
>>
>>                     D64              D128
>> ------------------------------------------------
>> | PAGE_SIZE |   PMD  |  PUD  |   PMD  |   PUD  |
>> -----------------------------|-----------------|
>> |     4K    |    2M  |  1G   |    1M  |  256M  |
>> |    16K    |   32M  | 64G   |   16M  |   16G  |
>> |    64K    |  512M  |  4T   |  256M  |    1T  |
>> ------------------------------------------------
>>
> 
> Interesting. That means user space will have it even harder to optimize
> for THP sizes.
> 
> What's the effect on cont-pte? Do they still span the same number of
> entries and there is effectively no change?

The numbers are the same for 4K base page size but will need
some changes for 16K and 64K base page sizes. Something that
git missed in this series, will fix it.

> 
>> From arm64 kernel features perspective KVM, KASAN and UNMAP_KERNEL_AT_EL0
>> are currently not supported as well.
>>
>> Open Questions:
>>
>> - Do we need to support UNMAP_KERNEL_AT_EL0 with D128
>> - Do we need to emulate traditional D64 sizes at PUD, PMD level with D128
> 
> It would certainly make user space interaction easier. But then, user
> space already has to consider various PMD sizes (and is better of
> querying /sys/kernel/mm/transparent_hugepage/hpage_pmd_size instead of
> hardcoding it). s390x, for example, also has 1M PMD size.
> > I guess with "emulating" you mean something simple like always
> allocating order-1 page tables that effectively have the same number of
> page table entries?

Yeah - thought something similar.

> 
> The would be an option, but I recall that the pte_map_* infrastructure
> currently expects that leaf page tables only ever span a single page.
> > So it wouldn't really give us a lot of easy benefit I guess.

Right. So probably need to figure all other benefits this might
add besides just the user space facing interactions as you have
mentioned earlier.


^ permalink raw reply

* Re: [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
From: Barry Song @ 2026-04-08 10:51 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21
In-Reply-To: <1e7427c6-b6e5-4a3a-a600-bef9ac2bf3e0@arm.com>

On Wed, Apr 8, 2026 at 5:14 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> > This patchset accelerates ioremap, vmalloc, and vmap when the memory
> > is physically fully or partially contiguous. Two techniques are used:
> >
> > 1. Avoid page table zigzag when setting PTEs/PMDs for multiple memory
> >    segments
> > 2. Use batched mappings wherever possible in both vmalloc and ARM64
> >    layers
> >
> > Patches 1–2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> > CONT-PTE regions instead of just one.
> >
> > Patches 3–4 extend vmap_small_pages_range_noflush() to support page
> > shifts other than PAGE_SHIFT. This allows mapping multiple memory
> > segments for vmalloc() without zigzagging page tables.
> >
> > Patches 5–8 add huge vmap support for contiguous pages. This not only
> > improves performance but also enables PMD or CONT-PTE mapping for the
> > vmapped area, reducing TLB pressure.
> >
> > Many thanks to Xueyuan Chen for his substantial testing efforts
> > on RK3588 boards.
> >
> > On the RK3588 8-core ARM64 SoC, with tasks pinned to CPU2 and
> > the performance CPUfreq policy enabled, Xueyuan’s tests report:
> >
> > * ioremap(1 MB): 1.2× faster
> > * vmalloc(1 MB) mapping time (excluding allocation) with
> >   VM_ALLOW_HUGE_VMAP: 1.5× faster
> > * vmap(): 5.6× faster when memory includes some order-8 pages,
> >   with no regression observed for order-0 pages
> >
> > Barry Song (Xiaomi) (8):
> >   arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
> >     setup
> >   arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
> >     CONT_PTE
> >   mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger
> >     page_shift sizes
> >   mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
> >   mm/vmalloc: map contiguous pages in batches for vmap() if possible
> >   mm/vmalloc: align vm_area so vmap() can batch mappings
> >   mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable
> >     zigzag
> >   mm/vmalloc: Stop scanning for compound pages after encountering small
> >     pages in vmap
> >
> >  arch/arm64/include/asm/vmalloc.h |   6 +-
> >  arch/arm64/mm/hugetlbpage.c      |  10 ++
> >  mm/vmalloc.c                     | 178 +++++++++++++++++++++++++------
> >  3 files changed, 161 insertions(+), 33 deletions(-)
> >
>
> On Linux VM on Apple M3, running mm-selftests:

Dev, thanks for your report. Sorry for the silly typo—
Xueyuan’s vmalloc/vmap tests don’t trigger that case yet.

it should be fixed by:

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index bf31c11ebd3b..25b9fce1ec6a 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -110,7 +110,7 @@ static inline int num_contig_ptes(unsigned long
size, size_t *pgsize)
                contig_ptes = CONT_PTES;
                break;
        default:
-               if (size < CONT_PMD_SIZE && size > 0 &&
+               if (size < PMD_SIZE && size > 0 &&
                                IS_ALIGNED(size, CONT_PTE_SIZE)) {
                        contig_ptes = size >> PAGE_SHIFT;
                        *pgsize = PAGE_SIZE;
@@ -365,7 +365,7 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int
shift, vm_flags_t flags)
        case CONT_PTE_SIZE:
                return pte_mkcont(entry);
        default:
-               if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
+               if (pagesize < PMD_SIZE && pagesize > 0 &&
                                IS_ALIGNED(pagesize, CONT_PTE_SIZE))
                        return pte_mkcont(entry);

>
>  ./run_vmtests.sh -t "hugetlb"
>
> TAP version 13
> # -----------------------
> # running ./hugepage-mmap
> # -----------------------
> # TAP version 13
> # 1..1
> # # Returned address is 0xffffe7c00000
>
>
>
> [   30.884630] kernel BUG at mm/page_table_check.c:86!
> [   30.884701] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
> [   30.886803] Modules linked in:
> [   30.887217] CPU: 3 UID: 0 PID: 1869 Comm: hugepage-mmap Not tainted 7.0.0-rc5+ #86 PREEMPT
> [   30.888218] Hardware name: linux,dummy-virt (DT)
> [   30.889413] pstate: a1400005 (NzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [   30.889901] pc : page_table_check_clear.part.0+0x128/0x1a0
> [   30.890337] lr : page_table_check_clear.part.0+0x7c/0x1a0
> [   30.890714] sp : ffff800084da3ad0
> [   30.890946] x29: ffff800084da3ad0 x28: 0000000000000001 x27: 0010000000000001
> [   30.891434] x26: 0040000000000040 x25: ffffa06bb8fb9000 x24: 00000000ffffffff
> [   30.891932] x23: 0000000000000001 x22: 0000000000000000 x21: ffffa06bb8997810
> [   30.892514] x20: 0000000000113e39 x19: 0000000000113e38 x18: 0000000000000000
> [   30.893007] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> [   30.893500] x14: ffffa06bb7013780 x13: 0000fffff7f90fff x12: 0000000000000000
> [   30.893990] x11: 1fffe0001a1282c1 x10: ffff0000d094160c x9 : ffffa06bb568a858
> [   30.894479] x8 : ffff5f95c8474000 x7 : 0000000000000000 x6 : ffff00017fffc500
> [   30.894973] x5 : ffff000191208fc0 x4 : 0000000000000000 x3 : 0000000000004000
> [   30.895449] x2 : 0000000000000000 x1 : 00000000ffffffff x0 : ffff0000c071f1b8
> [   30.895875] Call trace:
> [   30.896027]  page_table_check_clear.part.0+0x128/0x1a0 (P)
> [   30.896369]  page_table_check_clear+0xc8/0x138
> [   30.896776]  __page_table_check_ptes_set+0xe4/0x1e8
> [   30.897073]  __set_ptes_anysz+0x2e4/0x308
> [   30.897327]  set_huge_pte_at+0xec/0x210
> [   30.897561]  hugetlb_no_page+0x1ec/0x8e0
> [   30.897807]  hugetlb_fault+0x188/0x740
> [   30.898036]  handle_mm_fault+0x294/0x2c0
> [   30.898283]  do_page_fault+0x120/0x748
> [   30.898539]  do_translation_fault+0x68/0x90
> [   30.898796]  do_mem_abort+0x4c/0xa8
> [   30.899011]  el0_da+0x2c/0x90
> [   30.899205]  el0t_64_sync_handler+0xd0/0xe8
> [   30.899461]  el0t_64_sync+0x198/0x1a0
> [   30.899688] Code: 91001021 b8f80022 51000441 36fffd41 (d4210000)
> [   30.900053] ---[ end trace 0000000000000000 ]---
>
>
>
> The bug is at
>
> BUG_ON(atomic_dec_return(&ptc->file_map_count) < 0);
>
> My tree is mm-unstable, commit 3fa44141e0bb.
>

Thanks
Barry


^ permalink raw reply related

* Re: [PATCH] arm64: dts: imx93-9x9-qsb: Add tianma,tm050rdh03 panel
From: Frank Li @ 2026-04-08 10:50 UTC (permalink / raw)
  To: Liu Ying
  Cc: Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, imx, linux-arm-kernel,
	devicetree, linux-kernel
In-Reply-To: <5ce48659-2c6c-4c60-a8e8-9031bdbaa2a3@nxp.com>

On Wed, Apr 08, 2026 at 04:40:37PM +0800, Liu Ying wrote:
> On Wed, Apr 08, 2026 at 04:28:40AM -0400, Frank Li wrote:
...
> >>>>>
> >>>>> Is it possible to appply this overlay file and kd50g21-40nt-a1 overlay file
> >>>>>
> >>>>> to imx93-9x9-qsb.dtb, so needn't create dtsi.
> >>>>
> >>>> I'm sorry, I don't get your question here.
> >>>> Anyway, the DT overlays are needed, because the 40-pin EXP/PRI interface on
> >>>> the i.MX93 9x9 QSB board can not only connect to a DPI panel adapter board
> >>>> but also to an audio hat[2], and maybe more.  The newly introduced .dtsi
> >>>> file just aims to avoid duplicated code.
> >>>
> >>> My means apply two overlay files to dtb
> >>>
> >>> imx93-9x9-qsb-tianma-tm050rdh03-dtbs += imx93-9x9-qsb.dtb imx93-9x9-qsb-ontat-kd50g21-40nt-a1.dtbo imx93-9x9-qsb-tianma-tm050rdh03.dtbo
>
> This ...
>
> >>>
> >>> In imx93-9x9-qsb-tianma-tm050rdh03.dtbo, only include
> >>> &{/} {
> >>> 	panel {
> >>> 		compatible = "tianma,tm050rdh03";
> >>> 		enable-gpios = <&pcal6524 8 GPIO_ACTIVE_HIGH>;
> >>> 	};
> >>> };
> >>
> >> If an user wants to use imx93-9x9-qsb.dtb and the DT overlay blob
> >> imx93-9x9-qsb-tianma-tm050rdh03.dtbo to enable the tianma,tm050rdh03
> >> DPI panel, then it won't work unless the user also apply
> >> imx93-9x9-qsb-ontat-kd50g21-40nt-a1.dtbo, right?
> >>
> >>>
> >
> > Yes, imx93-9x9-qsb-tianma-tm050rdh03.dtb already created, which already
> > applied both overlay file.
>
> .... indicates that imx93-9x9-qsb-tianma-tm050rdh03.dtb is generated by
> applying both imx93-9x9-qsb-ontat-kd50g21-40nt-a1.dtbo and
> imx93-9x9-qsb-tianma-tm050rdh03.dtbo to imx93-9x9-qsb.dtb.
> While, imx93-9x9-qsb-tianma-tm050rdh03.dtbo(a DT overlay blob) just contains
> the panel node, which means that an user __cannot_ enable the tianma,tm050rdh03
> DPI panel by only applying it to imx93-9x9-qsb.dtb, unless the user also
> applies imx93-9x9-qsb-ontat-kd50g21-40nt-a1.dtbo.  That's why the .dtsi
> file is needed.

what's problem if we require user do that? Makefile already create finial
imx93-9x9-qsb-tianma-tm050rdh03.dtb.

Any user really apply dtso manaully without use kernel's Makefile?

>
> >
> > can the same board be use for imx91 or other evk boards?
>
> Yes, both tianma,tm050rdh03 and ontat,kd50g21-40nt-a1 DPI panels can be
> connected to i.MX91/93 11x11 EVK and 9x9 QSB boards.

Is it possible to use one overlay files for all imx91/imx93 boards?

Frank
>
> >
> > Frank
> >
> >>> Frank
> >>>>
> >>>> [2] https://www.nxp.com/design/design-center/development-boards-and-designs/mx93aud-hat-audio-board:MX93AUD-HAT
> >>>>
> >>>>>
> >>>>> Frank
> >>>>>>
> >>>>>> ---
> >>>>>> base-commit: 816f193dd0d95246f208590924dd962b192def78
> >>>>>> change-id: 20260407-tianma-tm050rdh03-imx93-9x9-qsb-6e4bbbde3d08
> >>>>>>
> >>>>>> Best regards,
> >>>>>> --
> >>>>>> Liu Ying <victor.liu@nxp.com>
> >>>>>>
> >>>>
> >>>> --
> >>>> Regards,
> >>>> Liu Ying
> >>
> >> --
> >> Regards,
> >> Liu Ying
>
> --
> Regards,
> Liu Ying


^ permalink raw reply

* Re: [PATCH v13 2/7] qcom-tgu: Add TGU driver
From: Konrad Dybcio @ 2026-04-08 10:42 UTC (permalink / raw)
  To: Songwei Chai, andersson, alexander.shishkin, mike.leach,
	suzuki.poulose, james.clark, krzk+dt, conor+dt
  Cc: linux-kernel, linux-arm-kernel, linux-arm-msm, coresight,
	devicetree, gregkh
In-Reply-To: <20260402092838.341295-3-songwei.chai@oss.qualcomm.com>

On 4/2/26 11:28 AM, Songwei Chai wrote:
> Add driver to support device TGU (Trigger Generation Unit).
> TGU is a Data Engine which can be utilized to sense a plurality of
> signals and create a trigger into the CTI or generate interrupts to
> processors. Add probe/enable/disable functions for tgu.
> 
> Signed-off-by: Songwei Chai <songwei.chai@oss.qualcomm.com>
> ---

Acked-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>

Konrad


^ permalink raw reply

* Re: [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
From: Dev Jain @ 2026-04-08 10:32 UTC (permalink / raw)
  To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
	will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21
In-Reply-To: <20260408025115.27368-2-baohua@kernel.org>



On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
> we can batch CONT_PTE settings instead of handling them individually.
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index a42c05cf5640..bf31c11ebd3b 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
>  		contig_ptes = CONT_PTES;
>  		break;
>  	default:
> +		if (size < CONT_PMD_SIZE && size > 0 &&
> +				IS_ALIGNED(size, CONT_PTE_SIZE)) {

Nit: Having the lower bound check before upper bound is natural to
read, so this should be size > 0 && size < CONT_PMD_SIZE (i.e written
the other way around).

Also IS_ALIGNED needs to go below size.


> +			contig_ptes = size >> PAGE_SHIFT;
> +			*pgsize = PAGE_SIZE;
> +			break;
> +		}
>  		WARN_ON(!__hugetlb_valid_size(size));
>  	}
>  
> @@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
>  	case CONT_PTE_SIZE:
>  		return pte_mkcont(entry);
>  	default:
> +		if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
> +				IS_ALIGNED(pagesize, CONT_PTE_SIZE))
> +			return pte_mkcont(entry);
> +
>  		break;
>  	}
>  	pr_warn("%s: unrecognized huge page size 0x%lx\n",



^ permalink raw reply

* Re: [PATCH 2/3] KVM: arm64: vgic: Allow userspace to set IIDR revision 1
From: David Woodhouse @ 2026-04-08 10:32 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Catalin Marinas, Will Deacon, Paolo Bonzini, Shuah Khan,
	Raghavendra Rao Ananta, Eric Auger, Kees Cook, Arnd Bergmann,
	Nathan Chancellor, linux-arm-kernel, kvmarm, linux-kernel, kvm,
	linux-kselftest
In-Reply-To: <87wlyhc2g4.wl-maz@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 985 bytes --]

On Wed, 2026-04-08 at 08:54 +0100, Marc Zyngier wrote:
> 
> > @@ -93,6 +95,9 @@ static int vgic_mmio_uaccess_write_v2_misc(struct
> > kvm_vcpu *vcpu,
> >   		 */
> >   		reg = FIELD_GET(GICD_IIDR_REVISION_MASK, val);
> >   		switch (reg) {
> > +		case KVM_VGIC_IMP_REV_1:
> > +			dist->implementation_rev = reg;
> > +			return 0;
> >   		case KVM_VGIC_IMP_REV_2:
> >   		case KVM_VGIC_IMP_REV_3:
> >   			vcpu->kvm-
> > >arch.vgic.v2_groups_user_writable = true;
> 
> nit: move the v1 handling down with a fallthrough in v2/v3 so that we
> don't duplicate the basic handling:

I think I actually want to rip out the v2_groups_user_writable flag
completely.

It was specifically added in order to allow the actual behaviour to be
inconsistent with the value in the IIDR.

But it doesn't actually stop the *guest* wrong writing; only userspace.

I'll rip it out in a fourth patch when I resend the series (having
addressed your other comments).


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox