[PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
@ 2026-06-18  8:47 Wen Jiang
  2026-06-18  8:47 ` [PATCH v4 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Wen Jiang @ 2026-06-18  8:47 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang

This patchset accelerates ioremap, vmalloc, and vmap when the memory
is physically fully or partially contiguous. Two techniques are used:

1. Avoid page table rewalk when setting PTEs/PMDs for multiple memory
   segments
2. Use batched mappings wherever possible in both vmalloc and ARM64
   layers

Besides accelerating the mapping path, this also enables large
mappings (PMD and cont-PTE) for vmap, which are currently not
supported.

Patches 1-2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
CONT-PTE regions instead of just one.

Patch 3 extracts a common helper vmap_set_ptes() that consolidates PTE
mapping logic between the ioremap and vmalloc/vmap paths, handling both
CONT_PTE and regular PTE mappings. This prepares for the next patch.

Patch 4 extends the page table walk path to support page shifts other
than PAGE_SHIFT and eliminates the page table rewalk for huge vmalloc
mappings. The function is renamed from vmap_small_pages_range_noflush()
to vmap_pages_range_noflush_walk().

Patches 5-6 add huge vmap support for contiguous pages, including
support for non-compound pages with pfn alignment verification.

On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
the performance CPUfreq policy enabled, benchmark results:

* ioremap(1 MB): 1.35x faster (3407 ns -> 2526 ns)
* vmalloc(1 MB) mapping time (excluding allocation) with
  VM_ALLOW_HUGE_VMAP: 1.42x faster (5.00 us -> 3.53us)
* vmap(100MB) with order-8 pages: 8.3x faster (1235 us -> 149 us)

Many thanks to Xueyuan Chen for his testing efforts on RK3588 boards.

Changes since v3:
- Squash vmap_pte_range() loop variable fix into patch 4 (patch 3, 4)
- Use shift >= PMD_SHIFT and fix *nr increment in
  vmap_pages_pmd_range() (patch 4)
- Pass page_shift directly without capping at PMD_SHIFT (patch 4, 5)
- Add vm_shift() helper and pass pgprot_t to get_vmap_batch_order()
  (patch 5)
- Use min(order, __ffs(pfn)) for graceful pfn alignment degradation,
  replacing IS_ALIGNED check (patch 5)
- Remove irrelevant ioremap_max_page_shift early-exit (patch 5)
- Add __get_vm_area_node_aligned_caller() wrapper, rename to
  vmap_get_aligned_vm_area() (patch 6)

Changes since v2:
- Use __fls instead of fls in arch_vmap_pte_range_map_size (patch 2)
- Add WARN_ON checks in vmap_pages_pmd_range (patch 4)
- Fix flush_cache_vmap to use saved start address instead of the
  already-advanced addr (patch 5)
- Rename __vmap_huge() to vmap_batched() (patch 5)
- Add caller parameter and unroll while(1) loop (patch 5)
- Squash patch 7 into patch 5 (stop scanning for compound pages after
  encountering small pages)

Changes since v1:
- Fix condition order and use PMD_SIZE instead of CONT_PMD_SIZE in
  patch 1 (Dev Jain)
- Squash patch 3+4 and patch 5+7 (Dev Jain)
- Replace "zigzag" with "page table rewalk" in commit messages
  (Dev Jain)
- Rename vmap_small_pages_range_noflush() to
  vmap_pages_range_noflush_walk() (Dev Jain)
- Extract vmap_set_ptes() as a new patch to consolidate PTE mapping
  logic between vmap_pte_range() and vmap_pages_pte_range(), handling
  both CONT_PTE and regular mappings (Mike Rapoport)
- Support non-compound pages in get_vmap_batch_order() by falling
  back to physical contiguity scanning with pfn alignment check
  (Dev Jain, Uladzislau Rezki)
- In get_vmap_batch_order(), filter out orders that the architecture
  cannot batch by checking arch_vmap_pte_supported_shift() directly.
  This avoids overhead for orders 1-3 on ARM64 CONT_PTE with 4K
  pages. (patch 5)

Barry Song (Xiaomi) (5):
  arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
    setup
  arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
    CONT_PTE
  mm/vmalloc: Extend page table walk to support larger page_shift sizes
    and eliminate page table rewalk
  mm/vmalloc: map contiguous pages in batches for vmap() if possible
  mm/vmalloc: align vm_area so vmap() can batch mappings

Wen Jiang (1):
  mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic

 arch/arm64/include/asm/vmalloc.h |   6 +-
 arch/arm64/mm/hugetlbpage.c      |  10 ++
 mm/vmalloc.c                     | 247 +++++++++++++++++++++++++------
 3 files changed, 213 insertions(+), 50 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v4 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
  2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
@ 2026-06-18  8:47 ` Wen Jiang
  2026-06-18  8:47 ` [PATCH v4 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Wen Jiang
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Wen Jiang @ 2026-06-18  8:47 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
we can handle CONT_PTE_SIZE groups together.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index a42c05cf56408..c4d8b226126cb 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
 		contig_ptes = CONT_PTES;
 		break;
 	default:
+		if (size > 0 && size < PMD_SIZE &&
+				IS_ALIGNED(size, CONT_PTE_SIZE)) {
+			contig_ptes = size >> PAGE_SHIFT;
+			*pgsize = PAGE_SIZE;
+			break;
+		}
 		WARN_ON(!__hugetlb_valid_size(size));
 	}
 
@@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
 	case CONT_PTE_SIZE:
 		return pte_mkcont(entry);
 	default:
+		if (pagesize > 0 && pagesize < PMD_SIZE &&
+				IS_ALIGNED(pagesize, CONT_PTE_SIZE))
+			return pte_mkcont(entry);
+
 		break;
 	}
 	pr_warn("%s: unrecognized huge page size 0x%lx\n",
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE
  2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
  2026-06-18  8:47 ` [PATCH v4 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
@ 2026-06-18  8:47 ` Wen Jiang
  2026-06-18  8:47 ` [PATCH v4 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic Wen Jiang
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Wen Jiang @ 2026-06-18  8:47 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

Allow arch_vmap_pte_range_map_size to batch across multiple CONT_PTE
blocks, reducing both PTE setup and TLB flush iterations.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 arch/arm64/include/asm/vmalloc.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 4ec1acd3c1b34..787fd17b48e2c 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -23,6 +23,8 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
 						unsigned long end, u64 pfn,
 						unsigned int max_page_shift)
 {
+	unsigned long size;
+
 	/*
 	 * If the block is at least CONT_PTE_SIZE in size, and is naturally
 	 * aligned in both virtual and physical space, then we can pte-map the
@@ -40,7 +42,9 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
 	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
 		return PAGE_SIZE;
 
-	return CONT_PTE_SIZE;
+	size = min3(end - addr, 1UL << max_page_shift, PMD_SIZE >> 1);
+	size = 1UL << __fls(size);
+	return size;
 }
 
 #define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic
  2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
  2026-06-18  8:47 ` [PATCH v4 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
  2026-06-18  8:47 ` [PATCH v4 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Wen Jiang
@ 2026-06-18  8:47 ` Wen Jiang
  2026-06-26 16:21   ` Uladzislau Rezki
  2026-06-18  8:47 ` [PATCH v4 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk Wen Jiang
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Wen Jiang @ 2026-06-18  8:47 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang

Extract the common PTE mapping logic from vmap_pte_range() into a
shared helper vmap_set_ptes(). This handles both CONT_PTE and regular
PTE mappings in a single function, preparing for the next patch which
will extend vmap_pages_pte_range() to also use this helper.

The #ifdef CONFIG_HUGETLB_PAGE guard is moved inside vmap_set_ptes(),
so callers no longer need to handle the conditional compilation.

No functional change.

Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 mm/vmalloc.c | 44 +++++++++++++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2c2f74a07f396..6660f240d27c9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -91,6 +91,35 @@ struct vfree_deferred {
 static DEFINE_PER_CPU(struct vfree_deferred, vfree_deferred);
 
 /*** Page table manipulation functions ***/
+
+/*
+ * Set PTE mappings for the given PFN. Try CONT_PTE mappings first when
+ * supported, otherwise fall back to PAGE_SIZE mappings.
+ *
+ * Return: mapping size.
+ */
+static __always_inline unsigned long vmap_set_ptes(pte_t *pte,
+		unsigned long addr, unsigned long end, u64 pfn,
+		pgprot_t prot, unsigned int max_page_shift)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	if (max_page_shift > PAGE_SHIFT) {
+		unsigned long size;
+
+		size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
+		if (size != PAGE_SIZE) {
+			pte_t entry = pfn_pte(pfn, prot);
+
+			entry = arch_make_huge_pte(entry, ilog2(size), 0);
+			set_huge_pte_at(&init_mm, addr, pte, entry, size);
+			return size;
+		}
+	}
+#endif
+	set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
+	return PAGE_SIZE;
+}
+
 static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			phys_addr_t phys_addr, pgprot_t prot,
 			unsigned int max_page_shift, pgtbl_mod_mask *mask)
@@ -119,19 +148,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			BUG();
 		}
 
-#ifdef CONFIG_HUGETLB_PAGE
-		size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
-		if (size != PAGE_SIZE) {
-			pte_t entry = pfn_pte(pfn, prot);
-
-			entry = arch_make_huge_pte(entry, ilog2(size), 0);
-			set_huge_pte_at(&init_mm, addr, pte, entry, size);
-			pfn += PFN_DOWN(size);
-			continue;
-		}
-#endif
-		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
-		pfn++;
+		size = vmap_set_ptes(pte, addr, end, pfn, prot, max_page_shift);
+		pfn += PFN_DOWN(size);
 	} while (pte += PFN_DOWN(size), addr += size, addr != end);
 
 	lazy_mmu_mode_disable();
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk
  2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
                   ` (2 preceding siblings ...)
  2026-06-18  8:47 ` [PATCH v4 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic Wen Jiang
@ 2026-06-18  8:47 ` Wen Jiang
  2026-06-18  8:47 ` [PATCH v4 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible Wen Jiang
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Wen Jiang @ 2026-06-18  8:47 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

vmap_pages_range_noflush_walk() (formerly vmap_small_pages_range_noflush())
provides a clean interface by taking struct page **pages and mapping them
via direct PTE iteration. This avoids the page table rewalk seen when
using vmap_range_noflush() for page_shift values other than PAGE_SHIFT.

Extend it to support larger page_shift values, and add PMD- and
contiguous-PTE mappings as well. Rename it to vmap_pages_range_noflush_walk()
since it now handles more than just small pages.

For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer need to
iterate over pages one by one via vmap_range_noflush(), which would
otherwise lead to page table rewalk. The code is now unified with the
PAGE_SHIFT case by simply calling vmap_pages_range_noflush_walk().

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 mm/vmalloc.c | 81 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 47 insertions(+), 34 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6660f240d27c9..253e017130e09 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -127,7 +127,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	u64 pfn;
 	struct page *page;
-	unsigned long size = PAGE_SIZE;
+	unsigned long size;
+	unsigned int steps;
 
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(end - addr)))
 		return -EINVAL;
@@ -149,8 +150,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		}
 
 		size = vmap_set_ptes(pte, addr, end, pfn, prot, max_page_shift);
-		pfn += PFN_DOWN(size);
-	} while (pte += PFN_DOWN(size), addr += size, addr != end);
+		steps = PFN_DOWN(size);
+	} while (pte += steps, pfn += steps, addr += size, addr != end);
 
 	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
@@ -542,8 +543,10 @@ void vunmap_range(unsigned long addr, unsigned long end)
 
 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
+	unsigned long pfn, size;
+	unsigned int steps;
 	int err = 0;
 	pte_t *pte;
 
@@ -574,9 +577,10 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			break;
 		}
 
-		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
-		(*nr)++;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+		pfn = page_to_pfn(page);
+		size = vmap_set_ptes(pte, addr, end, pfn, prot, shift);
+		steps = PFN_DOWN(size);
+	} while (pte += steps, *nr += steps, addr += size, addr != end);
 
 	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
@@ -586,7 +590,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 
 static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -596,7 +600,27 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
+
+		if (shift >= PMD_SHIFT) {
+			struct page *page = pages[*nr];
+			phys_addr_t phys_addr;
+
+			if (WARN_ON(!page))
+				return -ENOMEM;
+			if (WARN_ON(!pfn_valid(page_to_pfn(page))))
+				return -EINVAL;
+
+			phys_addr = page_to_phys(page);
+
+			if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
+						shift)) {
+				*mask |= PGTBL_PMD_MODIFIED;
+				*nr += 1 << (PMD_SHIFT - PAGE_SHIFT);
+				continue;
+			}
+		}
+
+		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
@@ -604,7 +628,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
 
 static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -614,7 +638,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
+		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
@@ -622,7 +646,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
 
 static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -632,14 +656,18 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
+		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
 }
 
-static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
-		pgprot_t prot, struct page **pages)
+/*
+ * It can take an array of pages which are not all contiguous, but it
+ * may have contiguous chunks, as hinted by @shift.
+ */
+static int vmap_pages_range_noflush_walk(unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages, unsigned int shift)
 {
 	unsigned long start = addr;
 	pgd_t *pgd;
@@ -654,7 +682,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
 		next = pgd_addr_end(addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
-		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
@@ -677,27 +705,12 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
 int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 		pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
-	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
-
 	WARN_ON(page_shift < PAGE_SHIFT);
 
-	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
-			page_shift == PAGE_SHIFT)
-		return vmap_small_pages_range_noflush(addr, end, prot, pages);
+	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
+		page_shift = PAGE_SHIFT;
 
-	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
-		int err;
-
-		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
-					page_to_phys(pages[i]), prot,
-					page_shift);
-		if (err)
-			return err;
-
-		addr += 1UL << page_shift;
-	}
-
-	return 0;
+	return vmap_pages_range_noflush_walk(addr, end, prot, pages, page_shift);
 }
 
 int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible
  2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
                   ` (3 preceding siblings ...)
  2026-06-18  8:47 ` [PATCH v4 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk Wen Jiang
@ 2026-06-18  8:47 ` Wen Jiang
  2026-06-18  8:47 ` [PATCH v4 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Wen Jiang @ 2026-06-18  8:47 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

In many cases, the pages passed to vmap() may include high-order
pages. For example, the systemheap often allocates pages in descending
order: order 8, then 4, then 0. Currently, vmap() iterates over every
page individually—even pages inside a high-order block are handled
one by one.

This patch detects physically contiguous pages (regardless of whether
they are compound or non-compound) by scanning with
num_pages_contiguous(), and maps them as a single contiguous block
whenever possible. The mapping order is determined by taking the
minimum of the contiguous page count and the pfn alignment, allowing
graceful degradation when pfn alignment is less than the contiguous
range.

Pages with the same page_shift are coalesced and mapped via
vmap_pages_range_noflush_walk() to avoid page table rewalk.

As users typically allocate memory in descending orders (e.g.
8 → 4 → 0), once an order-0 page is encountered, we stop scanning
for contiguous pages since subsequent pages are likely order-0 as well.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 mm/vmalloc.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 85 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 253e017130e09..fffb885cb2158 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3545,6 +3545,89 @@ void vunmap(const void *addr)
 }
 EXPORT_SYMBOL(vunmap);
 
+static inline unsigned int vm_shift(pgprot_t prot, unsigned long size)
+{
+	if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
+		return PMD_SHIFT;
+
+	return arch_vmap_pte_supported_shift(size);
+}
+
+static inline int get_vmap_batch_order(struct page **pages,
+		pgprot_t prot, unsigned int max_steps, unsigned int idx)
+{
+	unsigned int nr_contig;
+	int order;
+
+	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
+		return 0;
+
+	nr_contig = num_pages_contiguous(&pages[idx], max_steps);
+	if (nr_contig < 2)
+		return 0;
+
+	order = ilog2(nr_contig);
+
+	/* Limit order by pfn alignment */
+	order = min_t(int, order, __ffs(page_to_pfn(pages[idx])));
+
+	if (vm_shift(prot, PAGE_SIZE << order) == PAGE_SHIFT)
+		return 0;
+
+	return order;
+}
+
+static int vmap_batched(unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages)
+{
+	unsigned int count = (end - addr) >> PAGE_SHIFT;
+	unsigned int prev_shift = 0, idx = 0;
+	unsigned long start = addr, map_addr = addr;
+	int err;
+
+	err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
+						PAGE_SHIFT, GFP_KERNEL);
+	if (err)
+		goto out;
+
+	for (unsigned int i = 0; i < count; ) {
+		unsigned int shift = PAGE_SHIFT +
+			get_vmap_batch_order(pages, prot, count - i, i);
+
+		if (!i)
+			prev_shift = shift;
+
+		if (shift != prev_shift) {
+			err = vmap_pages_range_noflush_walk(map_addr, addr,
+					prot, pages + idx, prev_shift);
+			if (err)
+				goto out;
+			prev_shift = shift;
+			map_addr = addr;
+			idx = i;
+		}
+
+		/*
+		 * Once small pages are encountered, the remaining pages
+		 * are likely small as well.
+		 */
+		if (shift == PAGE_SHIFT)
+			break;
+
+		addr += 1UL << shift;
+		i += 1U << (shift - PAGE_SHIFT);
+	}
+
+	/* Remaining */
+	if (map_addr < end)
+		err = vmap_pages_range_noflush_walk(map_addr, end,
+				prot, pages + idx, prev_shift);
+
+out:
+	flush_cache_vmap(start, end);
+	return err;
+}
+
 /**
  * vmap - map an array of pages into virtually contiguous space
  * @pages: array of page pointers
@@ -3588,8 +3671,8 @@ void *vmap(struct page **pages, unsigned int count,
 		return NULL;
 
 	addr = (unsigned long)area->addr;
-	if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
-				pages, PAGE_SHIFT) < 0) {
+	if (vmap_batched(addr, addr + size, pgprot_nx(prot),
+				pages) < 0) {
 		vunmap(area->addr);
 		return NULL;
 	}
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings
  2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
                   ` (4 preceding siblings ...)
  2026-06-18  8:47 ` [PATCH v4 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible Wen Jiang
@ 2026-06-18  8:47 ` Wen Jiang
  2026-06-26 16:20   ` Uladzislau Rezki
  2026-06-25  2:57 ` [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Andrew Morton
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Wen Jiang @ 2026-06-18  8:47 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

Try to align the vmap virtual address to PMD_SHIFT or a
larger PTE mapping size hinted by the architecture, so
contiguous pages can be batch-mapped when setting PMD or
PTE entries.

Add __get_vm_area_node_aligned_caller() as a wrapper over
__get_vm_area_node() to simplify repeated calls with fixed
arguments.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 mm/vmalloc.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index fffb885cb2158..bc9fa93e2bdc6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3628,6 +3628,41 @@ static int vmap_batched(unsigned long addr, unsigned long end,
 	return err;
 }
 
+static struct vm_struct *__get_vm_area_node_aligned_caller(unsigned long size,
+		unsigned long align, unsigned long flags, const void *caller)
+{
+	return __get_vm_area_node(size, align, PAGE_SHIFT, flags,
+			VMALLOC_START, VMALLOC_END,
+			NUMA_NO_NODE, GFP_KERNEL, caller);
+}
+
+static struct vm_struct *vmap_get_aligned_vm_area(unsigned long size,
+		unsigned long flags, const void *caller)
+{
+	struct vm_struct *vm_area;
+	unsigned int shift;
+
+	/* Try PMD alignment for large sizes */
+	if (size >= PMD_SIZE) {
+		vm_area = __get_vm_area_node_aligned_caller(size, PMD_SIZE,
+				flags, caller);
+		if (vm_area)
+			return vm_area;
+	}
+
+	/* Try CONT_PTE alignment */
+	shift = arch_vmap_pte_supported_shift(size);
+	if (shift > PAGE_SHIFT) {
+		vm_area = __get_vm_area_node_aligned_caller(size, 1UL << shift,
+				flags, caller);
+		if (vm_area)
+			return vm_area;
+	}
+
+	/* Fall back to page alignment */
+	return __get_vm_area_node_aligned_caller(size, PAGE_SIZE, flags, caller);
+}
+
 /**
  * vmap - map an array of pages into virtually contiguous space
  * @pages: array of page pointers
@@ -3666,7 +3701,7 @@ void *vmap(struct page **pages, unsigned int count,
 		return NULL;
 
 	size = (unsigned long)count << PAGE_SHIFT;
-	area = get_vm_area_caller(size, flags, __builtin_return_address(0));
+	area = vmap_get_aligned_vm_area(size, flags, __builtin_return_address(0));
 	if (!area)
 		return NULL;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
  2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
                   ` (5 preceding siblings ...)
  2026-06-18  8:47 ` [PATCH v4 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
@ 2026-06-25  2:57 ` Andrew Morton
  2026-06-25  6:37 ` Dev Jain
  2026-06-26 15:12 ` Leo Yan
  8 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, urezki, baohua,
	Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang

On Thu, 18 Jun 2026 16:47:20 +0800 Wen Jiang <jiangwenxiaomi@gmail.com> wrote:

> This patchset accelerates ioremap, vmalloc, and vmap when the memory
> is physically fully or partially contiguous. Two techniques are used:

Thanks.

> 1. Avoid page table rewalk when setting PTEs/PMDs for multiple memory
>    segments
> 2. Use batched mappings wherever possible in both vmalloc and ARM64
>    layers
> 
> Besides accelerating the mapping path, this also enables large
> mappings (PMD and cont-PTE) for vmap, which are currently not
> supported.
> 
> Patches 1-2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> CONT-PTE regions instead of just one.
> 
> Patch 3 extracts a common helper vmap_set_ptes() that consolidates PTE
> mapping logic between the ioremap and vmalloc/vmap paths, handling both
> CONT_PTE and regular PTE mappings. This prepares for the next patch.
> 
> Patch 4 extends the page table walk path to support page shifts other
> than PAGE_SHIFT and eliminates the page table rewalk for huge vmalloc
> mappings. The function is renamed from vmap_small_pages_range_noflush()
> to vmap_pages_range_noflush_walk().
> 
> Patches 5-6 add huge vmap support for contiguous pages, including
> support for non-compound pages with pfn alignment verification.
> 
> On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
> the performance CPUfreq policy enabled, benchmark results:
> 
> * ioremap(1 MB): 1.35x faster (3407 ns -> 2526 ns)
> * vmalloc(1 MB) mapping time (excluding allocation) with
>   VM_ALLOW_HUGE_VMAP: 1.42x faster (5.00 us -> 3.53us)
> * vmap(100MB) with order-8 pages: 8.3x faster (1235 us -> 149 us)

Nice.

> Many thanks to Xueyuan Chen for his testing efforts on RK3588 boards.

Indeed.

I see Dev had a good look at v3 - hopefully he (and Ulad) (and more ARM
folks) have time to go through this.

Is there any effect on anything other than arm64?  I'm wondering how
much testing these changes will really get in mm.git and linux-next.

How is our selftests coverage of these changes?  Is there some existing
selftest which will exercise these new features?

You diligently went through the Sashiko report against v3 (thanks). 
Please pass an eye across its v4 report, see if something new popped
up?
	https://sashiko.dev/#/patchset/20260618084726.1070022-1-jiangwen6@xiaomi.com



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
  2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
                   ` (6 preceding siblings ...)
  2026-06-25  2:57 ` [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Andrew Morton
@ 2026-06-25  6:37 ` Dev Jain
  2026-06-26 11:09   ` Barry Song
  2026-06-26 15:12 ` Leo Yan
  8 siblings, 1 reply; 13+ messages in thread
From: Dev Jain @ 2026-06-25  6:37 UTC (permalink / raw)
  To: Wen Jiang, linux-mm, linux-arm-kernel, catalin.marinas, will,
	akpm, urezki
  Cc: baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang,
	Ard Biesheuvel



On 18/06/26 2:17 pm, Wen Jiang wrote:
> This patchset accelerates ioremap, vmalloc, and vmap when the memory
> is physically fully or partially contiguous. Two techniques are used:
> 
> 1. Avoid page table rewalk when setting PTEs/PMDs for multiple memory
>    segments
> 2. Use batched mappings wherever possible in both vmalloc and ARM64
>    layers
> 
> Besides accelerating the mapping path, this also enables large
> mappings (PMD and cont-PTE) for vmap, which are currently not
> supported.
> 
> Patches 1-2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> CONT-PTE regions instead of just one.
> 
> Patch 3 extracts a common helper vmap_set_ptes() that consolidates PTE
> mapping logic between the ioremap and vmalloc/vmap paths, handling both
> CONT_PTE and regular PTE mappings. This prepares for the next patch.
> 
> Patch 4 extends the page table walk path to support page shifts other
> than PAGE_SHIFT and eliminates the page table rewalk for huge vmalloc
> mappings. The function is renamed from vmap_small_pages_range_noflush()
> to vmap_pages_range_noflush_walk().
> 
> Patches 5-6 add huge vmap support for contiguous pages, including
> support for non-compound pages with pfn alignment verification.
> 
> On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
> the performance CPUfreq policy enabled, benchmark results:
> 
> * ioremap(1 MB): 1.35x faster (3407 ns -> 2526 ns)
> * vmalloc(1 MB) mapping time (excluding allocation) with
>   VM_ALLOW_HUGE_VMAP: 1.42x faster (5.00 us -> 3.53us)
> * vmap(100MB) with order-8 pages: 8.3x faster (1235 us -> 149 us)
> 
> Many thanks to Xueyuan Chen for his testing efforts on RK3588 boards.
> 

I am still a little nervous about doing vmap-huge by default.

We can play set_memory_* games on a vmap huge mapping partially, thus
forcing a pgtable split, and not all arches can handle a kernel pgtable
split.

For arm64, we can handle that with BBML2_NOABORT, but interestingly, in
change_memory_common, arch/arm64/mm/pageattr.c:

	area = find_vm_area((void *)addr);
	if (!area ||
	    ((unsigned long)kasan_reset_tag((void *)end) >
	     (unsigned long)kasan_reset_tag(area->addr) + area->size) ||
	    ((area->flags & (VM_ALLOC | VM_ALLOW_HUGE_VMAP)) != VM_ALLOC))
		return -EINVAL;

Even before my change fcf8dda8cc48, we were bailing out on

!(area->flags & VM_ALLOC))

So on arm64 we haven't been supporting set_memory_* for vmap memory at all, because
it has VM_MAP set and not VM_ALLOC. Although we have a contradictory comment above
this code so not sure if this was intentional:

"Let's restrict ourselves to mappings created by vmalloc (or vmap)."


So either there is no user in the kernel doing vmap + set_memory_* (looks like it
by doing an LLM scan), or it is not fatal for set_memory_* to fail.

But even if no one does it now, technically the API allows it.

> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
  2026-06-25  6:37 ` Dev Jain
@ 2026-06-26 11:09   ` Barry Song
  0 siblings, 0 replies; 13+ messages in thread
From: Barry Song @ 2026-06-26 11:09 UTC (permalink / raw)
  To: Dev Jain
  Cc: Wen Jiang, linux-mm, linux-arm-kernel, catalin.marinas, will,
	akpm, urezki, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang,
	Ard Biesheuvel

On Thu, Jun 25, 2026 at 2:37 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 18/06/26 2:17 pm, Wen Jiang wrote:
> > This patchset accelerates ioremap, vmalloc, and vmap when the memory
> > is physically fully or partially contiguous. Two techniques are used:
> >
> > 1. Avoid page table rewalk when setting PTEs/PMDs for multiple memory
> >    segments
> > 2. Use batched mappings wherever possible in both vmalloc and ARM64
> >    layers
> >
> > Besides accelerating the mapping path, this also enables large
> > mappings (PMD and cont-PTE) for vmap, which are currently not
> > supported.
> >
> > Patches 1-2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> > CONT-PTE regions instead of just one.
> >
> > Patch 3 extracts a common helper vmap_set_ptes() that consolidates PTE
> > mapping logic between the ioremap and vmalloc/vmap paths, handling both
> > CONT_PTE and regular PTE mappings. This prepares for the next patch.
> >
> > Patch 4 extends the page table walk path to support page shifts other
> > than PAGE_SHIFT and eliminates the page table rewalk for huge vmalloc
> > mappings. The function is renamed from vmap_small_pages_range_noflush()
> > to vmap_pages_range_noflush_walk().
> >
> > Patches 5-6 add huge vmap support for contiguous pages, including
> > support for non-compound pages with pfn alignment verification.
> >
> > On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
> > the performance CPUfreq policy enabled, benchmark results:
> >
> > * ioremap(1 MB): 1.35x faster (3407 ns -> 2526 ns)
> > * vmalloc(1 MB) mapping time (excluding allocation) with
> >   VM_ALLOW_HUGE_VMAP: 1.42x faster (5.00 us -> 3.53us)
> > * vmap(100MB) with order-8 pages: 8.3x faster (1235 us -> 149 us)
> >
> > Many thanks to Xueyuan Chen for his testing efforts on RK3588 boards.
> >
>
> I am still a little nervous about doing vmap-huge by default.
>
> We can play set_memory_* games on a vmap huge mapping partially, thus
> forcing a pgtable split, and not all arches can handle a kernel pgtable
> split.
>
> For arm64, we can handle that with BBML2_NOABORT, but interestingly, in
> change_memory_common, arch/arm64/mm/pageattr.c:
>
>         area = find_vm_area((void *)addr);
>         if (!area ||
>             ((unsigned long)kasan_reset_tag((void *)end) >
>              (unsigned long)kasan_reset_tag(area->addr) + area->size) ||
>             ((area->flags & (VM_ALLOC | VM_ALLOW_HUGE_VMAP)) != VM_ALLOC))
>                 return -EINVAL;
>
> Even before my change fcf8dda8cc48, we were bailing out on
>
> !(area->flags & VM_ALLOC))
>
> So on arm64 we haven't been supporting set_memory_* for vmap memory at all, because
> it has VM_MAP set and not VM_ALLOC. Although we have a contradictory comment above
> this code so not sure if this was intentional:
>
> "Let's restrict ourselves to mappings created by vmalloc (or vmap)."
>
>
> So either there is no user in the kernel doing vmap + set_memory_* (looks like it
> by doing an LLM scan), or it is not fatal for set_memory_* to fail.

Hi Dev,

The primary purpose of vmap() is to provide the CPU with a
virtual mapping to access memory used by device drivers,
followed by the appropriate cache synchronization with the
device when necessary.

Given that, I think it's technically quite questionable to use
set_memory_xxx() to change the page table attributes of a vmap
area, especially for only part of an existing vmap mapping.

>
> But even if no one does it now, technically the API allows it.

In case we ever run into the rather subtle case where someone
calls set_memory_xxx() on a vmap() mapping, are you
suggesting that VM_ALLOW_HUGE_VMAP should also apply to
vmap(), rather than only vmalloc()?
something like the concept below?
index 14e5a6f6cc76..204770474c60 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3559,13 +3559,15 @@ static inline unsigned int vm_shift(pgprot_t
prot, unsigned long size)
 }

 static inline int get_vmap_batch_order(struct page **pages,
-               pgprot_t prot, unsigned int max_steps, unsigned int idx)
+               unsigned long flags, pgprot_t prot, unsigned int
max_steps, unsigned int idx)
 {
        unsigned int nr_contig;
        int order;

        if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
                return 0;
+       if (!(flags & VM_ALLOW_HUGE_VMAP))
+               return 0;

        nr_contig = num_pages_contiguous(&pages[idx], max_steps);
        if (nr_contig < 2)
@@ -3583,7 +3585,7 @@ static inline int get_vmap_batch_order(struct
page **pages,
 }

 static int vmap_batched(unsigned long addr, unsigned long end,
-               pgprot_t prot, struct page **pages)
+               unsigned long flags, pgprot_t prot, struct page **pages)
 {
        unsigned int count = (end - addr) >> PAGE_SHIFT;
        unsigned int prev_shift = 0, idx = 0;
@@ -3597,7 +3599,7 @@ static int vmap_batched(unsigned long addr,
unsigned long end,

        for (unsigned int i = 0; i < count; ) {
                unsigned int shift = PAGE_SHIFT +
-                       get_vmap_batch_order(pages, prot, count - i, i);
+                       get_vmap_batch_order(pages, flags, prot, count - i, i);

                if (!i)
                        prev_shift = shift;
@@ -3711,7 +3713,7 @@ void *vmap(struct page **pages, unsigned int count,
                return NULL;

        addr = (unsigned long)area->addr;
-       if (vmap_batched(addr, addr + size, pgprot_nx(prot),
+       if (vmap_batched(addr, addr + size, flags, pgprot_nx(prot),
                                pages) < 0) {
                vunmap(area->addr);
                return NULL;

I feel it's unnecessary, given that kvmalloc() already
always uses VM_ALLOW_HUGE_VMAP which will fail
set_memory_xxx() as you mentioned.
kvmalloc() is already a generic memory allocation API.

        /*
         * kvmalloc() can always use VM_ALLOW_HUGE_VMAP,
         * since the callers already cannot assume anything
         * about the resulting pointer, and cannot play
         * protection games.
         */
        return __vmalloc_node_range_noprof(size, align, VMALLOC_START,
VMALLOC_END,
                        flags, PAGE_KERNEL, allow_block ? VM_ALLOW_HUGE_VMAP:0,
                        node, __builtin_return_address(0));

Best Regards
Barry


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
  2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
                   ` (7 preceding siblings ...)
  2026-06-25  6:37 ` Dev Jain
@ 2026-06-26 15:12 ` Leo Yan
  8 siblings, 0 replies; 13+ messages in thread
From: Leo Yan @ 2026-06-26 15:12 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang,
	Suzuki K Poulose, Mike Leach, James Clark, Tamas.Petz,
	Michiel.VanTol

On Thu, Jun 18, 2026 at 04:47:20PM +0800, Wen Jiang wrote:

> Besides accelerating the mapping path, this also enables large
> mappings (PMD and cont-PTE) for vmap, which are currently not
> supported.

I verified this series with large vmap() mappings for Arm trace buffer
units (TRBE and SPE), and the results are positive.

Arm trace buffer units use the CPU's page tables for address translation
when writing trace data to DRAM. The larger vmap() mapping granules
reduce TLB pressure, resulting in significantly fewer L2D TLB refills
and reduced L1D TLB refills. The decrease in dtlb_walk indicates that
fewer page table walks are required and that address translations are
more often satisfied by cached TLB entries.

The detailed results are included below for reference.

Thanks for working on this, and here is my test tag:

Tested-by: Leo Yan <leo.yan@arm.com>

P.s. I applied a local change to set PERF_PMU_CAP_AUX_PREFER_LARGE in
the CoreSight and SPE drivers to allocate large memory chunks. This
change will be sent out once the MM changes are agreed.


## Results with TRBE

Test command:

  taskset -c 2 perf stat -C 10 -e cycles:u,instructions:u,dtlb_walk:u,l1d_tlb:u,l1d_tlb_refill:u,l2d_tlb_refill:u \
    -- taskset -c 2 perf record -C 10 -m ,1G -e cs_etm// \
    -- taskset -c 10 ./sparse_branch_delay.elf

The benchmark was run 5 times. CPU10 was isolated and dedicated to
running the workload while collecting the TLB statistics.

Before this series:

 +----------------+--------+--------+--------+--------+--------+----------+
 |TLB Metrics     |   Run1 |   Run2 |   Run3 |   Run4 |   Run5 |     Avg. |
 +----------------+--------+--------+--------+--------+--------+----------+
 | dtlb_walk      |     63 |     75 |     62 |     73 |     69 |     68.4 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l1d_tlb        |   2093 |   2189 |   2237 |   2036 |   2086 |   2128.2 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l1d_tlb_refill |    154 |    153 |    150 |    165 |    157 |    155.8 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l2d_tlb_refill | 161325 | 161403 | 161432 | 161580 | 161439 | 161435.8 |
 +----------------+--------+--------+--------+--------+--------+----------+

After this series:

 +----------------+--------+--------+--------+--------+--------+----------+----------+
 |TLB Metrics     |   Run1 |   Run2 |   Run3 |   Run4 |   Run5 |     Avg. |    Diff. |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | dtlb_walk      |     67 |     59 |     60 |     58 |     53 |     59.4 |  -13.16% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l1d_tlb        |   6710 |   7120 |   6662 |   6626 |   6542 |   6732.0 | +216.32% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l1d_tlb_refill |    126 |    117 |    119 |    117 |    119 |    119.6 |  -23.23% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l2d_tlb_refill |    506 |    489 |    485 |    506 |    489 |   495.0  |  -99.69% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+

## Results with SPE

Test command:

  taskset -c 2 perf stat -C 10 -e cycles:u,instructions:u,dtlb_walk:u,l1d_tlb:u,l1d_tlb_refill:u,l2d_tlb_refill:u \
    -- taskset -c 2 perf record -C 10 -m ,512M -e arm_spe_0/ts_enable=1,pa_enable=1,period=64,min_latency=0/ \
    -- taskset -c 10 dd if=/dev/zero of=/dev/shm/dd_mem_test bs=1M count=1024 status=progress

The benchmark was run five times. CPU10 was isolated and dedicated to
running the workload while collecting the TLB statistics.

Before this series:

 +----------------+--------+--------+--------+--------+--------+----------+
 |TLB Metrics     |   Run1 |   Run2 |   Run3 |   Run4 |   Run5 |     Avg. |
 +----------------+--------+--------+--------+--------+--------+----------+
 | dtlb_walk      |   2090 |   1709 |   1679 |   1519 |   1555 |   1710.4 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l1d_tlb        | 254450 | 257227 | 252517 | 252535 | 254752 | 254296.2 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l1d_tlb_refill |  16023 |  16088 |  15944 |  15989 |  15956 |  16000.0 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l2d_tlb_refill |   5887 |   4204 |   3713 |   4556 |   5620 |   4796.0 |
 +----------------+--------+--------+--------+--------+--------+----------+

After this series:

 +----------------+--------+--------+--------+--------+--------+----------+----------+
 |TLB Metrics     |   Run1 |   Run2 |   Run3 |   Run4 |   Run5 |     Avg. |    Diff. |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | dtlb_walk      |   1111 |   1301 |   1229 |   1166 |   1771 |   1315.6 |  -23.08% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l1d_tlb        | 257462 | 257420 | 257241 | 259968 | 261324 | 258683.0 |   +1.73% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l1d_tlb_refill |  15954 |  15919 |  15948 |  15962 |  15968 |  15950.2 |   -0.31% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l2d_tlb_refill |   2672 |   2558 |   2801 |   2478 |   4147 |   2931.2 |  -38.88% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings
  2026-06-18  8:47 ` [PATCH v4 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
@ 2026-06-26 16:20   ` Uladzislau Rezki
  0 siblings, 0 replies; 13+ messages in thread
From: Uladzislau Rezki @ 2026-06-26 16:20 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang

On Thu, Jun 18, 2026 at 04:47:26PM +0800, Wen Jiang wrote:
> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> 
> Try to align the vmap virtual address to PMD_SHIFT or a
> larger PTE mapping size hinted by the architecture, so
> contiguous pages can be batch-mapped when setting PMD or
> PTE entries.
> 
> Add __get_vm_area_node_aligned_caller() as a wrapper over
> __get_vm_area_node() to simplify repeated calls with fixed
> arguments.
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> ---
>  mm/vmalloc.c | 37 ++++++++++++++++++++++++++++++++++++-
>  1 file changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index fffb885cb2158..bc9fa93e2bdc6 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3628,6 +3628,41 @@ static int vmap_batched(unsigned long addr, unsigned long end,
>  	return err;
>  }
>  
> +static struct vm_struct *__get_vm_area_node_aligned_caller(unsigned long size,
> +		unsigned long align, unsigned long flags, const void *caller)
> +{
> +	return __get_vm_area_node(size, align, PAGE_SHIFT, flags,
> +			VMALLOC_START, VMALLOC_END,
> +			NUMA_NO_NODE, GFP_KERNEL, caller);
> +}
> +
> +static struct vm_struct *vmap_get_aligned_vm_area(unsigned long size,
> +		unsigned long flags, const void *caller)
> +{
> +	struct vm_struct *vm_area;
> +	unsigned int shift;
> +
> +	/* Try PMD alignment for large sizes */
> +	if (size >= PMD_SIZE) {
> +		vm_area = __get_vm_area_node_aligned_caller(size, PMD_SIZE,
> +				flags, caller);
> +		if (vm_area)
> +			return vm_area;
> +	}
> +
> +	/* Try CONT_PTE alignment */
> +	shift = arch_vmap_pte_supported_shift(size);
> +	if (shift > PAGE_SHIFT) {
> +		vm_area = __get_vm_area_node_aligned_caller(size, 1UL << shift,
> +				flags, caller);
> +		if (vm_area)
> +			return vm_area;
> +	}
> +
> +	/* Fall back to page alignment */
> +	return __get_vm_area_node_aligned_caller(size, PAGE_SIZE, flags, caller);
> +}
> +
>  /**
>   * vmap - map an array of pages into virtually contiguous space
>   * @pages: array of page pointers
> @@ -3666,7 +3701,7 @@ void *vmap(struct page **pages, unsigned int count,
>  		return NULL;
>  
>  	size = (unsigned long)count << PAGE_SHIFT;
> -	area = get_vm_area_caller(size, flags, __builtin_return_address(0));
> +	area = vmap_get_aligned_vm_area(size, flags, __builtin_return_address(0));
>  	if (!area)
>  		return NULL;
>  
> -- 
> 2.34.1
> 
Did intensive random mapping/unmaping, so i have not noticed any issues.

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic
  2026-06-18  8:47 ` [PATCH v4 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic Wen Jiang
@ 2026-06-26 16:21   ` Uladzislau Rezki
  0 siblings, 0 replies; 13+ messages in thread
From: Uladzislau Rezki @ 2026-06-26 16:21 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6, shanghaoqiang

On Thu, Jun 18, 2026 at 04:47:23PM +0800, Wen Jiang wrote:
> Extract the common PTE mapping logic from vmap_pte_range() into a
> shared helper vmap_set_ptes(). This handles both CONT_PTE and regular
> PTE mappings in a single function, preparing for the next patch which
> will extend vmap_pages_pte_range() to also use this helper.
> 
> The #ifdef CONFIG_HUGETLB_PAGE guard is moved inside vmap_set_ptes(),
> so callers no longer need to handle the conditional compilation.
> 
> No functional change.
> 
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> ---
>  mm/vmalloc.c | 44 +++++++++++++++++++++++++++++++-------------
>  1 file changed, 31 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 2c2f74a07f396..6660f240d27c9 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -91,6 +91,35 @@ struct vfree_deferred {
>  static DEFINE_PER_CPU(struct vfree_deferred, vfree_deferred);
>  
>  /*** Page table manipulation functions ***/
> +
> +/*
> + * Set PTE mappings for the given PFN. Try CONT_PTE mappings first when
> + * supported, otherwise fall back to PAGE_SIZE mappings.
> + *
> + * Return: mapping size.
> + */
> +static __always_inline unsigned long vmap_set_ptes(pte_t *pte,
> +		unsigned long addr, unsigned long end, u64 pfn,
> +		pgprot_t prot, unsigned int max_page_shift)
> +{
> +#ifdef CONFIG_HUGETLB_PAGE
> +	if (max_page_shift > PAGE_SHIFT) {
> +		unsigned long size;
> +
> +		size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
> +		if (size != PAGE_SIZE) {
> +			pte_t entry = pfn_pte(pfn, prot);
> +
> +			entry = arch_make_huge_pte(entry, ilog2(size), 0);
> +			set_huge_pte_at(&init_mm, addr, pte, entry, size);
> +			return size;
> +		}
> +	}
> +#endif
> +	set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
> +	return PAGE_SIZE;
> +}
> +
>  static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			phys_addr_t phys_addr, pgprot_t prot,
>  			unsigned int max_page_shift, pgtbl_mod_mask *mask)
> @@ -119,19 +148,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			BUG();
>  		}
>  
> -#ifdef CONFIG_HUGETLB_PAGE
> -		size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
> -		if (size != PAGE_SIZE) {
> -			pte_t entry = pfn_pte(pfn, prot);
> -
> -			entry = arch_make_huge_pte(entry, ilog2(size), 0);
> -			set_huge_pte_at(&init_mm, addr, pte, entry, size);
> -			pfn += PFN_DOWN(size);
> -			continue;
> -		}
> -#endif
> -		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
> -		pfn++;
> +		size = vmap_set_ptes(pte, addr, end, pfn, prot, max_page_shift);
> +		pfn += PFN_DOWN(size);
>  	} while (pte += PFN_DOWN(size), addr += size, addr != end);
>  
>  	lazy_mmu_mode_disable();
> -- 
> 2.34.1
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-06-26 16:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
2026-06-18  8:47 ` [PATCH v4 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
2026-06-18  8:47 ` [PATCH v4 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Wen Jiang
2026-06-18  8:47 ` [PATCH v4 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic Wen Jiang
2026-06-26 16:21   ` Uladzislau Rezki
2026-06-18  8:47 ` [PATCH v4 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk Wen Jiang
2026-06-18  8:47 ` [PATCH v4 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible Wen Jiang
2026-06-18  8:47 ` [PATCH v4 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
2026-06-26 16:20   ` Uladzislau Rezki
2026-06-25  2:57 ` [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Andrew Morton
2026-06-25  6:37 ` Dev Jain
2026-06-26 11:09   ` Barry Song
2026-06-26 15:12 ` Leo Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox