[PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
@ 2026-05-22  5:31 Wen Jiang
  2026-05-22  5:31 ` [PATCH v3 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Wen Jiang @ 2026-05-22  5:31 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

From: jiangwen6 <jiangwen6@xiaomi.com>

This patchset accelerates ioremap, vmalloc, and vmap when the memory
is physically fully or partially contiguous. Two techniques are used:

1. Avoid page table rewalk when setting PTEs/PMDs for multiple memory
   segments
2. Use batched mappings wherever possible in both vmalloc and ARM64
   layers

Besides accelerating the mapping path, this also enables large
mappings (PMD and cont-PTE) for vmap, which are currently not
supported.

Patches 1-2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
CONT-PTE regions instead of just one.

Patch 3 extracts a common helper vmap_set_ptes() that consolidates PTE
mapping logic between the ioremap and vmalloc/vmap paths, handling both
CONT_PTE and regular PTE mappings. This prepares for the next patch.

Patch 4 extends the page table walk path to support page shifts other
than PAGE_SHIFT and eliminates the page table rewalk for huge vmalloc
mappings. The function is renamed from vmap_small_pages_range_noflush()
to vmap_pages_range_noflush_walk().

Patches 5-6 add huge vmap support for contiguous pages, including
support for non-compound pages with pfn alignment verification.

On the RK3588 8-core ARM64 SoC, with tasks pinned to a little core and
the performance CPUfreq policy enabled, benchmark results:

* ioremap(1 MB): 1.35x faster (3407 ns -> 2526 ns)
* vmalloc(1 MB) mapping time (excluding allocation) with
  VM_ALLOW_HUGE_VMAP: 1.42x faster (5.00 us -> 3.53us)
* vmap(100MB) with order-8 pages: 8.3x faster (1235 us -> 149 us)

Many thanks to Xueyuan Chen for his testing efforts on RK3588 boards.

Changes since v2:
- Use __fls instead of fls in arch_vmap_pte_range_map_size (patch 2)
- Add WARN_ON checks in vmap_pages_pmd_range (patch 4)
- Fix flush_cache_vmap to use saved start address instead of the
  already-advanced addr (patch 5)
- Rename __vmap_huge() to vmap_batched() (patch 5)
- Add caller parameter and unroll while(1) loop (patch 5)
- Squash patch 7 into patch 5 (stop scanning for compound pages after
  encountering small pages)

Changes since v1:
- Fix condition order and use PMD_SIZE instead of CONT_PMD_SIZE in
  patch 1 (Dev Jain)
- Squash patch 3+4 and patch 5+7 (Dev Jain)
- Replace "zigzag" with "page table rewalk" in commit messages
  (Dev Jain)
- Rename vmap_small_pages_range_noflush() to
  vmap_pages_range_noflush_walk() (Dev Jain)
- Extract vmap_set_ptes() as a new patch to consolidate PTE mapping
  logic between vmap_pte_range() and vmap_pages_pte_range(), handling
  both CONT_PTE and regular mappings (Mike Rapoport)
- Support non-compound pages in get_vmap_batch_order() by falling
  back to physical contiguity scanning with pfn alignment check
  (Dev Jain, Uladzislau Rezki)
- In get_vmap_batch_order(), filter out orders that the architecture
  cannot batch by checking arch_vmap_pte_supported_shift() directly.
  This avoids overhead for orders 1-3 on ARM64 CONT_PTE with 4K
  pages. (patch 5)

Barry Song (Xiaomi) (5):
  arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
    setup
  arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
    CONT_PTE
  mm/vmalloc: Extend page table walk to support larger page_shift sizes
    and eliminate page table rewalk
  mm/vmalloc: map contiguous pages in batches for vmap() if possible
  mm/vmalloc: align vm_area so vmap() can batch mappings

Wen Jiang (1):
  mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic

 arch/arm64/include/asm/vmalloc.h |   6 +-
 arch/arm64/mm/hugetlbpage.c      |  10 ++
 mm/vmalloc.c                     | 235 ++++++++++++++++++++++++-------
 3 files changed, 201 insertions(+), 50 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
  2026-05-22  5:31 [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
@ 2026-05-22  5:31 ` Wen Jiang
  2026-05-26  7:56   ` Dev Jain
  2026-05-22  5:31 ` [PATCH v3 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Wen Jiang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Wen Jiang @ 2026-05-22  5:31 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
we can batch CONT_PTE settings instead of handling them individually.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index a42c05cf56408..c4d8b226126cb 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
 		contig_ptes = CONT_PTES;
 		break;
 	default:
+		if (size > 0 && size < PMD_SIZE &&
+				IS_ALIGNED(size, CONT_PTE_SIZE)) {
+			contig_ptes = size >> PAGE_SHIFT;
+			*pgsize = PAGE_SIZE;
+			break;
+		}
 		WARN_ON(!__hugetlb_valid_size(size));
 	}
 
@@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
 	case CONT_PTE_SIZE:
 		return pte_mkcont(entry);
 	default:
+		if (pagesize > 0 && pagesize < PMD_SIZE &&
+				IS_ALIGNED(pagesize, CONT_PTE_SIZE))
+			return pte_mkcont(entry);
+
 		break;
 	}
 	pr_warn("%s: unrecognized huge page size 0x%lx\n",
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE
  2026-05-22  5:31 [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
  2026-05-22  5:31 ` [PATCH v3 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
@ 2026-05-22  5:31 ` Wen Jiang
  2026-05-27  5:43   ` Dev Jain
  2026-05-22  5:31 ` [PATCH v3 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic Wen Jiang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Wen Jiang @ 2026-05-22  5:31 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE hugepages,
reducing both PTE setup and TLB flush iterations.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 arch/arm64/include/asm/vmalloc.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 4ec1acd3c1b34..787fd17b48e2c 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -23,6 +23,8 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
 						unsigned long end, u64 pfn,
 						unsigned int max_page_shift)
 {
+	unsigned long size;
+
 	/*
 	 * If the block is at least CONT_PTE_SIZE in size, and is naturally
 	 * aligned in both virtual and physical space, then we can pte-map the
@@ -40,7 +42,9 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
 	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
 		return PAGE_SIZE;
 
-	return CONT_PTE_SIZE;
+	size = min3(end - addr, 1UL << max_page_shift, PMD_SIZE >> 1);
+	size = 1UL << __fls(size);
+	return size;
 }
 
 #define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic
  2026-05-22  5:31 [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
  2026-05-22  5:31 ` [PATCH v3 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
  2026-05-22  5:31 ` [PATCH v3 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Wen Jiang
@ 2026-05-22  5:31 ` Wen Jiang
  2026-06-01 17:34   ` Uladzislau Rezki
  2026-05-22  5:31 ` [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk Wen Jiang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Wen Jiang @ 2026-05-22  5:31 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

Extract the common PTE mapping logic from vmap_pte_range() into a
shared helper vmap_set_ptes(). This handles both CONT_PTE and regular
PTE mappings in a single function, preparing for the next patch which
will extend vmap_pages_pte_range() to also use this helper.

The #ifdef CONFIG_HUGETLB_PAGE guard is moved inside vmap_set_ptes(),
so callers no longer need to handle the conditional compilation.

Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 mm/vmalloc.c | 49 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 34 insertions(+), 15 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2c2f74a07f396..53fd4ee460ea4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -91,6 +91,35 @@ struct vfree_deferred {
 static DEFINE_PER_CPU(struct vfree_deferred, vfree_deferred);
 
 /*** Page table manipulation functions ***/
+
+/*
+ * Set PTE mappings for the given PFN. Try CONT_PTE mappings first when
+ * supported, otherwise fall back to PAGE_SIZE mappings.
+ *
+ * Return: mapping size.
+ */
+static __always_inline unsigned long vmap_set_ptes(pte_t *pte,
+		unsigned long addr, unsigned long end, u64 pfn,
+		pgprot_t prot, unsigned int max_page_shift)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	if (max_page_shift > PAGE_SHIFT) {
+		unsigned long size;
+
+		size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
+		if (size != PAGE_SIZE) {
+			pte_t entry = pfn_pte(pfn, prot);
+
+			entry = arch_make_huge_pte(entry, ilog2(size), 0);
+			set_huge_pte_at(&init_mm, addr, pte, entry, size);
+			return size;
+		}
+	}
+#endif
+	set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
+	return PAGE_SIZE;
+}
+
 static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			phys_addr_t phys_addr, pgprot_t prot,
 			unsigned int max_page_shift, pgtbl_mod_mask *mask)
@@ -98,7 +127,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	u64 pfn;
 	struct page *page;
-	unsigned long size = PAGE_SIZE;
+	unsigned long size;
+	unsigned int steps;
 
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(end - addr)))
 		return -EINVAL;
@@ -119,20 +149,9 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			BUG();
 		}
 
-#ifdef CONFIG_HUGETLB_PAGE
-		size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
-		if (size != PAGE_SIZE) {
-			pte_t entry = pfn_pte(pfn, prot);
-
-			entry = arch_make_huge_pte(entry, ilog2(size), 0);
-			set_huge_pte_at(&init_mm, addr, pte, entry, size);
-			pfn += PFN_DOWN(size);
-			continue;
-		}
-#endif
-		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
-		pfn++;
-	} while (pte += PFN_DOWN(size), addr += size, addr != end);
+		size = vmap_set_ptes(pte, addr, end, pfn, prot, max_page_shift);
+		steps = PFN_DOWN(size);
+	} while (pte += steps, pfn += steps, addr += size, addr != end);
 
 	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk
  2026-05-22  5:31 [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
                   ` (2 preceding siblings ...)
  2026-05-22  5:31 ` [PATCH v3 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic Wen Jiang
@ 2026-05-22  5:31 ` Wen Jiang
  2026-05-27  5:58   ` Dev Jain
  2026-05-22  5:31 ` [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible Wen Jiang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Wen Jiang @ 2026-05-22  5:31 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

vmap_pages_range_noflush_walk() (formerly vmap_small_pages_range_noflush())
provides a clean interface by taking struct page **pages and mapping them
via direct PTE iteration. This avoids the page table rewalk seen when
using vmap_range_noflush() for page_shift values other than PAGE_SHIFT.

Extend it to support larger page_shift values, and add PMD- and
contiguous-PTE mappings as well. Rename it to vmap_pages_range_noflush_walk()
since it now handles more than just small pages.

For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer need to
iterate over pages one by one via vmap_range_noflush(), which would
otherwise lead to page table rewalk. The code is now unified with the
PAGE_SHIFT case by simply calling vmap_pages_range_noflush_walk().

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 mm/vmalloc.c | 71 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 40 insertions(+), 31 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 53fd4ee460ea4..deb764abc0571 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -543,8 +543,10 @@ void vunmap_range(unsigned long addr, unsigned long end)
 
 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
+	unsigned long pfn, size;
+	unsigned int steps;
 	int err = 0;
 	pte_t *pte;
 
@@ -575,9 +577,10 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			break;
 		}
 
-		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
-		(*nr)++;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+		pfn = page_to_pfn(page);
+		size = vmap_set_ptes(pte, addr, end, pfn, prot, shift);
+		steps = PFN_DOWN(size);
+	} while (pte += steps, *nr += steps, addr += size, addr != end);
 
 	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
@@ -587,7 +590,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 
 static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -597,7 +600,27 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
+
+		if (shift == PMD_SHIFT) {
+			struct page *page = pages[*nr];
+			phys_addr_t phys_addr;
+
+			if (WARN_ON(!page))
+				return -ENOMEM;
+			if (WARN_ON(!pfn_valid(page_to_pfn(page))))
+				return -EINVAL;
+
+			phys_addr = page_to_phys(page);
+
+			if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
+						shift)) {
+				*mask |= PGTBL_PMD_MODIFIED;
+				*nr += 1 << (shift - PAGE_SHIFT);
+				continue;
+			}
+		}
+
+		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
@@ -605,7 +628,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
 
 static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -615,7 +638,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
+		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
@@ -623,7 +646,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
 
 static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -633,14 +656,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
+		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
 }
 
-static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
-		pgprot_t prot, struct page **pages)
+static int vmap_pages_range_noflush_walk(unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages, unsigned int shift)
 {
 	unsigned long start = addr;
 	pgd_t *pgd;
@@ -655,7 +678,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
 		next = pgd_addr_end(addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
-		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
@@ -678,27 +701,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
 int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 		pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
-	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
-
 	WARN_ON(page_shift < PAGE_SHIFT);
 
-	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
-			page_shift == PAGE_SHIFT)
-		return vmap_small_pages_range_noflush(addr, end, prot, pages);
+	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
+		page_shift = PAGE_SHIFT;
 
-	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
-		int err;
-
-		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
-					page_to_phys(pages[i]), prot,
-					page_shift);
-		if (err)
-			return err;
-
-		addr += 1UL << page_shift;
-	}
-
-	return 0;
+	return vmap_pages_range_noflush_walk(addr, end, prot, pages,
+			min(page_shift, PMD_SHIFT));
 }
 
 int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible
  2026-05-22  5:31 [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
                   ` (3 preceding siblings ...)
  2026-05-22  5:31 ` [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk Wen Jiang
@ 2026-05-22  5:31 ` Wen Jiang
  2026-05-27  8:27   ` Dev Jain
  2026-05-22  5:31 ` [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
  2026-05-22 18:07 ` [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Andrew Morton
  6 siblings, 1 reply; 26+ messages in thread
From: Wen Jiang @ 2026-05-22  5:31 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

In many cases, the pages passed to vmap() may include high-order
pages. For example, the systemheap often allocates pages in descending
order: order 8, then 4, then 0. Currently, vmap() iterates over every
page individually—even pages inside a high-order block are handled
one by one.

This patch detects physically contiguous pages (regardless of whether
they are compound or non-compound) by scanning with
num_pages_contiguous(), and maps them as a single contiguous block
whenever possible. The first page's pfn must be aligned to the
mapping order for the batched mapping to be used.

Pages with the same page_shift are coalesced and mapped via
vmap_pages_range_noflush_walk() to avoid page table rewalk.

As users typically allocate memory in descending orders (e.g.
8 → 4 → 0), once an order-0 page is encountered, we stop scanning
for contiguous pages since subsequent pages are likely order-0 as well.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 mm/vmalloc.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 80 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index deb764abc0571..50642246f4d40 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3542,6 +3542,84 @@ void vunmap(const void *addr)
 }
 EXPORT_SYMBOL(vunmap);
 
+static inline int get_vmap_batch_order(struct page **pages,
+		unsigned int max_steps, unsigned int idx)
+{
+	unsigned int nr_contig;
+	int order;
+
+	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) ||
+			ioremap_max_page_shift == PAGE_SHIFT)
+		return 0;
+
+	nr_contig = num_pages_contiguous(&pages[idx], max_steps);
+	if (nr_contig < 2)
+		return 0;
+
+	order = fls(nr_contig) - 1;
+
+	if (arch_vmap_pte_supported_shift(PAGE_SIZE << order) == PAGE_SHIFT)
+		return 0;
+
+	/* Ensure the first page's pfn is aligned to the order */
+	if (!IS_ALIGNED(page_to_pfn(pages[idx]), 1 << order))
+		return 0;
+
+	return order;
+}
+
+static int vmap_batched(unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages)
+{
+	unsigned int count = (end - addr) >> PAGE_SHIFT;
+	unsigned int prev_shift = 0, idx = 0;
+	unsigned long start = addr, map_addr = addr;
+	int err;
+
+	err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
+						PAGE_SHIFT, GFP_KERNEL);
+	if (err)
+		goto out;
+
+	for (unsigned int i = 0; i < count; ) {
+		unsigned int shift = PAGE_SHIFT +
+			get_vmap_batch_order(pages, count - i, i);
+
+		if (!i)
+			prev_shift = shift;
+
+		if (shift != prev_shift) {
+			err = vmap_pages_range_noflush_walk(map_addr, addr,
+					prot, pages + idx,
+					min(prev_shift, PMD_SHIFT));
+			if (err)
+				goto out;
+			prev_shift = shift;
+			map_addr = addr;
+			idx = i;
+		}
+
+		/*
+		 * Once small pages are encountered, the remaining pages
+		 * are likely small as well.
+		 */
+		if (shift == PAGE_SHIFT)
+			break;
+
+		addr += 1UL << shift;
+		i += 1U << (shift - PAGE_SHIFT);
+	}
+
+	/* Remaining */
+	if (map_addr < end)
+		err = vmap_pages_range_noflush_walk(map_addr, end,
+				prot, pages + idx, min(prev_shift, PMD_SHIFT));
+
+out:
+	flush_cache_vmap(start, end);
+	return err;
+}
+
 /**
  * vmap - map an array of pages into virtually contiguous space
  * @pages: array of page pointers
@@ -3585,8 +3663,8 @@ void *vmap(struct page **pages, unsigned int count,
 		return NULL;
 
 	addr = (unsigned long)area->addr;
-	if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
-				pages, PAGE_SHIFT) < 0) {
+	if (vmap_batched(addr, addr + size, pgprot_nx(prot),
+				pages) < 0) {
 		vunmap(area->addr);
 		return NULL;
 	}
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings
  2026-05-22  5:31 [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
                   ` (4 preceding siblings ...)
  2026-05-22  5:31 ` [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible Wen Jiang
@ 2026-05-22  5:31 ` Wen Jiang
  2026-05-23  7:53   ` Uladzislau Rezki
  2026-05-27  6:25   ` Dev Jain
  2026-05-22 18:07 ` [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Andrew Morton
  6 siblings, 2 replies; 26+ messages in thread
From: Wen Jiang @ 2026-05-22  5:31 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

Try to align the vmap virtual address to PMD_SHIFT or a
larger PTE mapping size hinted by the architecture, so
contiguous pages can be batch-mapped when setting PMD or
PTE entries.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
---
 mm/vmalloc.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 50642246f4d40..040d400928aab 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3620,6 +3620,37 @@ static int vmap_batched(unsigned long addr, unsigned long end,
 	return err;
 }
 
+static struct vm_struct *get_aligned_vm_area(unsigned long size,
+		unsigned long flags, const void *caller)
+{
+	struct vm_struct *vm_area;
+	unsigned int shift;
+
+	/* Try PMD alignment for large sizes */
+	if (size >= PMD_SIZE) {
+		vm_area = __get_vm_area_node(size, PMD_SIZE, PAGE_SHIFT, flags,
+				VMALLOC_START, VMALLOC_END,
+				NUMA_NO_NODE, GFP_KERNEL, caller);
+		if (vm_area)
+			return vm_area;
+	}
+
+	/* Try CONT_PTE alignment */
+	shift = arch_vmap_pte_supported_shift(size);
+	if (shift > PAGE_SHIFT) {
+		vm_area = __get_vm_area_node(size, 1UL << shift, PAGE_SHIFT, flags,
+				VMALLOC_START, VMALLOC_END,
+				NUMA_NO_NODE, GFP_KERNEL, caller);
+		if (vm_area)
+			return vm_area;
+	}
+
+	/* Fall back to page alignment */
+	return __get_vm_area_node(size, PAGE_SIZE, PAGE_SHIFT, flags,
+			VMALLOC_START, VMALLOC_END,
+			NUMA_NO_NODE, GFP_KERNEL, caller);
+}
+
 /**
  * vmap - map an array of pages into virtually contiguous space
  * @pages: array of page pointers
@@ -3658,7 +3689,7 @@ void *vmap(struct page **pages, unsigned int count,
 		return NULL;
 
 	size = (unsigned long)count << PAGE_SHIFT;
-	area = get_vm_area_caller(size, flags, __builtin_return_address(0));
+	area = get_aligned_vm_area(size, flags, __builtin_return_address(0));
 	if (!area)
 		return NULL;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
  2026-05-22  5:31 [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
                   ` (5 preceding siblings ...)
  2026-05-22  5:31 ` [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
@ 2026-05-22 18:07 ` Andrew Morton
  2026-05-23  8:26   ` Wen Jiang
  6 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2026-05-22 18:07 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, urezki, baohua,
	Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Fri, 22 May 2026 13:31:40 +0800 Wen Jiang <jiangwenxiaomi@gmail.com> wrote:

> This patchset accelerates ioremap, vmalloc, and vmap when the memory
> is physically fully or partially contiguous.

Thanks.  AI review asked a few things and might have found an existing
32-bit bug in vmap():

	https://sashiko.dev/#/patchset/20260522053146.83209-1-jiangwenxiaomi@gmail.com


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings
  2026-05-22  5:31 ` [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
@ 2026-05-23  7:53   ` Uladzislau Rezki
  2026-05-27  6:25   ` Dev Jain
  1 sibling, 0 replies; 26+ messages in thread
From: Uladzislau Rezki @ 2026-05-23  7:53 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Fri, May 22, 2026 at 01:31:46PM +0800, Wen Jiang wrote:
> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> 
> Try to align the vmap virtual address to PMD_SHIFT or a
> larger PTE mapping size hinted by the architecture, so
> contiguous pages can be batch-mapped when setting PMD or
> PTE entries.
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> ---
>  mm/vmalloc.c | 33 ++++++++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 50642246f4d40..040d400928aab 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3620,6 +3620,37 @@ static int vmap_batched(unsigned long addr, unsigned long end,
>  	return err;
>  }
>  
> +static struct vm_struct *get_aligned_vm_area(unsigned long size,
> +		unsigned long flags, const void *caller)
> +{
> +	struct vm_struct *vm_area;
> +	unsigned int shift;
> +
> +	/* Try PMD alignment for large sizes */
> +	if (size >= PMD_SIZE) {
> +		vm_area = __get_vm_area_node(size, PMD_SIZE, PAGE_SHIFT, flags,
> +				VMALLOC_START, VMALLOC_END,
> +				NUMA_NO_NODE, GFP_KERNEL, caller);
> +		if (vm_area)
> +			return vm_area;
> +	}
> +
> +	/* Try CONT_PTE alignment */
> +	shift = arch_vmap_pte_supported_shift(size);
> +	if (shift > PAGE_SHIFT) {
> +		vm_area = __get_vm_area_node(size, 1UL << shift, PAGE_SHIFT, flags,
> +				VMALLOC_START, VMALLOC_END,
> +				NUMA_NO_NODE, GFP_KERNEL, caller);
> +		if (vm_area)
> +			return vm_area;
> +	}
> +
> +	/* Fall back to page alignment */
> +	return __get_vm_area_node(size, PAGE_SIZE, PAGE_SHIFT, flags,
> +			VMALLOC_START, VMALLOC_END,
> +			NUMA_NO_NODE, GFP_KERNEL, caller);
> +}
> +
>  /**
>   * vmap - map an array of pages into virtually contiguous space
>   * @pages: array of page pointers
> @@ -3658,7 +3689,7 @@ void *vmap(struct page **pages, unsigned int count,
>  		return NULL;
>  
>  	size = (unsigned long)count << PAGE_SHIFT;
> -	area = get_vm_area_caller(size, flags, __builtin_return_address(0));
> +	area = get_aligned_vm_area(size, flags, __builtin_return_address(0));
>  	if (!area)
>  		return NULL;
>  
> -- 
> 2.34.1
> 
This one LGTM:

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
  2026-05-22 18:07 ` [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Andrew Morton
@ 2026-05-23  8:26   ` Wen Jiang
  2026-05-23 21:40     ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Wen Jiang @ 2026-05-23  8:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, urezki, baohua,
	Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Sat, 23 May 2026 at 02:07, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 22 May 2026 13:31:40 +0800 Wen Jiang <jiangwenxiaomi@gmail.com> wrote:
>
> > This patchset accelerates ioremap, vmalloc, and vmap when the memory
> > is physically fully or partially contiguous.
>
> Thanks.  AI review asked a few things and might have found an existing
> 32-bit bug in vmap():
>
>         https://sashiko.dev/#/patchset/20260522053146.83209-1-jiangwenxiaomi@gmail.com

Hi Andrew,

I've gone through the Sashiko findings:

- Patch 5 (arch_vmap_pte_supported_shift on x86): Over-interpretation.
  This targets ARM64 CONT_PTE. x86 falls through with PAGE_SHIFT
  same as before.

- Patch 5 (1 << order overflow at order=31): Over-interpretation.
  Reaching order=31 requires 8TB contiguous in a single vmap()
  not a realistic usage pattern.

- Patch 6 (GFP_KERNEL triggering purge): The purge only triggers
  when vmalloc space is already under pressure, and benefits the
  subsequent PAGE_SIZE fallback as well, not wasted work.

- Patch 6 (32-bit count << PAGE_SHIFT overflow): Pre-existing.
  Will send a separate fix.

- Patch 6 (unconditional alignment without checking contiguity):
  The main vmap() users typically pass contiguous pages
  (e.g. system_heap order 8 -> 4 -> 0).

Thanks,
Wen


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
  2026-05-23  8:26   ` Wen Jiang
@ 2026-05-23 21:40     ` Andrew Morton
  0 siblings, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2026-05-23 21:40 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, urezki, baohua,
	Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Sat, 23 May 2026 16:26:36 +0800 Wen Jiang <jiangwenxiaomi@gmail.com> wrote:

> On Sat, 23 May 2026 at 02:07, Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri, 22 May 2026 13:31:40 +0800 Wen Jiang <jiangwenxiaomi@gmail.com> wrote:
> >
> > > This patchset accelerates ioremap, vmalloc, and vmap when the memory
> > > is physically fully or partially contiguous.
> >
> > Thanks.  AI review asked a few things and might have found an existing
> > 32-bit bug in vmap():
> >
> >         https://sashiko.dev/#/patchset/20260522053146.83209-1-jiangwenxiaomi@gmail.com
> 
> Hi Andrew,
> 
> I've gone through the Sashiko findings:

Great, thanks.  I won't take any action at this time - let's see what
reviewers have to say.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
  2026-05-22  5:31 ` [PATCH v3 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
@ 2026-05-26  7:56   ` Dev Jain
  0 siblings, 0 replies; 26+ messages in thread
From: Dev Jain @ 2026-05-26  7:56 UTC (permalink / raw)
  To: Wen Jiang, linux-mm, linux-arm-kernel, catalin.marinas, will,
	akpm, urezki
  Cc: baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6



On 22/05/26 11:01 am, Wen Jiang wrote:
> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> 
> For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
> we can batch CONT_PTE settings instead of handling them individually.

Better wording: "we can handle CONT_PTE_SIZE groups together"

> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> ---
>  arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index a42c05cf56408..c4d8b226126cb 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
>  		contig_ptes = CONT_PTES;
>  		break;
>  	default:
> +		if (size > 0 && size < PMD_SIZE &&
> +				IS_ALIGNED(size, CONT_PTE_SIZE)) {
> +			contig_ptes = size >> PAGE_SHIFT;
> +			*pgsize = PAGE_SIZE;
> +			break;
> +		}
>  		WARN_ON(!__hugetlb_valid_size(size));
>  	}
>  
> @@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
>  	case CONT_PTE_SIZE:
>  		return pte_mkcont(entry);
>  	default:
> +		if (pagesize > 0 && pagesize < PMD_SIZE &&
> +				IS_ALIGNED(pagesize, CONT_PTE_SIZE))
> +			return pte_mkcont(entry);
> +
>  		break;
>  	}
>  	pr_warn("%s: unrecognized huge page size 0x%lx\n",



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE
  2026-05-22  5:31 ` [PATCH v3 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Wen Jiang
@ 2026-05-27  5:43   ` Dev Jain
  0 siblings, 0 replies; 26+ messages in thread
From: Dev Jain @ 2026-05-27  5:43 UTC (permalink / raw)
  To: Wen Jiang, linux-mm, linux-arm-kernel, catalin.marinas, will,
	akpm, urezki
  Cc: baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6



On 22/05/26 11:01 am, Wen Jiang wrote:
> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> 
> Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE hugepages,

..."to batch across multiple CONT_PTE blocks"

> reducing both PTE setup and TLB flush iterations.
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> ---
>  arch/arm64/include/asm/vmalloc.h | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
> index 4ec1acd3c1b34..787fd17b48e2c 100644
> --- a/arch/arm64/include/asm/vmalloc.h
> +++ b/arch/arm64/include/asm/vmalloc.h
> @@ -23,6 +23,8 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
>  						unsigned long end, u64 pfn,
>  						unsigned int max_page_shift)
>  {
> +	unsigned long size;
> +
>  	/*
>  	 * If the block is at least CONT_PTE_SIZE in size, and is naturally
>  	 * aligned in both virtual and physical space, then we can pte-map the
> @@ -40,7 +42,9 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
>  	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
>  		return PAGE_SIZE;
>  
> -	return CONT_PTE_SIZE;
> +	size = min3(end - addr, 1UL << max_page_shift, PMD_SIZE >> 1);
> +	size = 1UL << __fls(size);
> +	return size;
>  }
>  
>  #define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk
  2026-05-22  5:31 ` [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk Wen Jiang
@ 2026-05-27  5:58   ` Dev Jain
  2026-05-28  3:39     ` Wen Jiang
  0 siblings, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-05-27  5:58 UTC (permalink / raw)
  To: Wen Jiang, linux-mm, linux-arm-kernel, catalin.marinas, will,
	akpm, urezki
  Cc: baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6



On 22/05/26 11:01 am, Wen Jiang wrote:
> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> 
> vmap_pages_range_noflush_walk() (formerly vmap_small_pages_range_noflush())
> provides a clean interface by taking struct page **pages and mapping them
> via direct PTE iteration. This avoids the page table rewalk seen when
> using vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
> 
> Extend it to support larger page_shift values, and add PMD- and
> contiguous-PTE mappings as well. Rename it to vmap_pages_range_noflush_walk()
> since it now handles more than just small pages.
> 
> For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer need to
> iterate over pages one by one via vmap_range_noflush(), which would
> otherwise lead to page table rewalk. The code is now unified with the
> PAGE_SHIFT case by simply calling vmap_pages_range_noflush_walk().
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> ---
>  mm/vmalloc.c | 71 +++++++++++++++++++++++++++++-----------------------
>  1 file changed, 40 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 53fd4ee460ea4..deb764abc0571 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -543,8 +543,10 @@ void vunmap_range(unsigned long addr, unsigned long end)
>  
>  static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
> +	unsigned long pfn, size;
> +	unsigned int steps;
>  	int err = 0;
>  	pte_t *pte;
>  
> @@ -575,9 +577,10 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  			break;
>  		}
>  
> -		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
> -		(*nr)++;
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +		pfn = page_to_pfn(page);
> +		size = vmap_set_ptes(pte, addr, end, pfn, prot, shift);
> +		steps = PFN_DOWN(size);
> +	} while (pte += steps, *nr += steps, addr += size, addr != end);
>  
>  	lazy_mmu_mode_disable();
>  	*mask |= PGTBL_PTE_MODIFIED;
> @@ -587,7 +590,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> @@ -597,7 +600,27 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
> +
> +		if (shift == PMD_SHIFT) {
> +			struct page *page = pages[*nr];
> +			phys_addr_t phys_addr;
> +
> +			if (WARN_ON(!page))
> +				return -ENOMEM;
> +			if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> +				return -EINVAL;


So I know these !page and !pfn_valid checks have been copied from vmap_pages_pte_range,
but do they mean anything?

I think pfn_valid() makes sense in that someone may take a random VA/PA, convert it into a struct
page and pass to vmap layer. But I don't see how anyone would pass page == NULL? At the
very least, returning ENOMEM does not make sense because the pages are not being
allocated by vmap() but have already been allocated.

> +
> +			phys_addr = page_to_phys(page);
> +
> +			if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
> +						shift)) {
> +				*mask |= PGTBL_PMD_MODIFIED;
> +				*nr += 1 << (shift - PAGE_SHIFT);
> +				continue;
> +			}
> +		}
> +
> +		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
>  			return -ENOMEM;
>  	} while (pmd++, addr = next, addr != end);
>  	return 0;
> @@ -605,7 +628,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>  
>  static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -615,7 +638,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = pud_addr_end(addr, end);
> -		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
> +		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
>  			return -ENOMEM;
>  	} while (pud++, addr = next, addr != end);
>  	return 0;
> @@ -623,7 +646,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>  
>  static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
>  	p4d_t *p4d;
>  	unsigned long next;
> @@ -633,14 +656,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = p4d_addr_end(addr, end);
> -		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
> +		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
>  			return -ENOMEM;
>  	} while (p4d++, addr = next, addr != end);
>  	return 0;
>  }
>  
> -static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> -		pgprot_t prot, struct page **pages)
> +static int vmap_pages_range_noflush_walk(unsigned long addr, unsigned long end,
> +		pgprot_t prot, struct page **pages, unsigned int shift)
>  {
>  	unsigned long start = addr;
>  	pgd_t *pgd;
> @@ -655,7 +678,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_bad(*pgd))
>  			mask |= PGTBL_PGD_MODIFIED;
> -		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
> +		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
>  		if (err)
>  			break;
>  	} while (pgd++, addr = next, addr != end);
> @@ -678,27 +701,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>  int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>  		pgprot_t prot, struct page **pages, unsigned int page_shift)
>  {
> -	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
> -
>  	WARN_ON(page_shift < PAGE_SHIFT);
>  
> -	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> -			page_shift == PAGE_SHIFT)
> -		return vmap_small_pages_range_noflush(addr, end, prot, pages);
> +	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
> +		page_shift = PAGE_SHIFT;
>  
> -	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> -		int err;
> -
> -		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> -					page_to_phys(pages[i]), prot,
> -					page_shift);
> -		if (err)
> -			return err;
> -
> -		addr += 1UL << page_shift;
> -	}
> -
> -	return 0;
> +	return vmap_pages_range_noflush_walk(addr, end, prot, pages,
> +			min(page_shift, PMD_SHIFT));


We can easily extend to PUD huge mappings right? Not sure whether we
should keep everything symmetric to how vmap_range_noflush() operates
right now, since P4D mappings don't exist, but PUD looks worthwhile.

>  }
>  
>  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings
  2026-05-22  5:31 ` [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
  2026-05-23  7:53   ` Uladzislau Rezki
@ 2026-05-27  6:25   ` Dev Jain
  2026-06-02  8:57     ` Wen Jiang
  1 sibling, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-05-27  6:25 UTC (permalink / raw)
  To: Wen Jiang, linux-mm, linux-arm-kernel, catalin.marinas, will,
	akpm, urezki
  Cc: baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6



On 22/05/26 11:01 am, Wen Jiang wrote:
> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> 
> Try to align the vmap virtual address to PMD_SHIFT or a
> larger PTE mapping size hinted by the architecture, so
> contiguous pages can be batch-mapped when setting PMD or
> PTE entries.
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>


Hmm okay I would have preferred to squash this in the previous, but
the correctness of previous patch does not rely on this, so it's fine.

> ---
>  mm/vmalloc.c | 33 ++++++++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 50642246f4d40..040d400928aab 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3620,6 +3620,37 @@ static int vmap_batched(unsigned long addr, unsigned long end,
>  	return err;
>  }
>  

This is screaming for a helper :)

> +static struct vm_struct *get_aligned_vm_area(unsigned long size,
> +		unsigned long flags, const void *caller)


Call this vmap_get_aligned_vm_area, then ...


> +{
> +	struct vm_struct *vm_area;
> +	unsigned int shift;
> +
> +	/* Try PMD alignment for large sizes */
> +	if (size >= PMD_SIZE) {
> +		vm_area = __get_vm_area_node(size, PMD_SIZE, PAGE_SHIFT, flags,
> +				VMALLOC_START, VMALLOC_END,
> +				NUMA_NO_NODE, GFP_KERNEL, caller);

Add a wrapper over this called __get_vm_area_node_aligned_caller, which can
call __get_vm_area_node() with all other arguments fixed, except "align".

> +		if (vm_area)
> +			return vm_area;
> +	}
> +
> +	/* Try CONT_PTE alignment */
> +	shift = arch_vmap_pte_supported_shift(size);
> +	if (shift > PAGE_SHIFT) {
> +		vm_area = __get_vm_area_node(size, 1UL << shift, PAGE_SHIFT, flags,
> +				VMALLOC_START, VMALLOC_END,
> +				NUMA_NO_NODE, GFP_KERNEL, caller);
> +		if (vm_area)
> +			return vm_area;
> +	}
> +
> +	/* Fall back to page alignment */
> +	return __get_vm_area_node(size, PAGE_SIZE, PAGE_SHIFT, flags,
> +			VMALLOC_START, VMALLOC_END,
> +			NUMA_NO_NODE, GFP_KERNEL, caller);
> +}
> +
>  /**
>   * vmap - map an array of pages into virtually contiguous space
>   * @pages: array of page pointers
> @@ -3658,7 +3689,7 @@ void *vmap(struct page **pages, unsigned int count,
>  		return NULL;
>  
>  	size = (unsigned long)count << PAGE_SHIFT;
> -	area = get_vm_area_caller(size, flags, __builtin_return_address(0));
> +	area = get_aligned_vm_area(size, flags, __builtin_return_address(0));
>  	if (!area)
>  		return NULL;
>  



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible
  2026-05-22  5:31 ` [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible Wen Jiang
@ 2026-05-27  8:27   ` Dev Jain
  2026-05-28  3:42     ` Wen Jiang
  0 siblings, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-05-27  8:27 UTC (permalink / raw)
  To: Wen Jiang, linux-mm, linux-arm-kernel, catalin.marinas, will,
	akpm, urezki
  Cc: baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6



On 22/05/26 11:01 am, Wen Jiang wrote:
> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> 
> In many cases, the pages passed to vmap() may include high-order
> pages. For example, the systemheap often allocates pages in descending
> order: order 8, then 4, then 0. Currently, vmap() iterates over every
> page individually—even pages inside a high-order block are handled
> one by one.
> 
> This patch detects physically contiguous pages (regardless of whether
> they are compound or non-compound) by scanning with
> num_pages_contiguous(), and maps them as a single contiguous block
> whenever possible. The first page's pfn must be aligned to the
> mapping order for the batched mapping to be used.
> 
> Pages with the same page_shift are coalesced and mapped via
> vmap_pages_range_noflush_walk() to avoid page table rewalk.
> 
> As users typically allocate memory in descending orders (e.g.
> 8 → 4 → 0), once an order-0 page is encountered, we stop scanning
> for contiguous pages since subsequent pages are likely order-0 as well.
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> ---
>  mm/vmalloc.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 80 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index deb764abc0571..50642246f4d40 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3542,6 +3542,84 @@ void vunmap(const void *addr)
>  }
>  EXPORT_SYMBOL(vunmap);
>  
> +static inline int get_vmap_batch_order(struct page **pages,
> +		unsigned int max_steps, unsigned int idx)
> +{
> +	unsigned int nr_contig;
> +	int order;
> +
> +	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) ||
> +			ioremap_max_page_shift == PAGE_SHIFT)


Why bail out on ioremap_max_page_shift == PAGE_SHIFT? The code
path for ioremap is different from vmap right?


> +		return 0;
> +
> +	nr_contig = num_pages_contiguous(&pages[idx], max_steps);
> +	if (nr_contig < 2)
> +		return 0;
> +
> +	order = fls(nr_contig) - 1;
> +
> +	if (arch_vmap_pte_supported_shift(PAGE_SIZE << order) == PAGE_SHIFT)
> +		return 0;
> +
> +	/* Ensure the first page's pfn is aligned to the order */
> +	if (!IS_ALIGNED(page_to_pfn(pages[idx]), 1 << order))
> +		return 0;
> +
> +	return order;
> +}
> +
> +static int vmap_batched(unsigned long addr, unsigned long end,
> +		pgprot_t prot, struct page **pages)
> +{
> +	unsigned int count = (end - addr) >> PAGE_SHIFT;
> +	unsigned int prev_shift = 0, idx = 0;
> +	unsigned long start = addr, map_addr = addr;
> +	int err;
> +
> +	err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
> +						PAGE_SHIFT, GFP_KERNEL);
> +	if (err)
> +		goto out;
> +
> +	for (unsigned int i = 0; i < count; ) {
> +		unsigned int shift = PAGE_SHIFT +
> +			get_vmap_batch_order(pages, count - i, i);
> +
> +		if (!i)
> +			prev_shift = shift;
> +
> +		if (shift != prev_shift) {
> +			err = vmap_pages_range_noflush_walk(map_addr, addr,

It would be worth documenting vmap_pages_range_noflush_walk() that
it can take an array of pages which are not all contiguous, but it
may have contiguous chunks, as hinted by page_shift.

Otherwise this looks good.

> +					prot, pages + idx,
> +					min(prev_shift, PMD_SHIFT));
> +			if (err)
> +				goto out;
> +			prev_shift = shift;
> +			map_addr = addr;
> +			idx = i;
> +		}
> +
> +		/*
> +		 * Once small pages are encountered, the remaining pages
> +		 * are likely small as well.
> +		 */
> +		if (shift == PAGE_SHIFT)
> +			break;
> +
> +		addr += 1UL << shift;
> +		i += 1U << (shift - PAGE_SHIFT);
> +	}
> +
> +	/* Remaining */
> +	if (map_addr < end)
> +		err = vmap_pages_range_noflush_walk(map_addr, end,
> +				prot, pages + idx, min(prev_shift, PMD_SHIFT));
> +
> +out:
> +	flush_cache_vmap(start, end);
> +	return err;
> +}
> +
>  /**
>   * vmap - map an array of pages into virtually contiguous space
>   * @pages: array of page pointers
> @@ -3585,8 +3663,8 @@ void *vmap(struct page **pages, unsigned int count,
>  		return NULL;
>  
>  	addr = (unsigned long)area->addr;
> -	if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
> -				pages, PAGE_SHIFT) < 0) {
> +	if (vmap_batched(addr, addr + size, pgprot_nx(prot),
> +				pages) < 0) {
>  		vunmap(area->addr);
>  		return NULL;
>  	}



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk
  2026-05-27  5:58   ` Dev Jain
@ 2026-05-28  3:39     ` Wen Jiang
  2026-05-29  5:28       ` Dev Jain
  2026-06-05  6:02       ` Dev Jain
  0 siblings, 2 replies; 26+ messages in thread
From: Wen Jiang @ 2026-05-28  3:39 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Wed, 27 May 2026 at 13:59, Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 22/05/26 11:01 am, Wen Jiang wrote:
> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
>
> vmap_pages_range_noflush_walk() (formerly vmap_small_pages_range_noflush())
> provides a clean interface by taking struct page **pages and mapping them
> via direct PTE iteration. This avoids the page table rewalk seen when
> using vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
>
> Extend it to support larger page_shift values, and add PMD- and
> contiguous-PTE mappings as well. Rename it to vmap_pages_range_noflush_walk()
> since it now handles more than just small pages.
>
> For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer need to
> iterate over pages one by one via vmap_range_noflush(), which would
> otherwise lead to page table rewalk. The code is now unified with the
> PAGE_SHIFT case by simply calling vmap_pages_range_noflush_walk().
>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> ---
>  mm/vmalloc.c | 71 +++++++++++++++++++++++++++++-----------------------
>  1 file changed, 40 insertions(+), 31 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 53fd4ee460ea4..deb764abc0571 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -543,8 +543,10 @@ void vunmap_range(unsigned long addr, unsigned long end)
>
>  static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -             pgtbl_mod_mask *mask)
> +             pgtbl_mod_mask *mask, unsigned int shift)
>  {
> +     unsigned long pfn, size;
> +     unsigned int steps;
>       int err = 0;
>       pte_t *pte;
>
> @@ -575,9 +577,10 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>                       break;
>               }
>
> -             set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
> -             (*nr)++;
> -     } while (pte++, addr += PAGE_SIZE, addr != end);
> +             pfn = page_to_pfn(page);
> +             size = vmap_set_ptes(pte, addr, end, pfn, prot, shift);
> +             steps = PFN_DOWN(size);
> +     } while (pte += steps, *nr += steps, addr += size, addr != end);
>
>       lazy_mmu_mode_disable();
>       *mask |= PGTBL_PTE_MODIFIED;
> @@ -587,7 +590,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>
>  static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -             pgtbl_mod_mask *mask)
> +             pgtbl_mod_mask *mask, unsigned int shift)
>  {
>       pmd_t *pmd;
>       unsigned long next;
> > @@ -597,7 +600,27 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
> >               return -ENOMEM;
> >       do {
> >               next = pmd_addr_end(addr, end);
> > -             if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
> > +
> > +             if (shift == PMD_SHIFT) {
> > +                     struct page *page = pages[*nr];
> > +                     phys_addr_t phys_addr;
> > +
> > +                     if (WARN_ON(!page))
> > +                             return -ENOMEM;
> > +                     if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> > +                             return -EINVAL;
>
>
> So I know these !page and !pfn_valid checks have been copied from vmap_pages_pte_range,
> but do they mean anything?
>
> I think pfn_valid() makes sense in that someone may take a random VA/PA, convert it into a struct
> page and pass to vmap layer. But I don't see how anyone would pass page == NULL? At the
> very least, returning ENOMEM does not make sense because the pages are not being
> allocated by vmap() but have already been allocated.

Hi Dev,

vmap() is EXPORT_SYMBOL with many callers across drivers, each
constructing the pages array differently. The !page check guards
against malformed arrays at this API boundary.

The same -ENOMEM issue also exists in vmap_pages_pte_range().
Should I fix both in this patchset or leave it as a separate cleanup?

>
> > +
> > +                     phys_addr = page_to_phys(page);
> > +
> > +                     if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
> > +                                             shift)) {
> > +                             *mask |= PGTBL_PMD_MODIFIED;
> > +                             *nr += 1 << (shift - PAGE_SHIFT);
> > +                             continue;
> > +                     }
> > +             }
> > +
> > +             if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
> >                       return -ENOMEM;
> >       } while (pmd++, addr = next, addr != end);
> >       return 0;
> > @@ -605,7 +628,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
> >
> >  static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
> >               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> > -             pgtbl_mod_mask *mask)
> > +             pgtbl_mod_mask *mask, unsigned int shift)
> >  {
> >       pud_t *pud;
> >       unsigned long next;
> > @@ -615,7 +638,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
> >               return -ENOMEM;
> >       do {
> >               next = pud_addr_end(addr, end);
> > -             if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
> > +             if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
> >                       return -ENOMEM;
> >       } while (pud++, addr = next, addr != end);
> >       return 0;
> > @@ -623,7 +646,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
> >
> >  static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
> >               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> > -             pgtbl_mod_mask *mask)
> > +             pgtbl_mod_mask *mask, unsigned int shift)
> >  {
> >       p4d_t *p4d;
> >       unsigned long next;
> > @@ -633,14 +656,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
> >               return -ENOMEM;
> >       do {
> >               next = p4d_addr_end(addr, end);
> > -             if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
> > +             if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
> >                       return -ENOMEM;
> >       } while (p4d++, addr = next, addr != end);
> >       return 0;
> >  }
> >
> > -static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> > -             pgprot_t prot, struct page **pages)
> > +static int vmap_pages_range_noflush_walk(unsigned long addr, unsigned long end,
> > +             pgprot_t prot, struct page **pages, unsigned int shift)
> >  {
> >       unsigned long start = addr;
> >       pgd_t *pgd;
> > @@ -655,7 +678,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> >               next = pgd_addr_end(addr, end);
> >               if (pgd_bad(*pgd))
> >                       mask |= PGTBL_PGD_MODIFIED;
> > -             err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
> > +             err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
> >               if (err)
> >                       break;
> >       } while (pgd++, addr = next, addr != end);
> > @@ -678,27 +701,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> >  int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> >               pgprot_t prot, struct page **pages, unsigned int page_shift)
> >  {
> > -     unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
> > -
> >       WARN_ON(page_shift < PAGE_SHIFT);
> >
> > -     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> > -                     page_shift == PAGE_SHIFT)
> > -             return vmap_small_pages_range_noflush(addr, end, prot, pages);
> > +     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
> > +             page_shift = PAGE_SHIFT;
> >
> > -     for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> > -             int err;
> > -
> > -             err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> > -                                     page_to_phys(pages[i]), prot,
> > -                                     page_shift);
> > -             if (err)
> > -                     return err;
> > -
> > -             addr += 1UL << page_shift;
> > -     }
> > -
> > -     return 0;
> > +     return vmap_pages_range_noflush_walk(addr, end, prot, pages,
> > +                     min(page_shift, PMD_SHIFT));
>
>
> We can easily extend to PUD huge mappings right? Not sure whether we
> should keep everything symmetric to how vmap_range_noflush() operates
> right now, since P4D mappings don't exist, but PUD looks worthwhile.
>

PUD mapping requires 1GB of contiguous physical memory, but the buddy
allocator's MAX_PAGE_ORDER is 10 (4MB on 4K pages). So page_shift
passed to vmap_pages_range_noflush_walk() never exceeds PMD_SHIFT.

Thanks,
Wen
> >  }
> >
> >  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible
  2026-05-27  8:27   ` Dev Jain
@ 2026-05-28  3:42     ` Wen Jiang
  2026-05-29  5:57       ` Dev Jain
  0 siblings, 1 reply; 26+ messages in thread
From: Wen Jiang @ 2026-05-28  3:42 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Wed, 27 May 2026 at 16:28, Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 22/05/26 11:01 am, Wen Jiang wrote:
> > From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> >
> > In many cases, the pages passed to vmap() may include high-order
> > pages. For example, the systemheap often allocates pages in descending
> > order: order 8, then 4, then 0. Currently, vmap() iterates over every
> > page individually—even pages inside a high-order block are handled
> > one by one.
> >
> > This patch detects physically contiguous pages (regardless of whether
> > they are compound or non-compound) by scanning with
> > num_pages_contiguous(), and maps them as a single contiguous block
> > whenever possible. The first page's pfn must be aligned to the
> > mapping order for the batched mapping to be used.
> >
> > Pages with the same page_shift are coalesced and mapped via
> > vmap_pages_range_noflush_walk() to avoid page table rewalk.
> >
> > As users typically allocate memory in descending orders (e.g.
> > 8 → 4 → 0), once an order-0 page is encountered, we stop scanning
> > for contiguous pages since subsequent pages are likely order-0 as well.
> >
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > Co-developed-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> > Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> > ---
> >  mm/vmalloc.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 80 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index deb764abc0571..50642246f4d40 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3542,6 +3542,84 @@ void vunmap(const void *addr)
> >  }
> >  EXPORT_SYMBOL(vunmap);
> >
> > +static inline int get_vmap_batch_order(struct page **pages,
> > +             unsigned int max_steps, unsigned int idx)
> > +{
> > +     unsigned int nr_contig;
> > +     int order;
> > +
> > +     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) ||
> > +                     ioremap_max_page_shift == PAGE_SHIFT)
>
>
> Why bail out on ioremap_max_page_shift == PAGE_SHIFT? The code
> path for ioremap is different from vmap right?
>
>

ioremap_max_page_shift is under CONFIG_HAVE_ARCH_HUGE_VMAP which
controls both ioremap and vmap huge mappings.

> > +             return 0;
> > +
> > +     nr_contig = num_pages_contiguous(&pages[idx], max_steps);
> > +     if (nr_contig < 2)
> > +             return 0;
> > +
> > +     order = fls(nr_contig) - 1;
> > +
> > +     if (arch_vmap_pte_supported_shift(PAGE_SIZE << order) == PAGE_SHIFT)
> > +             return 0;
> > +
> > +     /* Ensure the first page's pfn is aligned to the order */
> > +     if (!IS_ALIGNED(page_to_pfn(pages[idx]), 1 << order))
> > +             return 0;
> > +
> > +     return order;
> > +}
> > +
> > +static int vmap_batched(unsigned long addr, unsigned long end,
> > +             pgprot_t prot, struct page **pages)
> > +{
> > +     unsigned int count = (end - addr) >> PAGE_SHIFT;
> > +     unsigned int prev_shift = 0, idx = 0;
> > +     unsigned long start = addr, map_addr = addr;
> > +     int err;
> > +
> > +     err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
> > +                                             PAGE_SHIFT, GFP_KERNEL);
> > +     if (err)
> > +             goto out;
> > +
> > +     for (unsigned int i = 0; i < count; ) {
> > +             unsigned int shift = PAGE_SHIFT +
> > +                     get_vmap_batch_order(pages, count - i, i);
> > +
> > +             if (!i)
> > +                     prev_shift = shift;
> > +
> > +             if (shift != prev_shift) {
> > +                     err = vmap_pages_range_noflush_walk(map_addr, addr,
>
> It would be worth documenting vmap_pages_range_noflush_walk() that
> it can take an array of pages which are not all contiguous, but it
> may have contiguous chunks, as hinted by page_shift.
>
> Otherwise this looks good.
>
> > +                                     prot, pages + idx,
> > +                                     min(prev_shift, PMD_SHIFT));
> > +                     if (err)
> > +                             goto out;
> > +                     prev_shift = shift;
> > +                     map_addr = addr;
> > +                     idx = i;
> > +             }
> > +
> > +             /*
> > +              * Once small pages are encountered, the remaining pages
> > +              * are likely small as well.
> > +              */
> > +             if (shift == PAGE_SHIFT)
> > +                     break;
> > +
> > +             addr += 1UL << shift;
> > +             i += 1U << (shift - PAGE_SHIFT);
> > +     }
> > +
> > +     /* Remaining */
> > +     if (map_addr < end)
> > +             err = vmap_pages_range_noflush_walk(map_addr, end,
> > +                             prot, pages + idx, min(prev_shift, PMD_SHIFT));
> > +
> > +out:
> > +     flush_cache_vmap(start, end);
> > +     return err;
> > +}
> > +
> >  /**
> >   * vmap - map an array of pages into virtually contiguous space
> >   * @pages: array of page pointers
> > @@ -3585,8 +3663,8 @@ void *vmap(struct page **pages, unsigned int count,
> >               return NULL;
> >
> >       addr = (unsigned long)area->addr;
> > -     if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
> > -                             pages, PAGE_SHIFT) < 0) {
> > +     if (vmap_batched(addr, addr + size, pgprot_nx(prot),
> > +                             pages) < 0) {
> >               vunmap(area->addr);
> >               return NULL;
> >       }
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk
  2026-05-28  3:39     ` Wen Jiang
@ 2026-05-29  5:28       ` Dev Jain
  2026-06-05  6:02       ` Dev Jain
  1 sibling, 0 replies; 26+ messages in thread
From: Dev Jain @ 2026-05-29  5:28 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6



On 28/05/26 9:09 am, Wen Jiang wrote:
> On Wed, 27 May 2026 at 13:59, Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 22/05/26 11:01 am, Wen Jiang wrote:
>> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
>>
>> vmap_pages_range_noflush_walk() (formerly vmap_small_pages_range_noflush())
>> provides a clean interface by taking struct page **pages and mapping them
>> via direct PTE iteration. This avoids the page table rewalk seen when
>> using vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
>>
>> Extend it to support larger page_shift values, and add PMD- and
>> contiguous-PTE mappings as well. Rename it to vmap_pages_range_noflush_walk()
>> since it now handles more than just small pages.
>>
>> For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer need to
>> iterate over pages one by one via vmap_range_noflush(), which would
>> otherwise lead to page table rewalk. The code is now unified with the
>> PAGE_SHIFT case by simply calling vmap_pages_range_noflush_walk().
>>
>> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
>> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
>> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
>> ---
>>  mm/vmalloc.c | 71 +++++++++++++++++++++++++++++-----------------------
>>  1 file changed, 40 insertions(+), 31 deletions(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 53fd4ee460ea4..deb764abc0571 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -543,8 +543,10 @@ void vunmap_range(unsigned long addr, unsigned long end)
>>
>>  static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
>> -             pgtbl_mod_mask *mask)
>> +             pgtbl_mod_mask *mask, unsigned int shift)
>>  {
>> +     unsigned long pfn, size;
>> +     unsigned int steps;
>>       int err = 0;
>>       pte_t *pte;
>>
>> @@ -575,9 +577,10 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>>                       break;
>>               }
>>
>> -             set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
>> -             (*nr)++;
>> -     } while (pte++, addr += PAGE_SIZE, addr != end);
>> +             pfn = page_to_pfn(page);
>> +             size = vmap_set_ptes(pte, addr, end, pfn, prot, shift);
>> +             steps = PFN_DOWN(size);
>> +     } while (pte += steps, *nr += steps, addr += size, addr != end);
>>
>>       lazy_mmu_mode_disable();
>>       *mask |= PGTBL_PTE_MODIFIED;
>> @@ -587,7 +590,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>>
>>  static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
>> -             pgtbl_mod_mask *mask)
>> +             pgtbl_mod_mask *mask, unsigned int shift)
>>  {
>>       pmd_t *pmd;
>>       unsigned long next;
>>> @@ -597,7 +600,27 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>>>               return -ENOMEM;
>>>       do {
>>>               next = pmd_addr_end(addr, end);
>>> -             if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
>>> +
>>> +             if (shift == PMD_SHIFT) {
>>> +                     struct page *page = pages[*nr];
>>> +                     phys_addr_t phys_addr;
>>> +
>>> +                     if (WARN_ON(!page))
>>> +                             return -ENOMEM;
>>> +                     if (WARN_ON(!pfn_valid(page_to_pfn(page))))
>>> +                             return -EINVAL;
>>
>>
>> So I know these !page and !pfn_valid checks have been copied from vmap_pages_pte_range,
>> but do they mean anything?
>>
>> I think pfn_valid() makes sense in that someone may take a random VA/PA, convert it into a struct
>> page and pass to vmap layer. But I don't see how anyone would pass page == NULL? At the
>> very least, returning ENOMEM does not make sense because the pages are not being
>> allocated by vmap() but have already been allocated.
> 
> Hi Dev,
> 
> vmap() is EXPORT_SYMBOL with many callers across drivers, each
> constructing the pages array differently. The !page check guards
> against malformed arrays at this API boundary.
> 
> The same -ENOMEM issue also exists in vmap_pages_pte_range().
> Should I fix both in this patchset or leave it as a separate cleanup?

Hmm - I think this is the issue of who validates what. I think the change
should be done but this is not an urgent issue, so let us leave this
for now.

> 
>>
>>> +
>>> +                     phys_addr = page_to_phys(page);
>>> +
>>> +                     if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
>>> +                                             shift)) {
>>> +                             *mask |= PGTBL_PMD_MODIFIED;
>>> +                             *nr += 1 << (shift - PAGE_SHIFT);
>>> +                             continue;
>>> +                     }
>>> +             }
>>> +
>>> +             if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
>>>                       return -ENOMEM;
>>>       } while (pmd++, addr = next, addr != end);
>>>       return 0;
>>> @@ -605,7 +628,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>>>
>>>  static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>>>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
>>> -             pgtbl_mod_mask *mask)
>>> +             pgtbl_mod_mask *mask, unsigned int shift)
>>>  {
>>>       pud_t *pud;
>>>       unsigned long next;
>>> @@ -615,7 +638,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>>>               return -ENOMEM;
>>>       do {
>>>               next = pud_addr_end(addr, end);
>>> -             if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
>>> +             if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
>>>                       return -ENOMEM;
>>>       } while (pud++, addr = next, addr != end);
>>>       return 0;
>>> @@ -623,7 +646,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>>>
>>>  static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>>>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
>>> -             pgtbl_mod_mask *mask)
>>> +             pgtbl_mod_mask *mask, unsigned int shift)
>>>  {
>>>       p4d_t *p4d;
>>>       unsigned long next;
>>> @@ -633,14 +656,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>>>               return -ENOMEM;
>>>       do {
>>>               next = p4d_addr_end(addr, end);
>>> -             if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
>>> +             if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
>>>                       return -ENOMEM;
>>>       } while (p4d++, addr = next, addr != end);
>>>       return 0;
>>>  }
>>>
>>> -static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>>> -             pgprot_t prot, struct page **pages)
>>> +static int vmap_pages_range_noflush_walk(unsigned long addr, unsigned long end,
>>> +             pgprot_t prot, struct page **pages, unsigned int shift)
>>>  {
>>>       unsigned long start = addr;
>>>       pgd_t *pgd;
>>> @@ -655,7 +678,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>>>               next = pgd_addr_end(addr, end);
>>>               if (pgd_bad(*pgd))
>>>                       mask |= PGTBL_PGD_MODIFIED;
>>> -             err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
>>> +             err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
>>>               if (err)
>>>                       break;
>>>       } while (pgd++, addr = next, addr != end);
>>> @@ -678,27 +701,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>>>  int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>>               pgprot_t prot, struct page **pages, unsigned int page_shift)
>>>  {
>>> -     unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
>>> -
>>>       WARN_ON(page_shift < PAGE_SHIFT);
>>>
>>> -     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>>> -                     page_shift == PAGE_SHIFT)
>>> -             return vmap_small_pages_range_noflush(addr, end, prot, pages);
>>> +     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
>>> +             page_shift = PAGE_SHIFT;
>>>
>>> -     for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
>>> -             int err;
>>> -
>>> -             err = vmap_range_noflush(addr, addr + (1UL << page_shift),
>>> -                                     page_to_phys(pages[i]), prot,
>>> -                                     page_shift);
>>> -             if (err)
>>> -                     return err;
>>> -
>>> -             addr += 1UL << page_shift;
>>> -     }
>>> -
>>> -     return 0;
>>> +     return vmap_pages_range_noflush_walk(addr, end, prot, pages,
>>> +                     min(page_shift, PMD_SHIFT));
>>
>>
>> We can easily extend to PUD huge mappings right? Not sure whether we
>> should keep everything symmetric to how vmap_range_noflush() operates
>> right now, since P4D mappings don't exist, but PUD looks worthwhile.
>>
> 
> PUD mapping requires 1GB of contiguous physical memory, but the buddy
> allocator's MAX_PAGE_ORDER is 10 (4MB on 4K pages). So page_shift
> passed to vmap_pages_range_noflush_walk() never exceeds PMD_SHIFT.

Ah okay. It is really confusing to see PUD helpers everywhere when
they are basically dead code ...

> 
> Thanks,
> Wen
>>>  }
>>>
>>>  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible
  2026-05-28  3:42     ` Wen Jiang
@ 2026-05-29  5:57       ` Dev Jain
  2026-06-02  7:34         ` Wen Jiang
  0 siblings, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-05-29  5:57 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6



On 28/05/26 9:12 am, Wen Jiang wrote:
> On Wed, 27 May 2026 at 16:28, Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 22/05/26 11:01 am, Wen Jiang wrote:
>>> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
>>>
>>> In many cases, the pages passed to vmap() may include high-order
>>> pages. For example, the systemheap often allocates pages in descending
>>> order: order 8, then 4, then 0. Currently, vmap() iterates over every
>>> page individually—even pages inside a high-order block are handled
>>> one by one.
>>>
>>> This patch detects physically contiguous pages (regardless of whether
>>> they are compound or non-compound) by scanning with
>>> num_pages_contiguous(), and maps them as a single contiguous block
>>> whenever possible. The first page's pfn must be aligned to the
>>> mapping order for the batched mapping to be used.
>>>
>>> Pages with the same page_shift are coalesced and mapped via
>>> vmap_pages_range_noflush_walk() to avoid page table rewalk.
>>>
>>> As users typically allocate memory in descending orders (e.g.
>>> 8 → 4 → 0), once an order-0 page is encountered, we stop scanning
>>> for contiguous pages since subsequent pages are likely order-0 as well.
>>>
>>> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
>>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
>>> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
>>> ---
>>>  mm/vmalloc.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>  1 file changed, 80 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index deb764abc0571..50642246f4d40 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -3542,6 +3542,84 @@ void vunmap(const void *addr)
>>>  }
>>>  EXPORT_SYMBOL(vunmap);
>>>
>>> +static inline int get_vmap_batch_order(struct page **pages,
>>> +             unsigned int max_steps, unsigned int idx)
>>> +{
>>> +     unsigned int nr_contig;
>>> +     int order;
>>> +
>>> +     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) ||
>>> +                     ioremap_max_page_shift == PAGE_SHIFT)
>>
>>
>> Why bail out on ioremap_max_page_shift == PAGE_SHIFT? The code
>> path for ioremap is different from vmap right?
>>
>>
> 
> ioremap_max_page_shift is under CONFIG_HAVE_ARCH_HUGE_VMAP which
> controls both ioremap and vmap huge mappings.

I don't get it. So with this patch if nohugeiomap is passed on kernel
cmdline, then vmap-huge is also disabled. That does not sound correct.
Currently ioremap_max_page_shift does not play at all with the normal
vmap code path. It is only involved in ioremap_page_range().


> 
>>> +             return 0;
>>> +
>>> +     nr_contig = num_pages_contiguous(&pages[idx], max_steps);
>>> +     if (nr_contig < 2)
>>> +             return 0;
>>> +
>>> +     order = fls(nr_contig) - 1;
>>> +
>>> +     if (arch_vmap_pte_supported_shift(PAGE_SIZE << order) == PAGE_SHIFT)
>>> +             return 0;

Also, for arches where this function does not do anything special
(i.e return PAGE_SHIFT), we will effectively not do any huge mappings
for them.


>>> +
>>> +     /* Ensure the first page's pfn is aligned to the order */
>>> +     if (!IS_ALIGNED(page_to_pfn(pages[idx]), 1 << order))
>>> +             return 0;

This condition is a bit fragile. It may happen that we have, say 2^8
contigous pages, but they are aligned to only 2^4. We are operating
on a page array and have no idea if the caller has passed some
random subrange of the array.

I think the purpose of these checks is this - to do an early bailout
if arch does not support huge mappings, or the alignment is not correct,
instead of finding this out very deep into vmap_pages_range_noflush_walk.

So you could do something like (completely untested and may miss some edge cases):

order = ilog2(nr_contig);

order = min(order, __ffs(page_to_pfn(pages[idx])));

order = vm_shift(PAGE_SIZE << order) - PAGE_SHIFT;

Where vm_shift() is the helper I had used in my patch.

>>> +
>>> +     return order;
>>> +}
>>> +
>>> +static int vmap_batched(unsigned long addr, unsigned long end,
>>> +             pgprot_t prot, struct page **pages)
>>> +{
>>> +     unsigned int count = (end - addr) >> PAGE_SHIFT;
>>> +     unsigned int prev_shift = 0, idx = 0;
>>> +     unsigned long start = addr, map_addr = addr;
>>> +     int err;
>>> +
>>> +     err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
>>> +                                             PAGE_SHIFT, GFP_KERNEL);
>>> +     if (err)
>>> +             goto out;
>>> +
>>> +     for (unsigned int i = 0; i < count; ) {
>>> +             unsigned int shift = PAGE_SHIFT +
>>> +                     get_vmap_batch_order(pages, count - i, i);
>>> +
>>> +             if (!i)
>>> +                     prev_shift = shift;
>>> +
>>> +             if (shift != prev_shift) {
>>> +                     err = vmap_pages_range_noflush_walk(map_addr, addr,
>>
>> It would be worth documenting vmap_pages_range_noflush_walk() that
>> it can take an array of pages which are not all contiguous, but it
>> may have contiguous chunks, as hinted by page_shift.
>>
>> Otherwise this looks good.
>>
>>> +                                     prot, pages + idx,
>>> +                                     min(prev_shift, PMD_SHIFT));
>>> +                     if (err)
>>> +                             goto out;
>>> +                     prev_shift = shift;
>>> +                     map_addr = addr;
>>> +                     idx = i;
>>> +             }
>>> +
>>> +             /*
>>> +              * Once small pages are encountered, the remaining pages
>>> +              * are likely small as well.
>>> +              */
>>> +             if (shift == PAGE_SHIFT)
>>> +                     break;
>>> +
>>> +             addr += 1UL << shift;
>>> +             i += 1U << (shift - PAGE_SHIFT);
>>> +     }
>>> +
>>> +     /* Remaining */
>>> +     if (map_addr < end)
>>> +             err = vmap_pages_range_noflush_walk(map_addr, end,
>>> +                             prot, pages + idx, min(prev_shift, PMD_SHIFT));
>>> +
>>> +out:
>>> +     flush_cache_vmap(start, end);
>>> +     return err;
>>> +}
>>> +
>>>  /**
>>>   * vmap - map an array of pages into virtually contiguous space
>>>   * @pages: array of page pointers
>>> @@ -3585,8 +3663,8 @@ void *vmap(struct page **pages, unsigned int count,
>>>               return NULL;
>>>
>>>       addr = (unsigned long)area->addr;
>>> -     if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
>>> -                             pages, PAGE_SHIFT) < 0) {
>>> +     if (vmap_batched(addr, addr + size, pgprot_nx(prot),
>>> +                             pages) < 0) {
>>>               vunmap(area->addr);
>>>               return NULL;
>>>       }
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic
  2026-05-22  5:31 ` [PATCH v3 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic Wen Jiang
@ 2026-06-01 17:34   ` Uladzislau Rezki
  2026-06-02  7:45     ` Wen Jiang
  0 siblings, 1 reply; 26+ messages in thread
From: Uladzislau Rezki @ 2026-06-01 17:34 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Fri, May 22, 2026 at 01:31:43PM +0800, Wen Jiang wrote:
> Extract the common PTE mapping logic from vmap_pte_range() into a
> shared helper vmap_set_ptes(). This handles both CONT_PTE and regular
> PTE mappings in a single function, preparing for the next patch which
> will extend vmap_pages_pte_range() to also use this helper.
> 
> The #ifdef CONFIG_HUGETLB_PAGE guard is moved inside vmap_set_ptes(),
> so callers no longer need to handle the conditional compilation.
> 
> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> ---
>  mm/vmalloc.c | 49 ++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 34 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 2c2f74a07f396..53fd4ee460ea4 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -91,6 +91,35 @@ struct vfree_deferred {
>  static DEFINE_PER_CPU(struct vfree_deferred, vfree_deferred);
>  
>  /*** Page table manipulation functions ***/
> +
> +/*
> + * Set PTE mappings for the given PFN. Try CONT_PTE mappings first when
> + * supported, otherwise fall back to PAGE_SIZE mappings.
> + *
> + * Return: mapping size.
> + */
> +static __always_inline unsigned long vmap_set_ptes(pte_t *pte,
> +		unsigned long addr, unsigned long end, u64 pfn,
> +		pgprot_t prot, unsigned int max_page_shift)
> +{
> +#ifdef CONFIG_HUGETLB_PAGE
> +	if (max_page_shift > PAGE_SHIFT) {
> +		unsigned long size;
> +
> +		size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
> +		if (size != PAGE_SIZE) {
> +			pte_t entry = pfn_pte(pfn, prot);
> +
> +			entry = arch_make_huge_pte(entry, ilog2(size), 0);
> +			set_huge_pte_at(&init_mm, addr, pte, entry, size);
> +			return size;
> +		}
> +	}
> +#endif
> +	set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
> +	return PAGE_SIZE;
> +}
> +
>  static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			phys_addr_t phys_addr, pgprot_t prot,
>  			unsigned int max_page_shift, pgtbl_mod_mask *mask)
> @@ -98,7 +127,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	pte_t *pte;
>  	u64 pfn;
>  	struct page *page;
> -	unsigned long size = PAGE_SIZE;
> +	unsigned long size;
> +	unsigned int steps;
>  
>  	if (WARN_ON_ONCE(!PAGE_ALIGNED(end - addr)))
>  		return -EINVAL;
> @@ -119,20 +149,9 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			BUG();
>  		}
>  
> -#ifdef CONFIG_HUGETLB_PAGE
> -		size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
> -		if (size != PAGE_SIZE) {
> -			pte_t entry = pfn_pte(pfn, prot);
> -
> -			entry = arch_make_huge_pte(entry, ilog2(size), 0);
> -			set_huge_pte_at(&init_mm, addr, pte, entry, size);
> -			pfn += PFN_DOWN(size);
> -			continue;
> -		}
> -#endif
> -		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
> -		pfn++;
> -	} while (pte += PFN_DOWN(size), addr += size, addr != end);
> +		size = vmap_set_ptes(pte, addr, end, pfn, prot, max_page_shift);
> +		steps = PFN_DOWN(size);
> +	} while (pte += steps, pfn += steps, addr += size, addr != end);
>  
>  	lazy_mmu_mode_disable();
>  	*mask |= PGTBL_PTE_MODIFIED;
> -- 
> 2.34.1
> 
IMO, we should add just a helper with "no functional change" and second
patch will extend it.

Otherwise you added a helper which has already been slightly modified.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible
  2026-05-29  5:57       ` Dev Jain
@ 2026-06-02  7:34         ` Wen Jiang
  0 siblings, 0 replies; 26+ messages in thread
From: Wen Jiang @ 2026-06-02  7:34 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Fri, 29 May 2026 at 13:57, Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 28/05/26 9:12 am, Wen Jiang wrote:
> > On Wed, 27 May 2026 at 16:28, Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >>
> >> On 22/05/26 11:01 am, Wen Jiang wrote:
> >>> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> >>>
> >>> In many cases, the pages passed to vmap() may include high-order
> >>> pages. For example, the systemheap often allocates pages in descending
> >>> order: order 8, then 4, then 0. Currently, vmap() iterates over every
> >>> page individually—even pages inside a high-order block are handled
> >>> one by one.
> >>>
> >>> This patch detects physically contiguous pages (regardless of whether
> >>> they are compound or non-compound) by scanning with
> >>> num_pages_contiguous(), and maps them as a single contiguous block
> >>> whenever possible. The first page's pfn must be aligned to the
> >>> mapping order for the batched mapping to be used.
> >>>
> >>> Pages with the same page_shift are coalesced and mapped via
> >>> vmap_pages_range_noflush_walk() to avoid page table rewalk.
> >>>
> >>> As users typically allocate memory in descending orders (e.g.
> >>> 8 → 4 → 0), once an order-0 page is encountered, we stop scanning
> >>> for contiguous pages since subsequent pages are likely order-0 as well.
> >>>
> >>> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> >>> Co-developed-by: Dev Jain <dev.jain@arm.com>
> >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> >>> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> >>> ---
> >>>  mm/vmalloc.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >>>  1 file changed, 80 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> >>> index deb764abc0571..50642246f4d40 100644
> >>> --- a/mm/vmalloc.c
> >>> +++ b/mm/vmalloc.c
> >>> @@ -3542,6 +3542,84 @@ void vunmap(const void *addr)
> >>>  }
> >>>  EXPORT_SYMBOL(vunmap);
> >>>
> >>> +static inline int get_vmap_batch_order(struct page **pages,
> >>> +             unsigned int max_steps, unsigned int idx)
> >>> +{
> >>> +     unsigned int nr_contig;
> >>> +     int order;
> >>> +
> >>> +     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) ||
> >>> +                     ioremap_max_page_shift == PAGE_SHIFT)
> >>
> >>
> >> Why bail out on ioremap_max_page_shift == PAGE_SHIFT? The code
> >> path for ioremap is different from vmap right?
> >>
> >>
> >
> > ioremap_max_page_shift is under CONFIG_HAVE_ARCH_HUGE_VMAP which
> > controls both ioremap and vmap huge mappings.
>
> I don't get it. So with this patch if nohugeiomap is passed on kernel
> cmdline, then vmap-huge is also disabled. That does not sound correct.
> Currently ioremap_max_page_shift does not play at all with the normal
> vmap code path. It is only involved in ioremap_page_range().

You're right, vmap path should not be affected by ioremap. I'll remove it.

> >>> +             return 0;
> >>> +
> >>> +     nr_contig = num_pages_contiguous(&pages[idx], max_steps);
> >>> +     if (nr_contig < 2)
> >>> +             return 0;
> >>> +
> >>> +     order = fls(nr_contig) - 1;
> >>> +
> >>> +     if (arch_vmap_pte_supported_shift(PAGE_SIZE << order) == PAGE_SHIFT)
> >>> +             return 0;
>
> Also, for arches where this function does not do anything special
> (i.e return PAGE_SHIFT), we will effectively not do any huge mappings
> for them.

Agreed, I'll add PMD-level handling here. Will reference your vm_shift() helper.

> >>> +
> >>> +     /* Ensure the first page's pfn is aligned to the order */
> >>> +     if (!IS_ALIGNED(page_to_pfn(pages[idx]), 1 << order))
> >>> +             return 0;
>
> This condition is a bit fragile. It may happen that we have, say 2^8
> contigous pages, but they are aligned to only 2^4. We are operating
> on a page array and have no idea if the caller has passed some
> random subrange of the array.
>
> I think the purpose of these checks is this - to do an early bailout
> if arch does not support huge mappings, or the alignment is not correct,
> instead of finding this out very deep into vmap_pages_range_noflush_walk.
>
> So you could do something like (completely untested and may miss some edge cases):
>
> order = ilog2(nr_contig);
>
> order = min(order, __ffs(page_to_pfn(pages[idx])));
>
> order = vm_shift(PAGE_SIZE << order) - PAGE_SHIFT;
>
> Where vm_shift() is the helper I had used in my patch.
>

Will adopt this approach in v4.

Thanks for the through review!

> >>> +
> >>> +     return order;
> >>> +}
> >>> +
> >>> +static int vmap_batched(unsigned long addr, unsigned long end,
> >>> +             pgprot_t prot, struct page **pages)
> >>> +{
> >>> +     unsigned int count = (end - addr) >> PAGE_SHIFT;
> >>> +     unsigned int prev_shift = 0, idx = 0;
> >>> +     unsigned long start = addr, map_addr = addr;
> >>> +     int err;
> >>> +
> >>> +     err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
> >>> +                                             PAGE_SHIFT, GFP_KERNEL);
> >>> +     if (err)
> >>> +             goto out;
> >>> +
> >>> +     for (unsigned int i = 0; i < count; ) {
> >>> +             unsigned int shift = PAGE_SHIFT +
> >>> +                     get_vmap_batch_order(pages, count - i, i);
> >>> +
> >>> +             if (!i)
> >>> +                     prev_shift = shift;
> >>> +
> >>> +             if (shift != prev_shift) {
> >>> +                     err = vmap_pages_range_noflush_walk(map_addr, addr,
> >>
> >> It would be worth documenting vmap_pages_range_noflush_walk() that
> >> it can take an array of pages which are not all contiguous, but it
> >> may have contiguous chunks, as hinted by page_shift.
> >>
> >> Otherwise this looks good.
> >>
> >>> +                                     prot, pages + idx,
> >>> +                                     min(prev_shift, PMD_SHIFT));
> >>> +                     if (err)
> >>> +                             goto out;
> >>> +                     prev_shift = shift;
> >>> +                     map_addr = addr;
> >>> +                     idx = i;
> >>> +             }
> >>> +
> >>> +             /*
> >>> +              * Once small pages are encountered, the remaining pages
> >>> +              * are likely small as well.
> >>> +              */
> >>> +             if (shift == PAGE_SHIFT)
> >>> +                     break;
> >>> +
> >>> +             addr += 1UL << shift;
> >>> +             i += 1U << (shift - PAGE_SHIFT);
> >>> +     }
> >>> +
> >>> +     /* Remaining */
> >>> +     if (map_addr < end)
> >>> +             err = vmap_pages_range_noflush_walk(map_addr, end,
> >>> +                             prot, pages + idx, min(prev_shift, PMD_SHIFT));
> >>> +
> >>> +out:
> >>> +     flush_cache_vmap(start, end);
> >>> +     return err;
> >>> +}
> >>> +
> >>>  /**
> >>>   * vmap - map an array of pages into virtually contiguous space
> >>>   * @pages: array of page pointers
> >>> @@ -3585,8 +3663,8 @@ void *vmap(struct page **pages, unsigned int count,
> >>>               return NULL;
> >>>
> >>>       addr = (unsigned long)area->addr;
> >>> -     if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
> >>> -                             pages, PAGE_SHIFT) < 0) {
> >>> +     if (vmap_batched(addr, addr + size, pgprot_nx(prot),
> >>> +                             pages) < 0) {
> >>>               vunmap(area->addr);
> >>>               return NULL;
> >>>       }
> >>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic
  2026-06-01 17:34   ` Uladzislau Rezki
@ 2026-06-02  7:45     ` Wen Jiang
  0 siblings, 0 replies; 26+ messages in thread
From: Wen Jiang @ 2026-06-02  7:45 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, baohua,
	Xueyuan.chen21, dev.jain, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Tue, 2 Jun 2026 at 01:34, Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Fri, May 22, 2026 at 01:31:43PM +0800, Wen Jiang wrote:
> > Extract the common PTE mapping logic from vmap_pte_range() into a
> > shared helper vmap_set_ptes(). This handles both CONT_PTE and regular
> > PTE mappings in a single function, preparing for the next patch which
> > will extend vmap_pages_pte_range() to also use this helper.
> >
> > The #ifdef CONFIG_HUGETLB_PAGE guard is moved inside vmap_set_ptes(),
> > so callers no longer need to handle the conditional compilation.
> >
> > Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> > Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> > ---
> >  mm/vmalloc.c | 49 ++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 34 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 2c2f74a07f396..53fd4ee460ea4 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -91,6 +91,35 @@ struct vfree_deferred {
> >  static DEFINE_PER_CPU(struct vfree_deferred, vfree_deferred);
> >
> >  /*** Page table manipulation functions ***/
> > +
> > +/*
> > + * Set PTE mappings for the given PFN. Try CONT_PTE mappings first when
> > + * supported, otherwise fall back to PAGE_SIZE mappings.
> > + *
> > + * Return: mapping size.
> > + */
> > +static __always_inline unsigned long vmap_set_ptes(pte_t *pte,
> > +             unsigned long addr, unsigned long end, u64 pfn,
> > +             pgprot_t prot, unsigned int max_page_shift)
> > +{
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +     if (max_page_shift > PAGE_SHIFT) {
> > +             unsigned long size;
> > +
> > +             size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
> > +             if (size != PAGE_SIZE) {
> > +                     pte_t entry = pfn_pte(pfn, prot);
> > +
> > +                     entry = arch_make_huge_pte(entry, ilog2(size), 0);
> > +                     set_huge_pte_at(&init_mm, addr, pte, entry, size);
> > +                     return size;
> > +             }
> > +     }
> > +#endif
> > +     set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
> > +     return PAGE_SIZE;
> > +}
> > +
> >  static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >                       phys_addr_t phys_addr, pgprot_t prot,
> >                       unsigned int max_page_shift, pgtbl_mod_mask *mask)
> > @@ -98,7 +127,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >       pte_t *pte;
> >       u64 pfn;
> >       struct page *page;
> > -     unsigned long size = PAGE_SIZE;
> > +     unsigned long size;
> > +     unsigned int steps;
> >
> >       if (WARN_ON_ONCE(!PAGE_ALIGNED(end - addr)))
> >               return -EINVAL;
> > @@ -119,20 +149,9 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >                       BUG();
> >               }
> >
> > -#ifdef CONFIG_HUGETLB_PAGE
> > -             size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
> > -             if (size != PAGE_SIZE) {
> > -                     pte_t entry = pfn_pte(pfn, prot);
> > -
> > -                     entry = arch_make_huge_pte(entry, ilog2(size), 0);
> > -                     set_huge_pte_at(&init_mm, addr, pte, entry, size);
> > -                     pfn += PFN_DOWN(size);
> > -                     continue;
> > -             }
> > -#endif
> > -             set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
> > -             pfn++;
> > -     } while (pte += PFN_DOWN(size), addr += size, addr != end);
> > +             size = vmap_set_ptes(pte, addr, end, pfn, prot, max_page_shift);
> > +             steps = PFN_DOWN(size);
> > +     } while (pte += steps, pfn += steps, addr += size, addr != end);
> >
> >       lazy_mmu_mode_disable();
> >       *mask |= PGTBL_PTE_MODIFIED;
> > --
> > 2.34.1
> >
> IMO, we should add just a helper with "no functional change" and second
> patch will extend it.
>
> Otherwise you added a helper which has already been slightly modified.
>
> --
> Uladzislau Rezki

Make sense, will split it, the steps/loops refactoring will move into patch 4.

Thanks,
Wen


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings
  2026-05-27  6:25   ` Dev Jain
@ 2026-06-02  8:57     ` Wen Jiang
  0 siblings, 0 replies; 26+ messages in thread
From: Wen Jiang @ 2026-06-02  8:57 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Wed, 27 May 2026 at 14:25, Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 22/05/26 11:01 am, Wen Jiang wrote:
> > From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> >
> > Try to align the vmap virtual address to PMD_SHIFT or a
> > larger PTE mapping size hinted by the architecture, so
> > contiguous pages can be batch-mapped when setting PMD or
> > PTE entries.
> >
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> > Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
>
>
> Hmm okay I would have preferred to squash this in the previous, but
> the correctness of previous patch does not rely on this, so it's fine.
>
> > ---
> >  mm/vmalloc.c | 33 ++++++++++++++++++++++++++++++++-
> >  1 file changed, 32 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 50642246f4d40..040d400928aab 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3620,6 +3620,37 @@ static int vmap_batched(unsigned long addr, unsigned long end,
> >       return err;
> >  }
> >
>
> This is screaming for a helper :)
>
> > +static struct vm_struct *get_aligned_vm_area(unsigned long size,
> > +             unsigned long flags, const void *caller)
>
>
> Call this vmap_get_aligned_vm_area, then ...
>
>

Will rename  get_aligned_vm_area to vmap_get_aligned_vm_area.

> > +{
> > +     struct vm_struct *vm_area;
> > +     unsigned int shift;
> > +
> > +     /* Try PMD alignment for large sizes */
> > +     if (size >= PMD_SIZE) {
> > +             vm_area = __get_vm_area_node(size, PMD_SIZE, PAGE_SHIFT, flags,
> > +                             VMALLOC_START, VMALLOC_END,
> > +                             NUMA_NO_NODE, GFP_KERNEL, caller);
>
> Add a wrapper over this called __get_vm_area_node_aligned_caller, which can
> call __get_vm_area_node() with all other arguments fixed, except "align".
>

Will add a __get_vm_area_node_aligned_caller. Will send in v4.

Thanks,
Wen
> > +             if (vm_area)
> > +                     return vm_area;
> > +     }
> > +
> > +     /* Try CONT_PTE alignment */
> > +     shift = arch_vmap_pte_supported_shift(size);
> > +     if (shift > PAGE_SHIFT) {
> > +             vm_area = __get_vm_area_node(size, 1UL << shift, PAGE_SHIFT, flags,
> > +                             VMALLOC_START, VMALLOC_END,
> > +                             NUMA_NO_NODE, GFP_KERNEL, caller);
> > +             if (vm_area)
> > +                     return vm_area;
> > +     }
> > +
> > +     /* Fall back to page alignment */
> > +     return __get_vm_area_node(size, PAGE_SIZE, PAGE_SHIFT, flags,
> > +                     VMALLOC_START, VMALLOC_END,
> > +                     NUMA_NO_NODE, GFP_KERNEL, caller);
> > +}
> > +
> >  /**
> >   * vmap - map an array of pages into virtually contiguous space
> >   * @pages: array of page pointers
> > @@ -3658,7 +3689,7 @@ void *vmap(struct page **pages, unsigned int count,
> >               return NULL;
> >
> >       size = (unsigned long)count << PAGE_SHIFT;
> > -     area = get_vm_area_caller(size, flags, __builtin_return_address(0));
> > +     area = get_aligned_vm_area(size, flags, __builtin_return_address(0));
> >       if (!area)
> >               return NULL;
> >
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk
  2026-05-28  3:39     ` Wen Jiang
  2026-05-29  5:28       ` Dev Jain
@ 2026-06-05  6:02       ` Dev Jain
  2026-06-08  6:25         ` Wen Jiang
  1 sibling, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-06-05  6:02 UTC (permalink / raw)
  To: Wen Jiang
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6



On 28/05/26 9:09 am, Wen Jiang wrote:
> On Wed, 27 May 2026 at 13:59, Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 22/05/26 11:01 am, Wen Jiang wrote:
>> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
>>
>> vmap_pages_range_noflush_walk() (formerly vmap_small_pages_range_noflush())
>> provides a clean interface by taking struct page **pages and mapping them
>> via direct PTE iteration. This avoids the page table rewalk seen when
>> using vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
>>
>> Extend it to support larger page_shift values, and add PMD- and
>> contiguous-PTE mappings as well. Rename it to vmap_pages_range_noflush_walk()
>> since it now handles more than just small pages.
>>
>> For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer need to
>> iterate over pages one by one via vmap_range_noflush(), which would
>> otherwise lead to page table rewalk. The code is now unified with the
>> PAGE_SHIFT case by simply calling vmap_pages_range_noflush_walk().
>>
>> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
>> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
>> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
>> ---
>>  mm/vmalloc.c | 71 +++++++++++++++++++++++++++++-----------------------
>>  1 file changed, 40 insertions(+), 31 deletions(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 53fd4ee460ea4..deb764abc0571 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -543,8 +543,10 @@ void vunmap_range(unsigned long addr, unsigned long end)
>>
>>  static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
>> -             pgtbl_mod_mask *mask)
>> +             pgtbl_mod_mask *mask, unsigned int shift)
>>  {
>> +     unsigned long pfn, size;
>> +     unsigned int steps;
>>       int err = 0;
>>       pte_t *pte;
>>
>> @@ -575,9 +577,10 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>>                       break;
>>               }
>>
>> -             set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
>> -             (*nr)++;
>> -     } while (pte++, addr += PAGE_SIZE, addr != end);
>> +             pfn = page_to_pfn(page);
>> +             size = vmap_set_ptes(pte, addr, end, pfn, prot, shift);
>> +             steps = PFN_DOWN(size);
>> +     } while (pte += steps, *nr += steps, addr += size, addr != end);
>>
>>       lazy_mmu_mode_disable();
>>       *mask |= PGTBL_PTE_MODIFIED;
>> @@ -587,7 +590,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>>
>>  static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
>> -             pgtbl_mod_mask *mask)
>> +             pgtbl_mod_mask *mask, unsigned int shift)
>>  {
>>       pmd_t *pmd;
>>       unsigned long next;
>>> @@ -597,7 +600,27 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>>>               return -ENOMEM;
>>>       do {
>>>               next = pmd_addr_end(addr, end);
>>> -             if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
>>> +
>>> +             if (shift == PMD_SHIFT) {
>>> +                     struct page *page = pages[*nr];
>>> +                     phys_addr_t phys_addr;
>>> +
>>> +                     if (WARN_ON(!page))
>>> +                             return -ENOMEM;
>>> +                     if (WARN_ON(!pfn_valid(page_to_pfn(page))))
>>> +                             return -EINVAL;
>>
>>
>> So I know these !page and !pfn_valid checks have been copied from vmap_pages_pte_range,
>> but do they mean anything?
>>
>> I think pfn_valid() makes sense in that someone may take a random VA/PA, convert it into a struct
>> page and pass to vmap layer. But I don't see how anyone would pass page == NULL? At the
>> very least, returning ENOMEM does not make sense because the pages are not being
>> allocated by vmap() but have already been allocated.
> 
> Hi Dev,
> 
> vmap() is EXPORT_SYMBOL with many callers across drivers, each
> constructing the pages array differently. The !page check guards
> against malformed arrays at this API boundary.
> 
> The same -ENOMEM issue also exists in vmap_pages_pte_range().
> Should I fix both in this patchset or leave it as a separate cleanup?
> 
>>
>>> +
>>> +                     phys_addr = page_to_phys(page);
>>> +
>>> +                     if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
>>> +                                             shift)) {
>>> +                             *mask |= PGTBL_PMD_MODIFIED;
>>> +                             *nr += 1 << (shift - PAGE_SHIFT);
>>> +                             continue;
>>> +                     }
>>> +             }
>>> +
>>> +             if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
>>>                       return -ENOMEM;
>>>       } while (pmd++, addr = next, addr != end);
>>>       return 0;
>>> @@ -605,7 +628,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>>>
>>>  static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>>>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
>>> -             pgtbl_mod_mask *mask)
>>> +             pgtbl_mod_mask *mask, unsigned int shift)
>>>  {
>>>       pud_t *pud;
>>>       unsigned long next;
>>> @@ -615,7 +638,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>>>               return -ENOMEM;
>>>       do {
>>>               next = pud_addr_end(addr, end);
>>> -             if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
>>> +             if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
>>>                       return -ENOMEM;
>>>       } while (pud++, addr = next, addr != end);
>>>       return 0;
>>> @@ -623,7 +646,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>>>
>>>  static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>>>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
>>> -             pgtbl_mod_mask *mask)
>>> +             pgtbl_mod_mask *mask, unsigned int shift)
>>>  {
>>>       p4d_t *p4d;
>>>       unsigned long next;
>>> @@ -633,14 +656,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>>>               return -ENOMEM;
>>>       do {
>>>               next = p4d_addr_end(addr, end);
>>> -             if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
>>> +             if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
>>>                       return -ENOMEM;
>>>       } while (p4d++, addr = next, addr != end);
>>>       return 0;
>>>  }
>>>
>>> -static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>>> -             pgprot_t prot, struct page **pages)
>>> +static int vmap_pages_range_noflush_walk(unsigned long addr, unsigned long end,
>>> +             pgprot_t prot, struct page **pages, unsigned int shift)
>>>  {
>>>       unsigned long start = addr;
>>>       pgd_t *pgd;
>>> @@ -655,7 +678,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>>>               next = pgd_addr_end(addr, end);
>>>               if (pgd_bad(*pgd))
>>>                       mask |= PGTBL_PGD_MODIFIED;
>>> -             err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
>>> +             err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
>>>               if (err)
>>>                       break;
>>>       } while (pgd++, addr = next, addr != end);
>>> @@ -678,27 +701,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>>>  int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>>               pgprot_t prot, struct page **pages, unsigned int page_shift)
>>>  {
>>> -     unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
>>> -
>>>       WARN_ON(page_shift < PAGE_SHIFT);
>>>
>>> -     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>>> -                     page_shift == PAGE_SHIFT)
>>> -             return vmap_small_pages_range_noflush(addr, end, prot, pages);
>>> +     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
>>> +             page_shift = PAGE_SHIFT;
>>>
>>> -     for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
>>> -             int err;
>>> -
>>> -             err = vmap_range_noflush(addr, addr + (1UL << page_shift),
>>> -                                     page_to_phys(pages[i]), prot,
>>> -                                     page_shift);
>>> -             if (err)
>>> -                     return err;
>>> -
>>> -             addr += 1UL << page_shift;
>>> -     }
>>> -
>>> -     return 0;
>>> +     return vmap_pages_range_noflush_walk(addr, end, prot, pages,
>>> +                     min(page_shift, PMD_SHIFT));
>>
>>
>> We can easily extend to PUD huge mappings right? Not sure whether we
>> should keep everything symmetric to how vmap_range_noflush() operates
>> right now, since P4D mappings don't exist, but PUD looks worthwhile.
>>
> 
> PUD mapping requires 1GB of contiguous physical memory, but the buddy
> allocator's MAX_PAGE_ORDER is 10 (4MB on 4K pages). So page_shift
> passed to vmap_pages_range_noflush_walk() never exceeds PMD_SHIFT.

Can we then just drop the min()? You can guard the try_huge_pmd with
shift >= PMD_SHIFT - the walker has the necessary ingredients to work
with a shift > PMD_SHIFT, so let us not confuse by this min() truncation.

> 
> Thanks,
> Wen
>>>  }
>>>
>>>  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk
  2026-06-05  6:02       ` Dev Jain
@ 2026-06-08  6:25         ` Wen Jiang
  0 siblings, 0 replies; 26+ messages in thread
From: Wen Jiang @ 2026-06-08  6:25 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	baohua, Xueyuan.chen21, rppt, david, ryan.roberts,
	anshuman.khandual, ajd, linux-kernel, jiangwen6

On Fri, 5 Jun 2026 at 14:02, Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 28/05/26 9:09 am, Wen Jiang wrote:
> > On Wed, 27 May 2026 at 13:59, Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >>
> >> On 22/05/26 11:01 am, Wen Jiang wrote:
> >> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> >>
> >> vmap_pages_range_noflush_walk() (formerly vmap_small_pages_range_noflush())
> >> provides a clean interface by taking struct page **pages and mapping them
> >> via direct PTE iteration. This avoids the page table rewalk seen when
> >> using vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
> >>
> >> Extend it to support larger page_shift values, and add PMD- and
> >> contiguous-PTE mappings as well. Rename it to vmap_pages_range_noflush_walk()
> >> since it now handles more than just small pages.
> >>
> >> For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer need to
> >> iterate over pages one by one via vmap_range_noflush(), which would
> >> otherwise lead to page table rewalk. The code is now unified with the
> >> PAGE_SHIFT case by simply calling vmap_pages_range_noflush_walk().
> >>
> >> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> >> Signed-off-by: Wen Jiang <jiangwen6@xiaomi.com>
> >> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> >> ---
> >>  mm/vmalloc.c | 71 +++++++++++++++++++++++++++++-----------------------
> >>  1 file changed, 40 insertions(+), 31 deletions(-)
> >>
> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> >> index 53fd4ee460ea4..deb764abc0571 100644
> >> --- a/mm/vmalloc.c
> >> +++ b/mm/vmalloc.c
> >> @@ -543,8 +543,10 @@ void vunmap_range(unsigned long addr, unsigned long end)
> >>
> >>  static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> >>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> >> -             pgtbl_mod_mask *mask)
> >> +             pgtbl_mod_mask *mask, unsigned int shift)
> >>  {
> >> +     unsigned long pfn, size;
> >> +     unsigned int steps;
> >>       int err = 0;
> >>       pte_t *pte;
> >>
> >> @@ -575,9 +577,10 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> >>                       break;
> >>               }
> >>
> >> -             set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
> >> -             (*nr)++;
> >> -     } while (pte++, addr += PAGE_SIZE, addr != end);
> >> +             pfn = page_to_pfn(page);
> >> +             size = vmap_set_ptes(pte, addr, end, pfn, prot, shift);
> >> +             steps = PFN_DOWN(size);
> >> +     } while (pte += steps, *nr += steps, addr += size, addr != end);
> >>
> >>       lazy_mmu_mode_disable();
> >>       *mask |= PGTBL_PTE_MODIFIED;
> >> @@ -587,7 +590,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> >>
> >>  static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
> >>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> >> -             pgtbl_mod_mask *mask)
> >> +             pgtbl_mod_mask *mask, unsigned int shift)
> >>  {
> >>       pmd_t *pmd;
> >>       unsigned long next;
> >>> @@ -597,7 +600,27 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
> >>>               return -ENOMEM;
> >>>       do {
> >>>               next = pmd_addr_end(addr, end);
> >>> -             if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
> >>> +
> >>> +             if (shift == PMD_SHIFT) {
> >>> +                     struct page *page = pages[*nr];
> >>> +                     phys_addr_t phys_addr;
> >>> +
> >>> +                     if (WARN_ON(!page))
> >>> +                             return -ENOMEM;
> >>> +                     if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> >>> +                             return -EINVAL;
> >>
> >>
> >> So I know these !page and !pfn_valid checks have been copied from vmap_pages_pte_range,
> >> but do they mean anything?
> >>
> >> I think pfn_valid() makes sense in that someone may take a random VA/PA, convert it into a struct
> >> page and pass to vmap layer. But I don't see how anyone would pass page == NULL? At the
> >> very least, returning ENOMEM does not make sense because the pages are not being
> >> allocated by vmap() but have already been allocated.
> >
> > Hi Dev,
> >
> > vmap() is EXPORT_SYMBOL with many callers across drivers, each
> > constructing the pages array differently. The !page check guards
> > against malformed arrays at this API boundary.
> >
> > The same -ENOMEM issue also exists in vmap_pages_pte_range().
> > Should I fix both in this patchset or leave it as a separate cleanup?
> >
> >>
> >>> +
> >>> +                     phys_addr = page_to_phys(page);
> >>> +
> >>> +                     if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
> >>> +                                             shift)) {
> >>> +                             *mask |= PGTBL_PMD_MODIFIED;
> >>> +                             *nr += 1 << (shift - PAGE_SHIFT);
> >>> +                             continue;
> >>> +                     }
> >>> +             }
> >>> +
> >>> +             if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
> >>>                       return -ENOMEM;
> >>>       } while (pmd++, addr = next, addr != end);
> >>>       return 0;
> >>> @@ -605,7 +628,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
> >>>
> >>>  static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
> >>>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> >>> -             pgtbl_mod_mask *mask)
> >>> +             pgtbl_mod_mask *mask, unsigned int shift)
> >>>  {
> >>>       pud_t *pud;
> >>>       unsigned long next;
> >>> @@ -615,7 +638,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
> >>>               return -ENOMEM;
> >>>       do {
> >>>               next = pud_addr_end(addr, end);
> >>> -             if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
> >>> +             if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
> >>>                       return -ENOMEM;
> >>>       } while (pud++, addr = next, addr != end);
> >>>       return 0;
> >>> @@ -623,7 +646,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
> >>>
> >>>  static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
> >>>               unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> >>> -             pgtbl_mod_mask *mask)
> >>> +             pgtbl_mod_mask *mask, unsigned int shift)
> >>>  {
> >>>       p4d_t *p4d;
> >>>       unsigned long next;
> >>> @@ -633,14 +656,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
> >>>               return -ENOMEM;
> >>>       do {
> >>>               next = p4d_addr_end(addr, end);
> >>> -             if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
> >>> +             if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
> >>>                       return -ENOMEM;
> >>>       } while (p4d++, addr = next, addr != end);
> >>>       return 0;
> >>>  }
> >>>
> >>> -static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> >>> -             pgprot_t prot, struct page **pages)
> >>> +static int vmap_pages_range_noflush_walk(unsigned long addr, unsigned long end,
> >>> +             pgprot_t prot, struct page **pages, unsigned int shift)
> >>>  {
> >>>       unsigned long start = addr;
> >>>       pgd_t *pgd;
> >>> @@ -655,7 +678,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> >>>               next = pgd_addr_end(addr, end);
> >>>               if (pgd_bad(*pgd))
> >>>                       mask |= PGTBL_PGD_MODIFIED;
> >>> -             err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
> >>> +             err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
> >>>               if (err)
> >>>                       break;
> >>>       } while (pgd++, addr = next, addr != end);
> >>> @@ -678,27 +701,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> >>>  int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> >>>               pgprot_t prot, struct page **pages, unsigned int page_shift)
> >>>  {
> >>> -     unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
> >>> -
> >>>       WARN_ON(page_shift < PAGE_SHIFT);
> >>>
> >>> -     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> >>> -                     page_shift == PAGE_SHIFT)
> >>> -             return vmap_small_pages_range_noflush(addr, end, prot, pages);
> >>> +     if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
> >>> +             page_shift = PAGE_SHIFT;
> >>>
> >>> -     for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> >>> -             int err;
> >>> -
> >>> -             err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> >>> -                                     page_to_phys(pages[i]), prot,
> >>> -                                     page_shift);
> >>> -             if (err)
> >>> -                     return err;
> >>> -
> >>> -             addr += 1UL << page_shift;
> >>> -     }
> >>> -
> >>> -     return 0;
> >>> +     return vmap_pages_range_noflush_walk(addr, end, prot, pages,
> >>> +                     min(page_shift, PMD_SHIFT));
> >>
> >>
> >> We can easily extend to PUD huge mappings right? Not sure whether we
> >> should keep everything symmetric to how vmap_range_noflush() operates
> >> right now, since P4D mappings don't exist, but PUD looks worthwhile.
> >>
> >
> > PUD mapping requires 1GB of contiguous physical memory, but the buddy
> > allocator's MAX_PAGE_ORDER is 10 (4MB on 4K pages). So page_shift
> > passed to vmap_pages_range_noflush_walk() never exceeds PMD_SHIFT.
>
> Can we then just drop the min()? You can guard the try_huge_pmd with
> shift >= PMD_SHIFT - the walker has the necessary ingredients to work
> with a shift > PMD_SHIFT, so let us not confuse by this min() truncation.
>

Will Drop the min() here.
Thanks.

> >
> > Thanks,
> > Wen
> >>>  }
> >>>
> >>>  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> >>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-06-08  6:25 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22  5:31 [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
2026-05-22  5:31 ` [PATCH v3 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
2026-05-26  7:56   ` Dev Jain
2026-05-22  5:31 ` [PATCH v3 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Wen Jiang
2026-05-27  5:43   ` Dev Jain
2026-05-22  5:31 ` [PATCH v3 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic Wen Jiang
2026-06-01 17:34   ` Uladzislau Rezki
2026-06-02  7:45     ` Wen Jiang
2026-05-22  5:31 ` [PATCH v3 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk Wen Jiang
2026-05-27  5:58   ` Dev Jain
2026-05-28  3:39     ` Wen Jiang
2026-05-29  5:28       ` Dev Jain
2026-06-05  6:02       ` Dev Jain
2026-06-08  6:25         ` Wen Jiang
2026-05-22  5:31 ` [PATCH v3 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible Wen Jiang
2026-05-27  8:27   ` Dev Jain
2026-05-28  3:42     ` Wen Jiang
2026-05-29  5:57       ` Dev Jain
2026-06-02  7:34         ` Wen Jiang
2026-05-22  5:31 ` [PATCH v3 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
2026-05-23  7:53   ` Uladzislau Rezki
2026-05-27  6:25   ` Dev Jain
2026-06-02  8:57     ` Wen Jiang
2026-05-22 18:07 ` [PATCH v3 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Andrew Morton
2026-05-23  8:26   ` Wen Jiang
2026-05-23 21:40     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox