[RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
@ 2026-04-08  2:51 Barry Song (Xiaomi)
  2026-04-08  2:51 ` [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Barry Song (Xiaomi)
                   ` (8 more replies)
  0 siblings, 9 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08  2:51 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21, Barry Song (Xiaomi)

This patchset accelerates ioremap, vmalloc, and vmap when the memory
is physically fully or partially contiguous. Two techniques are used:

1. Avoid page table zigzag when setting PTEs/PMDs for multiple memory
   segments
2. Use batched mappings wherever possible in both vmalloc and ARM64
   layers

Patches 1–2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
CONT-PTE regions instead of just one.

Patches 3–4 extend vmap_small_pages_range_noflush() to support page
shifts other than PAGE_SHIFT. This allows mapping multiple memory
segments for vmalloc() without zigzagging page tables.

Patches 5–8 add huge vmap support for contiguous pages. This not only
improves performance but also enables PMD or CONT-PTE mapping for the
vmapped area, reducing TLB pressure.

Many thanks to Xueyuan Chen for his substantial testing efforts
on RK3588 boards.

On the RK3588 8-core ARM64 SoC, with tasks pinned to CPU2 and
the performance CPUfreq policy enabled, Xueyuan’s tests report:

* ioremap(1 MB): 1.2× faster
* vmalloc(1 MB) mapping time (excluding allocation) with
  VM_ALLOW_HUGE_VMAP: 1.5× faster
* vmap(): 5.6× faster when memory includes some order-8 pages,
  with no regression observed for order-0 pages

Barry Song (Xiaomi) (8):
  arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
    setup
  arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
    CONT_PTE
  mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger
    page_shift sizes
  mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
  mm/vmalloc: map contiguous pages in batches for vmap() if possible
  mm/vmalloc: align vm_area so vmap() can batch mappings
  mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable
    zigzag
  mm/vmalloc: Stop scanning for compound pages after encountering small
    pages in vmap

 arch/arm64/include/asm/vmalloc.h |   6 +-
 arch/arm64/mm/hugetlbpage.c      |  10 ++
 mm/vmalloc.c                     | 178 +++++++++++++++++++++++++------
 3 files changed, 161 insertions(+), 33 deletions(-)

-- 
2.39.3 (Apple Git-146)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
  2026-04-08  2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
@ 2026-04-08  2:51 ` Barry Song (Xiaomi)
  2026-04-08 10:32   ` Dev Jain
  2026-04-08  2:51 ` [RFC PATCH 2/8] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Barry Song (Xiaomi)
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08  2:51 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21, Barry Song (Xiaomi)

For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
we can batch CONT_PTE settings instead of handling them individually.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index a42c05cf5640..bf31c11ebd3b 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
 		contig_ptes = CONT_PTES;
 		break;
 	default:
+		if (size < CONT_PMD_SIZE && size > 0 &&
+				IS_ALIGNED(size, CONT_PTE_SIZE)) {
+			contig_ptes = size >> PAGE_SHIFT;
+			*pgsize = PAGE_SIZE;
+			break;
+		}
 		WARN_ON(!__hugetlb_valid_size(size));
 	}
 
@@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
 	case CONT_PTE_SIZE:
 		return pte_mkcont(entry);
 	default:
+		if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
+				IS_ALIGNED(pagesize, CONT_PTE_SIZE))
+			return pte_mkcont(entry);
+
 		break;
 	}
 	pr_warn("%s: unrecognized huge page size 0x%lx\n",
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
  2026-04-08  2:51 ` [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Barry Song (Xiaomi)
@ 2026-04-08 10:32   ` Dev Jain
  2026-04-08 11:00     ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Dev Jain @ 2026-04-08 10:32 UTC (permalink / raw)
  To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
	will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21



On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
> we can batch CONT_PTE settings instead of handling them individually.
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index a42c05cf5640..bf31c11ebd3b 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
>  		contig_ptes = CONT_PTES;
>  		break;
>  	default:
> +		if (size < CONT_PMD_SIZE && size > 0 &&
> +				IS_ALIGNED(size, CONT_PTE_SIZE)) {

Nit: Having the lower bound check before upper bound is natural to
read, so this should be size > 0 && size < CONT_PMD_SIZE (i.e written
the other way around).

Also IS_ALIGNED needs to go below size.


> +			contig_ptes = size >> PAGE_SHIFT;
> +			*pgsize = PAGE_SIZE;
> +			break;
> +		}
>  		WARN_ON(!__hugetlb_valid_size(size));
>  	}
>  
> @@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
>  	case CONT_PTE_SIZE:
>  		return pte_mkcont(entry);
>  	default:
> +		if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
> +				IS_ALIGNED(pagesize, CONT_PTE_SIZE))
> +			return pte_mkcont(entry);
> +
>  		break;
>  	}
>  	pr_warn("%s: unrecognized huge page size 0x%lx\n",



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
  2026-04-08 10:32   ` Dev Jain
@ 2026-04-08 11:00     ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2026-04-08 11:00 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21

On Wed, Apr 8, 2026 at 6:32 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> > For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
> > we can batch CONT_PTE settings instead of handling them individually.
> >
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
> >  arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> > index a42c05cf5640..bf31c11ebd3b 100644
> > --- a/arch/arm64/mm/hugetlbpage.c
> > +++ b/arch/arm64/mm/hugetlbpage.c
> > @@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
> >               contig_ptes = CONT_PTES;
> >               break;
> >       default:
> > +             if (size < CONT_PMD_SIZE && size > 0 &&
> > +                             IS_ALIGNED(size, CONT_PTE_SIZE)) {
>
> Nit: Having the lower bound check before upper bound is natural to
> read, so this should be size > 0 && size < CONT_PMD_SIZE (i.e written
> the other way around).

Thanks very much for reviewing, Dev. As we discussed in patch 0/8,
this should be
PMD_SIZE, not CONT_PMD_SIZE. I will use size > 0 && size < PMD_SIZE
in the next version.

>
> Also IS_ALIGNED needs to go below size.

Sure, thanks!

>
>
> > +                     contig_ptes = size >> PAGE_SHIFT;
> > +                     *pgsize = PAGE_SIZE;
> > +                     break;
> > +             }
> >               WARN_ON(!__hugetlb_valid_size(size));
> >       }
> >
> > @@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
> >       case CONT_PTE_SIZE:
> >               return pte_mkcont(entry);
> >       default:
> > +             if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
> > +                             IS_ALIGNED(pagesize, CONT_PTE_SIZE))
> > +                     return pte_mkcont(entry);

Here it should be pagesize > 0 && pagesize < PMD_SIZE as well :-)

> > +
> >               break;
> >       }
> >       pr_warn("%s: unrecognized huge page size 0x%lx\n",
>

Best Regards
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/8] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE
  2026-04-08  2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
  2026-04-08  2:51 ` [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Barry Song (Xiaomi)
@ 2026-04-08  2:51 ` Barry Song (Xiaomi)
  2026-04-08  2:51 ` [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes Barry Song (Xiaomi)
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08  2:51 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21, Barry Song (Xiaomi)

Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE hugepages,
reducing both PTE setup and TLB flush iterations.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 arch/arm64/include/asm/vmalloc.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 4ec1acd3c1b3..9eea06d0f75d 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -23,6 +23,8 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
 						unsigned long end, u64 pfn,
 						unsigned int max_page_shift)
 {
+	unsigned long size;
+
 	/*
 	 * If the block is at least CONT_PTE_SIZE in size, and is naturally
 	 * aligned in both virtual and physical space, then we can pte-map the
@@ -40,7 +42,9 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
 	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
 		return PAGE_SIZE;
 
-	return CONT_PTE_SIZE;
+	size = min3(end - addr, 1UL << max_page_shift, PMD_SIZE >> 1);
+	size = 1UL << (fls(size) - 1);
+	return size;
 }
 
 #define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes
  2026-04-08  2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
  2026-04-08  2:51 ` [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Barry Song (Xiaomi)
  2026-04-08  2:51 ` [RFC PATCH 2/8] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Barry Song (Xiaomi)
@ 2026-04-08  2:51 ` Barry Song (Xiaomi)
  2026-04-08 11:08   ` Dev Jain
  2026-04-08  2:51 ` [RFC PATCH 4/8] mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings Barry Song (Xiaomi)
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08  2:51 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21, Barry Song (Xiaomi)

vmap_small_pages_range_noflush() provides a clean interface by taking
struct page **pages and mapping them via direct PTE iteration. This
avoids the page table zigzag seen when using
vmap_range_noflush() for page_shift values other than PAGE_SHIFT.

Extend it to support larger page_shift values, and add PMD- and
contiguous-PTE mappings as well.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/vmalloc.c | 54 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 57eae99d9909..5bf072297536 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -524,8 +524,9 @@ void vunmap_range(unsigned long addr, unsigned long end)
 
 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
+	unsigned int steps = 1;
 	int err = 0;
 	pte_t *pte;
 
@@ -543,6 +544,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	do {
 		struct page *page = pages[*nr];
 
+		steps = 1;
 		if (WARN_ON(!pte_none(ptep_get(pte)))) {
 			err = -EBUSY;
 			break;
@@ -556,9 +558,24 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			break;
 		}
 
+#ifdef CONFIG_HUGETLB_PAGE
+		if (shift != PAGE_SHIFT) {
+			unsigned long pfn = page_to_pfn(page), size;
+
+			size = arch_vmap_pte_range_map_size(addr, end, pfn, shift);
+			if (size != PAGE_SIZE) {
+				steps = size >> PAGE_SHIFT;
+				pte_t entry = pfn_pte(pfn, prot);
+
+				entry = arch_make_huge_pte(entry, ilog2(size), 0);
+				set_huge_pte_at(&init_mm, addr, pte, entry, size);
+				continue;
+			}
+		}
+#endif
+
 		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
-		(*nr)++;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+	} while (pte += steps, *nr += steps, addr += PAGE_SIZE * steps, addr != end);
 
 	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
@@ -568,7 +585,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 
 static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -578,7 +595,20 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
+
+		if (shift == PMD_SHIFT) {
+			struct page *page = pages[*nr];
+			phys_addr_t phys_addr = page_to_phys(page);
+
+			if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
+						shift)) {
+				*mask |= PGTBL_PMD_MODIFIED;
+				*nr += 1 << (shift - PAGE_SHIFT);
+				continue;
+			}
+		}
+
+		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
@@ -586,7 +616,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
 
 static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -596,7 +626,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
+		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
@@ -604,7 +634,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
 
 static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+		pgtbl_mod_mask *mask, unsigned int shift)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -614,14 +644,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
+		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
-		pgprot_t prot, struct page **pages)
+		pgprot_t prot, struct page **pages, unsigned int shift)
 {
 	unsigned long start = addr;
 	pgd_t *pgd;
@@ -636,7 +666,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
 		next = pgd_addr_end(addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
-		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
@@ -665,7 +695,7 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 
 	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
 			page_shift == PAGE_SHIFT)
-		return vmap_small_pages_range_noflush(addr, end, prot, pages);
+		return vmap_small_pages_range_noflush(addr, end, prot, pages, PAGE_SHIFT);
 
 	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
 		int err;
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes
  2026-04-08  2:51 ` [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes Barry Song (Xiaomi)
@ 2026-04-08 11:08   ` Dev Jain
  0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2026-04-08 11:08 UTC (permalink / raw)
  To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
	will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21



On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> vmap_small_pages_range_noflush() provides a clean interface by taking
> struct page **pages and mapping them via direct PTE iteration. This
> avoids the page table zigzag seen when using

"Zigzag" is ambiguous. Just say "page table rewalk". Also please
elaborate on why the rewalk is happening currently.

> vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
> 
> Extend it to support larger page_shift values, and add PMD- and
> contiguous-PTE mappings as well.

So we can drop the "small" here since now it supports larger chunks
as well.

Also at this point the code you add is a no-op since you pass PAGE_SHIFT.
Let us just squash patch 4 into this. This patch looks weird retaining
the pagetable-rewalk algorithm when it literally adds functionality
to avoid that.

> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/vmalloc.c | 54 ++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 42 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 57eae99d9909..5bf072297536 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -524,8 +524,9 @@ void vunmap_range(unsigned long addr, unsigned long end)
>  
>  static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
> +	unsigned int steps = 1;
>  	int err = 0;
>  	pte_t *pte;
>  
> @@ -543,6 +544,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	do {
>  		struct page *page = pages[*nr];
>  
> +		steps = 1;
>  		if (WARN_ON(!pte_none(ptep_get(pte)))) {
>  			err = -EBUSY;
>  			break;
> @@ -556,9 +558,24 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  			break;
>  		}
>  
> +#ifdef CONFIG_HUGETLB_PAGE
> +		if (shift != PAGE_SHIFT) {
> +			unsigned long pfn = page_to_pfn(page), size;
> +
> +			size = arch_vmap_pte_range_map_size(addr, end, pfn, shift);
> +			if (size != PAGE_SIZE) {
> +				steps = size >> PAGE_SHIFT;
> +				pte_t entry = pfn_pte(pfn, prot);
> +
> +				entry = arch_make_huge_pte(entry, ilog2(size), 0);
> +				set_huge_pte_at(&init_mm, addr, pte, entry, size);
> +				continue;
> +			}
> +		}
> +#endif
> +
>  		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
> -		(*nr)++;
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += steps, *nr += steps, addr += PAGE_SIZE * steps, addr != end);
>  
>  	lazy_mmu_mode_disable();
>  	*mask |= PGTBL_PTE_MODIFIED;
> @@ -568,7 +585,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> @@ -578,7 +595,20 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = pmd_addr_end(addr, end);
> -		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
> +
> +		if (shift == PMD_SHIFT) {
> +			struct page *page = pages[*nr];
> +			phys_addr_t phys_addr = page_to_phys(page);
> +
> +			if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
> +						shift)) {
> +				*mask |= PGTBL_PMD_MODIFIED;
> +				*nr += 1 << (shift - PAGE_SHIFT);
> +				continue;
> +			}
> +		}
> +
> +		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
>  			return -ENOMEM;
>  	} while (pmd++, addr = next, addr != end);
>  	return 0;
> @@ -586,7 +616,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>  
>  static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -596,7 +626,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = pud_addr_end(addr, end);
> -		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
> +		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
>  			return -ENOMEM;
>  	} while (pud++, addr = next, addr != end);
>  	return 0;
> @@ -604,7 +634,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>  
>  static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> -		pgtbl_mod_mask *mask)
> +		pgtbl_mod_mask *mask, unsigned int shift)
>  {
>  	p4d_t *p4d;
>  	unsigned long next;
> @@ -614,14 +644,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = p4d_addr_end(addr, end);
> -		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
> +		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
>  			return -ENOMEM;
>  	} while (p4d++, addr = next, addr != end);
>  	return 0;
>  }
>  
>  static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> -		pgprot_t prot, struct page **pages)
> +		pgprot_t prot, struct page **pages, unsigned int shift)
>  {
>  	unsigned long start = addr;
>  	pgd_t *pgd;
> @@ -636,7 +666,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_bad(*pgd))
>  			mask |= PGTBL_PGD_MODIFIED;
> -		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
> +		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
>  		if (err)
>  			break;
>  	} while (pgd++, addr = next, addr != end);
> @@ -665,7 +695,7 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>  
>  	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>  			page_shift == PAGE_SHIFT)
> -		return vmap_small_pages_range_noflush(addr, end, prot, pages);
> +		return vmap_small_pages_range_noflush(addr, end, prot, pages, PAGE_SHIFT);
>  
>  	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
>  		int err;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 4/8] mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
  2026-04-08  2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
                   ` (2 preceding siblings ...)
  2026-04-08  2:51 ` [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes Barry Song (Xiaomi)
@ 2026-04-08  2:51 ` Barry Song (Xiaomi)
  2026-04-08  2:51 ` [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible Barry Song (Xiaomi)
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08  2:51 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21, Barry Song (Xiaomi)

For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer
need to iterate over pages one by one, which would otherwise lead to
zigzag page table mappings.

The code is now unified with the PAGE_SHIFT case by simply
calling vmap_small_pages_range_noflush().

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/vmalloc.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5bf072297536..eba436386929 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -689,27 +689,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
 int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 		pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
-	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
-
 	WARN_ON(page_shift < PAGE_SHIFT);
 
-	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
-			page_shift == PAGE_SHIFT)
-		return vmap_small_pages_range_noflush(addr, end, prot, pages, PAGE_SHIFT);
-
-	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
-		int err;
-
-		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
-					page_to_phys(pages[i]), prot,
-					page_shift);
-		if (err)
-			return err;
+	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
+		page_shift = PAGE_SHIFT;
 
-		addr += 1UL << page_shift;
-	}
-
-	return 0;
+	return vmap_small_pages_range_noflush(addr, end, prot, pages,
+			min(page_shift, PMD_SHIFT));
 }
 
 int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible
  2026-04-08  2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
                   ` (3 preceding siblings ...)
  2026-04-08  2:51 ` [RFC PATCH 4/8] mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings Barry Song (Xiaomi)
@ 2026-04-08  2:51 ` Barry Song (Xiaomi)
  2026-04-08  4:19   ` Dev Jain
  2026-04-08  2:51 ` [RFC PATCH 6/8] mm/vmalloc: align vm_area so vmap() can batch mappings Barry Song (Xiaomi)
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08  2:51 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21, Barry Song (Xiaomi)

In many cases, the pages passed to vmap() may include high-order
pages allocated with __GFP_COMP flags. For example, the systemheap
often allocates pages in descending order: order 8, then 4, then 0.
Currently, vmap() iterates over every page individually—even pages
inside a high-order block are handled one by one.

This patch detects high-order pages and maps them as a single
contiguous block whenever possible.

An alternative would be to implement a new API, vmap_sg(), but that
change seems to be large in scope.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/vmalloc.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index eba436386929..e8dbfada42bc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3529,6 +3529,53 @@ void vunmap(const void *addr)
 }
 EXPORT_SYMBOL(vunmap);
 
+static inline int get_vmap_batch_order(struct page **pages,
+		unsigned int max_steps, unsigned int idx)
+{
+	unsigned int nr_pages;
+
+	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) ||
+			ioremap_max_page_shift == PAGE_SHIFT)
+		return 0;
+
+	nr_pages = compound_nr(pages[idx]);
+	if (nr_pages == 1 || max_steps < nr_pages)
+		return 0;
+
+	if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
+		return compound_order(pages[idx]);
+	return 0;
+}
+
+static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages)
+{
+	unsigned int count = (end - addr) >> PAGE_SHIFT;
+	int err;
+
+	err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
+						PAGE_SHIFT, GFP_KERNEL);
+	if (err)
+		goto out;
+
+	for (unsigned int i = 0; i < count; ) {
+		unsigned int shift = PAGE_SHIFT +
+			get_vmap_batch_order(pages, count - i, i);
+
+		err = vmap_range_noflush(addr, addr + (1UL << shift),
+				page_to_phys(pages[i]), prot, shift);
+		if (err)
+			goto out;
+
+		addr += 1UL << shift;
+		i += 1U << (shift - PAGE_SHIFT);
+	}
+
+out:
+	flush_cache_vmap(addr, end);
+	return err;
+}
+
 /**
  * vmap - map an array of pages into virtually contiguous space
  * @pages: array of page pointers
@@ -3572,8 +3619,8 @@ void *vmap(struct page **pages, unsigned int count,
 		return NULL;
 
 	addr = (unsigned long)area->addr;
-	if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
-				pages, PAGE_SHIFT) < 0) {
+	if (vmap_contig_pages_range(addr, addr + size, pgprot_nx(prot),
+				pages) < 0) {
 		vunmap(area->addr);
 		return NULL;
 	}
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible
  2026-04-08  2:51 ` [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible Barry Song (Xiaomi)
@ 2026-04-08  4:19   ` Dev Jain
  2026-04-08  5:12     ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Dev Jain @ 2026-04-08  4:19 UTC (permalink / raw)
  To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
	will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21



On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> In many cases, the pages passed to vmap() may include high-order
> pages allocated with __GFP_COMP flags. For example, the systemheap
> often allocates pages in descending order: order 8, then 4, then 0.
> Currently, vmap() iterates over every page individually—even pages
> inside a high-order block are handled one by one.
> 
> This patch detects high-order pages and maps them as a single
> contiguous block whenever possible.
> 
> An alternative would be to implement a new API, vmap_sg(), but that
> change seems to be large in scope.
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---

Coincidentally, I was working on the same thing :)

We have a usecase regarding Arm TRBE and SPE aux buffers.

I'll take a look at your patches later, but my implementation is the
following, if you have any comments. I have squashed the patches into
a single diff.



From ccb9670a52b7f50b1f1e07b579a1316f76b84811 Mon Sep 17 00:00:00 2001
From: Dev Jain <dev.jain@arm.com>
Date: Thu, 26 Feb 2026 16:21:29 +0530
Subject: [PATCH] arm64/perf: map AUX buffer with large pages

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 .../hwtracing/coresight/coresight-etm-perf.c  |  3 +-
 drivers/hwtracing/coresight/coresight-trbe.c  |  3 +-
 drivers/perf/arm_spe_pmu.c                    |  5 +-
 mm/vmalloc.c                                  | 86 ++++++++++++++++---
 4 files changed, 79 insertions(+), 18 deletions(-)

diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c
index 72017dcc3b7f1..e90a430af86bb 100644
--- a/drivers/hwtracing/coresight/coresight-etm-perf.c
+++ b/drivers/hwtracing/coresight/coresight-etm-perf.c
@@ -984,7 +984,8 @@ int __init etm_perf_init(void)

 	etm_pmu.capabilities		= (PERF_PMU_CAP_EXCLUSIVE |
 					   PERF_PMU_CAP_ITRACE |
-					   PERF_PMU_CAP_AUX_PAUSE);
+					   PERF_PMU_CAP_AUX_PAUSE |
+					   PERF_PMU_CAP_AUX_PREFER_LARGE);

 	etm_pmu.attr_groups		= etm_pmu_attr_groups;
 	etm_pmu.task_ctx_nr		= perf_sw_context;
diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
index 1511f8eb95afb..74e6ad891e236 100644
--- a/drivers/hwtracing/coresight/coresight-trbe.c
+++ b/drivers/hwtracing/coresight/coresight-trbe.c
@@ -760,7 +760,8 @@ static void *arm_trbe_alloc_buffer(struct coresight_device *csdev,
 	for (i = 0; i < nr_pages; i++)
 		pglist[i] = virt_to_page(pages[i]);

-	buf->trbe_base = (unsigned long)vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
+	buf->trbe_base = (unsigned long)vmap(pglist, nr_pages,
+			 VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
 	if (!buf->trbe_base) {
 		kfree(pglist);
 		kfree(buf);
diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index dbd0da1116390..90c349fd66b2c 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -1027,7 +1027,7 @@ static void *arm_spe_pmu_setup_aux(struct perf_event *event, void **pages,
 	for (i = 0; i < nr_pages; ++i)
 		pglist[i] = virt_to_page(pages[i]);

-	buf->base = vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
+	buf->base = vmap(pglist, nr_pages, VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
 	if (!buf->base)
 		goto out_free_pglist;

@@ -1064,7 +1064,8 @@ static int arm_spe_pmu_perf_init(struct arm_spe_pmu *spe_pmu)
 	spe_pmu->pmu = (struct pmu) {
 		.module = THIS_MODULE,
 		.parent		= &spe_pmu->pdev->dev,
-		.capabilities	= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
+		.capabilities	= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE |
+				  PERF_PMU_CAP_AUX_PREFER_LARGE,
 		.attr_groups	= arm_spe_pmu_attr_groups,
 		/*
 		 * We hitch a ride on the software context here, so that
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61caa55a44027..8482463d41203 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -660,14 +660,14 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 		pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
 	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
-
+	unsigned long step = 1UL << (page_shift - PAGE_SHIFT);
 	WARN_ON(page_shift < PAGE_SHIFT);

 	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
 			page_shift == PAGE_SHIFT)
 		return vmap_small_pages_range_noflush(addr, end, prot, pages);

-	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
+	for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
 		int err;

 		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
@@ -678,8 +678,9 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,

 		addr += 1UL << page_shift;
 	}
-
-	return 0;
+	if (IS_ALIGNED(nr, step))
+		return 0;
+	return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
 }

 int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
@@ -3514,6 +3515,50 @@ void vunmap(const void *addr)
 }
 EXPORT_SYMBOL(vunmap);

+static inline unsigned int vm_shift(pgprot_t prot, unsigned long size)
+{
+	if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
+		return PMD_SHIFT;
+
+	return arch_vmap_pte_supported_shift(size);
+}
+
+static inline int __vmap_huge(struct page **pages, pgprot_t prot,
+		unsigned long addr, unsigned int count)
+{
+	unsigned int i = 0;
+	unsigned int shift;
+	unsigned long nr;
+
+	while (i < count) {
+		nr = num_pages_contiguous(pages + i, count - i);
+		shift = vm_shift(prot, nr << PAGE_SHIFT);
+		if (vmap_pages_range(addr, addr + (nr << PAGE_SHIFT),
+				     pgprot_nx(prot), pages + i, shift) < 0) {
+			return 1;
+		}
+		i += nr;
+		addr += (nr << PAGE_SHIFT);
+	}
+	return 0;
+}
+
+static unsigned long max_contiguous_stride_order(struct page **pages,
+		pgprot_t prot, unsigned int count)
+{
+	unsigned long max_shift = PAGE_SHIFT;
+	unsigned int i = 0;
+
+	while (i < count) {
+		unsigned long nr = num_pages_contiguous(pages + i, count - i);
+		unsigned long shift = vm_shift(prot, nr << PAGE_SHIFT);
+
+		max_shift = max(max_shift, shift);
+		i += nr;
+	}
+	return max_shift;
+}
+
 /**
  * vmap - map an array of pages into virtually contiguous space
  * @pages: array of page pointers
@@ -3552,15 +3597,32 @@ void *vmap(struct page **pages, unsigned int count,
 		return NULL;

 	size = (unsigned long)count << PAGE_SHIFT;
-	area = get_vm_area_caller(size, flags, __builtin_return_address(0));
+	if (flags & VM_ALLOW_HUGE_VMAP) {
+		/* determine from page array, the max alignment */
+		unsigned long max_shift = max_contiguous_stride_order(pages, prot, count);
+
+		area = __get_vm_area_node(size, 1 << max_shift, max_shift, flags,
+					  VMALLOC_START, VMALLOC_END, NUMA_NO_NODE,
+					  GFP_KERNEL, __builtin_return_address(0));
+	} else {
+		area = get_vm_area_caller(size, flags, __builtin_return_address(0));
+	}
 	if (!area)
 		return NULL;

 	addr = (unsigned long)area->addr;
-	if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
-				pages, PAGE_SHIFT) < 0) {
-		vunmap(area->addr);
-		return NULL;
+
+	if (flags & VM_ALLOW_HUGE_VMAP) {
+		if (__vmap_huge(pages, prot, addr, count)) {
+			vunmap(area->addr);
+			return NULL;
+		}
+	} else {
+		if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
+					pages, PAGE_SHIFT) < 0) {
+			vunmap(area->addr);
+			return NULL;
+		}
 	}

 	if (flags & VM_MAP_PUT_PAGES) {
@@ -4011,11 +4073,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
 		 * their allocations due to apply_to_page_range not
 		 * supporting them.
 		 */
-
-		if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
-			shift = PMD_SHIFT;
-		else
-			shift = arch_vmap_pte_supported_shift(size);
+		shift = vm_shift(prot, size);

 		align = max(original_align, 1UL << shift);
 	}
-- 
2.34.1



>  mm/vmalloc.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index eba436386929..e8dbfada42bc 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3529,6 +3529,53 @@ void vunmap(const void *addr)
>  }
>  EXPORT_SYMBOL(vunmap);
>  
> +static inline int get_vmap_batch_order(struct page **pages,
> +		unsigned int max_steps, unsigned int idx)
> +{
> +	unsigned int nr_pages;
> +
> +	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) ||
> +			ioremap_max_page_shift == PAGE_SHIFT)
> +		return 0;
> +
> +	nr_pages = compound_nr(pages[idx]);
> +	if (nr_pages == 1 || max_steps < nr_pages)
> +		return 0;
> +
> +	if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
> +		return compound_order(pages[idx]);
> +	return 0;
> +}
> +
> +static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
> +		pgprot_t prot, struct page **pages)
> +{
> +	unsigned int count = (end - addr) >> PAGE_SHIFT;
> +	int err;
> +
> +	err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
> +						PAGE_SHIFT, GFP_KERNEL);
> +	if (err)
> +		goto out;
> +
> +	for (unsigned int i = 0; i < count; ) {
> +		unsigned int shift = PAGE_SHIFT +
> +			get_vmap_batch_order(pages, count - i, i);
> +
> +		err = vmap_range_noflush(addr, addr + (1UL << shift),
> +				page_to_phys(pages[i]), prot, shift);
> +		if (err)
> +			goto out;
> +
> +		addr += 1UL << shift;
> +		i += 1U << (shift - PAGE_SHIFT);
> +	}
> +
> +out:
> +	flush_cache_vmap(addr, end);
> +	return err;
> +}
> +
>  /**
>   * vmap - map an array of pages into virtually contiguous space
>   * @pages: array of page pointers
> @@ -3572,8 +3619,8 @@ void *vmap(struct page **pages, unsigned int count,
>  		return NULL;
>  
>  	addr = (unsigned long)area->addr;
> -	if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
> -				pages, PAGE_SHIFT) < 0) {
> +	if (vmap_contig_pages_range(addr, addr + size, pgprot_nx(prot),
> +				pages) < 0) {
>  		vunmap(area->addr);
>  		return NULL;
>  	}



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible
  2026-04-08  4:19   ` Dev Jain
@ 2026-04-08  5:12     ` Barry Song
  2026-04-08 11:22       ` Dev Jain
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2026-04-08  5:12 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21

On Wed, Apr 8, 2026 at 12:20 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> > In many cases, the pages passed to vmap() may include high-order
> > pages allocated with __GFP_COMP flags. For example, the systemheap
> > often allocates pages in descending order: order 8, then 4, then 0.
> > Currently, vmap() iterates over every page individually—even pages
> > inside a high-order block are handled one by one.
> >
> > This patch detects high-order pages and maps them as a single
> > contiguous block whenever possible.
> >
> > An alternative would be to implement a new API, vmap_sg(), but that
> > change seems to be large in scope.
> >
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
>
> Coincidentally, I was working on the same thing :)

Interesting, thanks — at least I’ve got one good reviewer :-)

>
> We have a usecase regarding Arm TRBE and SPE aux buffers.
>
> I'll take a look at your patches later, but my implementation is the

Yes. Please.


> following, if you have any comments. I have squashed the patches into
> a single diff.

Thanks very much, Dev. What you’ve done is quite similar to
patches 5/8 and 6/8, although the code differs somewhat.

>
>
>
> From ccb9670a52b7f50b1f1e07b579a1316f76b84811 Mon Sep 17 00:00:00 2001
> From: Dev Jain <dev.jain@arm.com>
> Date: Thu, 26 Feb 2026 16:21:29 +0530
> Subject: [PATCH] arm64/perf: map AUX buffer with large pages
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  .../hwtracing/coresight/coresight-etm-perf.c  |  3 +-
>  drivers/hwtracing/coresight/coresight-trbe.c  |  3 +-
>  drivers/perf/arm_spe_pmu.c                    |  5 +-
>  mm/vmalloc.c                                  | 86 ++++++++++++++++---
>  4 files changed, 79 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c
> index 72017dcc3b7f1..e90a430af86bb 100644
> --- a/drivers/hwtracing/coresight/coresight-etm-perf.c
> +++ b/drivers/hwtracing/coresight/coresight-etm-perf.c
> @@ -984,7 +984,8 @@ int __init etm_perf_init(void)
>
>         etm_pmu.capabilities            = (PERF_PMU_CAP_EXCLUSIVE |
>                                            PERF_PMU_CAP_ITRACE |
> -                                          PERF_PMU_CAP_AUX_PAUSE);
> +                                          PERF_PMU_CAP_AUX_PAUSE |
> +                                          PERF_PMU_CAP_AUX_PREFER_LARGE);
>
>         etm_pmu.attr_groups             = etm_pmu_attr_groups;
>         etm_pmu.task_ctx_nr             = perf_sw_context;
> diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
> index 1511f8eb95afb..74e6ad891e236 100644
> --- a/drivers/hwtracing/coresight/coresight-trbe.c
> +++ b/drivers/hwtracing/coresight/coresight-trbe.c
> @@ -760,7 +760,8 @@ static void *arm_trbe_alloc_buffer(struct coresight_device *csdev,
>         for (i = 0; i < nr_pages; i++)
>                 pglist[i] = virt_to_page(pages[i]);
>
> -       buf->trbe_base = (unsigned long)vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
> +       buf->trbe_base = (unsigned long)vmap(pglist, nr_pages,
> +                        VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
>         if (!buf->trbe_base) {
>                 kfree(pglist);
>                 kfree(buf);
> diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
> index dbd0da1116390..90c349fd66b2c 100644
> --- a/drivers/perf/arm_spe_pmu.c
> +++ b/drivers/perf/arm_spe_pmu.c
> @@ -1027,7 +1027,7 @@ static void *arm_spe_pmu_setup_aux(struct perf_event *event, void **pages,
>         for (i = 0; i < nr_pages; ++i)
>                 pglist[i] = virt_to_page(pages[i]);
>
> -       buf->base = vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
> +       buf->base = vmap(pglist, nr_pages, VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
>         if (!buf->base)
>                 goto out_free_pglist;
>
> @@ -1064,7 +1064,8 @@ static int arm_spe_pmu_perf_init(struct arm_spe_pmu *spe_pmu)
>         spe_pmu->pmu = (struct pmu) {
>                 .module = THIS_MODULE,
>                 .parent         = &spe_pmu->pdev->dev,
> -               .capabilities   = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
> +               .capabilities   = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE |
> +                                 PERF_PMU_CAP_AUX_PREFER_LARGE,
>                 .attr_groups    = arm_spe_pmu_attr_groups,
>                 /*
>                  * We hitch a ride on the software context here, so that
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 61caa55a44027..8482463d41203 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -660,14 +660,14 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>                 pgprot_t prot, struct page **pages, unsigned int page_shift)
>  {
>         unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
> -
> +       unsigned long step = 1UL << (page_shift - PAGE_SHIFT);
>         WARN_ON(page_shift < PAGE_SHIFT);
>
>         if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>                         page_shift == PAGE_SHIFT)
>                 return vmap_small_pages_range_noflush(addr, end, prot, pages);
>
> -       for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> +       for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
>                 int err;
>
>                 err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> @@ -678,8 +678,9 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>
>                 addr += 1UL << page_shift;
>         }
> -
> -       return 0;
> +       if (IS_ALIGNED(nr, step))
> +               return 0;
> +       return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
>  }
>
>  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> @@ -3514,6 +3515,50 @@ void vunmap(const void *addr)
>  }
>  EXPORT_SYMBOL(vunmap);
>
> +static inline unsigned int vm_shift(pgprot_t prot, unsigned long size)
> +{
> +       if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
> +               return PMD_SHIFT;
> +
> +       return arch_vmap_pte_supported_shift(size);
> +}
> +
> +static inline int __vmap_huge(struct page **pages, pgprot_t prot,
> +               unsigned long addr, unsigned int count)
> +{
> +       unsigned int i = 0;
> +       unsigned int shift;
> +       unsigned long nr;
> +
> +       while (i < count) {
> +               nr = num_pages_contiguous(pages + i, count - i);
> +               shift = vm_shift(prot, nr << PAGE_SHIFT);
> +               if (vmap_pages_range(addr, addr + (nr << PAGE_SHIFT),
> +                                    pgprot_nx(prot), pages + i, shift) < 0) {
> +                       return 1;
> +               }

One observation on my side is that the performance gain is somewhat
offset by page table zigzagging caused by what you are doing here -
iterating each mem segment by vmap_pages_range() .

In patch 3/8, I enhanced vmap_small_pages_range_noflush() to
avoid repeated pgd → p4d → pud → pmd → pte traversals for page
shifts other than PAGE_SHIFT. This improves performance for
vmalloc as well as vmap(). Then, in patch 7/8, I adopt the new
vmap_small_pages_range_noflush() and eliminate the iteration.

> +               i += nr;
> +               addr += (nr << PAGE_SHIFT);
> +       }
> +       return 0;
> +}
> +
> +static unsigned long max_contiguous_stride_order(struct page **pages,
> +               pgprot_t prot, unsigned int count)
> +{
> +       unsigned long max_shift = PAGE_SHIFT;
> +       unsigned int i = 0;
> +
> +       while (i < count) {
> +               unsigned long nr = num_pages_contiguous(pages + i, count - i);
> +               unsigned long shift = vm_shift(prot, nr << PAGE_SHIFT);
> +
> +               max_shift = max(max_shift, shift);
> +               i += nr;
> +       }
> +       return max_shift;
> +}
> +
>  /**
>   * vmap - map an array of pages into virtually contiguous space
>   * @pages: array of page pointers
> @@ -3552,15 +3597,32 @@ void *vmap(struct page **pages, unsigned int count,
>                 return NULL;
>
>         size = (unsigned long)count << PAGE_SHIFT;
> -       area = get_vm_area_caller(size, flags, __builtin_return_address(0));
> +       if (flags & VM_ALLOW_HUGE_VMAP) {
> +               /* determine from page array, the max alignment */
> +               unsigned long max_shift = max_contiguous_stride_order(pages, prot, count);
> +
> +               area = __get_vm_area_node(size, 1 << max_shift, max_shift, flags,
> +                                         VMALLOC_START, VMALLOC_END, NUMA_NO_NODE,
> +                                         GFP_KERNEL, __builtin_return_address(0));
> +       } else {
> +               area = get_vm_area_caller(size, flags, __builtin_return_address(0));
> +       }
>         if (!area)
>                 return NULL;
>
>         addr = (unsigned long)area->addr;
> -       if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
> -                               pages, PAGE_SHIFT) < 0) {
> -               vunmap(area->addr);
> -               return NULL;
> +
> +       if (flags & VM_ALLOW_HUGE_VMAP) {
> +               if (__vmap_huge(pages, prot, addr, count)) {
> +                       vunmap(area->addr);
> +                       return NULL;
> +               }
> +       } else {
> +               if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
> +                                       pages, PAGE_SHIFT) < 0) {
> +                       vunmap(area->addr);
> +                       return NULL;
> +               }
>         }
>
>         if (flags & VM_MAP_PUT_PAGES) {
> @@ -4011,11 +4073,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
>                  * their allocations due to apply_to_page_range not
>                  * supporting them.
>                  */
> -
> -               if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
> -                       shift = PMD_SHIFT;
> -               else
> -                       shift = arch_vmap_pte_supported_shift(size);
> +               shift = vm_shift(prot, size);

What I actually did is different. In patches 1/8 and 2/8, I
extended the arm64 levels to support N * CONT_PTE, and let the
final PTE mapping use the maximum possible batch after avoiding
zigzag. This further improves all orders greater than CONT_PTE.

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible
  2026-04-08  5:12     ` Barry Song
@ 2026-04-08 11:22       ` Dev Jain
  0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2026-04-08 11:22 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21



On 08/04/26 10:42 am, Barry Song wrote:
> On Wed, Apr 8, 2026 at 12:20 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
>>> In many cases, the pages passed to vmap() may include high-order
>>> pages allocated with __GFP_COMP flags. For example, the systemheap
>>> often allocates pages in descending order: order 8, then 4, then 0.
>>> Currently, vmap() iterates over every page individually—even pages
>>> inside a high-order block are handled one by one.
>>>
>>> This patch detects high-order pages and maps them as a single
>>> contiguous block whenever possible.
>>>
>>> An alternative would be to implement a new API, vmap_sg(), but that
>>> change seems to be large in scope.
>>>
>>> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
>>> ---
>>
>> Coincidentally, I was working on the same thing :)
> 
> Interesting, thanks — at least I’ve got one good reviewer :-)
> 
>>
>> We have a usecase regarding Arm TRBE and SPE aux buffers.
>>
>> I'll take a look at your patches later, but my implementation is the
> 
> Yes. Please.
> 
> 
>> following, if you have any comments. I have squashed the patches into
>> a single diff.
> 
> Thanks very much, Dev. What you’ve done is quite similar to
> patches 5/8 and 6/8, although the code differs somewhat.
> 
>>
>>
>>
>> From ccb9670a52b7f50b1f1e07b579a1316f76b84811 Mon Sep 17 00:00:00 2001
>> From: Dev Jain <dev.jain@arm.com>
>> Date: Thu, 26 Feb 2026 16:21:29 +0530
>> Subject: [PATCH] arm64/perf: map AUX buffer with large pages
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  .../hwtracing/coresight/coresight-etm-perf.c  |  3 +-
>>  drivers/hwtracing/coresight/coresight-trbe.c  |  3 +-
>>  drivers/perf/arm_spe_pmu.c                    |  5 +-
>>  mm/vmalloc.c                                  | 86 ++++++++++++++++---
>>  4 files changed, 79 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c
>> index 72017dcc3b7f1..e90a430af86bb 100644
>> --- a/drivers/hwtracing/coresight/coresight-etm-perf.c
>> +++ b/drivers/hwtracing/coresight/coresight-etm-perf.c
>> @@ -984,7 +984,8 @@ int __init etm_perf_init(void)
>>
>>         etm_pmu.capabilities            = (PERF_PMU_CAP_EXCLUSIVE |
>>                                            PERF_PMU_CAP_ITRACE |
>> -                                          PERF_PMU_CAP_AUX_PAUSE);
>> +                                          PERF_PMU_CAP_AUX_PAUSE |
>> +                                          PERF_PMU_CAP_AUX_PREFER_LARGE);
>>
>>         etm_pmu.attr_groups             = etm_pmu_attr_groups;
>>         etm_pmu.task_ctx_nr             = perf_sw_context;
>> diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
>> index 1511f8eb95afb..74e6ad891e236 100644
>> --- a/drivers/hwtracing/coresight/coresight-trbe.c
>> +++ b/drivers/hwtracing/coresight/coresight-trbe.c
>> @@ -760,7 +760,8 @@ static void *arm_trbe_alloc_buffer(struct coresight_device *csdev,
>>         for (i = 0; i < nr_pages; i++)
>>                 pglist[i] = virt_to_page(pages[i]);
>>
>> -       buf->trbe_base = (unsigned long)vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
>> +       buf->trbe_base = (unsigned long)vmap(pglist, nr_pages,
>> +                        VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
>>         if (!buf->trbe_base) {
>>                 kfree(pglist);
>>                 kfree(buf);
>> diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
>> index dbd0da1116390..90c349fd66b2c 100644
>> --- a/drivers/perf/arm_spe_pmu.c
>> +++ b/drivers/perf/arm_spe_pmu.c
>> @@ -1027,7 +1027,7 @@ static void *arm_spe_pmu_setup_aux(struct perf_event *event, void **pages,
>>         for (i = 0; i < nr_pages; ++i)
>>                 pglist[i] = virt_to_page(pages[i]);
>>
>> -       buf->base = vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
>> +       buf->base = vmap(pglist, nr_pages, VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
>>         if (!buf->base)
>>                 goto out_free_pglist;
>>
>> @@ -1064,7 +1064,8 @@ static int arm_spe_pmu_perf_init(struct arm_spe_pmu *spe_pmu)
>>         spe_pmu->pmu = (struct pmu) {
>>                 .module = THIS_MODULE,
>>                 .parent         = &spe_pmu->pdev->dev,
>> -               .capabilities   = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
>> +               .capabilities   = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE |
>> +                                 PERF_PMU_CAP_AUX_PREFER_LARGE,
>>                 .attr_groups    = arm_spe_pmu_attr_groups,
>>                 /*
>>                  * We hitch a ride on the software context here, so that
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 61caa55a44027..8482463d41203 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -660,14 +660,14 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>                 pgprot_t prot, struct page **pages, unsigned int page_shift)
>>  {
>>         unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
>> -
>> +       unsigned long step = 1UL << (page_shift - PAGE_SHIFT);
>>         WARN_ON(page_shift < PAGE_SHIFT);
>>
>>         if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>>                         page_shift == PAGE_SHIFT)
>>                 return vmap_small_pages_range_noflush(addr, end, prot, pages);
>>
>> -       for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
>> +       for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
>>                 int err;
>>
>>                 err = vmap_range_noflush(addr, addr + (1UL << page_shift),
>> @@ -678,8 +678,9 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>
>>                 addr += 1UL << page_shift;
>>         }
>> -
>> -       return 0;
>> +       if (IS_ALIGNED(nr, step))
>> +               return 0;
>> +       return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
>>  }
>>
>>  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>> @@ -3514,6 +3515,50 @@ void vunmap(const void *addr)
>>  }
>>  EXPORT_SYMBOL(vunmap);
>>
>> +static inline unsigned int vm_shift(pgprot_t prot, unsigned long size)
>> +{
>> +       if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
>> +               return PMD_SHIFT;
>> +
>> +       return arch_vmap_pte_supported_shift(size);
>> +}
>> +
>> +static inline int __vmap_huge(struct page **pages, pgprot_t prot,
>> +               unsigned long addr, unsigned int count)
>> +{
>> +       unsigned int i = 0;
>> +       unsigned int shift;
>> +       unsigned long nr;
>> +
>> +       while (i < count) {
>> +               nr = num_pages_contiguous(pages + i, count - i);
>> +               shift = vm_shift(prot, nr << PAGE_SHIFT);
>> +               if (vmap_pages_range(addr, addr + (nr << PAGE_SHIFT),
>> +                                    pgprot_nx(prot), pages + i, shift) < 0) {
>> +                       return 1;
>> +               }
> 
> One observation on my side is that the performance gain is somewhat
> offset by page table zigzagging caused by what you are doing here -
> iterating each mem segment by vmap_pages_range() .

I recall having observed this problem half an year back, and I wrote
code similar to what you did with patch 3 - but I didn't observe any
performance improvement. I think that was because I was testing
vmalloc - most of the cost there lies in the page allocation.

So looks like this indeed is a benefit for vmap.

> 
> In patch 3/8, I enhanced vmap_small_pages_range_noflush() to
> avoid repeated pgd → p4d → pud → pmd → pte traversals for page
> shifts other than PAGE_SHIFT. This improves performance for
> vmalloc as well as vmap(). Then, in patch 7/8, I adopt the new
> vmap_small_pages_range_noflush() and eliminate the iteration.
> 
>> +               i += nr;
>> +               addr += (nr << PAGE_SHIFT);
>> +       }
>> +       return 0;
>> +}
>> +
>> +static unsigned long max_contiguous_stride_order(struct page **pages,
>> +               pgprot_t prot, unsigned int count)
>> +{
>> +       unsigned long max_shift = PAGE_SHIFT;
>> +       unsigned int i = 0;
>> +
>> +       while (i < count) {
>> +               unsigned long nr = num_pages_contiguous(pages + i, count - i);
>> +               unsigned long shift = vm_shift(prot, nr << PAGE_SHIFT);
>> +
>> +               max_shift = max(max_shift, shift);
>> +               i += nr;
>> +       }
>> +       return max_shift;
>> +}
>> +
>>  /**
>>   * vmap - map an array of pages into virtually contiguous space
>>   * @pages: array of page pointers
>> @@ -3552,15 +3597,32 @@ void *vmap(struct page **pages, unsigned int count,
>>                 return NULL;
>>
>>         size = (unsigned long)count << PAGE_SHIFT;
>> -       area = get_vm_area_caller(size, flags, __builtin_return_address(0));
>> +       if (flags & VM_ALLOW_HUGE_VMAP) {
>> +               /* determine from page array, the max alignment */
>> +               unsigned long max_shift = max_contiguous_stride_order(pages, prot, count);
>> +
>> +               area = __get_vm_area_node(size, 1 << max_shift, max_shift, flags,
>> +                                         VMALLOC_START, VMALLOC_END, NUMA_NO_NODE,
>> +                                         GFP_KERNEL, __builtin_return_address(0));
>> +       } else {
>> +               area = get_vm_area_caller(size, flags, __builtin_return_address(0));
>> +       }
>>         if (!area)
>>                 return NULL;
>>
>>         addr = (unsigned long)area->addr;
>> -       if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
>> -                               pages, PAGE_SHIFT) < 0) {
>> -               vunmap(area->addr);
>> -               return NULL;
>> +
>> +       if (flags & VM_ALLOW_HUGE_VMAP) {
>> +               if (__vmap_huge(pages, prot, addr, count)) {
>> +                       vunmap(area->addr);
>> +                       return NULL;
>> +               }
>> +       } else {
>> +               if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
>> +                                       pages, PAGE_SHIFT) < 0) {
>> +                       vunmap(area->addr);
>> +                       return NULL;
>> +               }
>>         }
>>
>>         if (flags & VM_MAP_PUT_PAGES) {
>> @@ -4011,11 +4073,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
>>                  * their allocations due to apply_to_page_range not
>>                  * supporting them.
>>                  */
>> -
>> -               if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
>> -                       shift = PMD_SHIFT;
>> -               else
>> -                       shift = arch_vmap_pte_supported_shift(size);
>> +               shift = vm_shift(prot, size);
> 
> What I actually did is different. In patches 1/8 and 2/8, I
> extended the arm64 levels to support N * CONT_PTE, and let the
> final PTE mapping use the maximum possible batch after avoiding
> zigzag. This further improves all orders greater than CONT_PTE.
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 6/8] mm/vmalloc: align vm_area so vmap() can batch mappings
  2026-04-08  2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
                   ` (4 preceding siblings ...)
  2026-04-08  2:51 ` [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible Barry Song (Xiaomi)
@ 2026-04-08  2:51 ` Barry Song (Xiaomi)
  2026-04-08  2:51 ` [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag Barry Song (Xiaomi)
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08  2:51 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21, Barry Song (Xiaomi)

Try to align the vmap virtual address to PMD_SHIFT or a
larger PTE mapping size hinted by the architecture, so
contiguous pages can be batch-mapped when setting PMD or
PTE entries.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/vmalloc.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e8dbfada42bc..6643ec0288cd 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3576,6 +3576,35 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
 	return err;
 }
 
+static struct vm_struct *get_aligned_vm_area(unsigned long size, unsigned long flags)
+{
+	unsigned int shift = (size >= PMD_SIZE) ? PMD_SHIFT :
+				arch_vmap_pte_supported_shift(size);
+	struct vm_struct *vm_area = NULL;
+
+	/*
+	 * Try to allocate an aligned vm_area so contiguous pages can be
+	 * mapped in batches.
+	 */
+	while (1) {
+		unsigned long align = 1UL << shift;
+
+		vm_area =  __get_vm_area_node(size, align, PAGE_SHIFT, flags,
+				VMALLOC_START, VMALLOC_END,
+				NUMA_NO_NODE, GFP_KERNEL,
+				__builtin_return_address(0));
+		if (vm_area || shift <= PAGE_SHIFT)
+			goto out;
+		if (shift == PMD_SHIFT)
+			shift = arch_vmap_pte_supported_shift(size);
+		else if (shift > PAGE_SHIFT)
+			shift = PAGE_SHIFT;
+	}
+
+out:
+	return vm_area;
+}
+
 /**
  * vmap - map an array of pages into virtually contiguous space
  * @pages: array of page pointers
@@ -3614,7 +3643,7 @@ void *vmap(struct page **pages, unsigned int count,
 		return NULL;
 
 	size = (unsigned long)count << PAGE_SHIFT;
-	area = get_vm_area_caller(size, flags, __builtin_return_address(0));
+	area = get_aligned_vm_area(size, flags);
 	if (!area)
 		return NULL;
 
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag
  2026-04-08  2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
                   ` (5 preceding siblings ...)
  2026-04-08  2:51 ` [RFC PATCH 6/8] mm/vmalloc: align vm_area so vmap() can batch mappings Barry Song (Xiaomi)
@ 2026-04-08  2:51 ` Barry Song (Xiaomi)
  2026-04-08 11:36   ` Dev Jain
  2026-04-08  2:51 ` [RFC PATCH 8/8] mm/vmalloc: Stop scanning for compound pages after encountering small pages in vmap Barry Song (Xiaomi)
  2026-04-08  9:14 ` [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Dev Jain
  8 siblings, 1 reply; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08  2:51 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21, Barry Song (Xiaomi)

For vmap(), detect pages with the same page_shift and map them in
batches, avoiding the pgtable zigzag caused by per-page mapping.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/vmalloc.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6643ec0288cd..3c3b7217693a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3551,6 +3551,8 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
 		pgprot_t prot, struct page **pages)
 {
 	unsigned int count = (end - addr) >> PAGE_SHIFT;
+	unsigned int prev_shift = 0, idx = 0;
+	unsigned long map_addr = addr;
 	int err;
 
 	err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
@@ -3562,15 +3564,29 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
 		unsigned int shift = PAGE_SHIFT +
 			get_vmap_batch_order(pages, count - i, i);
 
-		err = vmap_range_noflush(addr, addr + (1UL << shift),
-				page_to_phys(pages[i]), prot, shift);
-		if (err)
-			goto out;
+		if (!i)
+			prev_shift = shift;
+
+		if (shift != prev_shift) {
+			err = vmap_small_pages_range_noflush(map_addr, addr,
+					prot, pages + idx,
+					min(prev_shift, PMD_SHIFT));
+			if (err)
+				goto out;
+			prev_shift = shift;
+			map_addr = addr;
+			idx = i;
+		}
 
 		addr += 1UL << shift;
 		i += 1U << (shift - PAGE_SHIFT);
 	}
 
+	/* Remaining */
+	if (map_addr < end)
+		err = vmap_small_pages_range_noflush(map_addr, end,
+				prot, pages + idx, min(prev_shift, PMD_SHIFT));
+
 out:
 	flush_cache_vmap(addr, end);
 	return err;
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag
  2026-04-08  2:51 ` [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag Barry Song (Xiaomi)
@ 2026-04-08 11:36   ` Dev Jain
  0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2026-04-08 11:36 UTC (permalink / raw)
  To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
	will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21



On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> For vmap(), detect pages with the same page_shift and map them in
> batches, avoiding the pgtable zigzag caused by per-page mapping.
> 
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---

In patch 4, you eliminate the pagetable rewalk, and in patch 5,
you re-introduce it, then in this patch you eliminate it again.
So please just squash this into #5.

>  mm/vmalloc.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6643ec0288cd..3c3b7217693a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3551,6 +3551,8 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
>  		pgprot_t prot, struct page **pages)
>  {
>  	unsigned int count = (end - addr) >> PAGE_SHIFT;
> +	unsigned int prev_shift = 0, idx = 0;
> +	unsigned long map_addr = addr;
>  	int err;
>  
>  	err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
> @@ -3562,15 +3564,29 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
>  		unsigned int shift = PAGE_SHIFT +
>  			get_vmap_batch_order(pages, count - i, i);
>  
> -		err = vmap_range_noflush(addr, addr + (1UL << shift),
> -				page_to_phys(pages[i]), prot, shift);
> -		if (err)
> -			goto out;
> +		if (!i)
> +			prev_shift = shift;
> +
> +		if (shift != prev_shift) {
> +			err = vmap_small_pages_range_noflush(map_addr, addr,
> +					prot, pages + idx,
> +					min(prev_shift, PMD_SHIFT));
> +			if (err)
> +				goto out;
> +			prev_shift = shift;
> +			map_addr = addr;
> +			idx = i;
> +		}
>  
>  		addr += 1UL << shift;
>  		i += 1U << (shift - PAGE_SHIFT);
>  	}
>  
> +	/* Remaining */
> +	if (map_addr < end)
> +		err = vmap_small_pages_range_noflush(map_addr, end,
> +				prot, pages + idx, min(prev_shift, PMD_SHIFT));
> +
>  out:
>  	flush_cache_vmap(addr, end);
>  	return err;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 8/8] mm/vmalloc: Stop scanning for compound pages after encountering small pages in vmap
  2026-04-08  2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
                   ` (6 preceding siblings ...)
  2026-04-08  2:51 ` [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag Barry Song (Xiaomi)
@ 2026-04-08  2:51 ` Barry Song (Xiaomi)
  2026-04-08  9:14 ` [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Dev Jain
  8 siblings, 0 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08  2:51 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21, Barry Song (Xiaomi)

Users typically allocate memory in descending orders, e.g.
8 → 4 → 0. Once an order-0 page is encountered, subsequent
pages are likely to also be order-0, so we stop scanning
for compound pages at that point.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/vmalloc.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3c3b7217693a..242f4bc1379c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3577,6 +3577,12 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
 			map_addr = addr;
 			idx = i;
 		}
+		/*
+		 * Once small pages are encountered, the remaining pages
+		 * are likely small as well
+		 */
+		if (shift == PAGE_SHIFT)
+			break;
 
 		addr += 1UL << shift;
 		i += 1U << (shift - PAGE_SHIFT);
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
  2026-04-08  2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
                   ` (7 preceding siblings ...)
  2026-04-08  2:51 ` [RFC PATCH 8/8] mm/vmalloc: Stop scanning for compound pages after encountering small pages in vmap Barry Song (Xiaomi)
@ 2026-04-08  9:14 ` Dev Jain
  2026-04-08 10:51   ` Barry Song
  8 siblings, 1 reply; 19+ messages in thread
From: Dev Jain @ 2026-04-08  9:14 UTC (permalink / raw)
  To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
	will, akpm, urezki
  Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21



On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> This patchset accelerates ioremap, vmalloc, and vmap when the memory
> is physically fully or partially contiguous. Two techniques are used:
> 
> 1. Avoid page table zigzag when setting PTEs/PMDs for multiple memory
>    segments
> 2. Use batched mappings wherever possible in both vmalloc and ARM64
>    layers
> 
> Patches 1–2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> CONT-PTE regions instead of just one.
> 
> Patches 3–4 extend vmap_small_pages_range_noflush() to support page
> shifts other than PAGE_SHIFT. This allows mapping multiple memory
> segments for vmalloc() without zigzagging page tables.
> 
> Patches 5–8 add huge vmap support for contiguous pages. This not only
> improves performance but also enables PMD or CONT-PTE mapping for the
> vmapped area, reducing TLB pressure.
> 
> Many thanks to Xueyuan Chen for his substantial testing efforts
> on RK3588 boards.
> 
> On the RK3588 8-core ARM64 SoC, with tasks pinned to CPU2 and
> the performance CPUfreq policy enabled, Xueyuan’s tests report:
> 
> * ioremap(1 MB): 1.2× faster
> * vmalloc(1 MB) mapping time (excluding allocation) with
>   VM_ALLOW_HUGE_VMAP: 1.5× faster
> * vmap(): 5.6× faster when memory includes some order-8 pages,
>   with no regression observed for order-0 pages
> 
> Barry Song (Xiaomi) (8):
>   arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
>     setup
>   arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
>     CONT_PTE
>   mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger
>     page_shift sizes
>   mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
>   mm/vmalloc: map contiguous pages in batches for vmap() if possible
>   mm/vmalloc: align vm_area so vmap() can batch mappings
>   mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable
>     zigzag
>   mm/vmalloc: Stop scanning for compound pages after encountering small
>     pages in vmap
> 
>  arch/arm64/include/asm/vmalloc.h |   6 +-
>  arch/arm64/mm/hugetlbpage.c      |  10 ++
>  mm/vmalloc.c                     | 178 +++++++++++++++++++++++++------
>  3 files changed, 161 insertions(+), 33 deletions(-)
> 

On Linux VM on Apple M3, running mm-selftests:

 ./run_vmtests.sh -t "hugetlb"

TAP version 13
# -----------------------
# running ./hugepage-mmap
# -----------------------
# TAP version 13
# 1..1
# # Returned address is 0xffffe7c00000



[   30.884630] kernel BUG at mm/page_table_check.c:86!
[   30.884701] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
[   30.886803] Modules linked in:
[   30.887217] CPU: 3 UID: 0 PID: 1869 Comm: hugepage-mmap Not tainted 7.0.0-rc5+ #86 PREEMPT
[   30.888218] Hardware name: linux,dummy-virt (DT)
[   30.889413] pstate: a1400005 (NzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[   30.889901] pc : page_table_check_clear.part.0+0x128/0x1a0
[   30.890337] lr : page_table_check_clear.part.0+0x7c/0x1a0
[   30.890714] sp : ffff800084da3ad0
[   30.890946] x29: ffff800084da3ad0 x28: 0000000000000001 x27: 0010000000000001
[   30.891434] x26: 0040000000000040 x25: ffffa06bb8fb9000 x24: 00000000ffffffff
[   30.891932] x23: 0000000000000001 x22: 0000000000000000 x21: ffffa06bb8997810
[   30.892514] x20: 0000000000113e39 x19: 0000000000113e38 x18: 0000000000000000
[   30.893007] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[   30.893500] x14: ffffa06bb7013780 x13: 0000fffff7f90fff x12: 0000000000000000
[   30.893990] x11: 1fffe0001a1282c1 x10: ffff0000d094160c x9 : ffffa06bb568a858
[   30.894479] x8 : ffff5f95c8474000 x7 : 0000000000000000 x6 : ffff00017fffc500
[   30.894973] x5 : ffff000191208fc0 x4 : 0000000000000000 x3 : 0000000000004000
[   30.895449] x2 : 0000000000000000 x1 : 00000000ffffffff x0 : ffff0000c071f1b8
[   30.895875] Call trace:
[   30.896027]  page_table_check_clear.part.0+0x128/0x1a0 (P)
[   30.896369]  page_table_check_clear+0xc8/0x138
[   30.896776]  __page_table_check_ptes_set+0xe4/0x1e8
[   30.897073]  __set_ptes_anysz+0x2e4/0x308
[   30.897327]  set_huge_pte_at+0xec/0x210
[   30.897561]  hugetlb_no_page+0x1ec/0x8e0
[   30.897807]  hugetlb_fault+0x188/0x740
[   30.898036]  handle_mm_fault+0x294/0x2c0
[   30.898283]  do_page_fault+0x120/0x748
[   30.898539]  do_translation_fault+0x68/0x90
[   30.898796]  do_mem_abort+0x4c/0xa8
[   30.899011]  el0_da+0x2c/0x90
[   30.899205]  el0t_64_sync_handler+0xd0/0xe8
[   30.899461]  el0t_64_sync+0x198/0x1a0
[   30.899688] Code: 91001021 b8f80022 51000441 36fffd41 (d4210000)
[   30.900053] ---[ end trace 0000000000000000 ]---



The bug is at

BUG_ON(atomic_dec_return(&ptc->file_map_count) < 0);

My tree is mm-unstable, commit 3fa44141e0bb.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
  2026-04-08  9:14 ` [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Dev Jain
@ 2026-04-08 10:51   ` Barry Song
  2026-04-08 10:55     ` Dev Jain
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2026-04-08 10:51 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21

On Wed, Apr 8, 2026 at 5:14 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> > This patchset accelerates ioremap, vmalloc, and vmap when the memory
> > is physically fully or partially contiguous. Two techniques are used:
> >
> > 1. Avoid page table zigzag when setting PTEs/PMDs for multiple memory
> >    segments
> > 2. Use batched mappings wherever possible in both vmalloc and ARM64
> >    layers
> >
> > Patches 1–2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> > CONT-PTE regions instead of just one.
> >
> > Patches 3–4 extend vmap_small_pages_range_noflush() to support page
> > shifts other than PAGE_SHIFT. This allows mapping multiple memory
> > segments for vmalloc() without zigzagging page tables.
> >
> > Patches 5–8 add huge vmap support for contiguous pages. This not only
> > improves performance but also enables PMD or CONT-PTE mapping for the
> > vmapped area, reducing TLB pressure.
> >
> > Many thanks to Xueyuan Chen for his substantial testing efforts
> > on RK3588 boards.
> >
> > On the RK3588 8-core ARM64 SoC, with tasks pinned to CPU2 and
> > the performance CPUfreq policy enabled, Xueyuan’s tests report:
> >
> > * ioremap(1 MB): 1.2× faster
> > * vmalloc(1 MB) mapping time (excluding allocation) with
> >   VM_ALLOW_HUGE_VMAP: 1.5× faster
> > * vmap(): 5.6× faster when memory includes some order-8 pages,
> >   with no regression observed for order-0 pages
> >
> > Barry Song (Xiaomi) (8):
> >   arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
> >     setup
> >   arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
> >     CONT_PTE
> >   mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger
> >     page_shift sizes
> >   mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
> >   mm/vmalloc: map contiguous pages in batches for vmap() if possible
> >   mm/vmalloc: align vm_area so vmap() can batch mappings
> >   mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable
> >     zigzag
> >   mm/vmalloc: Stop scanning for compound pages after encountering small
> >     pages in vmap
> >
> >  arch/arm64/include/asm/vmalloc.h |   6 +-
> >  arch/arm64/mm/hugetlbpage.c      |  10 ++
> >  mm/vmalloc.c                     | 178 +++++++++++++++++++++++++------
> >  3 files changed, 161 insertions(+), 33 deletions(-)
> >
>
> On Linux VM on Apple M3, running mm-selftests:

Dev, thanks for your report. Sorry for the silly typo—
Xueyuan’s vmalloc/vmap tests don’t trigger that case yet.

it should be fixed by:

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index bf31c11ebd3b..25b9fce1ec6a 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -110,7 +110,7 @@ static inline int num_contig_ptes(unsigned long
size, size_t *pgsize)
                contig_ptes = CONT_PTES;
                break;
        default:
-               if (size < CONT_PMD_SIZE && size > 0 &&
+               if (size < PMD_SIZE && size > 0 &&
                                IS_ALIGNED(size, CONT_PTE_SIZE)) {
                        contig_ptes = size >> PAGE_SHIFT;
                        *pgsize = PAGE_SIZE;
@@ -365,7 +365,7 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int
shift, vm_flags_t flags)
        case CONT_PTE_SIZE:
                return pte_mkcont(entry);
        default:
-               if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
+               if (pagesize < PMD_SIZE && pagesize > 0 &&
                                IS_ALIGNED(pagesize, CONT_PTE_SIZE))
                        return pte_mkcont(entry);

>
>  ./run_vmtests.sh -t "hugetlb"
>
> TAP version 13
> # -----------------------
> # running ./hugepage-mmap
> # -----------------------
> # TAP version 13
> # 1..1
> # # Returned address is 0xffffe7c00000
>
>
>
> [   30.884630] kernel BUG at mm/page_table_check.c:86!
> [   30.884701] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
> [   30.886803] Modules linked in:
> [   30.887217] CPU: 3 UID: 0 PID: 1869 Comm: hugepage-mmap Not tainted 7.0.0-rc5+ #86 PREEMPT
> [   30.888218] Hardware name: linux,dummy-virt (DT)
> [   30.889413] pstate: a1400005 (NzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [   30.889901] pc : page_table_check_clear.part.0+0x128/0x1a0
> [   30.890337] lr : page_table_check_clear.part.0+0x7c/0x1a0
> [   30.890714] sp : ffff800084da3ad0
> [   30.890946] x29: ffff800084da3ad0 x28: 0000000000000001 x27: 0010000000000001
> [   30.891434] x26: 0040000000000040 x25: ffffa06bb8fb9000 x24: 00000000ffffffff
> [   30.891932] x23: 0000000000000001 x22: 0000000000000000 x21: ffffa06bb8997810
> [   30.892514] x20: 0000000000113e39 x19: 0000000000113e38 x18: 0000000000000000
> [   30.893007] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> [   30.893500] x14: ffffa06bb7013780 x13: 0000fffff7f90fff x12: 0000000000000000
> [   30.893990] x11: 1fffe0001a1282c1 x10: ffff0000d094160c x9 : ffffa06bb568a858
> [   30.894479] x8 : ffff5f95c8474000 x7 : 0000000000000000 x6 : ffff00017fffc500
> [   30.894973] x5 : ffff000191208fc0 x4 : 0000000000000000 x3 : 0000000000004000
> [   30.895449] x2 : 0000000000000000 x1 : 00000000ffffffff x0 : ffff0000c071f1b8
> [   30.895875] Call trace:
> [   30.896027]  page_table_check_clear.part.0+0x128/0x1a0 (P)
> [   30.896369]  page_table_check_clear+0xc8/0x138
> [   30.896776]  __page_table_check_ptes_set+0xe4/0x1e8
> [   30.897073]  __set_ptes_anysz+0x2e4/0x308
> [   30.897327]  set_huge_pte_at+0xec/0x210
> [   30.897561]  hugetlb_no_page+0x1ec/0x8e0
> [   30.897807]  hugetlb_fault+0x188/0x740
> [   30.898036]  handle_mm_fault+0x294/0x2c0
> [   30.898283]  do_page_fault+0x120/0x748
> [   30.898539]  do_translation_fault+0x68/0x90
> [   30.898796]  do_mem_abort+0x4c/0xa8
> [   30.899011]  el0_da+0x2c/0x90
> [   30.899205]  el0t_64_sync_handler+0xd0/0xe8
> [   30.899461]  el0t_64_sync+0x198/0x1a0
> [   30.899688] Code: 91001021 b8f80022 51000441 36fffd41 (d4210000)
> [   30.900053] ---[ end trace 0000000000000000 ]---
>
>
>
> The bug is at
>
> BUG_ON(atomic_dec_return(&ptc->file_map_count) < 0);
>
> My tree is mm-unstable, commit 3fa44141e0bb.
>

Thanks
Barry


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
  2026-04-08 10:51   ` Barry Song
@ 2026-04-08 10:55     ` Dev Jain
  0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2026-04-08 10:55 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
	linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
	Xueyuan.chen21



On 08/04/26 4:21 pm, Barry Song wrote:
> On Wed, Apr 8, 2026 at 5:14 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
>>> This patchset accelerates ioremap, vmalloc, and vmap when the memory
>>> is physically fully or partially contiguous. Two techniques are used:
>>>
>>> 1. Avoid page table zigzag when setting PTEs/PMDs for multiple memory
>>>    segments
>>> 2. Use batched mappings wherever possible in both vmalloc and ARM64
>>>    layers
>>>
>>> Patches 1–2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
>>> CONT-PTE regions instead of just one.
>>>
>>> Patches 3–4 extend vmap_small_pages_range_noflush() to support page
>>> shifts other than PAGE_SHIFT. This allows mapping multiple memory
>>> segments for vmalloc() without zigzagging page tables.
>>>
>>> Patches 5–8 add huge vmap support for contiguous pages. This not only
>>> improves performance but also enables PMD or CONT-PTE mapping for the
>>> vmapped area, reducing TLB pressure.
>>>
>>> Many thanks to Xueyuan Chen for his substantial testing efforts
>>> on RK3588 boards.
>>>
>>> On the RK3588 8-core ARM64 SoC, with tasks pinned to CPU2 and
>>> the performance CPUfreq policy enabled, Xueyuan’s tests report:
>>>
>>> * ioremap(1 MB): 1.2× faster
>>> * vmalloc(1 MB) mapping time (excluding allocation) with
>>>   VM_ALLOW_HUGE_VMAP: 1.5× faster
>>> * vmap(): 5.6× faster when memory includes some order-8 pages,
>>>   with no regression observed for order-0 pages
>>>
>>> Barry Song (Xiaomi) (8):
>>>   arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
>>>     setup
>>>   arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
>>>     CONT_PTE
>>>   mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger
>>>     page_shift sizes
>>>   mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
>>>   mm/vmalloc: map contiguous pages in batches for vmap() if possible
>>>   mm/vmalloc: align vm_area so vmap() can batch mappings
>>>   mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable
>>>     zigzag
>>>   mm/vmalloc: Stop scanning for compound pages after encountering small
>>>     pages in vmap
>>>
>>>  arch/arm64/include/asm/vmalloc.h |   6 +-
>>>  arch/arm64/mm/hugetlbpage.c      |  10 ++
>>>  mm/vmalloc.c                     | 178 +++++++++++++++++++++++++------
>>>  3 files changed, 161 insertions(+), 33 deletions(-)
>>>
>>
>> On Linux VM on Apple M3, running mm-selftests:
> 
> Dev, thanks for your report. Sorry for the silly typo—
> Xueyuan’s vmalloc/vmap tests don’t trigger that case yet.
> 
> it should be fixed by:
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index bf31c11ebd3b..25b9fce1ec6a 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -110,7 +110,7 @@ static inline int num_contig_ptes(unsigned long
> size, size_t *pgsize)
>                 contig_ptes = CONT_PTES;
>                 break;
>         default:
> -               if (size < CONT_PMD_SIZE && size > 0 &&
> +               if (size < PMD_SIZE && size > 0 &&
>                                 IS_ALIGNED(size, CONT_PTE_SIZE)) {
>                         contig_ptes = size >> PAGE_SHIFT;
>                         *pgsize = PAGE_SIZE;
> @@ -365,7 +365,7 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int
> shift, vm_flags_t flags)
>         case CONT_PTE_SIZE:
>                 return pte_mkcont(entry);
>         default:
> -               if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
> +               if (pagesize < PMD_SIZE && pagesize > 0 &&
>                                 IS_ALIGNED(pagesize, CONT_PTE_SIZE))
>                         return pte_mkcont(entry);

Yeah indeed the problem was that a PMD chunk was being treated as 512 ptes
rather than 1 PMD. This fixes it.

> 
>>
>>  ./run_vmtests.sh -t "hugetlb"
>>
>> TAP version 13
>> # -----------------------
>> # running ./hugepage-mmap
>> # -----------------------
>> # TAP version 13
>> # 1..1
>> # # Returned address is 0xffffe7c00000
>>
>>
>>
>> [   30.884630] kernel BUG at mm/page_table_check.c:86!
>> [   30.884701] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
>> [   30.886803] Modules linked in:
>> [   30.887217] CPU: 3 UID: 0 PID: 1869 Comm: hugepage-mmap Not tainted 7.0.0-rc5+ #86 PREEMPT
>> [   30.888218] Hardware name: linux,dummy-virt (DT)
>> [   30.889413] pstate: a1400005 (NzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>> [   30.889901] pc : page_table_check_clear.part.0+0x128/0x1a0
>> [   30.890337] lr : page_table_check_clear.part.0+0x7c/0x1a0
>> [   30.890714] sp : ffff800084da3ad0
>> [   30.890946] x29: ffff800084da3ad0 x28: 0000000000000001 x27: 0010000000000001
>> [   30.891434] x26: 0040000000000040 x25: ffffa06bb8fb9000 x24: 00000000ffffffff
>> [   30.891932] x23: 0000000000000001 x22: 0000000000000000 x21: ffffa06bb8997810
>> [   30.892514] x20: 0000000000113e39 x19: 0000000000113e38 x18: 0000000000000000
>> [   30.893007] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>> [   30.893500] x14: ffffa06bb7013780 x13: 0000fffff7f90fff x12: 0000000000000000
>> [   30.893990] x11: 1fffe0001a1282c1 x10: ffff0000d094160c x9 : ffffa06bb568a858
>> [   30.894479] x8 : ffff5f95c8474000 x7 : 0000000000000000 x6 : ffff00017fffc500
>> [   30.894973] x5 : ffff000191208fc0 x4 : 0000000000000000 x3 : 0000000000004000
>> [   30.895449] x2 : 0000000000000000 x1 : 00000000ffffffff x0 : ffff0000c071f1b8
>> [   30.895875] Call trace:
>> [   30.896027]  page_table_check_clear.part.0+0x128/0x1a0 (P)
>> [   30.896369]  page_table_check_clear+0xc8/0x138
>> [   30.896776]  __page_table_check_ptes_set+0xe4/0x1e8
>> [   30.897073]  __set_ptes_anysz+0x2e4/0x308
>> [   30.897327]  set_huge_pte_at+0xec/0x210
>> [   30.897561]  hugetlb_no_page+0x1ec/0x8e0
>> [   30.897807]  hugetlb_fault+0x188/0x740
>> [   30.898036]  handle_mm_fault+0x294/0x2c0
>> [   30.898283]  do_page_fault+0x120/0x748
>> [   30.898539]  do_translation_fault+0x68/0x90
>> [   30.898796]  do_mem_abort+0x4c/0xa8
>> [   30.899011]  el0_da+0x2c/0x90
>> [   30.899205]  el0t_64_sync_handler+0xd0/0xe8
>> [   30.899461]  el0t_64_sync+0x198/0x1a0
>> [   30.899688] Code: 91001021 b8f80022 51000441 36fffd41 (d4210000)
>> [   30.900053] ---[ end trace 0000000000000000 ]---
>>
>>
>>
>> The bug is at
>>
>> BUG_ON(atomic_dec_return(&ptc->file_map_count) < 0);
>>
>> My tree is mm-unstable, commit 3fa44141e0bb.
>>
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-04-08 11:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-08  2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
2026-04-08  2:51 ` [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Barry Song (Xiaomi)
2026-04-08 10:32   ` Dev Jain
2026-04-08 11:00     ` Barry Song
2026-04-08  2:51 ` [RFC PATCH 2/8] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Barry Song (Xiaomi)
2026-04-08  2:51 ` [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes Barry Song (Xiaomi)
2026-04-08 11:08   ` Dev Jain
2026-04-08  2:51 ` [RFC PATCH 4/8] mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings Barry Song (Xiaomi)
2026-04-08  2:51 ` [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible Barry Song (Xiaomi)
2026-04-08  4:19   ` Dev Jain
2026-04-08  5:12     ` Barry Song
2026-04-08 11:22       ` Dev Jain
2026-04-08  2:51 ` [RFC PATCH 6/8] mm/vmalloc: align vm_area so vmap() can batch mappings Barry Song (Xiaomi)
2026-04-08  2:51 ` [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag Barry Song (Xiaomi)
2026-04-08 11:36   ` Dev Jain
2026-04-08  2:51 ` [RFC PATCH 8/8] mm/vmalloc: Stop scanning for compound pages after encountering small pages in vmap Barry Song (Xiaomi)
2026-04-08  9:14 ` [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Dev Jain
2026-04-08 10:51   ` Barry Song
2026-04-08 10:55     ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox