* [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
2026-04-08 2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
@ 2026-04-08 2:51 ` Barry Song (Xiaomi)
2026-04-08 10:32 ` Dev Jain
2026-04-08 2:51 ` [RFC PATCH 2/8] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Barry Song (Xiaomi)
` (7 subsequent siblings)
8 siblings, 1 reply; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08 2:51 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21, Barry Song (Xiaomi)
For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
we can batch CONT_PTE settings instead of handling them individually.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index a42c05cf5640..bf31c11ebd3b 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
contig_ptes = CONT_PTES;
break;
default:
+ if (size < CONT_PMD_SIZE && size > 0 &&
+ IS_ALIGNED(size, CONT_PTE_SIZE)) {
+ contig_ptes = size >> PAGE_SHIFT;
+ *pgsize = PAGE_SIZE;
+ break;
+ }
WARN_ON(!__hugetlb_valid_size(size));
}
@@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
case CONT_PTE_SIZE:
return pte_mkcont(entry);
default:
+ if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
+ IS_ALIGNED(pagesize, CONT_PTE_SIZE))
+ return pte_mkcont(entry);
+
break;
}
pr_warn("%s: unrecognized huge page size 0x%lx\n",
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
2026-04-08 2:51 ` [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Barry Song (Xiaomi)
@ 2026-04-08 10:32 ` Dev Jain
2026-04-08 11:00 ` Barry Song
0 siblings, 1 reply; 19+ messages in thread
From: Dev Jain @ 2026-04-08 10:32 UTC (permalink / raw)
To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21
On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
> we can batch CONT_PTE settings instead of handling them individually.
>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
> arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index a42c05cf5640..bf31c11ebd3b 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
> contig_ptes = CONT_PTES;
> break;
> default:
> + if (size < CONT_PMD_SIZE && size > 0 &&
> + IS_ALIGNED(size, CONT_PTE_SIZE)) {
Nit: Having the lower bound check before upper bound is natural to
read, so this should be size > 0 && size < CONT_PMD_SIZE (i.e written
the other way around).
Also IS_ALIGNED needs to go below size.
> + contig_ptes = size >> PAGE_SHIFT;
> + *pgsize = PAGE_SIZE;
> + break;
> + }
> WARN_ON(!__hugetlb_valid_size(size));
> }
>
> @@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
> case CONT_PTE_SIZE:
> return pte_mkcont(entry);
> default:
> + if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
> + IS_ALIGNED(pagesize, CONT_PTE_SIZE))
> + return pte_mkcont(entry);
> +
> break;
> }
> pr_warn("%s: unrecognized huge page size 0x%lx\n",
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup
2026-04-08 10:32 ` Dev Jain
@ 2026-04-08 11:00 ` Barry Song
0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2026-04-08 11:00 UTC (permalink / raw)
To: Dev Jain
Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21
On Wed, Apr 8, 2026 at 6:32 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> > For sizes aligned to CONT_PTE_SIZE and smaller than PMD_SIZE,
> > we can batch CONT_PTE settings instead of handling them individually.
> >
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
> > arch/arm64/mm/hugetlbpage.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> > index a42c05cf5640..bf31c11ebd3b 100644
> > --- a/arch/arm64/mm/hugetlbpage.c
> > +++ b/arch/arm64/mm/hugetlbpage.c
> > @@ -110,6 +110,12 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
> > contig_ptes = CONT_PTES;
> > break;
> > default:
> > + if (size < CONT_PMD_SIZE && size > 0 &&
> > + IS_ALIGNED(size, CONT_PTE_SIZE)) {
>
> Nit: Having the lower bound check before upper bound is natural to
> read, so this should be size > 0 && size < CONT_PMD_SIZE (i.e written
> the other way around).
Thanks very much for reviewing, Dev. As we discussed in patch 0/8,
this should be
PMD_SIZE, not CONT_PMD_SIZE. I will use size > 0 && size < PMD_SIZE
in the next version.
>
> Also IS_ALIGNED needs to go below size.
Sure, thanks!
>
>
> > + contig_ptes = size >> PAGE_SHIFT;
> > + *pgsize = PAGE_SIZE;
> > + break;
> > + }
> > WARN_ON(!__hugetlb_valid_size(size));
> > }
> >
> > @@ -359,6 +365,10 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
> > case CONT_PTE_SIZE:
> > return pte_mkcont(entry);
> > default:
> > + if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
> > + IS_ALIGNED(pagesize, CONT_PTE_SIZE))
> > + return pte_mkcont(entry);
Here it should be pagesize > 0 && pagesize < PMD_SIZE as well :-)
> > +
> > break;
> > }
> > pr_warn("%s: unrecognized huge page size 0x%lx\n",
>
Best Regards
Barry
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 2/8] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE
2026-04-08 2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
2026-04-08 2:51 ` [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Barry Song (Xiaomi)
@ 2026-04-08 2:51 ` Barry Song (Xiaomi)
2026-04-08 2:51 ` [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes Barry Song (Xiaomi)
` (6 subsequent siblings)
8 siblings, 0 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08 2:51 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21, Barry Song (Xiaomi)
Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE hugepages,
reducing both PTE setup and TLB flush iterations.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
arch/arm64/include/asm/vmalloc.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 4ec1acd3c1b3..9eea06d0f75d 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -23,6 +23,8 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
unsigned long end, u64 pfn,
unsigned int max_page_shift)
{
+ unsigned long size;
+
/*
* If the block is at least CONT_PTE_SIZE in size, and is naturally
* aligned in both virtual and physical space, then we can pte-map the
@@ -40,7 +42,9 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
return PAGE_SIZE;
- return CONT_PTE_SIZE;
+ size = min3(end - addr, 1UL << max_page_shift, PMD_SIZE >> 1);
+ size = 1UL << (fls(size) - 1);
+ return size;
}
#define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 19+ messages in thread* [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes
2026-04-08 2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
2026-04-08 2:51 ` [RFC PATCH 1/8] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Barry Song (Xiaomi)
2026-04-08 2:51 ` [RFC PATCH 2/8] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Barry Song (Xiaomi)
@ 2026-04-08 2:51 ` Barry Song (Xiaomi)
2026-04-08 11:08 ` Dev Jain
2026-04-08 2:51 ` [RFC PATCH 4/8] mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings Barry Song (Xiaomi)
` (5 subsequent siblings)
8 siblings, 1 reply; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08 2:51 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21, Barry Song (Xiaomi)
vmap_small_pages_range_noflush() provides a clean interface by taking
struct page **pages and mapping them via direct PTE iteration. This
avoids the page table zigzag seen when using
vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
Extend it to support larger page_shift values, and add PMD- and
contiguous-PTE mappings as well.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
mm/vmalloc.c | 54 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 42 insertions(+), 12 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 57eae99d9909..5bf072297536 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -524,8 +524,9 @@ void vunmap_range(unsigned long addr, unsigned long end)
static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, unsigned int shift)
{
+ unsigned int steps = 1;
int err = 0;
pte_t *pte;
@@ -543,6 +544,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
do {
struct page *page = pages[*nr];
+ steps = 1;
if (WARN_ON(!pte_none(ptep_get(pte)))) {
err = -EBUSY;
break;
@@ -556,9 +558,24 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
break;
}
+#ifdef CONFIG_HUGETLB_PAGE
+ if (shift != PAGE_SHIFT) {
+ unsigned long pfn = page_to_pfn(page), size;
+
+ size = arch_vmap_pte_range_map_size(addr, end, pfn, shift);
+ if (size != PAGE_SIZE) {
+ steps = size >> PAGE_SHIFT;
+ pte_t entry = pfn_pte(pfn, prot);
+
+ entry = arch_make_huge_pte(entry, ilog2(size), 0);
+ set_huge_pte_at(&init_mm, addr, pte, entry, size);
+ continue;
+ }
+ }
+#endif
+
set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
- (*nr)++;
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ } while (pte += steps, *nr += steps, addr += PAGE_SIZE * steps, addr != end);
lazy_mmu_mode_disable();
*mask |= PGTBL_PTE_MODIFIED;
@@ -568,7 +585,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, unsigned int shift)
{
pmd_t *pmd;
unsigned long next;
@@ -578,7 +595,20 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
- if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
+
+ if (shift == PMD_SHIFT) {
+ struct page *page = pages[*nr];
+ phys_addr_t phys_addr = page_to_phys(page);
+
+ if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
+ shift)) {
+ *mask |= PGTBL_PMD_MODIFIED;
+ *nr += 1 << (shift - PAGE_SHIFT);
+ continue;
+ }
+ }
+
+ if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
@@ -586,7 +616,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, unsigned int shift)
{
pud_t *pud;
unsigned long next;
@@ -596,7 +626,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
- if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
+ if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
@@ -604,7 +634,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
- pgtbl_mod_mask *mask)
+ pgtbl_mod_mask *mask, unsigned int shift)
{
p4d_t *p4d;
unsigned long next;
@@ -614,14 +644,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
- if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
+ if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
}
static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
- pgprot_t prot, struct page **pages)
+ pgprot_t prot, struct page **pages, unsigned int shift)
{
unsigned long start = addr;
pgd_t *pgd;
@@ -636,7 +666,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
next = pgd_addr_end(addr, end);
if (pgd_bad(*pgd))
mask |= PGTBL_PGD_MODIFIED;
- err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+ err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
if (err)
break;
} while (pgd++, addr = next, addr != end);
@@ -665,7 +695,7 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
page_shift == PAGE_SHIFT)
- return vmap_small_pages_range_noflush(addr, end, prot, pages);
+ return vmap_small_pages_range_noflush(addr, end, prot, pages, PAGE_SHIFT);
for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
int err;
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes
2026-04-08 2:51 ` [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes Barry Song (Xiaomi)
@ 2026-04-08 11:08 ` Dev Jain
0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2026-04-08 11:08 UTC (permalink / raw)
To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21
On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> vmap_small_pages_range_noflush() provides a clean interface by taking
> struct page **pages and mapping them via direct PTE iteration. This
> avoids the page table zigzag seen when using
"Zigzag" is ambiguous. Just say "page table rewalk". Also please
elaborate on why the rewalk is happening currently.
> vmap_range_noflush() for page_shift values other than PAGE_SHIFT.
>
> Extend it to support larger page_shift values, and add PMD- and
> contiguous-PTE mappings as well.
So we can drop the "small" here since now it supports larger chunks
as well.
Also at this point the code you add is a no-op since you pass PAGE_SHIFT.
Let us just squash patch 4 into this. This patch looks weird retaining
the pagetable-rewalk algorithm when it literally adds functionality
to avoid that.
>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
> mm/vmalloc.c | 54 ++++++++++++++++++++++++++++++++++++++++------------
> 1 file changed, 42 insertions(+), 12 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 57eae99d9909..5bf072297536 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -524,8 +524,9 @@ void vunmap_range(unsigned long addr, unsigned long end)
>
> static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, unsigned int shift)
> {
> + unsigned int steps = 1;
> int err = 0;
> pte_t *pte;
>
> @@ -543,6 +544,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> do {
> struct page *page = pages[*nr];
>
> + steps = 1;
> if (WARN_ON(!pte_none(ptep_get(pte)))) {
> err = -EBUSY;
> break;
> @@ -556,9 +558,24 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> break;
> }
>
> +#ifdef CONFIG_HUGETLB_PAGE
> + if (shift != PAGE_SHIFT) {
> + unsigned long pfn = page_to_pfn(page), size;
> +
> + size = arch_vmap_pte_range_map_size(addr, end, pfn, shift);
> + if (size != PAGE_SIZE) {
> + steps = size >> PAGE_SHIFT;
> + pte_t entry = pfn_pte(pfn, prot);
> +
> + entry = arch_make_huge_pte(entry, ilog2(size), 0);
> + set_huge_pte_at(&init_mm, addr, pte, entry, size);
> + continue;
> + }
> + }
> +#endif
> +
> set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
> - (*nr)++;
> - } while (pte++, addr += PAGE_SIZE, addr != end);
> + } while (pte += steps, *nr += steps, addr += PAGE_SIZE * steps, addr != end);
>
> lazy_mmu_mode_disable();
> *mask |= PGTBL_PTE_MODIFIED;
> @@ -568,7 +585,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>
> static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
> unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, unsigned int shift)
> {
> pmd_t *pmd;
> unsigned long next;
> @@ -578,7 +595,20 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
> return -ENOMEM;
> do {
> next = pmd_addr_end(addr, end);
> - if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
> +
> + if (shift == PMD_SHIFT) {
> + struct page *page = pages[*nr];
> + phys_addr_t phys_addr = page_to_phys(page);
> +
> + if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot,
> + shift)) {
> + *mask |= PGTBL_PMD_MODIFIED;
> + *nr += 1 << (shift - PAGE_SHIFT);
> + continue;
> + }
> + }
> +
> + if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask, shift))
> return -ENOMEM;
> } while (pmd++, addr = next, addr != end);
> return 0;
> @@ -586,7 +616,7 @@ static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
>
> static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
> unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, unsigned int shift)
> {
> pud_t *pud;
> unsigned long next;
> @@ -596,7 +626,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
> return -ENOMEM;
> do {
> next = pud_addr_end(addr, end);
> - if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
> + if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask, shift))
> return -ENOMEM;
> } while (pud++, addr = next, addr != end);
> return 0;
> @@ -604,7 +634,7 @@ static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
>
> static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
> unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> - pgtbl_mod_mask *mask)
> + pgtbl_mod_mask *mask, unsigned int shift)
> {
> p4d_t *p4d;
> unsigned long next;
> @@ -614,14 +644,14 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
> return -ENOMEM;
> do {
> next = p4d_addr_end(addr, end);
> - if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
> + if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask, shift))
> return -ENOMEM;
> } while (p4d++, addr = next, addr != end);
> return 0;
> }
>
> static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> - pgprot_t prot, struct page **pages)
> + pgprot_t prot, struct page **pages, unsigned int shift)
> {
> unsigned long start = addr;
> pgd_t *pgd;
> @@ -636,7 +666,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> next = pgd_addr_end(addr, end);
> if (pgd_bad(*pgd))
> mask |= PGTBL_PGD_MODIFIED;
> - err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
> + err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask, shift);
> if (err)
> break;
> } while (pgd++, addr = next, addr != end);
> @@ -665,7 +695,7 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>
> if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> page_shift == PAGE_SHIFT)
> - return vmap_small_pages_range_noflush(addr, end, prot, pages);
> + return vmap_small_pages_range_noflush(addr, end, prot, pages, PAGE_SHIFT);
>
> for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> int err;
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 4/8] mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
2026-04-08 2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
` (2 preceding siblings ...)
2026-04-08 2:51 ` [RFC PATCH 3/8] mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger page_shift sizes Barry Song (Xiaomi)
@ 2026-04-08 2:51 ` Barry Song (Xiaomi)
2026-04-08 2:51 ` [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible Barry Song (Xiaomi)
` (4 subsequent siblings)
8 siblings, 0 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08 2:51 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21, Barry Song (Xiaomi)
For vmalloc() allocations with VM_ALLOW_HUGE_VMAP, we no longer
need to iterate over pages one by one, which would otherwise lead to
zigzag page table mappings.
The code is now unified with the PAGE_SHIFT case by simply
calling vmap_small_pages_range_noflush().
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
mm/vmalloc.c | 22 ++++------------------
1 file changed, 4 insertions(+), 18 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5bf072297536..eba436386929 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -689,27 +689,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
pgprot_t prot, struct page **pages, unsigned int page_shift)
{
- unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
-
WARN_ON(page_shift < PAGE_SHIFT);
- if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
- page_shift == PAGE_SHIFT)
- return vmap_small_pages_range_noflush(addr, end, prot, pages, PAGE_SHIFT);
-
- for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
- int err;
-
- err = vmap_range_noflush(addr, addr + (1UL << page_shift),
- page_to_phys(pages[i]), prot,
- page_shift);
- if (err)
- return err;
+ if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC))
+ page_shift = PAGE_SHIFT;
- addr += 1UL << page_shift;
- }
-
- return 0;
+ return vmap_small_pages_range_noflush(addr, end, prot, pages,
+ min(page_shift, PMD_SHIFT));
}
int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 19+ messages in thread* [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible
2026-04-08 2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
` (3 preceding siblings ...)
2026-04-08 2:51 ` [RFC PATCH 4/8] mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings Barry Song (Xiaomi)
@ 2026-04-08 2:51 ` Barry Song (Xiaomi)
2026-04-08 4:19 ` Dev Jain
2026-04-08 2:51 ` [RFC PATCH 6/8] mm/vmalloc: align vm_area so vmap() can batch mappings Barry Song (Xiaomi)
` (3 subsequent siblings)
8 siblings, 1 reply; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08 2:51 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21, Barry Song (Xiaomi)
In many cases, the pages passed to vmap() may include high-order
pages allocated with __GFP_COMP flags. For example, the systemheap
often allocates pages in descending order: order 8, then 4, then 0.
Currently, vmap() iterates over every page individually—even pages
inside a high-order block are handled one by one.
This patch detects high-order pages and maps them as a single
contiguous block whenever possible.
An alternative would be to implement a new API, vmap_sg(), but that
change seems to be large in scope.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
mm/vmalloc.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 49 insertions(+), 2 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index eba436386929..e8dbfada42bc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3529,6 +3529,53 @@ void vunmap(const void *addr)
}
EXPORT_SYMBOL(vunmap);
+static inline int get_vmap_batch_order(struct page **pages,
+ unsigned int max_steps, unsigned int idx)
+{
+ unsigned int nr_pages;
+
+ if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) ||
+ ioremap_max_page_shift == PAGE_SHIFT)
+ return 0;
+
+ nr_pages = compound_nr(pages[idx]);
+ if (nr_pages == 1 || max_steps < nr_pages)
+ return 0;
+
+ if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
+ return compound_order(pages[idx]);
+ return 0;
+}
+
+static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
+ pgprot_t prot, struct page **pages)
+{
+ unsigned int count = (end - addr) >> PAGE_SHIFT;
+ int err;
+
+ err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
+ PAGE_SHIFT, GFP_KERNEL);
+ if (err)
+ goto out;
+
+ for (unsigned int i = 0; i < count; ) {
+ unsigned int shift = PAGE_SHIFT +
+ get_vmap_batch_order(pages, count - i, i);
+
+ err = vmap_range_noflush(addr, addr + (1UL << shift),
+ page_to_phys(pages[i]), prot, shift);
+ if (err)
+ goto out;
+
+ addr += 1UL << shift;
+ i += 1U << (shift - PAGE_SHIFT);
+ }
+
+out:
+ flush_cache_vmap(addr, end);
+ return err;
+}
+
/**
* vmap - map an array of pages into virtually contiguous space
* @pages: array of page pointers
@@ -3572,8 +3619,8 @@ void *vmap(struct page **pages, unsigned int count,
return NULL;
addr = (unsigned long)area->addr;
- if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
- pages, PAGE_SHIFT) < 0) {
+ if (vmap_contig_pages_range(addr, addr + size, pgprot_nx(prot),
+ pages) < 0) {
vunmap(area->addr);
return NULL;
}
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible
2026-04-08 2:51 ` [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible Barry Song (Xiaomi)
@ 2026-04-08 4:19 ` Dev Jain
2026-04-08 5:12 ` Barry Song
0 siblings, 1 reply; 19+ messages in thread
From: Dev Jain @ 2026-04-08 4:19 UTC (permalink / raw)
To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21
On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> In many cases, the pages passed to vmap() may include high-order
> pages allocated with __GFP_COMP flags. For example, the systemheap
> often allocates pages in descending order: order 8, then 4, then 0.
> Currently, vmap() iterates over every page individually—even pages
> inside a high-order block are handled one by one.
>
> This patch detects high-order pages and maps them as a single
> contiguous block whenever possible.
>
> An alternative would be to implement a new API, vmap_sg(), but that
> change seems to be large in scope.
>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
Coincidentally, I was working on the same thing :)
We have a usecase regarding Arm TRBE and SPE aux buffers.
I'll take a look at your patches later, but my implementation is the
following, if you have any comments. I have squashed the patches into
a single diff.
From ccb9670a52b7f50b1f1e07b579a1316f76b84811 Mon Sep 17 00:00:00 2001
From: Dev Jain <dev.jain@arm.com>
Date: Thu, 26 Feb 2026 16:21:29 +0530
Subject: [PATCH] arm64/perf: map AUX buffer with large pages
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
.../hwtracing/coresight/coresight-etm-perf.c | 3 +-
drivers/hwtracing/coresight/coresight-trbe.c | 3 +-
drivers/perf/arm_spe_pmu.c | 5 +-
mm/vmalloc.c | 86 ++++++++++++++++---
4 files changed, 79 insertions(+), 18 deletions(-)
diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c
index 72017dcc3b7f1..e90a430af86bb 100644
--- a/drivers/hwtracing/coresight/coresight-etm-perf.c
+++ b/drivers/hwtracing/coresight/coresight-etm-perf.c
@@ -984,7 +984,8 @@ int __init etm_perf_init(void)
etm_pmu.capabilities = (PERF_PMU_CAP_EXCLUSIVE |
PERF_PMU_CAP_ITRACE |
- PERF_PMU_CAP_AUX_PAUSE);
+ PERF_PMU_CAP_AUX_PAUSE |
+ PERF_PMU_CAP_AUX_PREFER_LARGE);
etm_pmu.attr_groups = etm_pmu_attr_groups;
etm_pmu.task_ctx_nr = perf_sw_context;
diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
index 1511f8eb95afb..74e6ad891e236 100644
--- a/drivers/hwtracing/coresight/coresight-trbe.c
+++ b/drivers/hwtracing/coresight/coresight-trbe.c
@@ -760,7 +760,8 @@ static void *arm_trbe_alloc_buffer(struct coresight_device *csdev,
for (i = 0; i < nr_pages; i++)
pglist[i] = virt_to_page(pages[i]);
- buf->trbe_base = (unsigned long)vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
+ buf->trbe_base = (unsigned long)vmap(pglist, nr_pages,
+ VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
if (!buf->trbe_base) {
kfree(pglist);
kfree(buf);
diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index dbd0da1116390..90c349fd66b2c 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -1027,7 +1027,7 @@ static void *arm_spe_pmu_setup_aux(struct perf_event *event, void **pages,
for (i = 0; i < nr_pages; ++i)
pglist[i] = virt_to_page(pages[i]);
- buf->base = vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
+ buf->base = vmap(pglist, nr_pages, VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
if (!buf->base)
goto out_free_pglist;
@@ -1064,7 +1064,8 @@ static int arm_spe_pmu_perf_init(struct arm_spe_pmu *spe_pmu)
spe_pmu->pmu = (struct pmu) {
.module = THIS_MODULE,
.parent = &spe_pmu->pdev->dev,
- .capabilities = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
+ .capabilities = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE |
+ PERF_PMU_CAP_AUX_PREFER_LARGE,
.attr_groups = arm_spe_pmu_attr_groups,
/*
* We hitch a ride on the software context here, so that
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61caa55a44027..8482463d41203 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -660,14 +660,14 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
pgprot_t prot, struct page **pages, unsigned int page_shift)
{
unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
-
+ unsigned long step = 1UL << (page_shift - PAGE_SHIFT);
WARN_ON(page_shift < PAGE_SHIFT);
if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
page_shift == PAGE_SHIFT)
return vmap_small_pages_range_noflush(addr, end, prot, pages);
- for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
+ for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
int err;
err = vmap_range_noflush(addr, addr + (1UL << page_shift),
@@ -678,8 +678,9 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
addr += 1UL << page_shift;
}
-
- return 0;
+ if (IS_ALIGNED(nr, step))
+ return 0;
+ return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
}
int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
@@ -3514,6 +3515,50 @@ void vunmap(const void *addr)
}
EXPORT_SYMBOL(vunmap);
+static inline unsigned int vm_shift(pgprot_t prot, unsigned long size)
+{
+ if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
+ return PMD_SHIFT;
+
+ return arch_vmap_pte_supported_shift(size);
+}
+
+static inline int __vmap_huge(struct page **pages, pgprot_t prot,
+ unsigned long addr, unsigned int count)
+{
+ unsigned int i = 0;
+ unsigned int shift;
+ unsigned long nr;
+
+ while (i < count) {
+ nr = num_pages_contiguous(pages + i, count - i);
+ shift = vm_shift(prot, nr << PAGE_SHIFT);
+ if (vmap_pages_range(addr, addr + (nr << PAGE_SHIFT),
+ pgprot_nx(prot), pages + i, shift) < 0) {
+ return 1;
+ }
+ i += nr;
+ addr += (nr << PAGE_SHIFT);
+ }
+ return 0;
+}
+
+static unsigned long max_contiguous_stride_order(struct page **pages,
+ pgprot_t prot, unsigned int count)
+{
+ unsigned long max_shift = PAGE_SHIFT;
+ unsigned int i = 0;
+
+ while (i < count) {
+ unsigned long nr = num_pages_contiguous(pages + i, count - i);
+ unsigned long shift = vm_shift(prot, nr << PAGE_SHIFT);
+
+ max_shift = max(max_shift, shift);
+ i += nr;
+ }
+ return max_shift;
+}
+
/**
* vmap - map an array of pages into virtually contiguous space
* @pages: array of page pointers
@@ -3552,15 +3597,32 @@ void *vmap(struct page **pages, unsigned int count,
return NULL;
size = (unsigned long)count << PAGE_SHIFT;
- area = get_vm_area_caller(size, flags, __builtin_return_address(0));
+ if (flags & VM_ALLOW_HUGE_VMAP) {
+ /* determine from page array, the max alignment */
+ unsigned long max_shift = max_contiguous_stride_order(pages, prot, count);
+
+ area = __get_vm_area_node(size, 1 << max_shift, max_shift, flags,
+ VMALLOC_START, VMALLOC_END, NUMA_NO_NODE,
+ GFP_KERNEL, __builtin_return_address(0));
+ } else {
+ area = get_vm_area_caller(size, flags, __builtin_return_address(0));
+ }
if (!area)
return NULL;
addr = (unsigned long)area->addr;
- if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
- pages, PAGE_SHIFT) < 0) {
- vunmap(area->addr);
- return NULL;
+
+ if (flags & VM_ALLOW_HUGE_VMAP) {
+ if (__vmap_huge(pages, prot, addr, count)) {
+ vunmap(area->addr);
+ return NULL;
+ }
+ } else {
+ if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
+ pages, PAGE_SHIFT) < 0) {
+ vunmap(area->addr);
+ return NULL;
+ }
}
if (flags & VM_MAP_PUT_PAGES) {
@@ -4011,11 +4073,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
* their allocations due to apply_to_page_range not
* supporting them.
*/
-
- if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
- shift = PMD_SHIFT;
- else
- shift = arch_vmap_pte_supported_shift(size);
+ shift = vm_shift(prot, size);
align = max(original_align, 1UL << shift);
}
--
2.34.1
> mm/vmalloc.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index eba436386929..e8dbfada42bc 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3529,6 +3529,53 @@ void vunmap(const void *addr)
> }
> EXPORT_SYMBOL(vunmap);
>
> +static inline int get_vmap_batch_order(struct page **pages,
> + unsigned int max_steps, unsigned int idx)
> +{
> + unsigned int nr_pages;
> +
> + if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) ||
> + ioremap_max_page_shift == PAGE_SHIFT)
> + return 0;
> +
> + nr_pages = compound_nr(pages[idx]);
> + if (nr_pages == 1 || max_steps < nr_pages)
> + return 0;
> +
> + if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
> + return compound_order(pages[idx]);
> + return 0;
> +}
> +
> +static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
> + pgprot_t prot, struct page **pages)
> +{
> + unsigned int count = (end - addr) >> PAGE_SHIFT;
> + int err;
> +
> + err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
> + PAGE_SHIFT, GFP_KERNEL);
> + if (err)
> + goto out;
> +
> + for (unsigned int i = 0; i < count; ) {
> + unsigned int shift = PAGE_SHIFT +
> + get_vmap_batch_order(pages, count - i, i);
> +
> + err = vmap_range_noflush(addr, addr + (1UL << shift),
> + page_to_phys(pages[i]), prot, shift);
> + if (err)
> + goto out;
> +
> + addr += 1UL << shift;
> + i += 1U << (shift - PAGE_SHIFT);
> + }
> +
> +out:
> + flush_cache_vmap(addr, end);
> + return err;
> +}
> +
> /**
> * vmap - map an array of pages into virtually contiguous space
> * @pages: array of page pointers
> @@ -3572,8 +3619,8 @@ void *vmap(struct page **pages, unsigned int count,
> return NULL;
>
> addr = (unsigned long)area->addr;
> - if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
> - pages, PAGE_SHIFT) < 0) {
> + if (vmap_contig_pages_range(addr, addr + size, pgprot_nx(prot),
> + pages) < 0) {
> vunmap(area->addr);
> return NULL;
> }
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible
2026-04-08 4:19 ` Dev Jain
@ 2026-04-08 5:12 ` Barry Song
2026-04-08 11:22 ` Dev Jain
0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2026-04-08 5:12 UTC (permalink / raw)
To: Dev Jain
Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21
On Wed, Apr 8, 2026 at 12:20 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> > In many cases, the pages passed to vmap() may include high-order
> > pages allocated with __GFP_COMP flags. For example, the systemheap
> > often allocates pages in descending order: order 8, then 4, then 0.
> > Currently, vmap() iterates over every page individually—even pages
> > inside a high-order block are handled one by one.
> >
> > This patch detects high-order pages and maps them as a single
> > contiguous block whenever possible.
> >
> > An alternative would be to implement a new API, vmap_sg(), but that
> > change seems to be large in scope.
> >
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
>
> Coincidentally, I was working on the same thing :)
Interesting, thanks — at least I’ve got one good reviewer :-)
>
> We have a usecase regarding Arm TRBE and SPE aux buffers.
>
> I'll take a look at your patches later, but my implementation is the
Yes. Please.
> following, if you have any comments. I have squashed the patches into
> a single diff.
Thanks very much, Dev. What you’ve done is quite similar to
patches 5/8 and 6/8, although the code differs somewhat.
>
>
>
> From ccb9670a52b7f50b1f1e07b579a1316f76b84811 Mon Sep 17 00:00:00 2001
> From: Dev Jain <dev.jain@arm.com>
> Date: Thu, 26 Feb 2026 16:21:29 +0530
> Subject: [PATCH] arm64/perf: map AUX buffer with large pages
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> .../hwtracing/coresight/coresight-etm-perf.c | 3 +-
> drivers/hwtracing/coresight/coresight-trbe.c | 3 +-
> drivers/perf/arm_spe_pmu.c | 5 +-
> mm/vmalloc.c | 86 ++++++++++++++++---
> 4 files changed, 79 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c
> index 72017dcc3b7f1..e90a430af86bb 100644
> --- a/drivers/hwtracing/coresight/coresight-etm-perf.c
> +++ b/drivers/hwtracing/coresight/coresight-etm-perf.c
> @@ -984,7 +984,8 @@ int __init etm_perf_init(void)
>
> etm_pmu.capabilities = (PERF_PMU_CAP_EXCLUSIVE |
> PERF_PMU_CAP_ITRACE |
> - PERF_PMU_CAP_AUX_PAUSE);
> + PERF_PMU_CAP_AUX_PAUSE |
> + PERF_PMU_CAP_AUX_PREFER_LARGE);
>
> etm_pmu.attr_groups = etm_pmu_attr_groups;
> etm_pmu.task_ctx_nr = perf_sw_context;
> diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
> index 1511f8eb95afb..74e6ad891e236 100644
> --- a/drivers/hwtracing/coresight/coresight-trbe.c
> +++ b/drivers/hwtracing/coresight/coresight-trbe.c
> @@ -760,7 +760,8 @@ static void *arm_trbe_alloc_buffer(struct coresight_device *csdev,
> for (i = 0; i < nr_pages; i++)
> pglist[i] = virt_to_page(pages[i]);
>
> - buf->trbe_base = (unsigned long)vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
> + buf->trbe_base = (unsigned long)vmap(pglist, nr_pages,
> + VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
> if (!buf->trbe_base) {
> kfree(pglist);
> kfree(buf);
> diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
> index dbd0da1116390..90c349fd66b2c 100644
> --- a/drivers/perf/arm_spe_pmu.c
> +++ b/drivers/perf/arm_spe_pmu.c
> @@ -1027,7 +1027,7 @@ static void *arm_spe_pmu_setup_aux(struct perf_event *event, void **pages,
> for (i = 0; i < nr_pages; ++i)
> pglist[i] = virt_to_page(pages[i]);
>
> - buf->base = vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
> + buf->base = vmap(pglist, nr_pages, VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
> if (!buf->base)
> goto out_free_pglist;
>
> @@ -1064,7 +1064,8 @@ static int arm_spe_pmu_perf_init(struct arm_spe_pmu *spe_pmu)
> spe_pmu->pmu = (struct pmu) {
> .module = THIS_MODULE,
> .parent = &spe_pmu->pdev->dev,
> - .capabilities = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
> + .capabilities = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE |
> + PERF_PMU_CAP_AUX_PREFER_LARGE,
> .attr_groups = arm_spe_pmu_attr_groups,
> /*
> * We hitch a ride on the software context here, so that
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 61caa55a44027..8482463d41203 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -660,14 +660,14 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> pgprot_t prot, struct page **pages, unsigned int page_shift)
> {
> unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
> -
> + unsigned long step = 1UL << (page_shift - PAGE_SHIFT);
> WARN_ON(page_shift < PAGE_SHIFT);
>
> if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> page_shift == PAGE_SHIFT)
> return vmap_small_pages_range_noflush(addr, end, prot, pages);
>
> - for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> + for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
> int err;
>
> err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> @@ -678,8 +678,9 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>
> addr += 1UL << page_shift;
> }
> -
> - return 0;
> + if (IS_ALIGNED(nr, step))
> + return 0;
> + return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
> }
>
> int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> @@ -3514,6 +3515,50 @@ void vunmap(const void *addr)
> }
> EXPORT_SYMBOL(vunmap);
>
> +static inline unsigned int vm_shift(pgprot_t prot, unsigned long size)
> +{
> + if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
> + return PMD_SHIFT;
> +
> + return arch_vmap_pte_supported_shift(size);
> +}
> +
> +static inline int __vmap_huge(struct page **pages, pgprot_t prot,
> + unsigned long addr, unsigned int count)
> +{
> + unsigned int i = 0;
> + unsigned int shift;
> + unsigned long nr;
> +
> + while (i < count) {
> + nr = num_pages_contiguous(pages + i, count - i);
> + shift = vm_shift(prot, nr << PAGE_SHIFT);
> + if (vmap_pages_range(addr, addr + (nr << PAGE_SHIFT),
> + pgprot_nx(prot), pages + i, shift) < 0) {
> + return 1;
> + }
One observation on my side is that the performance gain is somewhat
offset by page table zigzagging caused by what you are doing here -
iterating each mem segment by vmap_pages_range() .
In patch 3/8, I enhanced vmap_small_pages_range_noflush() to
avoid repeated pgd → p4d → pud → pmd → pte traversals for page
shifts other than PAGE_SHIFT. This improves performance for
vmalloc as well as vmap(). Then, in patch 7/8, I adopt the new
vmap_small_pages_range_noflush() and eliminate the iteration.
> + i += nr;
> + addr += (nr << PAGE_SHIFT);
> + }
> + return 0;
> +}
> +
> +static unsigned long max_contiguous_stride_order(struct page **pages,
> + pgprot_t prot, unsigned int count)
> +{
> + unsigned long max_shift = PAGE_SHIFT;
> + unsigned int i = 0;
> +
> + while (i < count) {
> + unsigned long nr = num_pages_contiguous(pages + i, count - i);
> + unsigned long shift = vm_shift(prot, nr << PAGE_SHIFT);
> +
> + max_shift = max(max_shift, shift);
> + i += nr;
> + }
> + return max_shift;
> +}
> +
> /**
> * vmap - map an array of pages into virtually contiguous space
> * @pages: array of page pointers
> @@ -3552,15 +3597,32 @@ void *vmap(struct page **pages, unsigned int count,
> return NULL;
>
> size = (unsigned long)count << PAGE_SHIFT;
> - area = get_vm_area_caller(size, flags, __builtin_return_address(0));
> + if (flags & VM_ALLOW_HUGE_VMAP) {
> + /* determine from page array, the max alignment */
> + unsigned long max_shift = max_contiguous_stride_order(pages, prot, count);
> +
> + area = __get_vm_area_node(size, 1 << max_shift, max_shift, flags,
> + VMALLOC_START, VMALLOC_END, NUMA_NO_NODE,
> + GFP_KERNEL, __builtin_return_address(0));
> + } else {
> + area = get_vm_area_caller(size, flags, __builtin_return_address(0));
> + }
> if (!area)
> return NULL;
>
> addr = (unsigned long)area->addr;
> - if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
> - pages, PAGE_SHIFT) < 0) {
> - vunmap(area->addr);
> - return NULL;
> +
> + if (flags & VM_ALLOW_HUGE_VMAP) {
> + if (__vmap_huge(pages, prot, addr, count)) {
> + vunmap(area->addr);
> + return NULL;
> + }
> + } else {
> + if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
> + pages, PAGE_SHIFT) < 0) {
> + vunmap(area->addr);
> + return NULL;
> + }
> }
>
> if (flags & VM_MAP_PUT_PAGES) {
> @@ -4011,11 +4073,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
> * their allocations due to apply_to_page_range not
> * supporting them.
> */
> -
> - if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
> - shift = PMD_SHIFT;
> - else
> - shift = arch_vmap_pte_supported_shift(size);
> + shift = vm_shift(prot, size);
What I actually did is different. In patches 1/8 and 2/8, I
extended the arm64 levels to support N * CONT_PTE, and let the
final PTE mapping use the maximum possible batch after avoiding
zigzag. This further improves all orders greater than CONT_PTE.
Thanks
Barry
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible
2026-04-08 5:12 ` Barry Song
@ 2026-04-08 11:22 ` Dev Jain
0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2026-04-08 11:22 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21
On 08/04/26 10:42 am, Barry Song wrote:
> On Wed, Apr 8, 2026 at 12:20 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
>>> In many cases, the pages passed to vmap() may include high-order
>>> pages allocated with __GFP_COMP flags. For example, the systemheap
>>> often allocates pages in descending order: order 8, then 4, then 0.
>>> Currently, vmap() iterates over every page individually—even pages
>>> inside a high-order block are handled one by one.
>>>
>>> This patch detects high-order pages and maps them as a single
>>> contiguous block whenever possible.
>>>
>>> An alternative would be to implement a new API, vmap_sg(), but that
>>> change seems to be large in scope.
>>>
>>> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
>>> ---
>>
>> Coincidentally, I was working on the same thing :)
>
> Interesting, thanks — at least I’ve got one good reviewer :-)
>
>>
>> We have a usecase regarding Arm TRBE and SPE aux buffers.
>>
>> I'll take a look at your patches later, but my implementation is the
>
> Yes. Please.
>
>
>> following, if you have any comments. I have squashed the patches into
>> a single diff.
>
> Thanks very much, Dev. What you’ve done is quite similar to
> patches 5/8 and 6/8, although the code differs somewhat.
>
>>
>>
>>
>> From ccb9670a52b7f50b1f1e07b579a1316f76b84811 Mon Sep 17 00:00:00 2001
>> From: Dev Jain <dev.jain@arm.com>
>> Date: Thu, 26 Feb 2026 16:21:29 +0530
>> Subject: [PATCH] arm64/perf: map AUX buffer with large pages
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> .../hwtracing/coresight/coresight-etm-perf.c | 3 +-
>> drivers/hwtracing/coresight/coresight-trbe.c | 3 +-
>> drivers/perf/arm_spe_pmu.c | 5 +-
>> mm/vmalloc.c | 86 ++++++++++++++++---
>> 4 files changed, 79 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c
>> index 72017dcc3b7f1..e90a430af86bb 100644
>> --- a/drivers/hwtracing/coresight/coresight-etm-perf.c
>> +++ b/drivers/hwtracing/coresight/coresight-etm-perf.c
>> @@ -984,7 +984,8 @@ int __init etm_perf_init(void)
>>
>> etm_pmu.capabilities = (PERF_PMU_CAP_EXCLUSIVE |
>> PERF_PMU_CAP_ITRACE |
>> - PERF_PMU_CAP_AUX_PAUSE);
>> + PERF_PMU_CAP_AUX_PAUSE |
>> + PERF_PMU_CAP_AUX_PREFER_LARGE);
>>
>> etm_pmu.attr_groups = etm_pmu_attr_groups;
>> etm_pmu.task_ctx_nr = perf_sw_context;
>> diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
>> index 1511f8eb95afb..74e6ad891e236 100644
>> --- a/drivers/hwtracing/coresight/coresight-trbe.c
>> +++ b/drivers/hwtracing/coresight/coresight-trbe.c
>> @@ -760,7 +760,8 @@ static void *arm_trbe_alloc_buffer(struct coresight_device *csdev,
>> for (i = 0; i < nr_pages; i++)
>> pglist[i] = virt_to_page(pages[i]);
>>
>> - buf->trbe_base = (unsigned long)vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
>> + buf->trbe_base = (unsigned long)vmap(pglist, nr_pages,
>> + VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
>> if (!buf->trbe_base) {
>> kfree(pglist);
>> kfree(buf);
>> diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
>> index dbd0da1116390..90c349fd66b2c 100644
>> --- a/drivers/perf/arm_spe_pmu.c
>> +++ b/drivers/perf/arm_spe_pmu.c
>> @@ -1027,7 +1027,7 @@ static void *arm_spe_pmu_setup_aux(struct perf_event *event, void **pages,
>> for (i = 0; i < nr_pages; ++i)
>> pglist[i] = virt_to_page(pages[i]);
>>
>> - buf->base = vmap(pglist, nr_pages, VM_MAP, PAGE_KERNEL);
>> + buf->base = vmap(pglist, nr_pages, VM_MAP | VM_ALLOW_HUGE_VMAP, PAGE_KERNEL);
>> if (!buf->base)
>> goto out_free_pglist;
>>
>> @@ -1064,7 +1064,8 @@ static int arm_spe_pmu_perf_init(struct arm_spe_pmu *spe_pmu)
>> spe_pmu->pmu = (struct pmu) {
>> .module = THIS_MODULE,
>> .parent = &spe_pmu->pdev->dev,
>> - .capabilities = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE,
>> + .capabilities = PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE |
>> + PERF_PMU_CAP_AUX_PREFER_LARGE,
>> .attr_groups = arm_spe_pmu_attr_groups,
>> /*
>> * We hitch a ride on the software context here, so that
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 61caa55a44027..8482463d41203 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -660,14 +660,14 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>> pgprot_t prot, struct page **pages, unsigned int page_shift)
>> {
>> unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
>> -
>> + unsigned long step = 1UL << (page_shift - PAGE_SHIFT);
>> WARN_ON(page_shift < PAGE_SHIFT);
>>
>> if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>> page_shift == PAGE_SHIFT)
>> return vmap_small_pages_range_noflush(addr, end, prot, pages);
>>
>> - for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
>> + for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
>> int err;
>>
>> err = vmap_range_noflush(addr, addr + (1UL << page_shift),
>> @@ -678,8 +678,9 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>
>> addr += 1UL << page_shift;
>> }
>> -
>> - return 0;
>> + if (IS_ALIGNED(nr, step))
>> + return 0;
>> + return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
>> }
>>
>> int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>> @@ -3514,6 +3515,50 @@ void vunmap(const void *addr)
>> }
>> EXPORT_SYMBOL(vunmap);
>>
>> +static inline unsigned int vm_shift(pgprot_t prot, unsigned long size)
>> +{
>> + if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
>> + return PMD_SHIFT;
>> +
>> + return arch_vmap_pte_supported_shift(size);
>> +}
>> +
>> +static inline int __vmap_huge(struct page **pages, pgprot_t prot,
>> + unsigned long addr, unsigned int count)
>> +{
>> + unsigned int i = 0;
>> + unsigned int shift;
>> + unsigned long nr;
>> +
>> + while (i < count) {
>> + nr = num_pages_contiguous(pages + i, count - i);
>> + shift = vm_shift(prot, nr << PAGE_SHIFT);
>> + if (vmap_pages_range(addr, addr + (nr << PAGE_SHIFT),
>> + pgprot_nx(prot), pages + i, shift) < 0) {
>> + return 1;
>> + }
>
> One observation on my side is that the performance gain is somewhat
> offset by page table zigzagging caused by what you are doing here -
> iterating each mem segment by vmap_pages_range() .
I recall having observed this problem half an year back, and I wrote
code similar to what you did with patch 3 - but I didn't observe any
performance improvement. I think that was because I was testing
vmalloc - most of the cost there lies in the page allocation.
So looks like this indeed is a benefit for vmap.
>
> In patch 3/8, I enhanced vmap_small_pages_range_noflush() to
> avoid repeated pgd → p4d → pud → pmd → pte traversals for page
> shifts other than PAGE_SHIFT. This improves performance for
> vmalloc as well as vmap(). Then, in patch 7/8, I adopt the new
> vmap_small_pages_range_noflush() and eliminate the iteration.
>
>> + i += nr;
>> + addr += (nr << PAGE_SHIFT);
>> + }
>> + return 0;
>> +}
>> +
>> +static unsigned long max_contiguous_stride_order(struct page **pages,
>> + pgprot_t prot, unsigned int count)
>> +{
>> + unsigned long max_shift = PAGE_SHIFT;
>> + unsigned int i = 0;
>> +
>> + while (i < count) {
>> + unsigned long nr = num_pages_contiguous(pages + i, count - i);
>> + unsigned long shift = vm_shift(prot, nr << PAGE_SHIFT);
>> +
>> + max_shift = max(max_shift, shift);
>> + i += nr;
>> + }
>> + return max_shift;
>> +}
>> +
>> /**
>> * vmap - map an array of pages into virtually contiguous space
>> * @pages: array of page pointers
>> @@ -3552,15 +3597,32 @@ void *vmap(struct page **pages, unsigned int count,
>> return NULL;
>>
>> size = (unsigned long)count << PAGE_SHIFT;
>> - area = get_vm_area_caller(size, flags, __builtin_return_address(0));
>> + if (flags & VM_ALLOW_HUGE_VMAP) {
>> + /* determine from page array, the max alignment */
>> + unsigned long max_shift = max_contiguous_stride_order(pages, prot, count);
>> +
>> + area = __get_vm_area_node(size, 1 << max_shift, max_shift, flags,
>> + VMALLOC_START, VMALLOC_END, NUMA_NO_NODE,
>> + GFP_KERNEL, __builtin_return_address(0));
>> + } else {
>> + area = get_vm_area_caller(size, flags, __builtin_return_address(0));
>> + }
>> if (!area)
>> return NULL;
>>
>> addr = (unsigned long)area->addr;
>> - if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
>> - pages, PAGE_SHIFT) < 0) {
>> - vunmap(area->addr);
>> - return NULL;
>> +
>> + if (flags & VM_ALLOW_HUGE_VMAP) {
>> + if (__vmap_huge(pages, prot, addr, count)) {
>> + vunmap(area->addr);
>> + return NULL;
>> + }
>> + } else {
>> + if (vmap_pages_range(addr, addr + size, pgprot_nx(prot),
>> + pages, PAGE_SHIFT) < 0) {
>> + vunmap(area->addr);
>> + return NULL;
>> + }
>> }
>>
>> if (flags & VM_MAP_PUT_PAGES) {
>> @@ -4011,11 +4073,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
>> * their allocations due to apply_to_page_range not
>> * supporting them.
>> */
>> -
>> - if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
>> - shift = PMD_SHIFT;
>> - else
>> - shift = arch_vmap_pte_supported_shift(size);
>> + shift = vm_shift(prot, size);
>
> What I actually did is different. In patches 1/8 and 2/8, I
> extended the arm64 levels to support N * CONT_PTE, and let the
> final PTE mapping use the maximum possible batch after avoiding
> zigzag. This further improves all orders greater than CONT_PTE.
>
> Thanks
> Barry
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 6/8] mm/vmalloc: align vm_area so vmap() can batch mappings
2026-04-08 2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
` (4 preceding siblings ...)
2026-04-08 2:51 ` [RFC PATCH 5/8] mm/vmalloc: map contiguous pages in batches for vmap() if possible Barry Song (Xiaomi)
@ 2026-04-08 2:51 ` Barry Song (Xiaomi)
2026-04-08 2:51 ` [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag Barry Song (Xiaomi)
` (2 subsequent siblings)
8 siblings, 0 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08 2:51 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21, Barry Song (Xiaomi)
Try to align the vmap virtual address to PMD_SHIFT or a
larger PTE mapping size hinted by the architecture, so
contiguous pages can be batch-mapped when setting PMD or
PTE entries.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
mm/vmalloc.c | 31 ++++++++++++++++++++++++++++++-
1 file changed, 30 insertions(+), 1 deletion(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e8dbfada42bc..6643ec0288cd 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3576,6 +3576,35 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
return err;
}
+static struct vm_struct *get_aligned_vm_area(unsigned long size, unsigned long flags)
+{
+ unsigned int shift = (size >= PMD_SIZE) ? PMD_SHIFT :
+ arch_vmap_pte_supported_shift(size);
+ struct vm_struct *vm_area = NULL;
+
+ /*
+ * Try to allocate an aligned vm_area so contiguous pages can be
+ * mapped in batches.
+ */
+ while (1) {
+ unsigned long align = 1UL << shift;
+
+ vm_area = __get_vm_area_node(size, align, PAGE_SHIFT, flags,
+ VMALLOC_START, VMALLOC_END,
+ NUMA_NO_NODE, GFP_KERNEL,
+ __builtin_return_address(0));
+ if (vm_area || shift <= PAGE_SHIFT)
+ goto out;
+ if (shift == PMD_SHIFT)
+ shift = arch_vmap_pte_supported_shift(size);
+ else if (shift > PAGE_SHIFT)
+ shift = PAGE_SHIFT;
+ }
+
+out:
+ return vm_area;
+}
+
/**
* vmap - map an array of pages into virtually contiguous space
* @pages: array of page pointers
@@ -3614,7 +3643,7 @@ void *vmap(struct page **pages, unsigned int count,
return NULL;
size = (unsigned long)count << PAGE_SHIFT;
- area = get_vm_area_caller(size, flags, __builtin_return_address(0));
+ area = get_aligned_vm_area(size, flags);
if (!area)
return NULL;
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 19+ messages in thread* [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag
2026-04-08 2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
` (5 preceding siblings ...)
2026-04-08 2:51 ` [RFC PATCH 6/8] mm/vmalloc: align vm_area so vmap() can batch mappings Barry Song (Xiaomi)
@ 2026-04-08 2:51 ` Barry Song (Xiaomi)
2026-04-08 11:36 ` Dev Jain
2026-04-08 2:51 ` [RFC PATCH 8/8] mm/vmalloc: Stop scanning for compound pages after encountering small pages in vmap Barry Song (Xiaomi)
2026-04-08 9:14 ` [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Dev Jain
8 siblings, 1 reply; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08 2:51 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21, Barry Song (Xiaomi)
For vmap(), detect pages with the same page_shift and map them in
batches, avoiding the pgtable zigzag caused by per-page mapping.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
mm/vmalloc.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6643ec0288cd..3c3b7217693a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3551,6 +3551,8 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
pgprot_t prot, struct page **pages)
{
unsigned int count = (end - addr) >> PAGE_SHIFT;
+ unsigned int prev_shift = 0, idx = 0;
+ unsigned long map_addr = addr;
int err;
err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
@@ -3562,15 +3564,29 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
unsigned int shift = PAGE_SHIFT +
get_vmap_batch_order(pages, count - i, i);
- err = vmap_range_noflush(addr, addr + (1UL << shift),
- page_to_phys(pages[i]), prot, shift);
- if (err)
- goto out;
+ if (!i)
+ prev_shift = shift;
+
+ if (shift != prev_shift) {
+ err = vmap_small_pages_range_noflush(map_addr, addr,
+ prot, pages + idx,
+ min(prev_shift, PMD_SHIFT));
+ if (err)
+ goto out;
+ prev_shift = shift;
+ map_addr = addr;
+ idx = i;
+ }
addr += 1UL << shift;
i += 1U << (shift - PAGE_SHIFT);
}
+ /* Remaining */
+ if (map_addr < end)
+ err = vmap_small_pages_range_noflush(map_addr, end,
+ prot, pages + idx, min(prev_shift, PMD_SHIFT));
+
out:
flush_cache_vmap(addr, end);
return err;
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag
2026-04-08 2:51 ` [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag Barry Song (Xiaomi)
@ 2026-04-08 11:36 ` Dev Jain
0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2026-04-08 11:36 UTC (permalink / raw)
To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21
On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> For vmap(), detect pages with the same page_shift and map them in
> batches, avoiding the pgtable zigzag caused by per-page mapping.
>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
In patch 4, you eliminate the pagetable rewalk, and in patch 5,
you re-introduce it, then in this patch you eliminate it again.
So please just squash this into #5.
> mm/vmalloc.c | 24 ++++++++++++++++++++----
> 1 file changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6643ec0288cd..3c3b7217693a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3551,6 +3551,8 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
> pgprot_t prot, struct page **pages)
> {
> unsigned int count = (end - addr) >> PAGE_SHIFT;
> + unsigned int prev_shift = 0, idx = 0;
> + unsigned long map_addr = addr;
> int err;
>
> err = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
> @@ -3562,15 +3564,29 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
> unsigned int shift = PAGE_SHIFT +
> get_vmap_batch_order(pages, count - i, i);
>
> - err = vmap_range_noflush(addr, addr + (1UL << shift),
> - page_to_phys(pages[i]), prot, shift);
> - if (err)
> - goto out;
> + if (!i)
> + prev_shift = shift;
> +
> + if (shift != prev_shift) {
> + err = vmap_small_pages_range_noflush(map_addr, addr,
> + prot, pages + idx,
> + min(prev_shift, PMD_SHIFT));
> + if (err)
> + goto out;
> + prev_shift = shift;
> + map_addr = addr;
> + idx = i;
> + }
>
> addr += 1UL << shift;
> i += 1U << (shift - PAGE_SHIFT);
> }
>
> + /* Remaining */
> + if (map_addr < end)
> + err = vmap_small_pages_range_noflush(map_addr, end,
> + prot, pages + idx, min(prev_shift, PMD_SHIFT));
> +
> out:
> flush_cache_vmap(addr, end);
> return err;
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 8/8] mm/vmalloc: Stop scanning for compound pages after encountering small pages in vmap
2026-04-08 2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
` (6 preceding siblings ...)
2026-04-08 2:51 ` [RFC PATCH 7/8] mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable zigzag Barry Song (Xiaomi)
@ 2026-04-08 2:51 ` Barry Song (Xiaomi)
2026-04-08 9:14 ` [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Dev Jain
8 siblings, 0 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-08 2:51 UTC (permalink / raw)
To: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21, Barry Song (Xiaomi)
Users typically allocate memory in descending orders, e.g.
8 → 4 → 0. Once an order-0 page is encountered, subsequent
pages are likely to also be order-0, so we stop scanning
for compound pages at that point.
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
mm/vmalloc.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3c3b7217693a..242f4bc1379c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3577,6 +3577,12 @@ static int vmap_contig_pages_range(unsigned long addr, unsigned long end,
map_addr = addr;
idx = i;
}
+ /*
+ * Once small pages are encountered, the remaining pages
+ * are likely small as well
+ */
+ if (shift == PAGE_SHIFT)
+ break;
addr += 1UL << shift;
i += 1U << (shift - PAGE_SHIFT);
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
2026-04-08 2:51 [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Barry Song (Xiaomi)
` (7 preceding siblings ...)
2026-04-08 2:51 ` [RFC PATCH 8/8] mm/vmalloc: Stop scanning for compound pages after encountering small pages in vmap Barry Song (Xiaomi)
@ 2026-04-08 9:14 ` Dev Jain
2026-04-08 10:51 ` Barry Song
8 siblings, 1 reply; 19+ messages in thread
From: Dev Jain @ 2026-04-08 9:14 UTC (permalink / raw)
To: Barry Song (Xiaomi), linux-mm, linux-arm-kernel, catalin.marinas,
will, akpm, urezki
Cc: linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21
On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> This patchset accelerates ioremap, vmalloc, and vmap when the memory
> is physically fully or partially contiguous. Two techniques are used:
>
> 1. Avoid page table zigzag when setting PTEs/PMDs for multiple memory
> segments
> 2. Use batched mappings wherever possible in both vmalloc and ARM64
> layers
>
> Patches 1–2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> CONT-PTE regions instead of just one.
>
> Patches 3–4 extend vmap_small_pages_range_noflush() to support page
> shifts other than PAGE_SHIFT. This allows mapping multiple memory
> segments for vmalloc() without zigzagging page tables.
>
> Patches 5–8 add huge vmap support for contiguous pages. This not only
> improves performance but also enables PMD or CONT-PTE mapping for the
> vmapped area, reducing TLB pressure.
>
> Many thanks to Xueyuan Chen for his substantial testing efforts
> on RK3588 boards.
>
> On the RK3588 8-core ARM64 SoC, with tasks pinned to CPU2 and
> the performance CPUfreq policy enabled, Xueyuan’s tests report:
>
> * ioremap(1 MB): 1.2× faster
> * vmalloc(1 MB) mapping time (excluding allocation) with
> VM_ALLOW_HUGE_VMAP: 1.5× faster
> * vmap(): 5.6× faster when memory includes some order-8 pages,
> with no regression observed for order-0 pages
>
> Barry Song (Xiaomi) (8):
> arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
> setup
> arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
> CONT_PTE
> mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger
> page_shift sizes
> mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
> mm/vmalloc: map contiguous pages in batches for vmap() if possible
> mm/vmalloc: align vm_area so vmap() can batch mappings
> mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable
> zigzag
> mm/vmalloc: Stop scanning for compound pages after encountering small
> pages in vmap
>
> arch/arm64/include/asm/vmalloc.h | 6 +-
> arch/arm64/mm/hugetlbpage.c | 10 ++
> mm/vmalloc.c | 178 +++++++++++++++++++++++++------
> 3 files changed, 161 insertions(+), 33 deletions(-)
>
On Linux VM on Apple M3, running mm-selftests:
./run_vmtests.sh -t "hugetlb"
TAP version 13
# -----------------------
# running ./hugepage-mmap
# -----------------------
# TAP version 13
# 1..1
# # Returned address is 0xffffe7c00000
[ 30.884630] kernel BUG at mm/page_table_check.c:86!
[ 30.884701] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 30.886803] Modules linked in:
[ 30.887217] CPU: 3 UID: 0 PID: 1869 Comm: hugepage-mmap Not tainted 7.0.0-rc5+ #86 PREEMPT
[ 30.888218] Hardware name: linux,dummy-virt (DT)
[ 30.889413] pstate: a1400005 (NzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 30.889901] pc : page_table_check_clear.part.0+0x128/0x1a0
[ 30.890337] lr : page_table_check_clear.part.0+0x7c/0x1a0
[ 30.890714] sp : ffff800084da3ad0
[ 30.890946] x29: ffff800084da3ad0 x28: 0000000000000001 x27: 0010000000000001
[ 30.891434] x26: 0040000000000040 x25: ffffa06bb8fb9000 x24: 00000000ffffffff
[ 30.891932] x23: 0000000000000001 x22: 0000000000000000 x21: ffffa06bb8997810
[ 30.892514] x20: 0000000000113e39 x19: 0000000000113e38 x18: 0000000000000000
[ 30.893007] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 30.893500] x14: ffffa06bb7013780 x13: 0000fffff7f90fff x12: 0000000000000000
[ 30.893990] x11: 1fffe0001a1282c1 x10: ffff0000d094160c x9 : ffffa06bb568a858
[ 30.894479] x8 : ffff5f95c8474000 x7 : 0000000000000000 x6 : ffff00017fffc500
[ 30.894973] x5 : ffff000191208fc0 x4 : 0000000000000000 x3 : 0000000000004000
[ 30.895449] x2 : 0000000000000000 x1 : 00000000ffffffff x0 : ffff0000c071f1b8
[ 30.895875] Call trace:
[ 30.896027] page_table_check_clear.part.0+0x128/0x1a0 (P)
[ 30.896369] page_table_check_clear+0xc8/0x138
[ 30.896776] __page_table_check_ptes_set+0xe4/0x1e8
[ 30.897073] __set_ptes_anysz+0x2e4/0x308
[ 30.897327] set_huge_pte_at+0xec/0x210
[ 30.897561] hugetlb_no_page+0x1ec/0x8e0
[ 30.897807] hugetlb_fault+0x188/0x740
[ 30.898036] handle_mm_fault+0x294/0x2c0
[ 30.898283] do_page_fault+0x120/0x748
[ 30.898539] do_translation_fault+0x68/0x90
[ 30.898796] do_mem_abort+0x4c/0xa8
[ 30.899011] el0_da+0x2c/0x90
[ 30.899205] el0t_64_sync_handler+0xd0/0xe8
[ 30.899461] el0t_64_sync+0x198/0x1a0
[ 30.899688] Code: 91001021 b8f80022 51000441 36fffd41 (d4210000)
[ 30.900053] ---[ end trace 0000000000000000 ]---
The bug is at
BUG_ON(atomic_dec_return(&ptc->file_map_count) < 0);
My tree is mm-unstable, commit 3fa44141e0bb.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
2026-04-08 9:14 ` [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Dev Jain
@ 2026-04-08 10:51 ` Barry Song
2026-04-08 10:55 ` Dev Jain
0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2026-04-08 10:51 UTC (permalink / raw)
To: Dev Jain
Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21
On Wed, Apr 8, 2026 at 5:14 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
> > This patchset accelerates ioremap, vmalloc, and vmap when the memory
> > is physically fully or partially contiguous. Two techniques are used:
> >
> > 1. Avoid page table zigzag when setting PTEs/PMDs for multiple memory
> > segments
> > 2. Use batched mappings wherever possible in both vmalloc and ARM64
> > layers
> >
> > Patches 1–2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
> > CONT-PTE regions instead of just one.
> >
> > Patches 3–4 extend vmap_small_pages_range_noflush() to support page
> > shifts other than PAGE_SHIFT. This allows mapping multiple memory
> > segments for vmalloc() without zigzagging page tables.
> >
> > Patches 5–8 add huge vmap support for contiguous pages. This not only
> > improves performance but also enables PMD or CONT-PTE mapping for the
> > vmapped area, reducing TLB pressure.
> >
> > Many thanks to Xueyuan Chen for his substantial testing efforts
> > on RK3588 boards.
> >
> > On the RK3588 8-core ARM64 SoC, with tasks pinned to CPU2 and
> > the performance CPUfreq policy enabled, Xueyuan’s tests report:
> >
> > * ioremap(1 MB): 1.2× faster
> > * vmalloc(1 MB) mapping time (excluding allocation) with
> > VM_ALLOW_HUGE_VMAP: 1.5× faster
> > * vmap(): 5.6× faster when memory includes some order-8 pages,
> > with no regression observed for order-0 pages
> >
> > Barry Song (Xiaomi) (8):
> > arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
> > setup
> > arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
> > CONT_PTE
> > mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger
> > page_shift sizes
> > mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
> > mm/vmalloc: map contiguous pages in batches for vmap() if possible
> > mm/vmalloc: align vm_area so vmap() can batch mappings
> > mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable
> > zigzag
> > mm/vmalloc: Stop scanning for compound pages after encountering small
> > pages in vmap
> >
> > arch/arm64/include/asm/vmalloc.h | 6 +-
> > arch/arm64/mm/hugetlbpage.c | 10 ++
> > mm/vmalloc.c | 178 +++++++++++++++++++++++++------
> > 3 files changed, 161 insertions(+), 33 deletions(-)
> >
>
> On Linux VM on Apple M3, running mm-selftests:
Dev, thanks for your report. Sorry for the silly typo—
Xueyuan’s vmalloc/vmap tests don’t trigger that case yet.
it should be fixed by:
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index bf31c11ebd3b..25b9fce1ec6a 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -110,7 +110,7 @@ static inline int num_contig_ptes(unsigned long
size, size_t *pgsize)
contig_ptes = CONT_PTES;
break;
default:
- if (size < CONT_PMD_SIZE && size > 0 &&
+ if (size < PMD_SIZE && size > 0 &&
IS_ALIGNED(size, CONT_PTE_SIZE)) {
contig_ptes = size >> PAGE_SHIFT;
*pgsize = PAGE_SIZE;
@@ -365,7 +365,7 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int
shift, vm_flags_t flags)
case CONT_PTE_SIZE:
return pte_mkcont(entry);
default:
- if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
+ if (pagesize < PMD_SIZE && pagesize > 0 &&
IS_ALIGNED(pagesize, CONT_PTE_SIZE))
return pte_mkcont(entry);
>
> ./run_vmtests.sh -t "hugetlb"
>
> TAP version 13
> # -----------------------
> # running ./hugepage-mmap
> # -----------------------
> # TAP version 13
> # 1..1
> # # Returned address is 0xffffe7c00000
>
>
>
> [ 30.884630] kernel BUG at mm/page_table_check.c:86!
> [ 30.884701] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [ 30.886803] Modules linked in:
> [ 30.887217] CPU: 3 UID: 0 PID: 1869 Comm: hugepage-mmap Not tainted 7.0.0-rc5+ #86 PREEMPT
> [ 30.888218] Hardware name: linux,dummy-virt (DT)
> [ 30.889413] pstate: a1400005 (NzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [ 30.889901] pc : page_table_check_clear.part.0+0x128/0x1a0
> [ 30.890337] lr : page_table_check_clear.part.0+0x7c/0x1a0
> [ 30.890714] sp : ffff800084da3ad0
> [ 30.890946] x29: ffff800084da3ad0 x28: 0000000000000001 x27: 0010000000000001
> [ 30.891434] x26: 0040000000000040 x25: ffffa06bb8fb9000 x24: 00000000ffffffff
> [ 30.891932] x23: 0000000000000001 x22: 0000000000000000 x21: ffffa06bb8997810
> [ 30.892514] x20: 0000000000113e39 x19: 0000000000113e38 x18: 0000000000000000
> [ 30.893007] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> [ 30.893500] x14: ffffa06bb7013780 x13: 0000fffff7f90fff x12: 0000000000000000
> [ 30.893990] x11: 1fffe0001a1282c1 x10: ffff0000d094160c x9 : ffffa06bb568a858
> [ 30.894479] x8 : ffff5f95c8474000 x7 : 0000000000000000 x6 : ffff00017fffc500
> [ 30.894973] x5 : ffff000191208fc0 x4 : 0000000000000000 x3 : 0000000000004000
> [ 30.895449] x2 : 0000000000000000 x1 : 00000000ffffffff x0 : ffff0000c071f1b8
> [ 30.895875] Call trace:
> [ 30.896027] page_table_check_clear.part.0+0x128/0x1a0 (P)
> [ 30.896369] page_table_check_clear+0xc8/0x138
> [ 30.896776] __page_table_check_ptes_set+0xe4/0x1e8
> [ 30.897073] __set_ptes_anysz+0x2e4/0x308
> [ 30.897327] set_huge_pte_at+0xec/0x210
> [ 30.897561] hugetlb_no_page+0x1ec/0x8e0
> [ 30.897807] hugetlb_fault+0x188/0x740
> [ 30.898036] handle_mm_fault+0x294/0x2c0
> [ 30.898283] do_page_fault+0x120/0x748
> [ 30.898539] do_translation_fault+0x68/0x90
> [ 30.898796] do_mem_abort+0x4c/0xa8
> [ 30.899011] el0_da+0x2c/0x90
> [ 30.899205] el0t_64_sync_handler+0xd0/0xe8
> [ 30.899461] el0t_64_sync+0x198/0x1a0
> [ 30.899688] Code: 91001021 b8f80022 51000441 36fffd41 (d4210000)
> [ 30.900053] ---[ end trace 0000000000000000 ]---
>
>
>
> The bug is at
>
> BUG_ON(atomic_dec_return(&ptc->file_map_count) < 0);
>
> My tree is mm-unstable, commit 3fa44141e0bb.
>
Thanks
Barry
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [RFC PATCH 0/8] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
2026-04-08 10:51 ` Barry Song
@ 2026-04-08 10:55 ` Dev Jain
0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2026-04-08 10:55 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, linux-arm-kernel, catalin.marinas, will, akpm, urezki,
linux-kernel, anshuman.khandual, ryan.roberts, ajd, rppt, david,
Xueyuan.chen21
On 08/04/26 4:21 pm, Barry Song wrote:
> On Wed, Apr 8, 2026 at 5:14 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 08/04/26 8:21 am, Barry Song (Xiaomi) wrote:
>>> This patchset accelerates ioremap, vmalloc, and vmap when the memory
>>> is physically fully or partially contiguous. Two techniques are used:
>>>
>>> 1. Avoid page table zigzag when setting PTEs/PMDs for multiple memory
>>> segments
>>> 2. Use batched mappings wherever possible in both vmalloc and ARM64
>>> layers
>>>
>>> Patches 1–2 extend ARM64 vmalloc CONT-PTE mapping to support multiple
>>> CONT-PTE regions instead of just one.
>>>
>>> Patches 3–4 extend vmap_small_pages_range_noflush() to support page
>>> shifts other than PAGE_SHIFT. This allows mapping multiple memory
>>> segments for vmalloc() without zigzagging page tables.
>>>
>>> Patches 5–8 add huge vmap support for contiguous pages. This not only
>>> improves performance but also enables PMD or CONT-PTE mapping for the
>>> vmapped area, reducing TLB pressure.
>>>
>>> Many thanks to Xueyuan Chen for his substantial testing efforts
>>> on RK3588 boards.
>>>
>>> On the RK3588 8-core ARM64 SoC, with tasks pinned to CPU2 and
>>> the performance CPUfreq policy enabled, Xueyuan’s tests report:
>>>
>>> * ioremap(1 MB): 1.2× faster
>>> * vmalloc(1 MB) mapping time (excluding allocation) with
>>> VM_ALLOW_HUGE_VMAP: 1.5× faster
>>> * vmap(): 5.6× faster when memory includes some order-8 pages,
>>> with no regression observed for order-0 pages
>>>
>>> Barry Song (Xiaomi) (8):
>>> arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE
>>> setup
>>> arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple
>>> CONT_PTE
>>> mm/vmalloc: Extend vmap_small_pages_range_noflush() to support larger
>>> page_shift sizes
>>> mm/vmalloc: Eliminate page table zigzag for huge vmalloc mappings
>>> mm/vmalloc: map contiguous pages in batches for vmap() if possible
>>> mm/vmalloc: align vm_area so vmap() can batch mappings
>>> mm/vmalloc: Coalesce same page_shift mappings in vmap to avoid pgtable
>>> zigzag
>>> mm/vmalloc: Stop scanning for compound pages after encountering small
>>> pages in vmap
>>>
>>> arch/arm64/include/asm/vmalloc.h | 6 +-
>>> arch/arm64/mm/hugetlbpage.c | 10 ++
>>> mm/vmalloc.c | 178 +++++++++++++++++++++++++------
>>> 3 files changed, 161 insertions(+), 33 deletions(-)
>>>
>>
>> On Linux VM on Apple M3, running mm-selftests:
>
> Dev, thanks for your report. Sorry for the silly typo—
> Xueyuan’s vmalloc/vmap tests don’t trigger that case yet.
>
> it should be fixed by:
>
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index bf31c11ebd3b..25b9fce1ec6a 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -110,7 +110,7 @@ static inline int num_contig_ptes(unsigned long
> size, size_t *pgsize)
> contig_ptes = CONT_PTES;
> break;
> default:
> - if (size < CONT_PMD_SIZE && size > 0 &&
> + if (size < PMD_SIZE && size > 0 &&
> IS_ALIGNED(size, CONT_PTE_SIZE)) {
> contig_ptes = size >> PAGE_SHIFT;
> *pgsize = PAGE_SIZE;
> @@ -365,7 +365,7 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int
> shift, vm_flags_t flags)
> case CONT_PTE_SIZE:
> return pte_mkcont(entry);
> default:
> - if (pagesize < CONT_PMD_SIZE && pagesize > 0 &&
> + if (pagesize < PMD_SIZE && pagesize > 0 &&
> IS_ALIGNED(pagesize, CONT_PTE_SIZE))
> return pte_mkcont(entry);
Yeah indeed the problem was that a PMD chunk was being treated as 512 ptes
rather than 1 PMD. This fixes it.
>
>>
>> ./run_vmtests.sh -t "hugetlb"
>>
>> TAP version 13
>> # -----------------------
>> # running ./hugepage-mmap
>> # -----------------------
>> # TAP version 13
>> # 1..1
>> # # Returned address is 0xffffe7c00000
>>
>>
>>
>> [ 30.884630] kernel BUG at mm/page_table_check.c:86!
>> [ 30.884701] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>> [ 30.886803] Modules linked in:
>> [ 30.887217] CPU: 3 UID: 0 PID: 1869 Comm: hugepage-mmap Not tainted 7.0.0-rc5+ #86 PREEMPT
>> [ 30.888218] Hardware name: linux,dummy-virt (DT)
>> [ 30.889413] pstate: a1400005 (NzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>> [ 30.889901] pc : page_table_check_clear.part.0+0x128/0x1a0
>> [ 30.890337] lr : page_table_check_clear.part.0+0x7c/0x1a0
>> [ 30.890714] sp : ffff800084da3ad0
>> [ 30.890946] x29: ffff800084da3ad0 x28: 0000000000000001 x27: 0010000000000001
>> [ 30.891434] x26: 0040000000000040 x25: ffffa06bb8fb9000 x24: 00000000ffffffff
>> [ 30.891932] x23: 0000000000000001 x22: 0000000000000000 x21: ffffa06bb8997810
>> [ 30.892514] x20: 0000000000113e39 x19: 0000000000113e38 x18: 0000000000000000
>> [ 30.893007] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>> [ 30.893500] x14: ffffa06bb7013780 x13: 0000fffff7f90fff x12: 0000000000000000
>> [ 30.893990] x11: 1fffe0001a1282c1 x10: ffff0000d094160c x9 : ffffa06bb568a858
>> [ 30.894479] x8 : ffff5f95c8474000 x7 : 0000000000000000 x6 : ffff00017fffc500
>> [ 30.894973] x5 : ffff000191208fc0 x4 : 0000000000000000 x3 : 0000000000004000
>> [ 30.895449] x2 : 0000000000000000 x1 : 00000000ffffffff x0 : ffff0000c071f1b8
>> [ 30.895875] Call trace:
>> [ 30.896027] page_table_check_clear.part.0+0x128/0x1a0 (P)
>> [ 30.896369] page_table_check_clear+0xc8/0x138
>> [ 30.896776] __page_table_check_ptes_set+0xe4/0x1e8
>> [ 30.897073] __set_ptes_anysz+0x2e4/0x308
>> [ 30.897327] set_huge_pte_at+0xec/0x210
>> [ 30.897561] hugetlb_no_page+0x1ec/0x8e0
>> [ 30.897807] hugetlb_fault+0x188/0x740
>> [ 30.898036] handle_mm_fault+0x294/0x2c0
>> [ 30.898283] do_page_fault+0x120/0x748
>> [ 30.898539] do_translation_fault+0x68/0x90
>> [ 30.898796] do_mem_abort+0x4c/0xa8
>> [ 30.899011] el0_da+0x2c/0x90
>> [ 30.899205] el0t_64_sync_handler+0xd0/0xe8
>> [ 30.899461] el0t_64_sync+0x198/0x1a0
>> [ 30.899688] Code: 91001021 b8f80022 51000441 36fffd41 (d4210000)
>> [ 30.900053] ---[ end trace 0000000000000000 ]---
>>
>>
>>
>> The bug is at
>>
>> BUG_ON(atomic_dec_return(&ptc->file_map_count) < 0);
>>
>> My tree is mm-unstable, commit 3fa44141e0bb.
>>
>
> Thanks
> Barry
^ permalink raw reply [flat|nested] 19+ messages in thread