[PATCH v3 0/3] mm: Free contiguous order-0 pages efficiently

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v3 0/3] mm: Free contiguous order-0 pages efficiently
@ 2026-03-24 13:35 Muhammad Usama Anjum
  2026-03-24 13:35 ` [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-24 13:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola, linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: Muhammad Usama Anjum

Hi All,

A recent change to vmalloc caused some performance benchmark regressions (see
[1]). I'm attempting to fix that (and at the same time significantly improve
beyond the baseline) by freeing a contiguous set of order-0 pages as a batch.

At the same time I observed that free_contig_range() was essentially doing the
same thing as vfree() so I've fixed it there too. While at it, optimize the
__free_contig_frozen_range() as well.

[1] https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com

v6.18      - Before the patch causing regression was added
mm-new     - current latest code
this series - v2 series of these patches

(>0 is faster, <0 is slower, (R)/(I) = statistically significant
Regression/Improvement)

v6.18 vs mm-new
+-----------------+----------------------------------------------------------+-------------------+-------------+
| Benchmark       | Result Class                                             |   v6.18    (base) |    mm-new   |
+=================+==========================================================+===================+=============+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |         653643.33 | (R) -50.92% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         366167.33 | (R) -11.96% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         489484.00 | (R) -35.21% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1011250.33 | (R) -36.45% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1086812.33 | (R) -31.83% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |         657940.00 | (R) -38.62% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |         765422.00 | (R) -24.84% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        2468585.00 | (R) -37.83% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        2815758.33 | (R) -26.32% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        4851969.00 | (R) -37.76% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        4496257.33 | (R) -31.15% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         570605.00 |      -8.97% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         500866.00 |      -5.88% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         499733.00 |      -6.95% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        5266237.67 | (R) -40.19% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         490284.00 |      -2.10% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |         850986.33 | (R) -48.03% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        2712106.00 | (R) -40.48% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         111151.33 |       3.52% |
+-----------------+----------------------------------------------------------+-------------------+-------------+

v6.18 vs mm-new with patches
+-----------------+----------------------------------------------------------+-------------------+--------------+
| Benchmark       | Result Class                                             |   v6.18 (base)    |  this series |
+=================+==========================================================+===================+==============+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |         653643.33 |      -14.02% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         366167.33 |       -7.23% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         489484.00 |       -1.57% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1011250.33 |        1.57% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1086812.33 |   (I) 15.75% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |         657940.00 |    (I) 9.05% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |         765422.00 |   (I) 38.45% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        2468585.00 |   (I) 12.56% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        2815758.33 |   (I) 38.61% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        4851969.00 |   (I) 13.43% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        4496257.33 |   (I) 49.21% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         570605.00 |       -8.47% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         500866.00 |       -8.17% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         499733.00 |       -5.54% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        5266237.67 |    (I) 4.63% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         490284.00 |        1.53% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |         850986.33 |       -0.00% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        2712106.00 |        1.22% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         111151.33 |    (I) 4.98% |
+-----------------+----------------------------------------------------------+-------------------+--------------+

mm-new vs vmalloc_2 results are in 2/3 patch.

So this series is mitigating the regression on average as results show -14% to 49% improvement.

Thanks,
Muhammad Usama Anjum

---
Changes since v2: (summary)
- Patch 1 and 3:  Rework the loop to check for memory sections
- Patch 2: Rework by removing the BUG on and add helper free_pages_bulk()

Changes since v1:
- Update description
- Rebase on mm-new and rerun benchmarks/tests
- Patch 1: move FPI_PREPARED check and add todo
- Patch 2: Rework catering newer changes in vfree()
- New Patch 3: optimizes __free_contig_frozen_range()

Muhammad Usama Anjum (1):
  mm/page_alloc: Optimize __free_contig_frozen_range()

Ryan Roberts (2):
  mm/page_alloc: Optimize free_contig_range()
  vmalloc: Optimize vfree

 include/linux/gfp.h |   4 ++
 mm/page_alloc.c     | 146 ++++++++++++++++++++++++++++++++++++++++++--
 mm/vmalloc.c        |  16 ++---
 3 files changed, 151 insertions(+), 15 deletions(-)

-- 
2.47.3



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-24 13:35 [PATCH v3 0/3] mm: Free contiguous order-0 pages efficiently Muhammad Usama Anjum
@ 2026-03-24 13:35 ` Muhammad Usama Anjum
  2026-03-24 14:46   ` Zi Yan
  2026-03-24 20:56   ` David Hildenbrand (Arm)
  2026-03-24 13:35 ` [PATCH v3 2/3] vmalloc: Optimize vfree Muhammad Usama Anjum
  2026-03-24 13:35 ` [PATCH v3 3/3] mm/page_alloc: Optimize __free_contig_frozen_range() Muhammad Usama Anjum
  2 siblings, 2 replies; 25+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-24 13:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola, linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: Ryan Roberts, usama.anjum

From: Ryan Roberts <ryan.roberts@arm.com>

Decompose the range of order-0 pages to be freed into the set of largest
possible power-of-2 size and aligned chunks and free them to the pcp or
buddy. This improves on the previous approach which freed each order-0
page individually in a loop. Testing shows performance to be improved by
more than 10x in some cases.

Since each page is order-0, we must decrement each page's reference
count individually and only consider the page for freeing as part of a
high order chunk if the reference count goes to zero. Additionally
free_pages_prepare() must be called for each individual order-0 page
too, so that the struct page state and global accounting state can be
appropriately managed. But once this is done, the resulting high order
chunks can be freed as a unit to the pcp or buddy.

This significantly speeds up the free operation but also has the side
benefit that high order blocks are added to the pcp instead of each page
ending up on the pcp order-0 list; memory remains more readily available
in high orders.

vmalloc will shortly become a user of this new optimized
free_contig_range() since it aggressively allocates high order
non-compound pages, but then calls split_page() to end up with
contiguous order-0 pages. These can now be freed much more efficiently.

The execution time of the following function was measured in a server
class arm64 machine:

static int page_alloc_high_order_test(void)
{
	unsigned int order = HPAGE_PMD_ORDER;
	struct page *page;
	int i;

	for (i = 0; i < 100000; i++) {
		page = alloc_pages(GFP_KERNEL, order);
		if (!page)
			return -1;
		split_page(page, order);
		free_contig_range(page_to_pfn(page), 1UL << order);
	}

	return 0;
}

Execution time before: 4097358 usec
Execution time after:   729831 usec

Perf trace before:

    99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
            |
            ---kthread
               0xffffb33c12a26af8
               |
               |--98.13%--0xffffb33c12a26060
               |          |
               |          |--97.37%--free_contig_range
               |          |          |
               |          |          |--94.93%--___free_pages
               |          |          |          |
               |          |          |          |--55.42%--__free_frozen_pages
               |          |          |          |          |
               |          |          |          |           --43.20%--free_frozen_page_commit
               |          |          |          |                     |
               |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
               |          |          |          |
               |          |          |          |--11.53%--_raw_spin_trylock
               |          |          |          |
               |          |          |          |--8.19%--__preempt_count_dec_and_test
               |          |          |          |
               |          |          |          |--5.64%--_raw_spin_unlock
               |          |          |          |
               |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
               |          |          |          |
               |          |          |           --1.07%--free_frozen_page_commit
               |          |          |
               |          |           --1.54%--__free_frozen_pages
               |          |
               |           --0.77%--___free_pages
               |
                --0.98%--0xffffb33c12a26078
                          alloc_pages_noprof

Perf trace after:

     8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
            |
            |--5.52%--__free_contig_range
            |          |
            |          |--5.00%--free_prepared_contig_range
            |          |          |
            |          |          |--1.43%--__free_frozen_pages
            |          |          |          |
            |          |          |           --0.51%--free_frozen_page_commit
            |          |          |
            |          |          |--1.08%--_raw_spin_trylock
            |          |          |
            |          |           --0.89%--_raw_spin_unlock
            |          |
            |           --0.52%--free_pages_prepare
            |
             --2.90%--ret_from_fork
                       kthread
                       0xffffae1c12abeaf8
                       0xffffae1c12abe7a0
                       |
                        --2.69%--vfree
                                  __free_contig_range

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
---
Changes since v2:
- Handle different possible section boundries in __free_contig_range()
- Drop the TODO
- Remove return value from __free_contig_range()
- Remove non-functional change from __free_pages_ok()

Changes since v1:
- Rebase on mm-new
- Move FPI_PREPARED check inside __free_pages_prepare() now that
  fpi_flags are already being passed.
- Add todo (Zi Yan)
- Rerun benchmarks
- Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
- Rework order calculation in free_prepared_contig_range() and use
  MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
  be up to internal __free_frozen_pages() how it frees them

Made-with: Cursor
---
 include/linux/gfp.h |  2 +
 mm/page_alloc.c     | 97 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f82d74a77cad8..7c1f9da7c8e56 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages);
 void free_contig_range(unsigned long pfn, unsigned long nr_pages);
 #endif
 
+void __free_contig_range(unsigned long pfn, unsigned long nr_pages);
+
 DEFINE_FREE(free_page, void *, free_page((unsigned long)_T))
 
 #endif /* __LINUX_GFP_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 75ee81445640b..eedce9a30eb7e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -91,6 +91,9 @@ typedef int __bitwise fpi_t;
 /* Free the page without taking locks. Rely on trylock only. */
 #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
 
+/* free_pages_prepare() has already been called for page(s) being freed. */
+#define FPI_PREPARED		((__force fpi_t)BIT(3))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1310,6 +1313,9 @@ __always_inline bool __free_pages_prepare(struct page *page,
 	bool compound = PageCompound(page);
 	struct folio *folio = page_folio(page);
 
+	if (fpi_flags & FPI_PREPARED)
+		return true;
+
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	trace_mm_page_free(page, order);
@@ -6784,6 +6790,94 @@ void __init page_alloc_sysctl_init(void)
 	register_sysctl_init("vm", page_alloc_sysctl_table);
 }
 
+static void free_prepared_contig_range(struct page *page,
+				       unsigned long nr_pages)
+{
+	while (nr_pages) {
+		unsigned int order;
+		unsigned long pfn;
+
+		pfn = page_to_pfn(page);
+		/* We are limited by the largest buddy order. */
+		order = pfn ? __ffs(pfn) : MAX_PAGE_ORDER;
+		/* Don't exceed the number of pages to free. */
+		order = min_t(unsigned int, order, ilog2(nr_pages));
+		order = min_t(unsigned int, order, MAX_PAGE_ORDER);
+
+		/*
+		 * Free the chunk as a single block. Our caller has already
+		 * called free_pages_prepare() for each order-0 page.
+		 */
+		__free_frozen_pages(page, order, FPI_PREPARED);
+
+		page += 1UL << order;
+		nr_pages -= 1UL << order;
+	}
+}
+
+/**
+ * __free_contig_range - Free contiguous range of order-0 pages.
+ * @pfn: Page frame number of the first page in the range.
+ * @nr_pages: Number of pages to free.
+ *
+ * For each order-0 struct page in the physically contiguous range, put a
+ * reference. Free any page who's reference count falls to zero. The
+ * implementation is functionally equivalent to, but significantly faster than
+ * calling __free_page() for each struct page in a loop.
+ *
+ * Memory allocated with alloc_pages(order>=1) then subsequently split to
+ * order-0 with split_page() is an example of appropriate contiguous pages that
+ * can be freed with this API.
+ *
+ * Context: May be called in interrupt context or while holding a normal
+ * spinlock, but not in NMI context or while holding a raw spinlock.
+ */
+void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
+{
+	struct page *page = pfn_to_page(pfn);
+	struct page *start = NULL;
+	unsigned long start_sec;
+	unsigned long i;
+	bool can_free;
+
+	/*
+	 * Chunk the range into contiguous runs of pages for which the refcount
+	 * went to zero and for which free_pages_prepare() succeeded. If
+	 * free_pages_prepare() fails we consider the page to have been freed;
+	 * deliberately leak it.
+	 *
+	 * Code assumes contiguous PFNs have contiguous struct pages, but not
+	 * vice versa. Break batches at section boundaries since pages from
+	 * different sections must not be coalesced into a single high-order
+	 * block.
+	 */
+	for (i = 0; i < nr_pages; i++, page++) {
+		VM_WARN_ON_ONCE(PageHead(page));
+		VM_WARN_ON_ONCE(PageTail(page));
+
+		can_free = put_page_testzero(page);
+		if (can_free && !free_pages_prepare(page, 0))
+			can_free = false;
+
+		if (can_free && start &&
+		    memdesc_section(page->flags) != start_sec) {
+			free_prepared_contig_range(start, page - start);
+			start = page;
+			start_sec = memdesc_section(page->flags);
+		} else if (!can_free && start) {
+			free_prepared_contig_range(start, page - start);
+			start = NULL;
+		} else if (can_free && !start) {
+			start = page;
+			start_sec = memdesc_section(page->flags);
+		}
+	}
+
+	if (start)
+		free_prepared_contig_range(start, page - start);
+}
+EXPORT_SYMBOL(__free_contig_range);
+
 #ifdef CONFIG_CONTIG_ALLOC
 /* Usage: See admin-guide/dynamic-debug-howto.rst */
 static void alloc_contig_dump_pages(struct list_head *page_list)
@@ -7330,8 +7424,7 @@ void free_contig_range(unsigned long pfn, unsigned long nr_pages)
 	if (WARN_ON_ONCE(PageHead(pfn_to_page(pfn))))
 		return;
 
-	for (; nr_pages--; pfn++)
-		__free_page(pfn_to_page(pfn));
+	__free_contig_range(pfn, nr_pages);
 }
 EXPORT_SYMBOL(free_contig_range);
 #endif /* CONFIG_CONTIG_ALLOC */
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-24 13:35 [PATCH v3 0/3] mm: Free contiguous order-0 pages efficiently Muhammad Usama Anjum
  2026-03-24 13:35 ` [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
@ 2026-03-24 13:35 ` Muhammad Usama Anjum
  2026-03-24 14:55   ` Zi Yan
  2026-03-25 10:05   ` David Hildenbrand (Arm)
  2026-03-24 13:35 ` [PATCH v3 3/3] mm/page_alloc: Optimize __free_contig_frozen_range() Muhammad Usama Anjum
  2 siblings, 2 replies; 25+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-24 13:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola, linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: Ryan Roberts, usama.anjum

From: Ryan Roberts <ryan.roberts@arm.com>

Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
must immediately split_page() to order-0 so that it remains compatible
with users that want to access the underlying struct page.
Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
allocator") recently made it much more likely for vmalloc to allocate
high order pages which are subsequently split to order-0.

Unfortunately this had the side effect of causing performance
regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
benchmarks). See Closes: tag. This happens because the high order pages
must be gotten from the buddy but then because they are split to
order-0, when they are freed they are freed to the order-0 pcp.
Previously allocation was for order-0 pages so they were recycled from
the pcp.

It would be preferable if when vmalloc allocates an (e.g.) order-3 page
that it also frees that order-3 page to the order-3 pcp, then the
regression could be removed.

So let's do exactly that; use the new __free_contig_range() API to
batch-free contiguous ranges of pfns. This not only removes the
regression, but significantly improves performance of vfree beyond the
baseline.

A selection of test_vmalloc benchmarks running on arm64 server class
system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
large order pages from buddy allocator") was added in v6.19-rc1 where we
see regressions. Then with this change performance is much better. (>0
is faster, <0 is slower, (R)/(I) = statistically significant
Regression/Improvement):

+-----------------+----------------------------------------------------------+-------------------+--------------------+
| Benchmark       | Result Class                                             |   mm-new          |  this series       |
+=================+==========================================================+===================+====================+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
+-----------------+----------------------------------------------------------+-------------------+--------------------+

Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
---
Changes since v2:
- Remove BUG_ON in favour of simple implementation as this has never
  been seen to output any bug in the past as well
- Move the free loop to separate function, free_pages_bulk()
- Update stats, lruvec_stat in separate loop

Changes since v1:
- Rebase on mm-new
- Rerun benchmarks

Made-with: Cursor
---
 include/linux/gfp.h |  2 ++
 mm/page_alloc.c     | 23 +++++++++++++++++++++++
 mm/vmalloc.c        | 16 +++++-----------
 3 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 7c1f9da7c8e56..71f9097ab99a0 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -239,6 +239,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 				struct page **page_array);
 #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
 
+void free_pages_bulk(struct page **page_array, unsigned long nr_pages);
+
 unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
 				unsigned long nr_pages,
 				struct page **page_array);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eedce9a30eb7e..250cc07e547b8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5175,6 +5175,29 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 }
 EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
 
+void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
+{
+	unsigned long start_pfn = 0, pfn;
+	unsigned long i, nr_contig = 0;
+
+	for (i = 0; i < nr_pages; i++) {
+		pfn = page_to_pfn(page_array[i]);
+		if (!nr_contig) {
+			start_pfn = pfn;
+			nr_contig = 1;
+		} else if (start_pfn + nr_contig != pfn) {
+			__free_contig_range(start_pfn, nr_contig);
+			start_pfn = pfn;
+			nr_contig = 1;
+			cond_resched();
+		} else {
+			nr_contig++;
+		}
+	}
+	if (nr_contig)
+		__free_contig_range(start_pfn, nr_contig);
+}
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c607307c657a6..e9b3d6451e48b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3459,19 +3459,13 @@ void vfree(const void *addr)
 
 	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
 		vm_reset_perms(vm);
-	for (i = 0; i < vm->nr_pages; i++) {
-		struct page *page = vm->pages[i];
 
-		BUG_ON(!page);
-		/*
-		 * High-order allocs for huge vmallocs are split, so
-		 * can be freed as an array of order-0 allocations
-		 */
-		if (!(vm->flags & VM_MAP_PUT_PAGES))
-			mod_lruvec_page_state(page, NR_VMALLOC, -1);
-		__free_page(page);
-		cond_resched();
+	if (!(vm->flags & VM_MAP_PUT_PAGES)) {
+		for (i = 0; i < vm->nr_pages; i++)
+			mod_lruvec_page_state(vm->pages[i], NR_VMALLOC, -1);
 	}
+	free_pages_bulk(vm->pages, vm->nr_pages);
+
 	kvfree(vm->pages);
 	kfree(vm);
 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 3/3] mm/page_alloc: Optimize __free_contig_frozen_range()
  2026-03-24 13:35 [PATCH v3 0/3] mm: Free contiguous order-0 pages efficiently Muhammad Usama Anjum
  2026-03-24 13:35 ` [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
  2026-03-24 13:35 ` [PATCH v3 2/3] vmalloc: Optimize vfree Muhammad Usama Anjum
@ 2026-03-24 13:35 ` Muhammad Usama Anjum
  2026-03-24 15:06   ` Zi Yan
  2 siblings, 1 reply; 25+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-24 13:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola, linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: Muhammad Usama Anjum

Apply the same batch-freeing optimization from free_contig_range() to the
frozen page path. The previous __free_contig_frozen_range() freed each
order-0 page individually via free_frozen_pages(), which is slow for the
same reason the old free_contig_range() was: each page goes to the
order-0 pcp list rather than being coalesced into higher-order blocks.

Rewrite __free_contig_frozen_range() to call free_pages_prepare() for
each order-0 page, then batch the prepared pages into the largest
possible power-of-2 aligned chunks via free_prepared_contig_range().
If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
deliberately not freed; it should not be returned to the allocator.

I've tested CMA through debugfs. The test allocates 16384 pages per
allocation for several iterations. There is 3.5x improvement.

Before: 1406 usec per iteration
After:   402 usec per iteration

Before:

    70.89%     0.69%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
            |
            |--70.20%--free_contig_frozen_range
            |          |
            |          |--46.41%--__free_frozen_pages
            |          |          |
            |          |           --36.18%--free_frozen_page_commit
            |          |                     |
            |          |                      --29.63%--_raw_spin_unlock_irqrestore
            |          |
            |          |--8.76%--_raw_spin_trylock
            |          |
            |          |--7.03%--__preempt_count_dec_and_test
            |          |
            |          |--4.57%--_raw_spin_unlock
            |          |
            |          |--1.96%--__get_pfnblock_flags_mask.isra.0
            |          |
            |           --1.15%--free_frozen_page_commit
            |
             --0.69%--el0t_64_sync

After:

    23.57%     0.00%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
            |
            ---free_contig_frozen_range
               |
               |--20.45%--__free_contig_frozen_range
               |          |
               |          |--17.77%--free_pages_prepare
               |          |
               |           --0.72%--free_prepared_contig_range
               |                     |
               |                      --0.55%--__free_frozen_pages
               |
                --3.12%--free_pages_prepare

Suggested-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
---
Changes since v2:
- Rework the loop to check for memory sections just like __free_contig_range()
- Didn't add reviewed-by tags because of rework
---
 mm/page_alloc.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 250cc07e547b8..26eac35ef73bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7038,8 +7038,30 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
 
 static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
 {
-	for (; nr_pages--; pfn++)
-		free_frozen_pages(pfn_to_page(pfn), 0);
+	struct page *page = pfn_to_page(pfn);
+	struct page *start = NULL;
+	unsigned long start_sec;
+	unsigned long i;
+
+	for (i = 0; i < nr_pages; i++, page++) {
+		if (!free_pages_prepare(page, 0)) {
+			if (start) {
+				free_prepared_contig_range(start, page - start);
+				start = NULL;
+			}
+		} else if (start &&
+			   memdesc_section(page->flags) != start_sec) {
+			free_prepared_contig_range(start, page - start);
+			start = page;
+			start_sec = memdesc_section(page->flags);
+		} else if (!start) {
+			start = page;
+			start_sec = memdesc_section(page->flags);
+		}
+	}
+
+	if (start)
+		free_prepared_contig_range(start, page - start);
 }
 
 /**
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-24 13:35 ` [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
@ 2026-03-24 14:46   ` Zi Yan
  2026-03-24 15:22     ` David Hildenbrand
  2026-03-24 20:56   ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 25+ messages in thread
From: Zi Yan @ 2026-03-24 14:46 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts,
	david.hildenbrand

On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:

> From: Ryan Roberts <ryan.roberts@arm.com>
>
> Decompose the range of order-0 pages to be freed into the set of largest
> possible power-of-2 size and aligned chunks and free them to the pcp or
> buddy. This improves on the previous approach which freed each order-0
> page individually in a loop. Testing shows performance to be improved by
> more than 10x in some cases.
>
> Since each page is order-0, we must decrement each page's reference
> count individually and only consider the page for freeing as part of a
> high order chunk if the reference count goes to zero. Additionally
> free_pages_prepare() must be called for each individual order-0 page
> too, so that the struct page state and global accounting state can be
> appropriately managed. But once this is done, the resulting high order
> chunks can be freed as a unit to the pcp or buddy.
>
> This significantly speeds up the free operation but also has the side
> benefit that high order blocks are added to the pcp instead of each page
> ending up on the pcp order-0 list; memory remains more readily available
> in high orders.
>
> vmalloc will shortly become a user of this new optimized
> free_contig_range() since it aggressively allocates high order
> non-compound pages, but then calls split_page() to end up with
> contiguous order-0 pages. These can now be freed much more efficiently.
>
> The execution time of the following function was measured in a server
> class arm64 machine:
>
> static int page_alloc_high_order_test(void)
> {
> 	unsigned int order = HPAGE_PMD_ORDER;
> 	struct page *page;
> 	int i;
>
> 	for (i = 0; i < 100000; i++) {
> 		page = alloc_pages(GFP_KERNEL, order);
> 		if (!page)
> 			return -1;
> 		split_page(page, order);
> 		free_contig_range(page_to_pfn(page), 1UL << order);
> 	}
>
> 	return 0;
> }
>
> Execution time before: 4097358 usec
> Execution time after:   729831 usec
>
> Perf trace before:
>
>     99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
>             |
>             ---kthread
>                0xffffb33c12a26af8
>                |
>                |--98.13%--0xffffb33c12a26060
>                |          |
>                |          |--97.37%--free_contig_range
>                |          |          |
>                |          |          |--94.93%--___free_pages
>                |          |          |          |
>                |          |          |          |--55.42%--__free_frozen_pages
>                |          |          |          |          |
>                |          |          |          |           --43.20%--free_frozen_page_commit
>                |          |          |          |                     |
>                |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
>                |          |          |          |
>                |          |          |          |--11.53%--_raw_spin_trylock
>                |          |          |          |
>                |          |          |          |--8.19%--__preempt_count_dec_and_test
>                |          |          |          |
>                |          |          |          |--5.64%--_raw_spin_unlock
>                |          |          |          |
>                |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
>                |          |          |          |
>                |          |          |           --1.07%--free_frozen_page_commit
>                |          |          |
>                |          |           --1.54%--__free_frozen_pages
>                |          |
>                |           --0.77%--___free_pages
>                |
>                 --0.98%--0xffffb33c12a26078
>                           alloc_pages_noprof
>
> Perf trace after:
>
>      8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
>             |
>             |--5.52%--__free_contig_range
>             |          |
>             |          |--5.00%--free_prepared_contig_range
>             |          |          |
>             |          |          |--1.43%--__free_frozen_pages
>             |          |          |          |
>             |          |          |           --0.51%--free_frozen_page_commit
>             |          |          |
>             |          |          |--1.08%--_raw_spin_trylock
>             |          |          |
>             |          |           --0.89%--_raw_spin_unlock
>             |          |
>             |           --0.52%--free_pages_prepare
>             |
>              --2.90%--ret_from_fork
>                        kthread
>                        0xffffae1c12abeaf8
>                        0xffffae1c12abe7a0
>                        |
>                         --2.69%--vfree
>                                   __free_contig_range
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
> Changes since v2:
> - Handle different possible section boundries in __free_contig_range()
> - Drop the TODO
> - Remove return value from __free_contig_range()
> - Remove non-functional change from __free_pages_ok()
>
> Changes since v1:
> - Rebase on mm-new
> - Move FPI_PREPARED check inside __free_pages_prepare() now that
>   fpi_flags are already being passed.
> - Add todo (Zi Yan)
> - Rerun benchmarks
> - Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
> - Rework order calculation in free_prepared_contig_range() and use
>   MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
>   be up to internal __free_frozen_pages() how it frees them
>
> Made-with: Cursor
> ---
>  include/linux/gfp.h |  2 +
>  mm/page_alloc.c     | 97 ++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 97 insertions(+), 2 deletions(-)
>

<snip>

> +
> +/**
> + * __free_contig_range - Free contiguous range of order-0 pages.
> + * @pfn: Page frame number of the first page in the range.
> + * @nr_pages: Number of pages to free.
> + *
> + * For each order-0 struct page in the physically contiguous range, put a
> + * reference. Free any page who's reference count falls to zero. The
> + * implementation is functionally equivalent to, but significantly faster than
> + * calling __free_page() for each struct page in a loop.
> + *
> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
> + * order-0 with split_page() is an example of appropriate contiguous pages that
> + * can be freed with this API.
> + *
> + * Context: May be called in interrupt context or while holding a normal
> + * spinlock, but not in NMI context or while holding a raw spinlock.
> + */
> +void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	struct page *start = NULL;
> +	unsigned long start_sec;
> +	unsigned long i;
> +	bool can_free;
> +
> +	/*
> +	 * Chunk the range into contiguous runs of pages for which the refcount
> +	 * went to zero and for which free_pages_prepare() succeeded. If
> +	 * free_pages_prepare() fails we consider the page to have been freed;
> +	 * deliberately leak it.
> +	 *
> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
> +	 * vice versa. Break batches at section boundaries since pages from
> +	 * different sections must not be coalesced into a single high-order
> +	 * block.
> +	 */
> +	for (i = 0; i < nr_pages; i++, page++) {
> +		VM_WARN_ON_ONCE(PageHead(page));
> +		VM_WARN_ON_ONCE(PageTail(page));
> +
> +		can_free = put_page_testzero(page);
> +		if (can_free && !free_pages_prepare(page, 0))
> +			can_free = false;
> +
> +		if (can_free && start &&
> +		    memdesc_section(page->flags) != start_sec) {
> +			free_prepared_contig_range(start, page - start);
> +			start = page;
> +			start_sec = memdesc_section(page->flags);
> +		} else if (!can_free && start) {
> +			free_prepared_contig_range(start, page - start);
> +			start = NULL;
> +		} else if (can_free && !start) {
> +			start = page;
> +			start_sec = memdesc_section(page->flags);
> +		}
> +	}

It can be simplified to:

        for (i = 0; i < nr_pages; i++, page++) {
                VM_WARN_ON_ONCE(PageHead(page));
                VM_WARN_ON_ONCE(PageTail(page));

                can_free = put_page_testzero(page) && free_pages_prepare(page, 0);

                if (!can_free) {
                        if (start) {
                                free_prepared_contig_range(start, page - start);
                                start = NULL;
                        }
                        continue;
                }

                if (start && memdesc_section(page->flags) != start_sec) {
                        free_prepared_contig_range(start, page - start);
                        start = page;
                        start_sec = memdesc_section(page->flags);
                } else if (!start) {
                        start = page;
                        start_sec = memdesc_section(page->flags);
                }
        }

BTW, memdesc_section() returns 0 for !SECTION_IN_PAGE_FLAGS.
Is pfn_to_section_nr() more robust?

> +
> +	if (start)
> +		free_prepared_contig_range(start, page - start);
> +}
> +EXPORT_SYMBOL(__free_contig_range);
> +
>  #ifdef CONFIG_CONTIG_ALLOC
>  /* Usage: See admin-guide/dynamic-debug-howto.rst */
>  static void alloc_contig_dump_pages(struct list_head *page_list)
> @@ -7330,8 +7424,7 @@ void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>  	if (WARN_ON_ONCE(PageHead(pfn_to_page(pfn))))
>  		return;
>
> -	for (; nr_pages--; pfn++)
> -		__free_page(pfn_to_page(pfn));
> +	__free_contig_range(pfn, nr_pages);
>  }
>  EXPORT_SYMBOL(free_contig_range);
>  #endif /* CONFIG_CONTIG_ALLOC */
> -- 
> 2.47.3


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-24 13:35 ` [PATCH v3 2/3] vmalloc: Optimize vfree Muhammad Usama Anjum
@ 2026-03-24 14:55   ` Zi Yan
  2026-03-25  8:56     ` Uladzislau Rezki
  2026-03-25 14:34     ` Usama Anjum
  2026-03-25 10:05   ` David Hildenbrand (Arm)
  1 sibling, 2 replies; 25+ messages in thread
From: Zi Yan @ 2026-03-24 14:55 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts,
	david.hildenbrand

On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:

> From: Ryan Roberts <ryan.roberts@arm.com>
>
> Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
> must immediately split_page() to order-0 so that it remains compatible
> with users that want to access the underlying struct page.
> Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
> allocator") recently made it much more likely for vmalloc to allocate
> high order pages which are subsequently split to order-0.
>
> Unfortunately this had the side effect of causing performance
> regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
> benchmarks). See Closes: tag. This happens because the high order pages
> must be gotten from the buddy but then because they are split to
> order-0, when they are freed they are freed to the order-0 pcp.
> Previously allocation was for order-0 pages so they were recycled from
> the pcp.
>
> It would be preferable if when vmalloc allocates an (e.g.) order-3 page
> that it also frees that order-3 page to the order-3 pcp, then the
> regression could be removed.
>
> So let's do exactly that; use the new __free_contig_range() API to
> batch-free contiguous ranges of pfns. This not only removes the
> regression, but significantly improves performance of vfree beyond the
> baseline.
>
> A selection of test_vmalloc benchmarks running on arm64 server class
> system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
> large order pages from buddy allocator") was added in v6.19-rc1 where we
> see regressions. Then with this change performance is much better. (>0
> is faster, <0 is slower, (R)/(I) = statistically significant
> Regression/Improvement):
>
> +-----------------+----------------------------------------------------------+-------------------+--------------------+
> | Benchmark       | Result Class                                             |   mm-new          |  this series       |
> +=================+==========================================================+===================+====================+
> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>
> Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
> Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
> Changes since v2:
> - Remove BUG_ON in favour of simple implementation as this has never
>   been seen to output any bug in the past as well
> - Move the free loop to separate function, free_pages_bulk()
> - Update stats, lruvec_stat in separate loop
>
> Changes since v1:
> - Rebase on mm-new
> - Rerun benchmarks
>
> Made-with: Cursor
> ---
>  include/linux/gfp.h |  2 ++
>  mm/page_alloc.c     | 23 +++++++++++++++++++++++
>  mm/vmalloc.c        | 16 +++++-----------
>  3 files changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 7c1f9da7c8e56..71f9097ab99a0 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -239,6 +239,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  				struct page **page_array);
>  #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
>
> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages);
> +
>  unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
>  				unsigned long nr_pages,
>  				struct page **page_array);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eedce9a30eb7e..250cc07e547b8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5175,6 +5175,29 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  }
>  EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
>
> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
> +{
> +	unsigned long start_pfn = 0, pfn;
> +	unsigned long i, nr_contig = 0;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		pfn = page_to_pfn(page_array[i]);
> +		if (!nr_contig) {
> +			start_pfn = pfn;
> +			nr_contig = 1;
> +		} else if (start_pfn + nr_contig != pfn) {
> +			__free_contig_range(start_pfn, nr_contig);
> +			start_pfn = pfn;
> +			nr_contig = 1;
> +			cond_resched();
> +		} else {
> +			nr_contig++;
> +		}
> +	}
> +	if (nr_contig)
> +		__free_contig_range(start_pfn, nr_contig);
> +}

free_pages_bulk() assumes pages in page_array are sorted in PFN ascending order.
I think it is worth documenting it, since without sorting, it can degrade
back to the original implementation.

> +
>  /*
>   * This is the 'heart' of the zoned buddy allocator.
>   */
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c607307c657a6..e9b3d6451e48b 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3459,19 +3459,13 @@ void vfree(const void *addr)
>
>  	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
>  		vm_reset_perms(vm);
> -	for (i = 0; i < vm->nr_pages; i++) {
> -		struct page *page = vm->pages[i];
>
> -		BUG_ON(!page);
> -		/*
> -		 * High-order allocs for huge vmallocs are split, so
> -		 * can be freed as an array of order-0 allocations
> -		 */
> -		if (!(vm->flags & VM_MAP_PUT_PAGES))
> -			mod_lruvec_page_state(page, NR_VMALLOC, -1);
> -		__free_page(page);
> -		cond_resched();
> +	if (!(vm->flags & VM_MAP_PUT_PAGES)) {
> +		for (i = 0; i < vm->nr_pages; i++)
> +			mod_lruvec_page_state(vm->pages[i], NR_VMALLOC, -1);
>  	}
> +	free_pages_bulk(vm->pages, vm->nr_pages);
> +

stats is updated before any page is freed. It is better to mention
it in the commit message.

>  	kvfree(vm->pages);
>  	kfree(vm);
>  }
> -- 
> 2.47.3

Otherwise, LGTM.

Acked-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Optimize __free_contig_frozen_range()
  2026-03-24 13:35 ` [PATCH v3 3/3] mm/page_alloc: Optimize __free_contig_frozen_range() Muhammad Usama Anjum
@ 2026-03-24 15:06   ` Zi Yan
  2026-03-25 10:14     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 25+ messages in thread
From: Zi Yan @ 2026-03-24 15:06 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts,
	david.hildenbrand

On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:

> Apply the same batch-freeing optimization from free_contig_range() to the
> frozen page path. The previous __free_contig_frozen_range() freed each
> order-0 page individually via free_frozen_pages(), which is slow for the
> same reason the old free_contig_range() was: each page goes to the
> order-0 pcp list rather than being coalesced into higher-order blocks.
>
> Rewrite __free_contig_frozen_range() to call free_pages_prepare() for
> each order-0 page, then batch the prepared pages into the largest
> possible power-of-2 aligned chunks via free_prepared_contig_range().
> If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
> deliberately not freed; it should not be returned to the allocator.
>
> I've tested CMA through debugfs. The test allocates 16384 pages per
> allocation for several iterations. There is 3.5x improvement.
>
> Before: 1406 usec per iteration
> After:   402 usec per iteration
>
> Before:
>
>     70.89%     0.69%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
>             |
>             |--70.20%--free_contig_frozen_range
>             |          |
>             |          |--46.41%--__free_frozen_pages
>             |          |          |
>             |          |           --36.18%--free_frozen_page_commit
>             |          |                     |
>             |          |                      --29.63%--_raw_spin_unlock_irqrestore
>             |          |
>             |          |--8.76%--_raw_spin_trylock
>             |          |
>             |          |--7.03%--__preempt_count_dec_and_test
>             |          |
>             |          |--4.57%--_raw_spin_unlock
>             |          |
>             |          |--1.96%--__get_pfnblock_flags_mask.isra.0
>             |          |
>             |           --1.15%--free_frozen_page_commit
>             |
>              --0.69%--el0t_64_sync
>
> After:
>
>     23.57%     0.00%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
>             |
>             ---free_contig_frozen_range
>                |
>                |--20.45%--__free_contig_frozen_range
>                |          |
>                |          |--17.77%--free_pages_prepare
>                |          |
>                |           --0.72%--free_prepared_contig_range
>                |                     |
>                |                      --0.55%--__free_frozen_pages
>                |
>                 --3.12%--free_pages_prepare
>
> Suggested-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
> Changes since v2:
> - Rework the loop to check for memory sections just like __free_contig_range()
> - Didn't add reviewed-by tags because of rework
> ---
>  mm/page_alloc.c | 26 ++++++++++++++++++++++++--
>  1 file changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 250cc07e547b8..26eac35ef73bd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7038,8 +7038,30 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
>
>  static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
>  {
> -	for (; nr_pages--; pfn++)
> -		free_frozen_pages(pfn_to_page(pfn), 0);
> +	struct page *page = pfn_to_page(pfn);
> +	struct page *start = NULL;
> +	unsigned long start_sec;
> +	unsigned long i;
> +
> +	for (i = 0; i < nr_pages; i++, page++) {
> +		if (!free_pages_prepare(page, 0)) {
> +			if (start) {
> +				free_prepared_contig_range(start, page - start);
> +				start = NULL;
> +			}
> +		} else if (start &&
> +			   memdesc_section(page->flags) != start_sec) {
> +			free_prepared_contig_range(start, page - start);
> +			start = page;
> +			start_sec = memdesc_section(page->flags);
> +		} else if (!start) {
> +			start = page;
> +			start_sec = memdesc_section(page->flags);
> +		}
> +	}
> +
> +	if (start)
> +		free_prepared_contig_range(start, page - start);
>  }

This looks almost the same as __free_contig_range().

Two approaches to deduplicate the code:

1. __free_contig_range() first does put_page_testzero()
on all pages and call __free_contig_frozen_range()
on the range, __free_contig_frozen_range() will need
to skip not frozen pages. It is not ideal.

2. add a helper function
__free_contig_range_common(unsigned long pfn,
unsigned long nr_pages, bool is_page_frozen),
and
a. call __free_contig_range_common(..., /*is_page_frozen=*/ false)
in __free_contig_range(),
b. __free_contig_range_common(..., /*is_page_frozen=*/ true)
in __free_contig_frozen_range().

But I would like to hear others’ opinions.



Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-24 14:46   ` Zi Yan
@ 2026-03-24 15:22     ` David Hildenbrand
  2026-03-24 17:14       ` Zi Yan
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2026-03-24 15:22 UTC (permalink / raw)
  To: Zi Yan, Muhammad Usama Anjum
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts

On 3/24/26 15:46, Zi Yan wrote:
> On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:
> 
>> From: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Decompose the range of order-0 pages to be freed into the set of largest
>> possible power-of-2 size and aligned chunks and free them to the pcp or
>> buddy. This improves on the previous approach which freed each order-0
>> page individually in a loop. Testing shows performance to be improved by
>> more than 10x in some cases.
>>
>> Since each page is order-0, we must decrement each page's reference
>> count individually and only consider the page for freeing as part of a
>> high order chunk if the reference count goes to zero. Additionally
>> free_pages_prepare() must be called for each individual order-0 page
>> too, so that the struct page state and global accounting state can be
>> appropriately managed. But once this is done, the resulting high order
>> chunks can be freed as a unit to the pcp or buddy.
>>
>> This significantly speeds up the free operation but also has the side
>> benefit that high order blocks are added to the pcp instead of each page
>> ending up on the pcp order-0 list; memory remains more readily available
>> in high orders.
>>
>> vmalloc will shortly become a user of this new optimized
>> free_contig_range() since it aggressively allocates high order
>> non-compound pages, but then calls split_page() to end up with
>> contiguous order-0 pages. These can now be freed much more efficiently.
>>
>> The execution time of the following function was measured in a server
>> class arm64 machine:
>>
>> static int page_alloc_high_order_test(void)
>> {
>> 	unsigned int order = HPAGE_PMD_ORDER;
>> 	struct page *page;
>> 	int i;
>>
>> 	for (i = 0; i < 100000; i++) {
>> 		page = alloc_pages(GFP_KERNEL, order);
>> 		if (!page)
>> 			return -1;
>> 		split_page(page, order);
>> 		free_contig_range(page_to_pfn(page), 1UL << order);
>> 	}
>>
>> 	return 0;
>> }
>>
>> Execution time before: 4097358 usec
>> Execution time after:   729831 usec
>>
>> Perf trace before:
>>
>>     99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
>>             |
>>             ---kthread
>>                0xffffb33c12a26af8
>>                |
>>                |--98.13%--0xffffb33c12a26060
>>                |          |
>>                |          |--97.37%--free_contig_range
>>                |          |          |
>>                |          |          |--94.93%--___free_pages
>>                |          |          |          |
>>                |          |          |          |--55.42%--__free_frozen_pages
>>                |          |          |          |          |
>>                |          |          |          |           --43.20%--free_frozen_page_commit
>>                |          |          |          |                     |
>>                |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
>>                |          |          |          |
>>                |          |          |          |--11.53%--_raw_spin_trylock
>>                |          |          |          |
>>                |          |          |          |--8.19%--__preempt_count_dec_and_test
>>                |          |          |          |
>>                |          |          |          |--5.64%--_raw_spin_unlock
>>                |          |          |          |
>>                |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
>>                |          |          |          |
>>                |          |          |           --1.07%--free_frozen_page_commit
>>                |          |          |
>>                |          |           --1.54%--__free_frozen_pages
>>                |          |
>>                |           --0.77%--___free_pages
>>                |
>>                 --0.98%--0xffffb33c12a26078
>>                           alloc_pages_noprof
>>
>> Perf trace after:
>>
>>      8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
>>             |
>>             |--5.52%--__free_contig_range
>>             |          |
>>             |          |--5.00%--free_prepared_contig_range
>>             |          |          |
>>             |          |          |--1.43%--__free_frozen_pages
>>             |          |          |          |
>>             |          |          |           --0.51%--free_frozen_page_commit
>>             |          |          |
>>             |          |          |--1.08%--_raw_spin_trylock
>>             |          |          |
>>             |          |           --0.89%--_raw_spin_unlock
>>             |          |
>>             |           --0.52%--free_pages_prepare
>>             |
>>              --2.90%--ret_from_fork
>>                        kthread
>>                        0xffffae1c12abeaf8
>>                        0xffffae1c12abe7a0
>>                        |
>>                         --2.69%--vfree
>>                                   __free_contig_range
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> ---
>> Changes since v2:
>> - Handle different possible section boundries in __free_contig_range()
>> - Drop the TODO
>> - Remove return value from __free_contig_range()
>> - Remove non-functional change from __free_pages_ok()
>>
>> Changes since v1:
>> - Rebase on mm-new
>> - Move FPI_PREPARED check inside __free_pages_prepare() now that
>>   fpi_flags are already being passed.
>> - Add todo (Zi Yan)
>> - Rerun benchmarks
>> - Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
>> - Rework order calculation in free_prepared_contig_range() and use
>>   MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
>>   be up to internal __free_frozen_pages() how it frees them
>>
>> Made-with: Cursor
>> ---
>>  include/linux/gfp.h |  2 +
>>  mm/page_alloc.c     | 97 ++++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 97 insertions(+), 2 deletions(-)
>>
> 
> <snip>
> 
>> +
>> +/**
>> + * __free_contig_range - Free contiguous range of order-0 pages.
>> + * @pfn: Page frame number of the first page in the range.
>> + * @nr_pages: Number of pages to free.
>> + *
>> + * For each order-0 struct page in the physically contiguous range, put a
>> + * reference. Free any page who's reference count falls to zero. The
>> + * implementation is functionally equivalent to, but significantly faster than
>> + * calling __free_page() for each struct page in a loop.
>> + *
>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>> + * can be freed with this API.
>> + *
>> + * Context: May be called in interrupt context or while holding a normal
>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>> + */
>> +void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>> +{
>> +	struct page *page = pfn_to_page(pfn);
>> +	struct page *start = NULL;
>> +	unsigned long start_sec;
>> +	unsigned long i;
>> +	bool can_free;
>> +
>> +	/*
>> +	 * Chunk the range into contiguous runs of pages for which the refcount
>> +	 * went to zero and for which free_pages_prepare() succeeded. If
>> +	 * free_pages_prepare() fails we consider the page to have been freed;
>> +	 * deliberately leak it.
>> +	 *
>> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
>> +	 * vice versa. Break batches at section boundaries since pages from
>> +	 * different sections must not be coalesced into a single high-order
>> +	 * block.
>> +	 */
>> +	for (i = 0; i < nr_pages; i++, page++) {
>> +		VM_WARN_ON_ONCE(PageHead(page));
>> +		VM_WARN_ON_ONCE(PageTail(page));
>> +
>> +		can_free = put_page_testzero(page);
>> +		if (can_free && !free_pages_prepare(page, 0))
>> +			can_free = false;
>> +
>> +		if (can_free && start &&
>> +		    memdesc_section(page->flags) != start_sec) {
>> +			free_prepared_contig_range(start, page - start);
>> +			start = page;
>> +			start_sec = memdesc_section(page->flags);
>> +		} else if (!can_free && start) {
>> +			free_prepared_contig_range(start, page - start);
>> +			start = NULL;
>> +		} else if (can_free && !start) {
>> +			start = page;
>> +			start_sec = memdesc_section(page->flags);
>> +		}
>> +	}
> 
> It can be simplified to:
> 
>         for (i = 0; i < nr_pages; i++, page++) {
>                 VM_WARN_ON_ONCE(PageHead(page));
>                 VM_WARN_ON_ONCE(PageTail(page));
> 
>                 can_free = put_page_testzero(page) && free_pages_prepare(page, 0);
> 
>                 if (!can_free) {
>                         if (start) {
>                                 free_prepared_contig_range(start, page - start);
>                                 start = NULL;
>                         }
>                         continue;
>                 }
> 
>                 if (start && memdesc_section(page->flags) != start_sec) {
>                         free_prepared_contig_range(start, page - start);
>                         start = page;
>                         start_sec = memdesc_section(page->flags);
>                 } else if (!start) {
>                         start = page;
>                         start_sec = memdesc_section(page->flags);
>                 }
>         }
> 
> BTW, memdesc_section() returns 0 for !SECTION_IN_PAGE_FLAGS.
> Is pfn_to_section_nr() more robust?

That's the whole trick: it's optimized out in that case. Linus proposed
that for num_pages_contiguous().

The cover letter should likely refer to num_pages_contiguous() :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-24 15:22     ` David Hildenbrand
@ 2026-03-24 17:14       ` Zi Yan
  2026-03-25 14:06         ` Muhammad Usama Anjum
  0 siblings, 1 reply; 25+ messages in thread
From: Zi Yan @ 2026-03-24 17:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Muhammad Usama Anjum, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts

On 24 Mar 2026, at 11:22, David Hildenbrand wrote:

> On 3/24/26 15:46, Zi Yan wrote:
>> On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:
>>
>>> From: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>> Decompose the range of order-0 pages to be freed into the set of largest
>>> possible power-of-2 size and aligned chunks and free them to the pcp or
>>> buddy. This improves on the previous approach which freed each order-0
>>> page individually in a loop. Testing shows performance to be improved by
>>> more than 10x in some cases.
>>>
>>> Since each page is order-0, we must decrement each page's reference
>>> count individually and only consider the page for freeing as part of a
>>> high order chunk if the reference count goes to zero. Additionally
>>> free_pages_prepare() must be called for each individual order-0 page
>>> too, so that the struct page state and global accounting state can be
>>> appropriately managed. But once this is done, the resulting high order
>>> chunks can be freed as a unit to the pcp or buddy.
>>>
>>> This significantly speeds up the free operation but also has the side
>>> benefit that high order blocks are added to the pcp instead of each page
>>> ending up on the pcp order-0 list; memory remains more readily available
>>> in high orders.
>>>
>>> vmalloc will shortly become a user of this new optimized
>>> free_contig_range() since it aggressively allocates high order
>>> non-compound pages, but then calls split_page() to end up with
>>> contiguous order-0 pages. These can now be freed much more efficiently.
>>>
>>> The execution time of the following function was measured in a server
>>> class arm64 machine:
>>>
>>> static int page_alloc_high_order_test(void)
>>> {
>>> 	unsigned int order = HPAGE_PMD_ORDER;
>>> 	struct page *page;
>>> 	int i;
>>>
>>> 	for (i = 0; i < 100000; i++) {
>>> 		page = alloc_pages(GFP_KERNEL, order);
>>> 		if (!page)
>>> 			return -1;
>>> 		split_page(page, order);
>>> 		free_contig_range(page_to_pfn(page), 1UL << order);
>>> 	}
>>>
>>> 	return 0;
>>> }
>>>
>>> Execution time before: 4097358 usec
>>> Execution time after:   729831 usec
>>>
>>> Perf trace before:
>>>
>>>     99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
>>>             |
>>>             ---kthread
>>>                0xffffb33c12a26af8
>>>                |
>>>                |--98.13%--0xffffb33c12a26060
>>>                |          |
>>>                |          |--97.37%--free_contig_range
>>>                |          |          |
>>>                |          |          |--94.93%--___free_pages
>>>                |          |          |          |
>>>                |          |          |          |--55.42%--__free_frozen_pages
>>>                |          |          |          |          |
>>>                |          |          |          |           --43.20%--free_frozen_page_commit
>>>                |          |          |          |                     |
>>>                |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
>>>                |          |          |          |
>>>                |          |          |          |--11.53%--_raw_spin_trylock
>>>                |          |          |          |
>>>                |          |          |          |--8.19%--__preempt_count_dec_and_test
>>>                |          |          |          |
>>>                |          |          |          |--5.64%--_raw_spin_unlock
>>>                |          |          |          |
>>>                |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
>>>                |          |          |          |
>>>                |          |          |           --1.07%--free_frozen_page_commit
>>>                |          |          |
>>>                |          |           --1.54%--__free_frozen_pages
>>>                |          |
>>>                |           --0.77%--___free_pages
>>>                |
>>>                 --0.98%--0xffffb33c12a26078
>>>                           alloc_pages_noprof
>>>
>>> Perf trace after:
>>>
>>>      8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
>>>             |
>>>             |--5.52%--__free_contig_range
>>>             |          |
>>>             |          |--5.00%--free_prepared_contig_range
>>>             |          |          |
>>>             |          |          |--1.43%--__free_frozen_pages
>>>             |          |          |          |
>>>             |          |          |           --0.51%--free_frozen_page_commit
>>>             |          |          |
>>>             |          |          |--1.08%--_raw_spin_trylock
>>>             |          |          |
>>>             |          |           --0.89%--_raw_spin_unlock
>>>             |          |
>>>             |           --0.52%--free_pages_prepare
>>>             |
>>>              --2.90%--ret_from_fork
>>>                        kthread
>>>                        0xffffae1c12abeaf8
>>>                        0xffffae1c12abe7a0
>>>                        |
>>>                         --2.69%--vfree
>>>                                   __free_contig_range
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>>> ---
>>> Changes since v2:
>>> - Handle different possible section boundries in __free_contig_range()
>>> - Drop the TODO
>>> - Remove return value from __free_contig_range()
>>> - Remove non-functional change from __free_pages_ok()
>>>
>>> Changes since v1:
>>> - Rebase on mm-new
>>> - Move FPI_PREPARED check inside __free_pages_prepare() now that
>>>   fpi_flags are already being passed.
>>> - Add todo (Zi Yan)
>>> - Rerun benchmarks
>>> - Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
>>> - Rework order calculation in free_prepared_contig_range() and use
>>>   MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
>>>   be up to internal __free_frozen_pages() how it frees them
>>>
>>> Made-with: Cursor
>>> ---
>>>  include/linux/gfp.h |  2 +
>>>  mm/page_alloc.c     | 97 ++++++++++++++++++++++++++++++++++++++++++++-
>>>  2 files changed, 97 insertions(+), 2 deletions(-)
>>>
>>
>> <snip>
>>
>>> +
>>> +/**
>>> + * __free_contig_range - Free contiguous range of order-0 pages.
>>> + * @pfn: Page frame number of the first page in the range.
>>> + * @nr_pages: Number of pages to free.
>>> + *
>>> + * For each order-0 struct page in the physically contiguous range, put a
>>> + * reference. Free any page who's reference count falls to zero. The
>>> + * implementation is functionally equivalent to, but significantly faster than
>>> + * calling __free_page() for each struct page in a loop.
>>> + *
>>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>>> + * can be freed with this API.
>>> + *
>>> + * Context: May be called in interrupt context or while holding a normal
>>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>>> + */
>>> +void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>> +{
>>> +	struct page *page = pfn_to_page(pfn);
>>> +	struct page *start = NULL;
>>> +	unsigned long start_sec;
>>> +	unsigned long i;
>>> +	bool can_free;
>>> +
>>> +	/*
>>> +	 * Chunk the range into contiguous runs of pages for which the refcount
>>> +	 * went to zero and for which free_pages_prepare() succeeded. If
>>> +	 * free_pages_prepare() fails we consider the page to have been freed;
>>> +	 * deliberately leak it.
>>> +	 *
>>> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
>>> +	 * vice versa. Break batches at section boundaries since pages from
>>> +	 * different sections must not be coalesced into a single high-order
>>> +	 * block.
>>> +	 */
>>> +	for (i = 0; i < nr_pages; i++, page++) {
>>> +		VM_WARN_ON_ONCE(PageHead(page));
>>> +		VM_WARN_ON_ONCE(PageTail(page));
>>> +
>>> +		can_free = put_page_testzero(page);
>>> +		if (can_free && !free_pages_prepare(page, 0))
>>> +			can_free = false;
>>> +
>>> +		if (can_free && start &&
>>> +		    memdesc_section(page->flags) != start_sec) {
>>> +			free_prepared_contig_range(start, page - start);
>>> +			start = page;
>>> +			start_sec = memdesc_section(page->flags);
>>> +		} else if (!can_free && start) {
>>> +			free_prepared_contig_range(start, page - start);
>>> +			start = NULL;
>>> +		} else if (can_free && !start) {
>>> +			start = page;
>>> +			start_sec = memdesc_section(page->flags);
>>> +		}
>>> +	}
>>
>> It can be simplified to:
>>
>>         for (i = 0; i < nr_pages; i++, page++) {
>>                 VM_WARN_ON_ONCE(PageHead(page));
>>                 VM_WARN_ON_ONCE(PageTail(page));
>>
>>                 can_free = put_page_testzero(page) && free_pages_prepare(page, 0);
>>
>>                 if (!can_free) {
>>                         if (start) {
>>                                 free_prepared_contig_range(start, page - start);
>>                                 start = NULL;
>>                         }
>>                         continue;
>>                 }
>>
>>                 if (start && memdesc_section(page->flags) != start_sec) {
>>                         free_prepared_contig_range(start, page - start);
>>                         start = page;
>>                         start_sec = memdesc_section(page->flags);
>>                 } else if (!start) {
>>                         start = page;
>>                         start_sec = memdesc_section(page->flags);
>>                 }
>>         }
>>
>> BTW, memdesc_section() returns 0 for !SECTION_IN_PAGE_FLAGS.
>> Is pfn_to_section_nr() more robust?
>
> That's the whole trick: it's optimized out in that case. Linus proposed
> that for num_pages_contiguous().
>
> The cover letter should likely refer to num_pages_contiguous() :)

Oh, I needed to refresh my memory on SPARSEMEM to remember
!SECTION_IN_PAGE_FLAGS is for SPARSE_VMEMMAP and the contiguous PFNs vs
contiguous struct page thing.

Now memdesc_section() makes sense to me. Thanks.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-24 13:35 ` [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
  2026-03-24 14:46   ` Zi Yan
@ 2026-03-24 20:56   ` David Hildenbrand (Arm)
  2026-03-25 14:11     ` Muhammad Usama Anjum
  1 sibling, 1 reply; 25+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-24 20:56 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola, linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand


> +void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	struct page *start = NULL;
> +	unsigned long start_sec;
> +	unsigned long i;
> +	bool can_free;
> +
> +	/*
> +	 * Chunk the range into contiguous runs of pages for which the refcount
> +	 * went to zero and for which free_pages_prepare() succeeded. If
> +	 * free_pages_prepare() fails we consider the page to have been freed;
> +	 * deliberately leak it.
> +	 *
> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
> +	 * vice versa. Break batches at section boundaries since pages from
> +	 * different sections must not be coalesced into a single high-order
> +	 * block.

The comment is not completely accurate: section boundary only applies to
some kernel configs.

Maybe rewrite the whole paragraph into

"Contiguous PFNs might not have a contiguous "struct pages" in some
kernel config. Therefore, check memdesc_section(), and stop batching
once it changes, see num_pages_contiguous()."

> +	 */
> +	for (i = 0; i < nr_pages; i++, page++) {
> +		VM_WARN_ON_ONCE(PageHead(page));
> +		VM_WARN_ON_ONCE(PageTail(page));
> +
> +		can_free = put_page_testzero(page);
> +		if (can_free && !free_pages_prepare(page, 0))
> +			can_free = false;
> +
> +		if (can_free && start &&
> +		    memdesc_section(page->flags) != start_sec) {
> +			free_prepared_contig_range(start, page - start);
> +			start = page;
> +			start_sec = memdesc_section(page->flags);
> +		} else if (!can_free && start) {
> +			free_prepared_contig_range(start, page - start);
> +			start = NULL;
> +		} else if (can_free && !start) {
> +			start = page;
> +			start_sec = memdesc_section(page->flags);
> +		}
> +	}

Simplification a proposed by Zi make sense to me!

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-24 14:55   ` Zi Yan
@ 2026-03-25  8:56     ` Uladzislau Rezki
  2026-03-25 15:02       ` Muhammad Usama Anjum
  2026-03-25 14:34     ` Usama Anjum
  1 sibling, 1 reply; 25+ messages in thread
From: Uladzislau Rezki @ 2026-03-25  8:56 UTC (permalink / raw)
  To: Zi Yan
  Cc: Muhammad Usama Anjum, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts,
	david.hildenbrand

On Tue, Mar 24, 2026 at 10:55:55AM -0400, Zi Yan wrote:
> On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:
> 
> > From: Ryan Roberts <ryan.roberts@arm.com>
> >
> > Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
> > must immediately split_page() to order-0 so that it remains compatible
> > with users that want to access the underlying struct page.
> > Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
> > allocator") recently made it much more likely for vmalloc to allocate
> > high order pages which are subsequently split to order-0.
> >
> > Unfortunately this had the side effect of causing performance
> > regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
> > benchmarks). See Closes: tag. This happens because the high order pages
> > must be gotten from the buddy but then because they are split to
> > order-0, when they are freed they are freed to the order-0 pcp.
> > Previously allocation was for order-0 pages so they were recycled from
> > the pcp.
> >
> > It would be preferable if when vmalloc allocates an (e.g.) order-3 page
> > that it also frees that order-3 page to the order-3 pcp, then the
> > regression could be removed.
> >
> > So let's do exactly that; use the new __free_contig_range() API to
> > batch-free contiguous ranges of pfns. This not only removes the
> > regression, but significantly improves performance of vfree beyond the
> > baseline.
> >
> > A selection of test_vmalloc benchmarks running on arm64 server class
> > system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
> > large order pages from buddy allocator") was added in v6.19-rc1 where we
> > see regressions. Then with this change performance is much better. (>0
> > is faster, <0 is slower, (R)/(I) = statistically significant
> > Regression/Improvement):
> >
> > +-----------------+----------------------------------------------------------+-------------------+--------------------+
> > | Benchmark       | Result Class                                             |   mm-new          |  this series       |
> > +=================+==========================================================+===================+====================+
> > | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
> > |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
> > |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
> > |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
> > |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
> > |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
> > |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
> > |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
> > |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
> > |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
> > |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
> > |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
> > |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
> > |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
> > |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
> > |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
> > |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
> > |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
> > |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
> > +-----------------+----------------------------------------------------------+-------------------+--------------------+
> >
> > Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
> > Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> > Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> > ---
> > Changes since v2:
> > - Remove BUG_ON in favour of simple implementation as this has never
> >   been seen to output any bug in the past as well
> > - Move the free loop to separate function, free_pages_bulk()
> > - Update stats, lruvec_stat in separate loop
> >
> > Changes since v1:
> > - Rebase on mm-new
> > - Rerun benchmarks
> >
> > Made-with: Cursor
> > ---
> >  include/linux/gfp.h |  2 ++
> >  mm/page_alloc.c     | 23 +++++++++++++++++++++++
> >  mm/vmalloc.c        | 16 +++++-----------
> >  3 files changed, 30 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 7c1f9da7c8e56..71f9097ab99a0 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -239,6 +239,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >  				struct page **page_array);
> >  #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
> >
> > +void free_pages_bulk(struct page **page_array, unsigned long nr_pages);
> > +
> >  unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
> >  				unsigned long nr_pages,
> >  				struct page **page_array);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index eedce9a30eb7e..250cc07e547b8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5175,6 +5175,29 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >  }
> >  EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
> >
> > +void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
> > +{
> > +	unsigned long start_pfn = 0, pfn;
> > +	unsigned long i, nr_contig = 0;
> > +
> > +	for (i = 0; i < nr_pages; i++) {
> > +		pfn = page_to_pfn(page_array[i]);
> > +		if (!nr_contig) {
> > +			start_pfn = pfn;
> > +			nr_contig = 1;
> > +		} else if (start_pfn + nr_contig != pfn) {
> > +			__free_contig_range(start_pfn, nr_contig);
> > +			start_pfn = pfn;
> > +			nr_contig = 1;
> > +			cond_resched();
>
It will cause schedule while atomic. Have you checked that
__free_contig_range() also can sleep? Of so then we are aligned, if not
probably we should remove it.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-24 13:35 ` [PATCH v3 2/3] vmalloc: Optimize vfree Muhammad Usama Anjum
  2026-03-24 14:55   ` Zi Yan
@ 2026-03-25 10:05   ` David Hildenbrand (Arm)
  2026-03-25 14:26     ` Muhammad Usama Anjum
  1 sibling, 1 reply; 25+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-25 10:05 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola, linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand

On 3/24/26 14:35, Muhammad Usama Anjum wrote:
> From: Ryan Roberts <ryan.roberts@arm.com>
> 
> Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
> must immediately split_page() to order-0 so that it remains compatible
> with users that want to access the underlying struct page.
> Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
> allocator") recently made it much more likely for vmalloc to allocate
> high order pages which are subsequently split to order-0.
> 
> Unfortunately this had the side effect of causing performance
> regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
> benchmarks). See Closes: tag. This happens because the high order pages
> must be gotten from the buddy but then because they are split to
> order-0, when they are freed they are freed to the order-0 pcp.
> Previously allocation was for order-0 pages so they were recycled from
> the pcp.
> 
> It would be preferable if when vmalloc allocates an (e.g.) order-3 page
> that it also frees that order-3 page to the order-3 pcp, then the
> regression could be removed.
> 
> So let's do exactly that; use the new __free_contig_range() API to
> batch-free contiguous ranges of pfns. This not only removes the
> regression, but significantly improves performance of vfree beyond the
> baseline.
> 
> A selection of test_vmalloc benchmarks running on arm64 server class
> system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
> large order pages from buddy allocator") was added in v6.19-rc1 where we
> see regressions. Then with this change performance is much better. (>0
> is faster, <0 is slower, (R)/(I) = statistically significant
> Regression/Improvement):
> 
> +-----------------+----------------------------------------------------------+-------------------+--------------------+
> | Benchmark       | Result Class                                             |   mm-new          |  this series       |
> +=================+==========================================================+===================+====================+
> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
> +-----------------+----------------------------------------------------------+-------------------+--------------------+
> 
> Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
> Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
> Changes since v2:
> - Remove BUG_ON in favour of simple implementation as this has never
>   been seen to output any bug in the past as well
> - Move the free loop to separate function, free_pages_bulk()
> - Update stats, lruvec_stat in separate loop
> 
> Changes since v1:
> - Rebase on mm-new
> - Rerun benchmarks
> 
> Made-with: Cursor
> ---
>  include/linux/gfp.h |  2 ++
>  mm/page_alloc.c     | 23 +++++++++++++++++++++++
>  mm/vmalloc.c        | 16 +++++-----------
>  3 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 7c1f9da7c8e56..71f9097ab99a0 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -239,6 +239,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  				struct page **page_array);
>  #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
>  
> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages);
> +
>  unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
>  				unsigned long nr_pages,
>  				struct page **page_array);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eedce9a30eb7e..250cc07e547b8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5175,6 +5175,29 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  }
>  EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
>  

Can we add some kerneldoc describing call context etc?

> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
> +{
> +	unsigned long start_pfn = 0, pfn;
> +	unsigned long i, nr_contig = 0;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		pfn = page_to_pfn(page_array[i]);
> +		if (!nr_contig) {
> +			start_pfn = pfn;
> +			nr_contig = 1;
> +		} else if (start_pfn + nr_contig != pfn) {
> +			__free_contig_range(start_pfn, nr_contig);
> +			start_pfn = pfn;
> +			nr_contig = 1;
> +			cond_resched();
> +		} else {
> +			nr_contig++;
> +		}
> +	}

Could we use num_pages_contiguous() here?

while (nr_pages) {
	unsigned long nr_contig_pages = num_pages_contiguous(page_array, nr_pages);

	__free_contig_range(pfn_to_page(*page_array), nr_contig_pages);
	
	nr_pages -= nr_contig;
	page_array += nr_contig;
	cond_resched();
}

Something like that?

> +	if (nr_contig)
> +		__free_contig_range(start_pfn, nr_contig);
> +}
> +
>  /*
>   * This is the 'heart' of the zoned buddy allocator.
>   */
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c607307c657a6..e9b3d6451e48b 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3459,19 +3459,13 @@ void vfree(const void *addr)
>  
>  	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
>  		vm_reset_perms(vm);
> -	for (i = 0; i < vm->nr_pages; i++) {
> -		struct page *page = vm->pages[i];
>  
> -		BUG_ON(!page);
> -		/*
> -		 * High-order allocs for huge vmallocs are split, so
> -		 * can be freed as an array of order-0 allocations
> -		 */
> -		if (!(vm->flags & VM_MAP_PUT_PAGES))
> -			mod_lruvec_page_state(page, NR_VMALLOC, -1);
> -		__free_page(page);
> -		cond_resched();
> +	if (!(vm->flags & VM_MAP_PUT_PAGES)) {
> +		for (i = 0; i < vm->nr_pages; i++)
> +			mod_lruvec_page_state(vm->pages[i], NR_VMALLOC, -1);
>  	}
> +	free_pages_bulk(vm->pages, vm->nr_pages);
> +
>  	kvfree(vm->pages);
>  	kfree(vm);
>  }


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Optimize __free_contig_frozen_range()
  2026-03-24 15:06   ` Zi Yan
@ 2026-03-25 10:14     ` David Hildenbrand (Arm)
  2026-03-25 16:03       ` Muhammad Usama Anjum
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-25 10:14 UTC (permalink / raw)
  To: Zi Yan, Muhammad Usama Anjum
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts,
	david.hildenbrand

On 3/24/26 16:06, Zi Yan wrote:
> On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:
> 
>> Apply the same batch-freeing optimization from free_contig_range() to the
>> frozen page path. The previous __free_contig_frozen_range() freed each
>> order-0 page individually via free_frozen_pages(), which is slow for the
>> same reason the old free_contig_range() was: each page goes to the
>> order-0 pcp list rather than being coalesced into higher-order blocks.
>>
>> Rewrite __free_contig_frozen_range() to call free_pages_prepare() for
>> each order-0 page, then batch the prepared pages into the largest
>> possible power-of-2 aligned chunks via free_prepared_contig_range().
>> If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
>> deliberately not freed; it should not be returned to the allocator.
>>
>> I've tested CMA through debugfs. The test allocates 16384 pages per
>> allocation for several iterations. There is 3.5x improvement.
>>
>> Before: 1406 usec per iteration
>> After:   402 usec per iteration
>>
>> Before:
>>
>>     70.89%     0.69%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
>>             |
>>             |--70.20%--free_contig_frozen_range
>>             |          |
>>             |          |--46.41%--__free_frozen_pages
>>             |          |          |
>>             |          |           --36.18%--free_frozen_page_commit
>>             |          |                     |
>>             |          |                      --29.63%--_raw_spin_unlock_irqrestore
>>             |          |
>>             |          |--8.76%--_raw_spin_trylock
>>             |          |
>>             |          |--7.03%--__preempt_count_dec_and_test
>>             |          |
>>             |          |--4.57%--_raw_spin_unlock
>>             |          |
>>             |          |--1.96%--__get_pfnblock_flags_mask.isra.0
>>             |          |
>>             |           --1.15%--free_frozen_page_commit
>>             |
>>              --0.69%--el0t_64_sync
>>
>> After:
>>
>>     23.57%     0.00%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
>>             |
>>             ---free_contig_frozen_range
>>                |
>>                |--20.45%--__free_contig_frozen_range
>>                |          |
>>                |          |--17.77%--free_pages_prepare
>>                |          |
>>                |           --0.72%--free_prepared_contig_range
>>                |                     |
>>                |                      --0.55%--__free_frozen_pages
>>                |
>>                 --3.12%--free_pages_prepare
>>
>> Suggested-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> ---
>> Changes since v2:
>> - Rework the loop to check for memory sections just like __free_contig_range()
>> - Didn't add reviewed-by tags because of rework
>> ---
>>  mm/page_alloc.c | 26 ++++++++++++++++++++++++--
>>  1 file changed, 24 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 250cc07e547b8..26eac35ef73bd 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7038,8 +7038,30 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
>>
>>  static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
>>  {
>> -	for (; nr_pages--; pfn++)
>> -		free_frozen_pages(pfn_to_page(pfn), 0);
>> +	struct page *page = pfn_to_page(pfn);
>> +	struct page *start = NULL;
>> +	unsigned long start_sec;
>> +	unsigned long i;
>> +
>> +	for (i = 0; i < nr_pages; i++, page++) {
>> +		if (!free_pages_prepare(page, 0)) {
>> +			if (start) {
>> +				free_prepared_contig_range(start, page - start);
>> +				start = NULL;
>> +			}
>> +		} else if (start &&
>> +			   memdesc_section(page->flags) != start_sec) {
>> +			free_prepared_contig_range(start, page - start);
>> +			start = page;
>> +			start_sec = memdesc_section(page->flags);
>> +		} else if (!start) {
>> +			start = page;
>> +			start_sec = memdesc_section(page->flags);
>> +		}
>> +	}
>> +
>> +	if (start)
>> +		free_prepared_contig_range(start, page - start);
>>  }
> 
> This looks almost the same as __free_contig_range().
> 
> Two approaches to deduplicate the code:
> 
> 1. __free_contig_range() first does put_page_testzero()
> on all pages and call __free_contig_frozen_range()
> on the range, __free_contig_frozen_range() will need
> to skip not frozen pages. It is not ideal.

Right, let's not do that.

> 
> 2. add a helper function
> __free_contig_range_common(unsigned long pfn,
> unsigned long nr_pages, bool is_page_frozen),
> and
> a. call __free_contig_range_common(..., /*is_page_frozen=*/ false)
> in __free_contig_range(),
> b. __free_contig_range_common(..., /*is_page_frozen=*/ true)
> in __free_contig_frozen_range().
> 

As long as it's an internal helper, that makes sense. I wouldn't want to
expose the bool in the external interface.

Thanks!

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-24 17:14       ` Zi Yan
@ 2026-03-25 14:06         ` Muhammad Usama Anjum
  0 siblings, 0 replies; 25+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-25 14:06 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand
  Cc: usama.anjum, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts

On 24/03/2026 5:14 pm, Zi Yan wrote:
> On 24 Mar 2026, at 11:22, David Hildenbrand wrote:
> 
>> On 3/24/26 15:46, Zi Yan wrote:
>>> On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:
>>>
>>>> From: Ryan Roberts <ryan.roberts@arm.com>
>>>>
>>>> Decompose the range of order-0 pages to be freed into the set of largest
>>>> possible power-of-2 size and aligned chunks and free them to the pcp or
>>>> buddy. This improves on the previous approach which freed each order-0
>>>> page individually in a loop. Testing shows performance to be improved by
>>>> more than 10x in some cases.
>>>>
>>>> Since each page is order-0, we must decrement each page's reference
>>>> count individually and only consider the page for freeing as part of a
>>>> high order chunk if the reference count goes to zero. Additionally
>>>> free_pages_prepare() must be called for each individual order-0 page
>>>> too, so that the struct page state and global accounting state can be
>>>> appropriately managed. But once this is done, the resulting high order
>>>> chunks can be freed as a unit to the pcp or buddy.
>>>>
>>>> This significantly speeds up the free operation but also has the side
>>>> benefit that high order blocks are added to the pcp instead of each page
>>>> ending up on the pcp order-0 list; memory remains more readily available
>>>> in high orders.
>>>>
>>>> vmalloc will shortly become a user of this new optimized
>>>> free_contig_range() since it aggressively allocates high order
>>>> non-compound pages, but then calls split_page() to end up with
>>>> contiguous order-0 pages. These can now be freed much more efficiently.
>>>>
>>>> The execution time of the following function was measured in a server
>>>> class arm64 machine:
>>>>
>>>> static int page_alloc_high_order_test(void)
>>>> {
>>>> 	unsigned int order = HPAGE_PMD_ORDER;
>>>> 	struct page *page;
>>>> 	int i;
>>>>
>>>> 	for (i = 0; i < 100000; i++) {
>>>> 		page = alloc_pages(GFP_KERNEL, order);
>>>> 		if (!page)
>>>> 			return -1;
>>>> 		split_page(page, order);
>>>> 		free_contig_range(page_to_pfn(page), 1UL << order);
>>>> 	}
>>>>
>>>> 	return 0;
>>>> }
>>>>
>>>> Execution time before: 4097358 usec
>>>> Execution time after:   729831 usec
>>>>
>>>> Perf trace before:
>>>>
>>>>     99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
>>>>             |
>>>>             ---kthread
>>>>                0xffffb33c12a26af8
>>>>                |
>>>>                |--98.13%--0xffffb33c12a26060
>>>>                |          |
>>>>                |          |--97.37%--free_contig_range
>>>>                |          |          |
>>>>                |          |          |--94.93%--___free_pages
>>>>                |          |          |          |
>>>>                |          |          |          |--55.42%--__free_frozen_pages
>>>>                |          |          |          |          |
>>>>                |          |          |          |           --43.20%--free_frozen_page_commit
>>>>                |          |          |          |                     |
>>>>                |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
>>>>                |          |          |          |
>>>>                |          |          |          |--11.53%--_raw_spin_trylock
>>>>                |          |          |          |
>>>>                |          |          |          |--8.19%--__preempt_count_dec_and_test
>>>>                |          |          |          |
>>>>                |          |          |          |--5.64%--_raw_spin_unlock
>>>>                |          |          |          |
>>>>                |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
>>>>                |          |          |          |
>>>>                |          |          |           --1.07%--free_frozen_page_commit
>>>>                |          |          |
>>>>                |          |           --1.54%--__free_frozen_pages
>>>>                |          |
>>>>                |           --0.77%--___free_pages
>>>>                |
>>>>                 --0.98%--0xffffb33c12a26078
>>>>                           alloc_pages_noprof
>>>>
>>>> Perf trace after:
>>>>
>>>>      8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
>>>>             |
>>>>             |--5.52%--__free_contig_range
>>>>             |          |
>>>>             |          |--5.00%--free_prepared_contig_range
>>>>             |          |          |
>>>>             |          |          |--1.43%--__free_frozen_pages
>>>>             |          |          |          |
>>>>             |          |          |           --0.51%--free_frozen_page_commit
>>>>             |          |          |
>>>>             |          |          |--1.08%--_raw_spin_trylock
>>>>             |          |          |
>>>>             |          |           --0.89%--_raw_spin_unlock
>>>>             |          |
>>>>             |           --0.52%--free_pages_prepare
>>>>             |
>>>>              --2.90%--ret_from_fork
>>>>                        kthread
>>>>                        0xffffae1c12abeaf8
>>>>                        0xffffae1c12abe7a0
>>>>                        |
>>>>                         --2.69%--vfree
>>>>                                   __free_contig_range
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>>>> ---
>>>> Changes since v2:
>>>> - Handle different possible section boundries in __free_contig_range()
>>>> - Drop the TODO
>>>> - Remove return value from __free_contig_range()
>>>> - Remove non-functional change from __free_pages_ok()
>>>>
>>>> Changes since v1:
>>>> - Rebase on mm-new
>>>> - Move FPI_PREPARED check inside __free_pages_prepare() now that
>>>>   fpi_flags are already being passed.
>>>> - Add todo (Zi Yan)
>>>> - Rerun benchmarks
>>>> - Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
>>>> - Rework order calculation in free_prepared_contig_range() and use
>>>>   MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
>>>>   be up to internal __free_frozen_pages() how it frees them
>>>>
>>>> Made-with: Cursor
>>>> ---
>>>>  include/linux/gfp.h |  2 +
>>>>  mm/page_alloc.c     | 97 ++++++++++++++++++++++++++++++++++++++++++++-
>>>>  2 files changed, 97 insertions(+), 2 deletions(-)
>>>>
>>>
>>> <snip>
>>>
>>>> +
>>>> +/**
>>>> + * __free_contig_range - Free contiguous range of order-0 pages.
>>>> + * @pfn: Page frame number of the first page in the range.
>>>> + * @nr_pages: Number of pages to free.
>>>> + *
>>>> + * For each order-0 struct page in the physically contiguous range, put a
>>>> + * reference. Free any page who's reference count falls to zero. The
>>>> + * implementation is functionally equivalent to, but significantly faster than
>>>> + * calling __free_page() for each struct page in a loop.
>>>> + *
>>>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>>>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>>>> + * can be freed with this API.
>>>> + *
>>>> + * Context: May be called in interrupt context or while holding a normal
>>>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>>>> + */
>>>> +void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>>> +{
>>>> +	struct page *page = pfn_to_page(pfn);
>>>> +	struct page *start = NULL;
>>>> +	unsigned long start_sec;
>>>> +	unsigned long i;
>>>> +	bool can_free;
>>>> +
>>>> +	/*
>>>> +	 * Chunk the range into contiguous runs of pages for which the refcount
>>>> +	 * went to zero and for which free_pages_prepare() succeeded. If
>>>> +	 * free_pages_prepare() fails we consider the page to have been freed;
>>>> +	 * deliberately leak it.
>>>> +	 *
>>>> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
>>>> +	 * vice versa. Break batches at section boundaries since pages from
>>>> +	 * different sections must not be coalesced into a single high-order
>>>> +	 * block.
>>>> +	 */
>>>> +	for (i = 0; i < nr_pages; i++, page++) {
>>>> +		VM_WARN_ON_ONCE(PageHead(page));
>>>> +		VM_WARN_ON_ONCE(PageTail(page));
>>>> +
>>>> +		can_free = put_page_testzero(page);
>>>> +		if (can_free && !free_pages_prepare(page, 0))
>>>> +			can_free = false;
>>>> +
>>>> +		if (can_free && start &&
>>>> +		    memdesc_section(page->flags) != start_sec) {
>>>> +			free_prepared_contig_range(start, page - start);
>>>> +			start = page;
>>>> +			start_sec = memdesc_section(page->flags);
>>>> +		} else if (!can_free && start) {
>>>> +			free_prepared_contig_range(start, page - start);
>>>> +			start = NULL;
>>>> +		} else if (can_free && !start) {
>>>> +			start = page;
>>>> +			start_sec = memdesc_section(page->flags);
>>>> +		}
>>>> +	}
>>>
>>> It can be simplified to:
>>>
>>>         for (i = 0; i < nr_pages; i++, page++) {
>>>                 VM_WARN_ON_ONCE(PageHead(page));
>>>                 VM_WARN_ON_ONCE(PageTail(page));
>>>
>>>                 can_free = put_page_testzero(page) && free_pages_prepare(page, 0);
>>>
>>>                 if (!can_free) {
>>>                         if (start) {
>>>                                 free_prepared_contig_range(start, page - start);
>>>                                 start = NULL;
>>>                         }
>>>                         continue;
>>>                 }
>>>
>>>                 if (start && memdesc_section(page->flags) != start_sec) {
>>>                         free_prepared_contig_range(start, page - start);
>>>                         start = page;
>>>                         start_sec = memdesc_section(page->flags);
>>>                 } else if (!start) {
>>>                         start = page;
>>>                         start_sec = memdesc_section(page->flags);
>>>                 }
>>>         }
I'll simplify in the next version. Thanks.

>>>
>>> BTW, memdesc_section() returns 0 for !SECTION_IN_PAGE_FLAGS.
>>> Is pfn_to_section_nr() more robust?
>>
>> That's the whole trick: it's optimized out in that case. Linus proposed
>> that for num_pages_contiguous().
>>
>> The cover letter should likely refer to num_pages_contiguous() :)
I'll refer to num_pages_contiguous() as well.

> 
> Oh, I needed to refresh my memory on SPARSEMEM to remember
> !SECTION_IN_PAGE_FLAGS is for SPARSE_VMEMMAP and the contiguous PFNs vs
> contiguous struct page thing.
> 
> Now memdesc_section() makes sense to me. Thanks.
> 
> 
> Best Regards,
> Yan, Zi



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-24 20:56   ` David Hildenbrand (Arm)
@ 2026-03-25 14:11     ` Muhammad Usama Anjum
  0 siblings, 0 replies; 25+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-25 14:11 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola, linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: usama.anjum

On 24/03/2026 8:56 pm, David Hildenbrand (Arm) wrote:
> 
>> +void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>> +{
>> +	struct page *page = pfn_to_page(pfn);
>> +	struct page *start = NULL;
>> +	unsigned long start_sec;
>> +	unsigned long i;
>> +	bool can_free;
>> +
>> +	/*
>> +	 * Chunk the range into contiguous runs of pages for which the refcount
>> +	 * went to zero and for which free_pages_prepare() succeeded. If
>> +	 * free_pages_prepare() fails we consider the page to have been freed;
>> +	 * deliberately leak it.
>> +	 *
>> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
>> +	 * vice versa. Break batches at section boundaries since pages from
>> +	 * different sections must not be coalesced into a single high-order
>> +	 * block.
> 
> The comment is not completely accurate: section boundary only applies to
> some kernel configs.
> 
> Maybe rewrite the whole paragraph into
> 
> "Contiguous PFNs might not have a contiguous "struct pages" in some
> kernel config. Therefore, check memdesc_section(), and stop batching
> once it changes, see num_pages_contiguous()."
Agreed, I'll update.

> 
>> +	 */
>> +	for (i = 0; i < nr_pages; i++, page++) {
>> +		VM_WARN_ON_ONCE(PageHead(page));
>> +		VM_WARN_ON_ONCE(PageTail(page));
>> +
>> +		can_free = put_page_testzero(page);
>> +		if (can_free && !free_pages_prepare(page, 0))
>> +			can_free = false;
>> +
>> +		if (can_free && start &&
>> +		    memdesc_section(page->flags) != start_sec) {
>> +			free_prepared_contig_range(start, page - start);
>> +			start = page;
>> +			start_sec = memdesc_section(page->flags);
>> +		} else if (!can_free && start) {
>> +			free_prepared_contig_range(start, page - start);
>> +			start = NULL;
>> +		} else if (can_free && !start) {
>> +			start = page;
>> +			start_sec = memdesc_section(page->flags);
>> +		}
>> +	}
> 
> Simplification a proposed by Zi make sense to me!
I've added it.

Thanks,
Usama



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-25 10:05   ` David Hildenbrand (Arm)
@ 2026-03-25 14:26     ` Muhammad Usama Anjum
  2026-03-25 15:01       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 25+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-25 14:26 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: usama.anjum, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Uladzislau Rezki,
	Nick Terrell, David Sterba, Vishal Moola, linux-mm, linux-kernel,
	bpf, Ryan.Roberts, david.hildenbrand

On 25/03/2026 10:05 am, David Hildenbrand (Arm) wrote:
> On 3/24/26 14:35, Muhammad Usama Anjum wrote:
>> From: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
>> must immediately split_page() to order-0 so that it remains compatible
>> with users that want to access the underlying struct page.
>> Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
>> allocator") recently made it much more likely for vmalloc to allocate
>> high order pages which are subsequently split to order-0.
>>
>> Unfortunately this had the side effect of causing performance
>> regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
>> benchmarks). See Closes: tag. This happens because the high order pages
>> must be gotten from the buddy but then because they are split to
>> order-0, when they are freed they are freed to the order-0 pcp.
>> Previously allocation was for order-0 pages so they were recycled from
>> the pcp.
>>
>> It would be preferable if when vmalloc allocates an (e.g.) order-3 page
>> that it also frees that order-3 page to the order-3 pcp, then the
>> regression could be removed.
>>
>> So let's do exactly that; use the new __free_contig_range() API to
>> batch-free contiguous ranges of pfns. This not only removes the
>> regression, but significantly improves performance of vfree beyond the
>> baseline.
>>
>> A selection of test_vmalloc benchmarks running on arm64 server class
>> system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
>> large order pages from buddy allocator") was added in v6.19-rc1 where we
>> see regressions. Then with this change performance is much better. (>0
>> is faster, <0 is slower, (R)/(I) = statistically significant
>> Regression/Improvement):
>>
>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>> | Benchmark       | Result Class                                             |   mm-new          |  this series       |
>> +=================+==========================================================+===================+====================+
>> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
>> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
>> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
>> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
>> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
>> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
>> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
>> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
>> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
>> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
>> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
>> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
>> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
>> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
>> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
>> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
>> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
>> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
>> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>>
>> Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
>> Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> ---
>> Changes since v2:
>> - Remove BUG_ON in favour of simple implementation as this has never
>>   been seen to output any bug in the past as well
>> - Move the free loop to separate function, free_pages_bulk()
>> - Update stats, lruvec_stat in separate loop
>>
>> Changes since v1:
>> - Rebase on mm-new
>> - Rerun benchmarks
>>
>> Made-with: Cursor
>> ---
>>  include/linux/gfp.h |  2 ++
>>  mm/page_alloc.c     | 23 +++++++++++++++++++++++
>>  mm/vmalloc.c        | 16 +++++-----------
>>  3 files changed, 30 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 7c1f9da7c8e56..71f9097ab99a0 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -239,6 +239,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>>  				struct page **page_array);
>>  #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
>>  
>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages);
>> +
>>  unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
>>  				unsigned long nr_pages,
>>  				struct page **page_array);
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index eedce9a30eb7e..250cc07e547b8 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5175,6 +5175,29 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>>  }
>>  EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
>>  
> 
> Can we add some kerneldoc describing call context etc?
Yes, I'll add short kerneldoc here.
> 
>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
>> +{
>> +	unsigned long start_pfn = 0, pfn;
>> +	unsigned long i, nr_contig = 0;
>> +
>> +	for (i = 0; i < nr_pages; i++) {
>> +		pfn = page_to_pfn(page_array[i]);
>> +		if (!nr_contig) {
>> +			start_pfn = pfn;
>> +			nr_contig = 1;
>> +		} else if (start_pfn + nr_contig != pfn) {
>> +			__free_contig_range(start_pfn, nr_contig);
>> +			start_pfn = pfn;
>> +			nr_contig = 1;
>> +			cond_resched();
>> +		} else {
>> +			nr_contig++;
>> +		}
>> +	}
> 
> Could we use num_pages_contiguous() here?
> 
> while (nr_pages) {
> 	unsigned long nr_contig_pages = num_pages_contiguous(page_array, nr_pages);
> 
> 	__free_contig_range(pfn_to_page(*page_array), nr_contig_pages);
> 	
> 	nr_pages -= nr_contig;
> 	page_array += nr_contig;
> 	cond_resched();
> }
> 
> Something like that?
__free_contig_range() is already checking for the sections. If
num_pages_contiguous() is called here, it'll cause the duplication
of the section check.

> 
>> +	if (nr_contig)
>> +		__free_contig_range(start_pfn, nr_contig);
>> +}
>> +
>>  /*
>>   * This is the 'heart' of the zoned buddy allocator.
>>   */
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index c607307c657a6..e9b3d6451e48b 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -3459,19 +3459,13 @@ void vfree(const void *addr)
>>  
>>  	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
>>  		vm_reset_perms(vm);
>> -	for (i = 0; i < vm->nr_pages; i++) {
>> -		struct page *page = vm->pages[i];
>>  
>> -		BUG_ON(!page);
>> -		/*
>> -		 * High-order allocs for huge vmallocs are split, so
>> -		 * can be freed as an array of order-0 allocations
>> -		 */
>> -		if (!(vm->flags & VM_MAP_PUT_PAGES))
>> -			mod_lruvec_page_state(page, NR_VMALLOC, -1);
>> -		__free_page(page);
>> -		cond_resched();
>> +	if (!(vm->flags & VM_MAP_PUT_PAGES)) {
>> +		for (i = 0; i < vm->nr_pages; i++)
>> +			mod_lruvec_page_state(vm->pages[i], NR_VMALLOC, -1);
>>  	}
>> +	free_pages_bulk(vm->pages, vm->nr_pages);
>> +
>>  	kvfree(vm->pages);
>>  	kfree(vm);
>>  }
> 
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-24 14:55   ` Zi Yan
  2026-03-25  8:56     ` Uladzislau Rezki
@ 2026-03-25 14:34     ` Usama Anjum
  1 sibling, 0 replies; 25+ messages in thread
From: Usama Anjum @ 2026-03-25 14:34 UTC (permalink / raw)
  To: Zi Yan
  Cc: usama.anjum, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts,
	david.hildenbrand

<snip>
>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
>> +{
>> +	unsigned long start_pfn = 0, pfn;
>> +	unsigned long i, nr_contig = 0;
>> +
>> +	for (i = 0; i < nr_pages; i++) {
>> +		pfn = page_to_pfn(page_array[i]);
>> +		if (!nr_contig) {
>> +			start_pfn = pfn;
>> +			nr_contig = 1;
>> +		} else if (start_pfn + nr_contig != pfn) {
>> +			__free_contig_range(start_pfn, nr_contig);
>> +			start_pfn = pfn;
>> +			nr_contig = 1;
>> +			cond_resched();
>> +		} else {
>> +			nr_contig++;
>> +		}
>> +	}
>> +	if (nr_contig)
>> +		__free_contig_range(start_pfn, nr_contig);
>> +}
> 
> free_pages_bulk() assumes pages in page_array are sorted in PFN ascending order.
> I think it is worth documenting it, since without sorting, it can degrade
> back to the original implementation.
I'll add the kerneldoc comment.

> 
>> +
>>  /*
>>   * This is the 'heart' of the zoned buddy allocator.
>>   */
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index c607307c657a6..e9b3d6451e48b 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -3459,19 +3459,13 @@ void vfree(const void *addr)
>>
>>  	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
>>  		vm_reset_perms(vm);
>> -	for (i = 0; i < vm->nr_pages; i++) {
>> -		struct page *page = vm->pages[i];
>>
>> -		BUG_ON(!page);
>> -		/*
>> -		 * High-order allocs for huge vmallocs are split, so
>> -		 * can be freed as an array of order-0 allocations
>> -		 */
>> -		if (!(vm->flags & VM_MAP_PUT_PAGES))
>> -			mod_lruvec_page_state(page, NR_VMALLOC, -1);
>> -		__free_page(page);
>> -		cond_resched();
>> +	if (!(vm->flags & VM_MAP_PUT_PAGES)) {
>> +		for (i = 0; i < vm->nr_pages; i++)
>> +			mod_lruvec_page_state(vm->pages[i], NR_VMALLOC, -1);
>>  	}
>> +	free_pages_bulk(vm->pages, vm->nr_pages);
>> +
> 
> stats is updated before any page is freed. It is better to mention
> it in the commit message.
I'll mention it.

> 
>>  	kvfree(vm->pages);
>>  	kfree(vm);
>>  }
>> -- 
>> 2.47.3
> 
> Otherwise, LGTM.
> 
> Acked-by: Zi Yan <ziy@nvidia.com>
> 
> Best Regards,
> Yan, Zi
> 

Thanks,
Usama


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-25 14:26     ` Muhammad Usama Anjum
@ 2026-03-25 15:01       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-25 15:01 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola, linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand

On 3/25/26 15:26, Muhammad Usama Anjum wrote:
> On 25/03/2026 10:05 am, David Hildenbrand (Arm) wrote:
>> On 3/24/26 14:35, Muhammad Usama Anjum wrote:
>>> From: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>> Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
>>> must immediately split_page() to order-0 so that it remains compatible
>>> with users that want to access the underlying struct page.
>>> Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
>>> allocator") recently made it much more likely for vmalloc to allocate
>>> high order pages which are subsequently split to order-0.
>>>
>>> Unfortunately this had the side effect of causing performance
>>> regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
>>> benchmarks). See Closes: tag. This happens because the high order pages
>>> must be gotten from the buddy but then because they are split to
>>> order-0, when they are freed they are freed to the order-0 pcp.
>>> Previously allocation was for order-0 pages so they were recycled from
>>> the pcp.
>>>
>>> It would be preferable if when vmalloc allocates an (e.g.) order-3 page
>>> that it also frees that order-3 page to the order-3 pcp, then the
>>> regression could be removed.
>>>
>>> So let's do exactly that; use the new __free_contig_range() API to
>>> batch-free contiguous ranges of pfns. This not only removes the
>>> regression, but significantly improves performance of vfree beyond the
>>> baseline.
>>>
>>> A selection of test_vmalloc benchmarks running on arm64 server class
>>> system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
>>> large order pages from buddy allocator") was added in v6.19-rc1 where we
>>> see regressions. Then with this change performance is much better. (>0
>>> is faster, <0 is slower, (R)/(I) = statistically significant
>>> Regression/Improvement):
>>>
>>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>>> | Benchmark       | Result Class                                             |   mm-new          |  this series       |
>>> +=================+==========================================================+===================+====================+
>>> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
>>> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
>>> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
>>> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
>>> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
>>> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
>>> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
>>> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
>>> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
>>> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
>>> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
>>> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
>>> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
>>> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
>>> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
>>> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
>>> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
>>> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
>>> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
>>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>>>
>>> Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
>>> Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>>> ---
>>> Changes since v2:
>>> - Remove BUG_ON in favour of simple implementation as this has never
>>>   been seen to output any bug in the past as well
>>> - Move the free loop to separate function, free_pages_bulk()
>>> - Update stats, lruvec_stat in separate loop
>>>
>>> Changes since v1:
>>> - Rebase on mm-new
>>> - Rerun benchmarks
>>>
>>> Made-with: Cursor
>>> ---
>>>  include/linux/gfp.h |  2 ++
>>>  mm/page_alloc.c     | 23 +++++++++++++++++++++++
>>>  mm/vmalloc.c        | 16 +++++-----------
>>>  3 files changed, 30 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index 7c1f9da7c8e56..71f9097ab99a0 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -239,6 +239,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>>>  				struct page **page_array);
>>>  #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
>>>  
>>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages);
>>> +
>>>  unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
>>>  				unsigned long nr_pages,
>>>  				struct page **page_array);
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index eedce9a30eb7e..250cc07e547b8 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5175,6 +5175,29 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>>>  }
>>>  EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
>>>  
>>
>> Can we add some kerneldoc describing call context etc?
> Yes, I'll add short kerneldoc here.
>>
>>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
>>> +{
>>> +	unsigned long start_pfn = 0, pfn;
>>> +	unsigned long i, nr_contig = 0;
>>> +
>>> +	for (i = 0; i < nr_pages; i++) {
>>> +		pfn = page_to_pfn(page_array[i]);
>>> +		if (!nr_contig) {
>>> +			start_pfn = pfn;
>>> +			nr_contig = 1;
>>> +		} else if (start_pfn + nr_contig != pfn) {
>>> +			__free_contig_range(start_pfn, nr_contig);
>>> +			start_pfn = pfn;
>>> +			nr_contig = 1;
>>> +			cond_resched();
>>> +		} else {
>>> +			nr_contig++;
>>> +		}
>>> +	}
>>
>> Could we use num_pages_contiguous() here?
>>
>> while (nr_pages) {
>> 	unsigned long nr_contig_pages = num_pages_contiguous(page_array, nr_pages);
>>
>> 	__free_contig_range(pfn_to_page(*page_array), nr_contig_pages);
>> 	
>> 	nr_pages -= nr_contig;
>> 	page_array += nr_contig;
>> 	cond_resched();
>> }
>>
>> Something like that?
> __free_contig_range() is already checking for the sections. If
> num_pages_contiguous() is called here, it'll cause the duplication
> of the section check.

No problem. For configs we care about it's optimized out entirely either
way.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-25  8:56     ` Uladzislau Rezki
@ 2026-03-25 15:02       ` Muhammad Usama Anjum
  2026-03-25 16:16         ` Uladzislau Rezki
  0 siblings, 1 reply; 25+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-25 15:02 UTC (permalink / raw)
  To: Uladzislau Rezki, Zi Yan
  Cc: usama.anjum, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Nick Terrell, David Sterba, Vishal Moola,
	linux-mm, linux-kernel, bpf, Ryan.Roberts, david.hildenbrand

On 25/03/2026 8:56 am, Uladzislau Rezki wrote:
> On Tue, Mar 24, 2026 at 10:55:55AM -0400, Zi Yan wrote:
>> On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:
>>
>>> From: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>> Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
>>> must immediately split_page() to order-0 so that it remains compatible
>>> with users that want to access the underlying struct page.
>>> Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
>>> allocator") recently made it much more likely for vmalloc to allocate
>>> high order pages which are subsequently split to order-0.
>>>
>>> Unfortunately this had the side effect of causing performance
>>> regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
>>> benchmarks). See Closes: tag. This happens because the high order pages
>>> must be gotten from the buddy but then because they are split to
>>> order-0, when they are freed they are freed to the order-0 pcp.
>>> Previously allocation was for order-0 pages so they were recycled from
>>> the pcp.
>>>
>>> It would be preferable if when vmalloc allocates an (e.g.) order-3 page
>>> that it also frees that order-3 page to the order-3 pcp, then the
>>> regression could be removed.
>>>
>>> So let's do exactly that; use the new __free_contig_range() API to
>>> batch-free contiguous ranges of pfns. This not only removes the
>>> regression, but significantly improves performance of vfree beyond the
>>> baseline.
>>>
>>> A selection of test_vmalloc benchmarks running on arm64 server class
>>> system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
>>> large order pages from buddy allocator") was added in v6.19-rc1 where we
>>> see regressions. Then with this change performance is much better. (>0
>>> is faster, <0 is slower, (R)/(I) = statistically significant
>>> Regression/Improvement):
>>>
>>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>>> | Benchmark       | Result Class                                             |   mm-new          |  this series       |
>>> +=================+==========================================================+===================+====================+
>>> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
>>> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
>>> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
>>> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
>>> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
>>> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
>>> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
>>> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
>>> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
>>> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
>>> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
>>> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
>>> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
>>> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
>>> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
>>> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
>>> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
>>> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
>>> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
>>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>>>
>>> Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
>>> Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>>> ---
>>> Changes since v2:
>>> - Remove BUG_ON in favour of simple implementation as this has never
>>>   been seen to output any bug in the past as well
>>> - Move the free loop to separate function, free_pages_bulk()
>>> - Update stats, lruvec_stat in separate loop
>>>
>>> Changes since v1:
>>> - Rebase on mm-new
>>> - Rerun benchmarks
>>>
>>> Made-with: Cursor
>>> ---
>>>  include/linux/gfp.h |  2 ++
>>>  mm/page_alloc.c     | 23 +++++++++++++++++++++++
>>>  mm/vmalloc.c        | 16 +++++-----------
>>>  3 files changed, 30 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index 7c1f9da7c8e56..71f9097ab99a0 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -239,6 +239,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>>>  				struct page **page_array);
>>>  #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
>>>
>>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages);
>>> +
>>>  unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
>>>  				unsigned long nr_pages,
>>>  				struct page **page_array);
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index eedce9a30eb7e..250cc07e547b8 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5175,6 +5175,29 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>>>  }
>>>  EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
>>>
>>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
>>> +{
>>> +	unsigned long start_pfn = 0, pfn;
>>> +	unsigned long i, nr_contig = 0;
>>> +
>>> +	for (i = 0; i < nr_pages; i++) {
>>> +		pfn = page_to_pfn(page_array[i]);
>>> +		if (!nr_contig) {
>>> +			start_pfn = pfn;
>>> +			nr_contig = 1;
>>> +		} else if (start_pfn + nr_contig != pfn) {
>>> +			__free_contig_range(start_pfn, nr_contig);
>>> +			start_pfn = pfn;
>>> +			nr_contig = 1;
>>> +			cond_resched();
>>
> It will cause schedule while atomic. Have you checked that
> __free_contig_range() also can sleep? Of so then we are aligned, if not
> probably we should remove it.
Sorry, I didn't get it. How does having cond_resched() in this function
affects __free_contig_range()?

The current user of this function is only vfree() which is sleepable.

Thanks,
Usama

> 
> --
> Uladzislau Rezki
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Optimize __free_contig_frozen_range()
  2026-03-25 10:14     ` David Hildenbrand (Arm)
@ 2026-03-25 16:03       ` Muhammad Usama Anjum
  2026-03-25 19:52         ` Zi Yan
  0 siblings, 1 reply; 25+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-25 16:03 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Zi Yan
  Cc: usama.anjum, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola, linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand

<snip>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 250cc07e547b8..26eac35ef73bd 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -7038,8 +7038,30 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
>>>
>>>  static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
>>>  {
>>> -	for (; nr_pages--; pfn++)
>>> -		free_frozen_pages(pfn_to_page(pfn), 0);
>>> +	struct page *page = pfn_to_page(pfn);
>>> +	struct page *start = NULL;
>>> +	unsigned long start_sec;
>>> +	unsigned long i;
>>> +
>>> +	for (i = 0; i < nr_pages; i++, page++) {
>>> +		if (!free_pages_prepare(page, 0)) {
>>> +			if (start) {
>>> +				free_prepared_contig_range(start, page - start);
>>> +				start = NULL;
>>> +			}
>>> +		} else if (start &&
>>> +			   memdesc_section(page->flags) != start_sec) {
>>> +			free_prepared_contig_range(start, page - start);
>>> +			start = page;
>>> +			start_sec = memdesc_section(page->flags);
>>> +		} else if (!start) {
>>> +			start = page;
>>> +			start_sec = memdesc_section(page->flags);
>>> +		}
>>> +	}
>>> +
>>> +	if (start)
>>> +		free_prepared_contig_range(start, page - start);
>>>  }
>>
>> This looks almost the same as __free_contig_range().
>>
>> Two approaches to deduplicate the code:
>>
>> 1. __free_contig_range() first does put_page_testzero()
>> on all pages and call __free_contig_frozen_range()
>> on the range, __free_contig_frozen_range() will need
>> to skip not frozen pages. It is not ideal.
> 
> Right, let's not do that.
> 
>>
>> 2. add a helper function
>> __free_contig_range_common(unsigned long pfn,
>> unsigned long nr_pages, bool is_page_frozen),
>> and
>> a. call __free_contig_range_common(..., /*is_page_frozen=*/ false)
>> in __free_contig_range(),
>> b. __free_contig_range_common(..., /*is_page_frozen=*/ true)
>> in __free_contig_frozen_range().
>>
I'm adding the common version. After the change, I'm thinking about the current functions and if
they can be simplified further:

free_contig_range()	- only calls __free_contig_range()
			- only visible with CONFIG_CONTIG_ALLOC
			- Exported as well

__free_contig_range()	- only calls __free_contig_range_common(is_frozen=false)
			- visible even without CONFIG_CONTIG_ALLOC as vfree() uses it
			- Exported as well (there is no user of this export at this time)

__free_contig_frozen_range()	- only calls __free_contig_range_common(is_frozen=true)

__free_contig_range_common()	- it does the actual work

vfree()->free_pages_bulk()	- calls __free_contig_range()

Should we remove __free_contig_range() and __free_contig_frozen_range() both entirely and
just use __free_contig_range_common() everywhere?
> 
> As long as it's an internal helper, that makes sense. I wouldn't want to
> expose the bool in the external interface.
> 
> Thanks!
> 

Thanks,
Usama




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-25 15:02       ` Muhammad Usama Anjum
@ 2026-03-25 16:16         ` Uladzislau Rezki
  2026-03-25 16:25           ` Muhammad Usama Anjum
  0 siblings, 1 reply; 25+ messages in thread
From: Uladzislau Rezki @ 2026-03-25 16:16 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Uladzislau Rezki, Zi Yan, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Nick Terrell, David Sterba, Vishal Moola,
	linux-mm, linux-kernel, bpf, Ryan.Roberts, david.hildenbrand

On Wed, Mar 25, 2026 at 03:02:14PM +0000, Muhammad Usama Anjum wrote:
> On 25/03/2026 8:56 am, Uladzislau Rezki wrote:
> > On Tue, Mar 24, 2026 at 10:55:55AM -0400, Zi Yan wrote:
> >> On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:
> >>
> >>> From: Ryan Roberts <ryan.roberts@arm.com>
> >>>
> >>> Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
> >>> must immediately split_page() to order-0 so that it remains compatible
> >>> with users that want to access the underlying struct page.
> >>> Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
> >>> allocator") recently made it much more likely for vmalloc to allocate
> >>> high order pages which are subsequently split to order-0.
> >>>
> >>> Unfortunately this had the side effect of causing performance
> >>> regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
> >>> benchmarks). See Closes: tag. This happens because the high order pages
> >>> must be gotten from the buddy but then because they are split to
> >>> order-0, when they are freed they are freed to the order-0 pcp.
> >>> Previously allocation was for order-0 pages so they were recycled from
> >>> the pcp.
> >>>
> >>> It would be preferable if when vmalloc allocates an (e.g.) order-3 page
> >>> that it also frees that order-3 page to the order-3 pcp, then the
> >>> regression could be removed.
> >>>
> >>> So let's do exactly that; use the new __free_contig_range() API to
> >>> batch-free contiguous ranges of pfns. This not only removes the
> >>> regression, but significantly improves performance of vfree beyond the
> >>> baseline.
> >>>
> >>> A selection of test_vmalloc benchmarks running on arm64 server class
> >>> system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
> >>> large order pages from buddy allocator") was added in v6.19-rc1 where we
> >>> see regressions. Then with this change performance is much better. (>0
> >>> is faster, <0 is slower, (R)/(I) = statistically significant
> >>> Regression/Improvement):
> >>>
> >>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
> >>> | Benchmark       | Result Class                                             |   mm-new          |  this series       |
> >>> +=================+==========================================================+===================+====================+
> >>> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
> >>> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
> >>> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
> >>> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
> >>> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
> >>> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
> >>> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
> >>> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
> >>> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
> >>> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
> >>> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
> >>> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
> >>> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
> >>> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
> >>> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
> >>> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
> >>> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
> >>> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
> >>> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
> >>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
> >>>
> >>> Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
> >>> Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> >>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> >>> ---
> >>> Changes since v2:
> >>> - Remove BUG_ON in favour of simple implementation as this has never
> >>>   been seen to output any bug in the past as well
> >>> - Move the free loop to separate function, free_pages_bulk()
> >>> - Update stats, lruvec_stat in separate loop
> >>>
> >>> Changes since v1:
> >>> - Rebase on mm-new
> >>> - Rerun benchmarks
> >>>
> >>> Made-with: Cursor
> >>> ---
> >>>  include/linux/gfp.h |  2 ++
> >>>  mm/page_alloc.c     | 23 +++++++++++++++++++++++
> >>>  mm/vmalloc.c        | 16 +++++-----------
> >>>  3 files changed, 30 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >>> index 7c1f9da7c8e56..71f9097ab99a0 100644
> >>> --- a/include/linux/gfp.h
> >>> +++ b/include/linux/gfp.h
> >>> @@ -239,6 +239,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >>>  				struct page **page_array);
> >>>  #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
> >>>
> >>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages);
> >>> +
> >>>  unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
> >>>  				unsigned long nr_pages,
> >>>  				struct page **page_array);
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index eedce9a30eb7e..250cc07e547b8 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -5175,6 +5175,29 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
> >>>
> >>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
> >>> +{
> >>> +	unsigned long start_pfn = 0, pfn;
> >>> +	unsigned long i, nr_contig = 0;
> >>> +
> >>> +	for (i = 0; i < nr_pages; i++) {
> >>> +		pfn = page_to_pfn(page_array[i]);
> >>> +		if (!nr_contig) {
> >>> +			start_pfn = pfn;
> >>> +			nr_contig = 1;
> >>> +		} else if (start_pfn + nr_contig != pfn) {
> >>> +			__free_contig_range(start_pfn, nr_contig);
> >>> +			start_pfn = pfn;
> >>> +			nr_contig = 1;
> >>> +			cond_resched();
> >>
> > It will cause schedule while atomic. Have you checked that
> > __free_contig_range() also can sleep? Of so then we are aligned, if not
> > probably we should remove it.
> Sorry, I didn't get it. How does having cond_resched() in this function
> affects __free_contig_range()?
> 
It is not. What i am asking is about:

<snip>
spin_lock();
free_pages_bulk()
...
<snip>

so this is not allowed because there is cond_resched() call. We
can remove it and make it possible to invoke free_pages_bulk() under
spin-lock, __but__ only if for example other calls do not sleep:

__free_contig_range()
memdesc_section()
free_prepared_contig_range()
...

>
> The current user of this function is only vfree() which is sleepable.
> 
I know. But this function can be used by others soon or later.

Another option is add a comment, saying that it is only for sleepable
contexts.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-25 16:16         ` Uladzislau Rezki
@ 2026-03-25 16:25           ` Muhammad Usama Anjum
  2026-03-25 16:34             ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 25+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-25 16:25 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: usama.anjum, Zi Yan, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Nick Terrell, David Sterba, Vishal Moola,
	linux-mm, linux-kernel, bpf, Ryan.Roberts, david.hildenbrand

On 25/03/2026 4:16 pm, Uladzislau Rezki wrote:
> On Wed, Mar 25, 2026 at 03:02:14PM +0000, Muhammad Usama Anjum wrote:
>> On 25/03/2026 8:56 am, Uladzislau Rezki wrote:
>>> On Tue, Mar 24, 2026 at 10:55:55AM -0400, Zi Yan wrote:
>>>> On 24 Mar 2026, at 9:35, Muhammad Usama Anjum wrote:
>>>>
>>>>> From: Ryan Roberts <ryan.roberts@arm.com>
>>>>>
>>>>> Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
>>>>> must immediately split_page() to order-0 so that it remains compatible
>>>>> with users that want to access the underlying struct page.
>>>>> Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
>>>>> allocator") recently made it much more likely for vmalloc to allocate
>>>>> high order pages which are subsequently split to order-0.
>>>>>
>>>>> Unfortunately this had the side effect of causing performance
>>>>> regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
>>>>> benchmarks). See Closes: tag. This happens because the high order pages
>>>>> must be gotten from the buddy but then because they are split to
>>>>> order-0, when they are freed they are freed to the order-0 pcp.
>>>>> Previously allocation was for order-0 pages so they were recycled from
>>>>> the pcp.
>>>>>
>>>>> It would be preferable if when vmalloc allocates an (e.g.) order-3 page
>>>>> that it also frees that order-3 page to the order-3 pcp, then the
>>>>> regression could be removed.
>>>>>
>>>>> So let's do exactly that; use the new __free_contig_range() API to
>>>>> batch-free contiguous ranges of pfns. This not only removes the
>>>>> regression, but significantly improves performance of vfree beyond the
>>>>> baseline.
>>>>>
>>>>> A selection of test_vmalloc benchmarks running on arm64 server class
>>>>> system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
>>>>> large order pages from buddy allocator") was added in v6.19-rc1 where we
>>>>> see regressions. Then with this change performance is much better. (>0
>>>>> is faster, <0 is slower, (R)/(I) = statistically significant
>>>>> Regression/Improvement):
>>>>>
>>>>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>>>>> | Benchmark       | Result Class                                             |   mm-new          |  this series       |
>>>>> +=================+==========================================================+===================+====================+
>>>>> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
>>>>> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
>>>>> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
>>>>> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
>>>>> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
>>>>> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
>>>>> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
>>>>> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
>>>>> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
>>>>> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
>>>>> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
>>>>> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
>>>>> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
>>>>> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
>>>>> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
>>>>> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
>>>>> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
>>>>> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
>>>>> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
>>>>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>>>>>
>>>>> Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
>>>>> Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>>>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>>>>> ---
>>>>> Changes since v2:
>>>>> - Remove BUG_ON in favour of simple implementation as this has never
>>>>>   been seen to output any bug in the past as well
>>>>> - Move the free loop to separate function, free_pages_bulk()
>>>>> - Update stats, lruvec_stat in separate loop
>>>>>
>>>>> Changes since v1:
>>>>> - Rebase on mm-new
>>>>> - Rerun benchmarks
>>>>>
>>>>> Made-with: Cursor
>>>>> ---
>>>>>  include/linux/gfp.h |  2 ++
>>>>>  mm/page_alloc.c     | 23 +++++++++++++++++++++++
>>>>>  mm/vmalloc.c        | 16 +++++-----------
>>>>>  3 files changed, 30 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>>> index 7c1f9da7c8e56..71f9097ab99a0 100644
>>>>> --- a/include/linux/gfp.h
>>>>> +++ b/include/linux/gfp.h
>>>>> @@ -239,6 +239,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>>>>>  				struct page **page_array);
>>>>>  #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
>>>>>
>>>>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages);
>>>>> +
>>>>>  unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
>>>>>  				unsigned long nr_pages,
>>>>>  				struct page **page_array);
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index eedce9a30eb7e..250cc07e547b8 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -5175,6 +5175,29 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>>>>>  }
>>>>>  EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
>>>>>
>>>>> +void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
>>>>> +{
>>>>> +	unsigned long start_pfn = 0, pfn;
>>>>> +	unsigned long i, nr_contig = 0;
>>>>> +
>>>>> +	for (i = 0; i < nr_pages; i++) {
>>>>> +		pfn = page_to_pfn(page_array[i]);
>>>>> +		if (!nr_contig) {
>>>>> +			start_pfn = pfn;
>>>>> +			nr_contig = 1;
>>>>> +		} else if (start_pfn + nr_contig != pfn) {
>>>>> +			__free_contig_range(start_pfn, nr_contig);
>>>>> +			start_pfn = pfn;
>>>>> +			nr_contig = 1;
>>>>> +			cond_resched();
>>>>
>>> It will cause schedule while atomic. Have you checked that
>>> __free_contig_range() also can sleep? Of so then we are aligned, if not
>>> probably we should remove it.
>> Sorry, I didn't get it. How does having cond_resched() in this function
>> affects __free_contig_range()?
>>
> It is not. What i am asking is about:
> 
> <snip>
> spin_lock();
> free_pages_bulk()
> ...
> <snip>
> 
> so this is not allowed because there is cond_resched() call. We
> can remove it and make it possible to invoke free_pages_bulk() under
> spin-lock, __but__ only if for example other calls do not sleep:
> 
> __free_contig_range()
> memdesc_section()
> free_prepared_contig_range()
> ...
> 
>>
>> The current user of this function is only vfree() which is sleepable.
>>
> I know. But this function can be used by others soon or later.
> 
> Another option is add a comment, saying that it is only for sleepable
> contexts.
Thank you for detailed response. I can move cond_resched() to vfree() and make
free_pages_bulk() allowed to be called form sleepable context. But I feel the
current implementation is better to avoid latency spikes. I'll put explicit
comment that this function can only be called from sleepable contexts.

Thanks,
Usama

> 
> --
> Uladzislau Rezki



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-25 16:25           ` Muhammad Usama Anjum
@ 2026-03-25 16:34             ` David Hildenbrand (Arm)
  2026-03-25 16:49               ` Uladzislau Rezki
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-25 16:34 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Uladzislau Rezki
  Cc: Zi Yan, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts,
	david.hildenbrand

On 3/25/26 17:25, Muhammad Usama Anjum wrote:
> On 25/03/2026 4:16 pm, Uladzislau Rezki wrote:
>> On Wed, Mar 25, 2026 at 03:02:14PM +0000, Muhammad Usama Anjum wrote:
>>> Sorry, I didn't get it. How does having cond_resched() in this function
>>> affects __free_contig_range()?
>>>
>> It is not. What i am asking is about:
>>
>> <snip>
>> spin_lock();
>> free_pages_bulk()
>> ...
>> <snip>
>>
>> so this is not allowed because there is cond_resched() call. We
>> can remove it and make it possible to invoke free_pages_bulk() under
>> spin-lock, __but__ only if for example other calls do not sleep:
>>
>> __free_contig_range()
>> memdesc_section()
>> free_prepared_contig_range()
>> ...
>>
>>>
>>> The current user of this function is only vfree() which is sleepable.
>>>
>> I know. But this function can be used by others soon or later.
>>
>> Another option is add a comment, saying that it is only for sleepable
>> contexts.
> Thank you for detailed response. I can move cond_resched() to vfree() and make
> free_pages_bulk() allowed to be called form sleepable context. But I feel the
> current implementation is better to avoid latency spikes. I'll put explicit
> comment that this function can only be called from sleepable contexts.

That's probably good enough for now. It can accept arbitrarily large
areas, so the cond_resched() in there is the right thing to do. :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] vmalloc: Optimize vfree
  2026-03-25 16:34             ` David Hildenbrand (Arm)
@ 2026-03-25 16:49               ` Uladzislau Rezki
  0 siblings, 0 replies; 25+ messages in thread
From: Uladzislau Rezki @ 2026-03-25 16:49 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Muhammad Usama Anjum
  Cc: Muhammad Usama Anjum, Uladzislau Rezki, Zi Yan, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Nick Terrell, David Sterba, Vishal Moola,
	linux-mm, linux-kernel, bpf, Ryan.Roberts, david.hildenbrand

On Wed, Mar 25, 2026 at 05:34:08PM +0100, David Hildenbrand (Arm) wrote:
> On 3/25/26 17:25, Muhammad Usama Anjum wrote:
> > On 25/03/2026 4:16 pm, Uladzislau Rezki wrote:
> >> On Wed, Mar 25, 2026 at 03:02:14PM +0000, Muhammad Usama Anjum wrote:
> >>> Sorry, I didn't get it. How does having cond_resched() in this function
> >>> affects __free_contig_range()?
> >>>
> >> It is not. What i am asking is about:
> >>
> >> <snip>
> >> spin_lock();
> >> free_pages_bulk()
> >> ...
> >> <snip>
> >>
> >> so this is not allowed because there is cond_resched() call. We
> >> can remove it and make it possible to invoke free_pages_bulk() under
> >> spin-lock, __but__ only if for example other calls do not sleep:
> >>
> >> __free_contig_range()
> >> memdesc_section()
> >> free_prepared_contig_range()
> >> ...
> >>
> >>>
> >>> The current user of this function is only vfree() which is sleepable.
> >>>
> >> I know. But this function can be used by others soon or later.
> >>
> >> Another option is add a comment, saying that it is only for sleepable
> >> contexts.
> > Thank you for detailed response. I can move cond_resched() to vfree() and make
> > free_pages_bulk() allowed to be called form sleepable context. But I feel the
> > current implementation is better to avoid latency spikes. I'll put explicit
> > comment that this function can only be called from sleepable contexts.
> 
Sounds good!

> That's probably good enough for now. It can accept arbitrarily large
> areas, so the cond_resched() in there is the right thing to do. :)
> 
I agree, since it will be available for other callers, adding the
comment is a right way, so people know :)

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Optimize __free_contig_frozen_range()
  2026-03-25 16:03       ` Muhammad Usama Anjum
@ 2026-03-25 19:52         ` Zi Yan
  0 siblings, 0 replies; 25+ messages in thread
From: Zi Yan @ 2026-03-25 19:52 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: David Hildenbrand (Arm), Andrew Morton, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola, linux-mm, linux-kernel, bpf, Ryan.Roberts,
	david.hildenbrand

On 25 Mar 2026, at 12:03, Muhammad Usama Anjum wrote:

> <snip>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 250cc07e547b8..26eac35ef73bd 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -7038,8 +7038,30 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
>>>>
>>>>  static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
>>>>  {
>>>> -	for (; nr_pages--; pfn++)
>>>> -		free_frozen_pages(pfn_to_page(pfn), 0);
>>>> +	struct page *page = pfn_to_page(pfn);
>>>> +	struct page *start = NULL;
>>>> +	unsigned long start_sec;
>>>> +	unsigned long i;
>>>> +
>>>> +	for (i = 0; i < nr_pages; i++, page++) {
>>>> +		if (!free_pages_prepare(page, 0)) {
>>>> +			if (start) {
>>>> +				free_prepared_contig_range(start, page - start);
>>>> +				start = NULL;
>>>> +			}
>>>> +		} else if (start &&
>>>> +			   memdesc_section(page->flags) != start_sec) {
>>>> +			free_prepared_contig_range(start, page - start);
>>>> +			start = page;
>>>> +			start_sec = memdesc_section(page->flags);
>>>> +		} else if (!start) {
>>>> +			start = page;
>>>> +			start_sec = memdesc_section(page->flags);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	if (start)
>>>> +		free_prepared_contig_range(start, page - start);
>>>>  }
>>>
>>> This looks almost the same as __free_contig_range().
>>>
>>> Two approaches to deduplicate the code:
>>>
>>> 1. __free_contig_range() first does put_page_testzero()
>>> on all pages and call __free_contig_frozen_range()
>>> on the range, __free_contig_frozen_range() will need
>>> to skip not frozen pages. It is not ideal.
>>
>> Right, let's not do that.
>>
>>>
>>> 2. add a helper function
>>> __free_contig_range_common(unsigned long pfn,
>>> unsigned long nr_pages, bool is_page_frozen),
>>> and
>>> a. call __free_contig_range_common(..., /*is_page_frozen=*/ false)
>>> in __free_contig_range(),
>>> b. __free_contig_range_common(..., /*is_page_frozen=*/ true)
>>> in __free_contig_frozen_range().
>>>
> I'm adding the common version. After the change, I'm thinking about the current functions and if
> they can be simplified further:
>
> free_contig_range()	- only calls __free_contig_range()
> 			- only visible with CONFIG_CONTIG_ALLOC
> 			- Exported as well
>
> __free_contig_range()	- only calls __free_contig_range_common(is_frozen=false)
> 			- visible even without CONFIG_CONTIG_ALLOC as vfree() uses it
> 			- Exported as well (there is no user of this export at this time)
>
> __free_contig_frozen_range()	- only calls __free_contig_range_common(is_frozen=true)
>
> __free_contig_range_common()	- it does the actual work
>
> vfree()->free_pages_bulk()	- calls __free_contig_range()
>
> Should we remove __free_contig_range() and __free_contig_frozen_range() both entirely and
> just use __free_contig_range_common() everywhere?

The idea of adding __free_contig_range_common() is to deduplicate code.
__free_contig_range() and __free_contig_frozen_range() improves readability,
since people do not need to see the /* is_page_frozen= */ true/false
in __free_contig_range_common().

>>
>> As long as it's an internal helper, that makes sense. I wouldn't want to
>> expose the bool in the external interface.
>>

Like what David said above.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-03-25 19:53 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-24 13:35 [PATCH v3 0/3] mm: Free contiguous order-0 pages efficiently Muhammad Usama Anjum
2026-03-24 13:35 ` [PATCH v3 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
2026-03-24 14:46   ` Zi Yan
2026-03-24 15:22     ` David Hildenbrand
2026-03-24 17:14       ` Zi Yan
2026-03-25 14:06         ` Muhammad Usama Anjum
2026-03-24 20:56   ` David Hildenbrand (Arm)
2026-03-25 14:11     ` Muhammad Usama Anjum
2026-03-24 13:35 ` [PATCH v3 2/3] vmalloc: Optimize vfree Muhammad Usama Anjum
2026-03-24 14:55   ` Zi Yan
2026-03-25  8:56     ` Uladzislau Rezki
2026-03-25 15:02       ` Muhammad Usama Anjum
2026-03-25 16:16         ` Uladzislau Rezki
2026-03-25 16:25           ` Muhammad Usama Anjum
2026-03-25 16:34             ` David Hildenbrand (Arm)
2026-03-25 16:49               ` Uladzislau Rezki
2026-03-25 14:34     ` Usama Anjum
2026-03-25 10:05   ` David Hildenbrand (Arm)
2026-03-25 14:26     ` Muhammad Usama Anjum
2026-03-25 15:01       ` David Hildenbrand (Arm)
2026-03-24 13:35 ` [PATCH v3 3/3] mm/page_alloc: Optimize __free_contig_frozen_range() Muhammad Usama Anjum
2026-03-24 15:06   ` Zi Yan
2026-03-25 10:14     ` David Hildenbrand (Arm)
2026-03-25 16:03       ` Muhammad Usama Anjum
2026-03-25 19:52         ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox