[PATCH v2 0/3] mm: Free contiguous order-0 pages efficiently

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] mm: Free contiguous order-0 pages efficiently
@ 2026-03-16 11:31 Muhammad Usama Anjum
  2026-03-16 11:31 ` [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-16 11:31 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: Muhammad Usama Anjum

Hi All,

A recent change to vmalloc caused some performance benchmark regressions (see
[1]). I'm attempting to fix that (and at the same time significantly improve
beyond the baseline) by freeing a contiguous set of order-0 pages as a batch.

At the same time I observed that free_contig_range() was essentially doing the
same thing as vfree() so I've fixed it there too. While at it, optimize the
__free_contig_frozen_range() as well.

[1] https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com

v6.18      - Before the patch causing regression was added
mm-new     - current latest code
this series - v2 series of these patches

(>0 is faster, <0 is slower, (R)/(I) = statistically significant
Regression/Improvement)

v6.18 vs mm-new
+-----------------+----------------------------------------------------------+-------------------+-------------+
| Benchmark       | Result Class                                             |   v6.18    (base) |    mm-new   |
+=================+==========================================================+===================+=============+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |         653643.33 | (R) -50.92% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         366167.33 | (R) -11.96% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         489484.00 | (R) -35.21% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1011250.33 | (R) -36.45% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1086812.33 | (R) -31.83% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |         657940.00 | (R) -38.62% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |         765422.00 | (R) -24.84% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        2468585.00 | (R) -37.83% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        2815758.33 | (R) -26.32% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        4851969.00 | (R) -37.76% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        4496257.33 | (R) -31.15% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         570605.00 |      -8.97% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         500866.00 |      -5.88% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         499733.00 |      -6.95% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        5266237.67 | (R) -40.19% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         490284.00 |      -2.10% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |         850986.33 | (R) -48.03% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        2712106.00 | (R) -40.48% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         111151.33 |       3.52% |
+-----------------+----------------------------------------------------------+-------------------+-------------+

v6.18 vs mm-new with patches
+-----------------+----------------------------------------------------------+-------------------+--------------+
| Benchmark       | Result Class                                             |   v6.18 (base)    |  this series |
+=================+==========================================================+===================+==============+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |         653643.33 |      -14.02% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         366167.33 |       -7.23% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         489484.00 |       -1.57% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1011250.33 |        1.57% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1086812.33 |   (I) 15.75% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |         657940.00 |    (I) 9.05% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |         765422.00 |   (I) 38.45% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        2468585.00 |   (I) 12.56% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        2815758.33 |   (I) 38.61% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        4851969.00 |   (I) 13.43% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        4496257.33 |   (I) 49.21% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         570605.00 |       -8.47% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         500866.00 |       -8.17% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         499733.00 |       -5.54% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        5266237.67 |    (I) 4.63% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         490284.00 |        1.53% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |         850986.33 |       -0.00% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        2712106.00 |        1.22% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         111151.33 |    (I) 4.98% |
+-----------------+----------------------------------------------------------+-------------------+--------------+

mm-new vs vmalloc_2 results are in 2/3 patch.

So this series is mitigating the regression on average as results show -14% to 49% improvement.

Thanks,
Muhammad Usama Anjum

---
Changes since v1:
- Update description
- Rebase on mm-new and rerun benchmarks/tests
- Patch 1: move FPI_PREPARED check and add todo
- Patch 2: Rework catering newer changes in vfree()
- New Patch 3: optimizes __free_contig_frozen_range()

Muhammad Usama Anjum (1):
  mm/page_alloc: Optimize __free_contig_frozen_range()

Ryan Roberts (2):
  mm/page_alloc: Optimize free_contig_range()
  vmalloc: Optimize vfree

 include/linux/gfp.h |   2 +
 mm/page_alloc.c     | 128 +++++++++++++++++++++++++++++++++++++++++---
 mm/vmalloc.c        |  34 ++++++++----
 3 files changed, 149 insertions(+), 15 deletions(-)

-- 
2.47.3



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-16 11:31 [PATCH v2 0/3] mm: Free contiguous order-0 pages efficiently Muhammad Usama Anjum
@ 2026-03-16 11:31 ` Muhammad Usama Anjum
  2026-03-16 15:21   ` Vlastimil Babka
  2026-03-16 11:31 ` [PATCH v2 2/3] vmalloc: Optimize vfree Muhammad Usama Anjum
  2026-03-16 11:31 ` [PATCH v2 3/3] mm/page_alloc: Optimize __free_contig_frozen_range() Muhammad Usama Anjum
  2 siblings, 1 reply; 20+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-16 11:31 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: Ryan Roberts, usama.anjum

From: Ryan Roberts <ryan.roberts@arm.com>

Decompose the range of order-0 pages to be freed into the set of largest
possible power-of-2 size and aligned chunks and free them to the pcp or
buddy. This improves on the previous approach which freed each order-0
page individually in a loop. Testing shows performance to be improved by
more than 10x in some cases.

Since each page is order-0, we must decrement each page's reference
count individually and only consider the page for freeing as part of a
high order chunk if the reference count goes to zero. Additionally
free_pages_prepare() must be called for each individual order-0 page
too, so that the struct page state and global accounting state can be
appropriately managed. But once this is done, the resulting high order
chunks can be freed as a unit to the pcp or buddy.

This significantly speeds up the free operation but also has the side
benefit that high order blocks are added to the pcp instead of each page
ending up on the pcp order-0 list; memory remains more readily available
in high orders.

vmalloc will shortly become a user of this new optimized
free_contig_range() since it aggressively allocates high order
non-compound pages, but then calls split_page() to end up with
contiguous order-0 pages. These can now be freed much more efficiently.

The execution time of the following function was measured in a server
class arm64 machine:

static int page_alloc_high_order_test(void)
{
	unsigned int order = HPAGE_PMD_ORDER;
	struct page *page;
	int i;

	for (i = 0; i < 100000; i++) {
		page = alloc_pages(GFP_KERNEL, order);
		if (!page)
			return -1;
		split_page(page, order);
		free_contig_range(page_to_pfn(page), 1UL << order);
	}

	return 0;
}

Execution time before: 4097358 usec
Execution time after:   729831 usec

Perf trace before:

    99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
            |
            ---kthread
               0xffffb33c12a26af8
               |
               |--98.13%--0xffffb33c12a26060
               |          |
               |          |--97.37%--free_contig_range
               |          |          |
               |          |          |--94.93%--___free_pages
               |          |          |          |
               |          |          |          |--55.42%--__free_frozen_pages
               |          |          |          |          |
               |          |          |          |           --43.20%--free_frozen_page_commit
               |          |          |          |                     |
               |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
               |          |          |          |
               |          |          |          |--11.53%--_raw_spin_trylock
               |          |          |          |
               |          |          |          |--8.19%--__preempt_count_dec_and_test
               |          |          |          |
               |          |          |          |--5.64%--_raw_spin_unlock
               |          |          |          |
               |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
               |          |          |          |
               |          |          |           --1.07%--free_frozen_page_commit
               |          |          |
               |          |           --1.54%--__free_frozen_pages
               |          |
               |           --0.77%--___free_pages
               |
                --0.98%--0xffffb33c12a26078
                          alloc_pages_noprof

Perf trace after:

     8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
            |
            |--5.52%--__free_contig_range
            |          |
            |          |--5.00%--free_prepared_contig_range
            |          |          |
            |          |          |--1.43%--__free_frozen_pages
            |          |          |          |
            |          |          |           --0.51%--free_frozen_page_commit
            |          |          |
            |          |          |--1.08%--_raw_spin_trylock
            |          |          |
            |          |           --0.89%--_raw_spin_unlock
            |          |
            |           --0.52%--free_pages_prepare
            |
             --2.90%--ret_from_fork
                       kthread
                       0xffffae1c12abeaf8
                       0xffffae1c12abe7a0
                       |
                        --2.69%--vfree
                                  __free_contig_range

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
---
Changes since v1:
- Rebase on mm-new
- Move FPI_PREPARED check inside __free_pages_prepare() now that
  fpi_flags are already being passed.
- Add todo (Zi Yan)
- Rerun benchmarks
- Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
- Rework order calculation in free_prepared_contig_range() and use
  MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
  be up to internal __free_frozen_pages() how it frees them
---
 include/linux/gfp.h |   2 +
 mm/page_alloc.c     | 110 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 108 insertions(+), 4 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f82d74a77cad8..96ac7aae370c4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages);
 void free_contig_range(unsigned long pfn, unsigned long nr_pages);
 #endif
 
+unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
+
 DEFINE_FREE(free_page, void *, free_page((unsigned long)_T))
 
 #endif /* __LINUX_GFP_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 75ee81445640b..6a9430f720579 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
 /* Free the page without taking locks. Rely on trylock only. */
 #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
 
+/*
+ * free_pages_prepare() has already been called for page(s) being freed.
+ * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
+ * (HWPoison, PageNetpp, bad free page).
+ */
+#define FPI_PREPARED		((__force fpi_t)BIT(3))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1310,6 +1317,9 @@ __always_inline bool __free_pages_prepare(struct page *page,
 	bool compound = PageCompound(page);
 	struct folio *folio = page_folio(page);
 
+	if (fpi_flags & FPI_PREPARED)
+		return true;
+
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	trace_mm_page_free(page, order);
@@ -1579,8 +1589,10 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long pfn = page_to_pfn(page);
 	struct zone *zone = page_zone(page);
 
-	if (__free_pages_prepare(page, order, fpi_flags))
-		free_one_page(zone, page, pfn, order, fpi_flags);
+	if (!__free_pages_prepare(page, order, fpi_flags))
+		return;
+
+	free_one_page(zone, page, pfn, order, fpi_flags);
 }
 
 void __meminit __free_pages_core(struct page *page, unsigned int order,
@@ -6784,6 +6796,93 @@ void __init page_alloc_sysctl_init(void)
 	register_sysctl_init("vm", page_alloc_sysctl_table);
 }
 
+static void free_prepared_contig_range(struct page *page,
+				       unsigned long nr_pages)
+{
+	while (nr_pages) {
+		unsigned int order;
+		unsigned long pfn;
+
+		pfn = page_to_pfn(page);
+		/* We are limited by the largest buddy order. */
+		order = pfn ? __ffs(pfn) : MAX_PAGE_ORDER;
+		/* Don't exceed the number of pages to free. */
+		order = min(order, ilog2(nr_pages));
+		order = min_t(unsigned int, order, MAX_PAGE_ORDER);
+
+		/*
+		 * Free the chunk as a single block. Our caller has already
+		 * called free_pages_prepare() for each order-0 page.
+		 */
+		__free_frozen_pages(page, order, FPI_PREPARED);
+
+		page += 1UL << order;
+		nr_pages -= 1UL << order;
+	}
+}
+
+/**
+ * __free_contig_range - Free contiguous range of order-0 pages.
+ * @pfn: Page frame number of the first page in the range.
+ * @nr_pages: Number of pages to free.
+ *
+ * For each order-0 struct page in the physically contiguous range, put a
+ * reference. Free any page who's reference count falls to zero. The
+ * implementation is functionally equivalent to, but significantly faster than
+ * calling __free_page() for each struct page in a loop.
+ *
+ * Memory allocated with alloc_pages(order>=1) then subsequently split to
+ * order-0 with split_page() is an example of appropriate contiguous pages that
+ * can be freed with this API.
+ *
+ * Returns the number of pages which were not freed, because their reference
+ * count did not fall to zero.
+ *
+ * Context: May be called in interrupt context or while holding a normal
+ * spinlock, but not in NMI context or while holding a raw spinlock.
+ */
+unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
+{
+	struct page *page = pfn_to_page(pfn);
+	unsigned long not_freed = 0;
+	struct page *start = NULL;
+	unsigned long i;
+	bool can_free;
+
+	/*
+	 * Chunk the range into contiguous runs of pages for which the refcount
+	 * went to zero and for which free_pages_prepare() succeeded. If
+	 * free_pages_prepare() fails we consider the page to have been freed;
+	 * deliberately leak it.
+	 *
+	 * Code assumes contiguous PFNs have contiguous struct pages, but not
+	 * vice versa.
+	 */
+	for (i = 0; i < nr_pages; i++, page++) {
+		VM_WARN_ON_ONCE(PageHead(page));
+		VM_WARN_ON_ONCE(PageTail(page));
+
+		can_free = put_page_testzero(page);
+		if (!can_free)
+			not_freed++;
+		else if (!free_pages_prepare(page, 0))
+			can_free = false;
+
+		if (!can_free && start) {
+			free_prepared_contig_range(start, page - start);
+			start = NULL;
+		} else if (can_free && !start) {
+			start = page;
+		}
+	}
+
+	if (start)
+		free_prepared_contig_range(start, page - start);
+
+	return not_freed;
+}
+EXPORT_SYMBOL(__free_contig_range);
+
 #ifdef CONFIG_CONTIG_ALLOC
 /* Usage: See admin-guide/dynamic-debug-howto.rst */
 static void alloc_contig_dump_pages(struct list_head *page_list)
@@ -7327,11 +7426,14 @@ EXPORT_SYMBOL(free_contig_frozen_range);
  */
 void free_contig_range(unsigned long pfn, unsigned long nr_pages)
 {
+	unsigned long count;
+
 	if (WARN_ON_ONCE(PageHead(pfn_to_page(pfn))))
 		return;
 
-	for (; nr_pages--; pfn++)
-		__free_page(pfn_to_page(pfn));
+	count = __free_contig_range(pfn, nr_pages);
+	WARN(count != 0, "%lu pages are still in use!\n", count);
+
 }
 EXPORT_SYMBOL(free_contig_range);
 #endif /* CONFIG_CONTIG_ALLOC */
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-16 11:31 ` [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
@ 2026-03-16 15:21   ` Vlastimil Babka
  2026-03-16 16:02     ` Zi Yan
  2026-03-16 16:11     ` Muhammad Usama Anjum
  0 siblings, 2 replies; 20+ messages in thread
From: Vlastimil Babka @ 2026-03-16 15:21 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand

On 3/16/26 12:31, Muhammad Usama Anjum wrote:
> From: Ryan Roberts <ryan.roberts@arm.com>
> 
> Decompose the range of order-0 pages to be freed into the set of largest
> possible power-of-2 size and aligned chunks and free them to the pcp or
> buddy. This improves on the previous approach which freed each order-0
> page individually in a loop. Testing shows performance to be improved by
> more than 10x in some cases.
> 
> Since each page is order-0, we must decrement each page's reference
> count individually and only consider the page for freeing as part of a
> high order chunk if the reference count goes to zero. Additionally
> free_pages_prepare() must be called for each individual order-0 page
> too, so that the struct page state and global accounting state can be
> appropriately managed. But once this is done, the resulting high order
> chunks can be freed as a unit to the pcp or buddy.
> 
> This significantly speeds up the free operation but also has the side
> benefit that high order blocks are added to the pcp instead of each page
> ending up on the pcp order-0 list; memory remains more readily available
> in high orders.
> 
> vmalloc will shortly become a user of this new optimized
> free_contig_range() since it aggressively allocates high order
> non-compound pages, but then calls split_page() to end up with
> contiguous order-0 pages. These can now be freed much more efficiently.
> 
> The execution time of the following function was measured in a server
> class arm64 machine:
> 
> static int page_alloc_high_order_test(void)
> {
> 	unsigned int order = HPAGE_PMD_ORDER;
> 	struct page *page;
> 	int i;
> 
> 	for (i = 0; i < 100000; i++) {
> 		page = alloc_pages(GFP_KERNEL, order);
> 		if (!page)
> 			return -1;
> 		split_page(page, order);
> 		free_contig_range(page_to_pfn(page), 1UL << order);
> 	}
> 
> 	return 0;
> }
> 
> Execution time before: 4097358 usec
> Execution time after:   729831 usec
> 
> Perf trace before:
> 
>     99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
>             |
>             ---kthread
>                0xffffb33c12a26af8
>                |
>                |--98.13%--0xffffb33c12a26060
>                |          |
>                |          |--97.37%--free_contig_range
>                |          |          |
>                |          |          |--94.93%--___free_pages
>                |          |          |          |
>                |          |          |          |--55.42%--__free_frozen_pages
>                |          |          |          |          |
>                |          |          |          |           --43.20%--free_frozen_page_commit
>                |          |          |          |                     |
>                |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
>                |          |          |          |
>                |          |          |          |--11.53%--_raw_spin_trylock
>                |          |          |          |
>                |          |          |          |--8.19%--__preempt_count_dec_and_test
>                |          |          |          |
>                |          |          |          |--5.64%--_raw_spin_unlock
>                |          |          |          |
>                |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
>                |          |          |          |
>                |          |          |           --1.07%--free_frozen_page_commit
>                |          |          |
>                |          |           --1.54%--__free_frozen_pages
>                |          |
>                |           --0.77%--___free_pages
>                |
>                 --0.98%--0xffffb33c12a26078
>                           alloc_pages_noprof
> 
> Perf trace after:
> 
>      8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
>             |
>             |--5.52%--__free_contig_range
>             |          |
>             |          |--5.00%--free_prepared_contig_range
>             |          |          |
>             |          |          |--1.43%--__free_frozen_pages
>             |          |          |          |
>             |          |          |           --0.51%--free_frozen_page_commit
>             |          |          |
>             |          |          |--1.08%--_raw_spin_trylock
>             |          |          |
>             |          |           --0.89%--_raw_spin_unlock
>             |          |
>             |           --0.52%--free_pages_prepare
>             |
>              --2.90%--ret_from_fork
>                        kthread
>                        0xffffae1c12abeaf8
>                        0xffffae1c12abe7a0
>                        |
>                         --2.69%--vfree
>                                   __free_contig_range
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
> Changes since v1:
> - Rebase on mm-new
> - Move FPI_PREPARED check inside __free_pages_prepare() now that
>   fpi_flags are already being passed.
> - Add todo (Zi Yan)
> - Rerun benchmarks
> - Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
> - Rework order calculation in free_prepared_contig_range() and use
>   MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
>   be up to internal __free_frozen_pages() how it frees them
> ---
>  include/linux/gfp.h |   2 +
>  mm/page_alloc.c     | 110 ++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 108 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f82d74a77cad8..96ac7aae370c4 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages);
>  void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>  #endif
>  
> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
> +
>  DEFINE_FREE(free_page, void *, free_page((unsigned long)_T))
>  
>  #endif /* __LINUX_GFP_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 75ee81445640b..6a9430f720579 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
>  /* Free the page without taking locks. Rely on trylock only. */
>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>  
> +/*
> + * free_pages_prepare() has already been called for page(s) being freed.
> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
> + * (HWPoison, PageNetpp, bad free page).
> + */

I'm confused, and reading the v1 thread didn't help either. Where would the
subpages to check come from? AFAICS we start from order-0 pages always.
__free_contig_range calls free_pages_prepare on every page with order 0
unconditionally, so we check every page as an order-0 page. If we then free
the bunch of individually checked pages as a high-order page, there's no
reason to check those subpages again, no? Am I missing something?

> +#define FPI_PREPARED		((__force fpi_t)BIT(3))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -1310,6 +1317,9 @@ __always_inline bool __free_pages_prepare(struct page *page,
>  	bool compound = PageCompound(page);
>  	struct folio *folio = page_folio(page);
>  
> +	if (fpi_flags & FPI_PREPARED)
> +		return true;
> +
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  
>  	trace_mm_page_free(page, order);
> @@ -1579,8 +1589,10 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	unsigned long pfn = page_to_pfn(page);
>  	struct zone *zone = page_zone(page);
>  
> -	if (__free_pages_prepare(page, order, fpi_flags))
> -		free_one_page(zone, page, pfn, order, fpi_flags);
> +	if (!__free_pages_prepare(page, order, fpi_flags))
> +		return;
> +
> +	free_one_page(zone, page, pfn, order, fpi_flags);

This is not a functional change, can we drop it?

>  }
>  
>  void __meminit __free_pages_core(struct page *page, unsigned int order,
> @@ -6784,6 +6796,93 @@ void __init page_alloc_sysctl_init(void)
>  	register_sysctl_init("vm", page_alloc_sysctl_table);
>  }
>  
> +static void free_prepared_contig_range(struct page *page,
> +				       unsigned long nr_pages)
> +{
> +	while (nr_pages) {
> +		unsigned int order;
> +		unsigned long pfn;
> +
> +		pfn = page_to_pfn(page);
> +		/* We are limited by the largest buddy order. */
> +		order = pfn ? __ffs(pfn) : MAX_PAGE_ORDER;
> +		/* Don't exceed the number of pages to free. */
> +		order = min(order, ilog2(nr_pages));
> +		order = min_t(unsigned int, order, MAX_PAGE_ORDER);
> +
> +		/*
> +		 * Free the chunk as a single block. Our caller has already
> +		 * called free_pages_prepare() for each order-0 page.
> +		 */
> +		__free_frozen_pages(page, order, FPI_PREPARED);
> +
> +		page += 1UL << order;
> +		nr_pages -= 1UL << order;
> +	}
> +}
> +
> +/**
> + * __free_contig_range - Free contiguous range of order-0 pages.
> + * @pfn: Page frame number of the first page in the range.
> + * @nr_pages: Number of pages to free.
> + *
> + * For each order-0 struct page in the physically contiguous range, put a
> + * reference. Free any page who's reference count falls to zero. The
> + * implementation is functionally equivalent to, but significantly faster than
> + * calling __free_page() for each struct page in a loop.
> + *
> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
> + * order-0 with split_page() is an example of appropriate contiguous pages that
> + * can be freed with this API.
> + *
> + * Returns the number of pages which were not freed, because their reference
> + * count did not fall to zero.

We probably don't need this part.

> + *
> + * Context: May be called in interrupt context or while holding a normal
> + * spinlock, but not in NMI context or while holding a raw spinlock.
> + */
> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	unsigned long not_freed = 0;
> +	struct page *start = NULL;
> +	unsigned long i;
> +	bool can_free;
> +
> +	/*
> +	 * Chunk the range into contiguous runs of pages for which the refcount
> +	 * went to zero and for which free_pages_prepare() succeeded. If
> +	 * free_pages_prepare() fails we consider the page to have been freed;
> +	 * deliberately leak it.
> +	 *
> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
> +	 * vice versa.
> +	 */
> +	for (i = 0; i < nr_pages; i++, page++) {
> +		VM_WARN_ON_ONCE(PageHead(page));
> +		VM_WARN_ON_ONCE(PageTail(page));
> +
> +		can_free = put_page_testzero(page);
> +		if (!can_free)
> +			not_freed++;
> +		else if (!free_pages_prepare(page, 0))
> +			can_free = false;
> +
> +		if (!can_free && start) {
> +			free_prepared_contig_range(start, page - start);
> +			start = NULL;
> +		} else if (can_free && !start) {
> +			start = page;
> +		}
> +	}
> +
> +	if (start)
> +		free_prepared_contig_range(start, page - start);
> +
> +	return not_freed;
> +}
> +EXPORT_SYMBOL(__free_contig_range);
> +
>  #ifdef CONFIG_CONTIG_ALLOC
>  /* Usage: See admin-guide/dynamic-debug-howto.rst */
>  static void alloc_contig_dump_pages(struct list_head *page_list)
> @@ -7327,11 +7426,14 @@ EXPORT_SYMBOL(free_contig_frozen_range);
>   */
>  void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>  {
> +	unsigned long count;
> +
>  	if (WARN_ON_ONCE(PageHead(pfn_to_page(pfn))))
>  		return;
>  
> -	for (; nr_pages--; pfn++)
> -		__free_page(pfn_to_page(pfn));
> +	count = __free_contig_range(pfn, nr_pages);
> +	WARN(count != 0, "%lu pages are still in use!\n", count);

And we almost certainly don't want this warning. Spurious temporary page
refcount increases (get_page_unless_zero()) can happen e.g. due to memory
compaction pfn scanners. It just might mean that side will be then the last
one to drop the refcount and freeing the order-0 page. For us it means only
that we abort and restart the batching, so we get worse performance, but
functionally it's ok, and should be very rare anyway.

> +
>  }
>  EXPORT_SYMBOL(free_contig_range);
>  #endif /* CONFIG_CONTIG_ALLOC */



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-16 15:21   ` Vlastimil Babka
@ 2026-03-16 16:02     ` Zi Yan
  2026-03-16 16:19       ` Vlastimil Babka (SUSE)
  2026-03-16 16:11     ` Muhammad Usama Anjum
  1 sibling, 1 reply; 20+ messages in thread
From: Zi Yan @ 2026-03-16 16:02 UTC (permalink / raw)
  To: Vlastimil Babka, Muhammad Usama Anjum, Ryan.Roberts
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	david.hildenbrand

On 16 Mar 2026, at 11:21, Vlastimil Babka wrote:

> On 3/16/26 12:31, Muhammad Usama Anjum wrote:
>> From: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Decompose the range of order-0 pages to be freed into the set of largest
>> possible power-of-2 size and aligned chunks and free them to the pcp or
>> buddy. This improves on the previous approach which freed each order-0
>> page individually in a loop. Testing shows performance to be improved by
>> more than 10x in some cases.
>>
>> Since each page is order-0, we must decrement each page's reference
>> count individually and only consider the page for freeing as part of a
>> high order chunk if the reference count goes to zero. Additionally
>> free_pages_prepare() must be called for each individual order-0 page
>> too, so that the struct page state and global accounting state can be
>> appropriately managed. But once this is done, the resulting high order
>> chunks can be freed as a unit to the pcp or buddy.
>>
>> This significantly speeds up the free operation but also has the side
>> benefit that high order blocks are added to the pcp instead of each page
>> ending up on the pcp order-0 list; memory remains more readily available
>> in high orders.
>>
>> vmalloc will shortly become a user of this new optimized
>> free_contig_range() since it aggressively allocates high order
>> non-compound pages, but then calls split_page() to end up with
>> contiguous order-0 pages. These can now be freed much more efficiently.
>>
>> The execution time of the following function was measured in a server
>> class arm64 machine:
>>
>> static int page_alloc_high_order_test(void)
>> {
>> 	unsigned int order = HPAGE_PMD_ORDER;
>> 	struct page *page;
>> 	int i;
>>
>> 	for (i = 0; i < 100000; i++) {
>> 		page = alloc_pages(GFP_KERNEL, order);
>> 		if (!page)
>> 			return -1;
>> 		split_page(page, order);
>> 		free_contig_range(page_to_pfn(page), 1UL << order);
>> 	}
>>
>> 	return 0;
>> }
>>
>> Execution time before: 4097358 usec
>> Execution time after:   729831 usec
>>
>> Perf trace before:
>>
>>     99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
>>             |
>>             ---kthread
>>                0xffffb33c12a26af8
>>                |
>>                |--98.13%--0xffffb33c12a26060
>>                |          |
>>                |          |--97.37%--free_contig_range
>>                |          |          |
>>                |          |          |--94.93%--___free_pages
>>                |          |          |          |
>>                |          |          |          |--55.42%--__free_frozen_pages
>>                |          |          |          |          |
>>                |          |          |          |           --43.20%--free_frozen_page_commit
>>                |          |          |          |                     |
>>                |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
>>                |          |          |          |
>>                |          |          |          |--11.53%--_raw_spin_trylock
>>                |          |          |          |
>>                |          |          |          |--8.19%--__preempt_count_dec_and_test
>>                |          |          |          |
>>                |          |          |          |--5.64%--_raw_spin_unlock
>>                |          |          |          |
>>                |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
>>                |          |          |          |
>>                |          |          |           --1.07%--free_frozen_page_commit
>>                |          |          |
>>                |          |           --1.54%--__free_frozen_pages
>>                |          |
>>                |           --0.77%--___free_pages
>>                |
>>                 --0.98%--0xffffb33c12a26078
>>                           alloc_pages_noprof
>>
>> Perf trace after:
>>
>>      8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
>>             |
>>             |--5.52%--__free_contig_range
>>             |          |
>>             |          |--5.00%--free_prepared_contig_range
>>             |          |          |
>>             |          |          |--1.43%--__free_frozen_pages
>>             |          |          |          |
>>             |          |          |           --0.51%--free_frozen_page_commit
>>             |          |          |
>>             |          |          |--1.08%--_raw_spin_trylock
>>             |          |          |
>>             |          |           --0.89%--_raw_spin_unlock
>>             |          |
>>             |           --0.52%--free_pages_prepare
>>             |
>>              --2.90%--ret_from_fork
>>                        kthread
>>                        0xffffae1c12abeaf8
>>                        0xffffae1c12abe7a0
>>                        |
>>                         --2.69%--vfree
>>                                   __free_contig_range
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> ---
>> Changes since v1:
>> - Rebase on mm-new
>> - Move FPI_PREPARED check inside __free_pages_prepare() now that
>>   fpi_flags are already being passed.
>> - Add todo (Zi Yan)
>> - Rerun benchmarks
>> - Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
>> - Rework order calculation in free_prepared_contig_range() and use
>>   MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
>>   be up to internal __free_frozen_pages() how it frees them
>> ---
>>  include/linux/gfp.h |   2 +
>>  mm/page_alloc.c     | 110 ++++++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 108 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index f82d74a77cad8..96ac7aae370c4 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages);
>>  void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>  #endif
>>
>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
>> +
>>  DEFINE_FREE(free_page, void *, free_page((unsigned long)_T))
>>
>>  #endif /* __LINUX_GFP_H */
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 75ee81445640b..6a9430f720579 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
>>  /* Free the page without taking locks. Rely on trylock only. */
>>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>>
>> +/*
>> + * free_pages_prepare() has already been called for page(s) being freed.
>> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
>> + * (HWPoison, PageNetpp, bad free page).
>> + */
>
> I'm confused, and reading the v1 thread didn't help either. Where would the
> subpages to check come from? AFAICS we start from order-0 pages always.
> __free_contig_range calls free_pages_prepare on every page with order 0
> unconditionally, so we check every page as an order-0 page. If we then free
> the bunch of individually checked pages as a high-order page, there's no
> reason to check those subpages again, no? Am I missing something?

There are two kinds of order > 0 pages, compound and not compound.
free_pages_prepare() checks all tail pages of a compound order > 0 pages too.
For non compound ones, free_pages_prepare() only has free_page_is_bad()
check on tail ones.

So my guess is that the TODO is to check all subpages on a non compound
order > 0 one in the same manner. This is based on the assumption that
all non compound order > 0 page users use split_page() after the allocation,
treat each page individually, and free them back altogether. But I am not
sure if this is true for all users allocating non compound order > 0 pages.
And free_pages_prepare_bulk() might be a better name for such functions.

The above confusion is also a reason I asked Ryan to try adding a unsplit_page()
function to fuse back non compound order > 0 pages and free the fused one
as we are currently doing. But that looks like a pain to implment. Maybe an
alternative to this FPI_PREPARED is to add FPI_FREE_BULK and loop through all
subpages if FPI_FREE_BULK is set with
__free_pages_prepare(page + i, 0, fpi_flags & ~FPI_FREE_BULK) in
__free_pages_ok().


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-16 16:02     ` Zi Yan
@ 2026-03-16 16:19       ` Vlastimil Babka (SUSE)
  2026-03-17 15:17         ` Zi Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-16 16:19 UTC (permalink / raw)
  To: Zi Yan, Muhammad Usama Anjum, Ryan.Roberts
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	david.hildenbrand

On 3/16/26 17:02, Zi Yan wrote:
> On 16 Mar 2026, at 11:21, Vlastimil Babka wrote:
> 
>>> +/*
>>> + * free_pages_prepare() has already been called for page(s) being freed.
>>> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
>>> + * (HWPoison, PageNetpp, bad free page).
>>> + */
>>
>> I'm confused, and reading the v1 thread didn't help either. Where would the
>> subpages to check come from? AFAICS we start from order-0 pages always.
>> __free_contig_range calls free_pages_prepare on every page with order 0
>> unconditionally, so we check every page as an order-0 page. If we then free
>> the bunch of individually checked pages as a high-order page, there's no
>> reason to check those subpages again, no? Am I missing something?
> 
> There are two kinds of order > 0 pages, compound and not compound.
> free_pages_prepare() checks all tail pages of a compound order > 0 pages too.
> For non compound ones, free_pages_prepare() only has free_page_is_bad()
> check on tail ones.
> 
> So my guess is that the TODO is to check all subpages on a non compound
> order > 0 one in the same manner. This is based on the assumption that

OK but:

1) Why put that TODO specifically on FPI_PREPARED definition, which is for
the case we skip the prepare/check?
2) Why add it in this series which AFAICS doesn't handle non-compound
order>0 anywhere.
3) We'd better work on eliminating the non-compound order>0 usages
altogether, rather than work on support them better.

> all non compound order > 0 page users use split_page() after the allocation,
> treat each page individually, and free them back altogether. But I am not
> sure if this is true for all users allocating non compound order > 0 pages.

Maybe as part of the elimination (point 3 above) we should combine the
allocation+split so it's never the first without the second anymore.

> And free_pages_prepare_bulk() might be a better name for such functions.
> 
> The above confusion is also a reason I asked Ryan to try adding a unsplit_page()
> function to fuse back non compound order > 0 pages and free the fused one
> as we are currently doing. But that looks like a pain to implment. Maybe an

Yeah not sure it's worth it either.

> alternative to this FPI_PREPARED is to add FPI_FREE_BULK and loop through all
> subpages if FPI_FREE_BULK is set with
> __free_pages_prepare(page + i, 0, fpi_flags & ~FPI_FREE_BULK) in
> __free_pages_ok().

Hmm, maybe...

> 
> Best Regards,
> Yan, Zi



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-16 16:19       ` Vlastimil Babka (SUSE)
@ 2026-03-17 15:17         ` Zi Yan
  2026-03-17 18:48           ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 20+ messages in thread
From: Zi Yan @ 2026-03-17 15:17 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Muhammad Usama Anjum, Ryan.Roberts, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	david.hildenbrand

On 16 Mar 2026, at 12:19, Vlastimil Babka (SUSE) wrote:

> On 3/16/26 17:02, Zi Yan wrote:
>> On 16 Mar 2026, at 11:21, Vlastimil Babka wrote:
>>
>>>> +/*
>>>> + * free_pages_prepare() has already been called for page(s) being freed.
>>>> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
>>>> + * (HWPoison, PageNetpp, bad free page).
>>>> + */
>>>
>>> I'm confused, and reading the v1 thread didn't help either. Where would the
>>> subpages to check come from? AFAICS we start from order-0 pages always.
>>> __free_contig_range calls free_pages_prepare on every page with order 0
>>> unconditionally, so we check every page as an order-0 page. If we then free
>>> the bunch of individually checked pages as a high-order page, there's no
>>> reason to check those subpages again, no? Am I missing something?
>>
>> There are two kinds of order > 0 pages, compound and not compound.
>> free_pages_prepare() checks all tail pages of a compound order > 0 pages too.
>> For non compound ones, free_pages_prepare() only has free_page_is_bad()
>> check on tail ones.
>>
>> So my guess is that the TODO is to check all subpages on a non compound
>> order > 0 one in the same manner. This is based on the assumption that
>
> OK but:
>
> 1) Why put that TODO specifically on FPI_PREPARED definition, which is for
> the case we skip the prepare/check?
> 2) Why add it in this series which AFAICS doesn't handle non-compound
> order>0 anywhere.
> 3) We'd better work on eliminating the non-compound order>0 usages
> altogether, rather than work on support them better.

I agreed with you when I first saw this. After I think about it again,
the issue might not be directly related to the allocation but is the free path.
Like the patch title said, it is an optimization of free contiguous pages.
These physically contiguous pages happen to come from alloc non-compound order>0
and this leads to this optimization.

The problem they want to solve is to speed up page free path by freeing
a group of pages together. They are optimizing for a special situation
where a group of pages that are physically contiguous, so that these pages
can be freed via free_pages(page, order /* > 0 */). If we take away
the allocation of non-compound order>0, like you suggested in 3, we basically
remove the optimization opportunity from them. I am not sure that what
people want.

To think about the problem broadly, how can we optimize free_page_bulk(),
if that exists? Sorting input pages based on PFNs, so that we can them in
high orders instead of individual order-0s. This patch basically says,
hey, the group of pages we are freeing are all contiguous, since that is
how we allocate them, freeing them as a whole is much quicker than freeing
them individually.

>
>> all non compound order > 0 page users use split_page() after the allocation,
>> treat each page individually, and free them back altogether. But I am not
>> sure if this is true for all users allocating non compound order > 0 pages.
>
> Maybe as part of the elimination (point 3 above) we should combine the
> allocation+split so it's never the first without the second anymore.
>
>> And free_pages_prepare_bulk() might be a better name for such functions.
>>
>> The above confusion is also a reason I asked Ryan to try adding a unsplit_page()
>> function to fuse back non compound order > 0 pages and free the fused one
>> as we are currently doing. But that looks like a pain to implment. Maybe an
>
> Yeah not sure it's worth it either.
>
>> alternative to this FPI_PREPARED is to add FPI_FREE_BULK and loop through all
>> subpages if FPI_FREE_BULK is set with
>> __free_pages_prepare(page + i, 0, fpi_flags & ~FPI_FREE_BULK) in
>> __free_pages_ok().
>
> Hmm, maybe...

Let me know if my reasoning above moves your opinion on FPI_FREE_BULK towards
a positive direction. :)


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-17 15:17         ` Zi Yan
@ 2026-03-17 18:48           ` Vlastimil Babka (SUSE)
  2026-03-19 22:07             ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-17 18:48 UTC (permalink / raw)
  To: Zi Yan
  Cc: Muhammad Usama Anjum, Ryan.Roberts, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	david.hildenbrand

On 3/17/26 16:17, Zi Yan wrote:
> On 16 Mar 2026, at 12:19, Vlastimil Babka (SUSE) wrote:
> 
>> On 3/16/26 17:02, Zi Yan wrote:
>>> On 16 Mar 2026, at 11:21, Vlastimil Babka wrote:
>>>
>>>>> +/*
>>>>> + * free_pages_prepare() has already been called for page(s) being freed.
>>>>> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
>>>>> + * (HWPoison, PageNetpp, bad free page).
>>>>> + */
>>>>
>>>> I'm confused, and reading the v1 thread didn't help either. Where would the
>>>> subpages to check come from? AFAICS we start from order-0 pages always.
>>>> __free_contig_range calls free_pages_prepare on every page with order 0
>>>> unconditionally, so we check every page as an order-0 page. If we then free
>>>> the bunch of individually checked pages as a high-order page, there's no
>>>> reason to check those subpages again, no? Am I missing something?
>>>
>>> There are two kinds of order > 0 pages, compound and not compound.
>>> free_pages_prepare() checks all tail pages of a compound order > 0 pages too.
>>> For non compound ones, free_pages_prepare() only has free_page_is_bad()
>>> check on tail ones.
>>>
>>> So my guess is that the TODO is to check all subpages on a non compound
>>> order > 0 one in the same manner. This is based on the assumption that
>>
>> OK but:
>>
>> 1) Why put that TODO specifically on FPI_PREPARED definition, which is for
>> the case we skip the prepare/check?
>> 2) Why add it in this series which AFAICS doesn't handle non-compound
>> order>0 anywhere.
>> 3) We'd better work on eliminating the non-compound order>0 usages
>> altogether, rather than work on support them better.
> 
> I agreed with you when I first saw this. After I think about it again,
> the issue might not be directly related to the allocation but is the free path.
> Like the patch title said, it is an optimization of free contiguous pages.
> These physically contiguous pages happen to come from alloc non-compound order>0
> and this leads to this optimization.

Sure and this use-case doesn't need the TODO to be solved, or am I mistaken?

That TODO seems to be about a hypothetical other use case with order>0
non-compound pages. Because AFAICS the use-cases in this series are not
about order>0 non-compound pages. Maybe they exist for a brief moment
between allocation and split_page() (in vmalloc() case?), but when we are
freeing them, we start with a contiguous series of order-0 pages (refcounted
or not).

So by my definition we are not freeing an order>0 non-compound page. By
"freeing order>0 non-compound page" I mean specifically what ___free_pages()
is handling in the "else if (!head) {" path, which I'd love to get rid of.
That TODO to me seems like about supporting that case.

> The problem they want to solve is to speed up page free path by freeing
> a group of pages together. They are optimizing for a special situation
> where a group of pages that are physically contiguous, so that these pages
> can be freed via free_pages(page, order /* > 0 */). If we take away

I don't think we want that as that leads to the case I described above. It
assumes head is refcounted and tail are not. I'd rather not overload it with
a case where we have contiguous order-0 pages where each is refcounted (or
none are). Yeah we can optimize the freeing like this series does, but I'd
not do it via something like "free_pages(page, order /* > 0 */)"

> the allocation of non-compound order>0, like you suggested in 3, we basically

I suggested we'd take it away in the sense of not producing order>0 where
head is refcounted, tails are not, and it's not a compound page. I'd rather
have an API that applies split_page() before and returns it as order-0
refcounted pages, but not the intermediate order>0 non-compound anymore.

> remove the optimization opportunity from them. I am not sure that what
> people want.
> 
> To think about the problem broadly, how can we optimize free_page_bulk(),
> if that exists? Sorting input pages based on PFNs, so that we can them in
> high orders instead of individual order-0s. This patch basically says,
> hey, the group of pages we are freeing are all contiguous, since that is
> how we allocate them, freeing them as a whole is much quicker than freeing
> them individually.

Yes we can have generalized, perhaps stacked support for the cases used by
the converted callers in this series, but not using a generic API that would
try e.g. sorting pfns even when we know they are already sorted. That means:

- given as contiguous range, frozen (patch 3)
- given as contiguous range, not frozen (patch 1)
- probably contiguous, needs checking, given as array of pages (patch 2)

>>
>>> all non compound order > 0 page users use split_page() after the allocation,
>>> treat each page individually, and free them back altogether. But I am not
>>> sure if this is true for all users allocating non compound order > 0 pages.
>>
>> Maybe as part of the elimination (point 3 above) we should combine the
>> allocation+split so it's never the first without the second anymore.

I elaborated on this above.

>>> And free_pages_prepare_bulk() might be a better name for such functions.
>>>
>>> The above confusion is also a reason I asked Ryan to try adding a unsplit_page()
>>> function to fuse back non compound order > 0 pages and free the fused one
>>> as we are currently doing. But that looks like a pain to implment. Maybe an
>>
>> Yeah not sure it's worth it either.
>>
>>> alternative to this FPI_PREPARED is to add FPI_FREE_BULK and loop through all
>>> subpages if FPI_FREE_BULK is set with
>>> __free_pages_prepare(page + i, 0, fpi_flags & ~FPI_FREE_BULK) in
>>> __free_pages_ok().
>>
>> Hmm, maybe...
> 
> Let me know if my reasoning above moves your opinion on FPI_FREE_BULK towards
> a positive direction. :)

If you can make it work to support the three cases above, without doing
unnecessary work, and with no "free_pages(page, order /* > 0 */)" like API?

> Best Regards,
> Yan, Zi



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-17 18:48           ` Vlastimil Babka (SUSE)
@ 2026-03-19 22:07             ` David Hildenbrand (Arm)
  2026-03-20  8:20               ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19 22:07 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Zi Yan
  Cc: Muhammad Usama Anjum, Ryan.Roberts, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	david.hildenbrand

>> the allocation of non-compound order>0, like you suggested in 3, we basically
> 
> I suggested we'd take it away in the sense of not producing order>0 where
> head is refcounted, tails are not, and it's not a compound page. I'd rather
> have an API that applies split_page() before and returns it as order-0
> refcounted pages, but not the intermediate order>0 non-compound anymore.

Are you talking about external API or internal API?

Regarding external interface: I think the crucial part is that an
external interface (free_contig_range) should always get a range of
individual order-0 pages: neither compound nor non-compound order > 0.

The individual order-0 pages can either be frozen or refcounted
(depending on the interface).

Regarding internal interface: To me that implies that FPI_PREPARED will
never ever have to do any kind of "subpage" (page) free_pages_prepare()
checks. It must already have been performed on all order-0 pages.

So the TODO should indeed be dropped.

I'm not sure I understood whether you think using the
__free_frozen_pages() with order > 0 is okay, or whether we need a
different (internal) interface.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-19 22:07             ` David Hildenbrand (Arm)
@ 2026-03-20  8:20               ` Vlastimil Babka (SUSE)
  2026-03-20 12:46                 ` Zi Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-20  8:20 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Zi Yan
  Cc: Muhammad Usama Anjum, Ryan.Roberts, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	david.hildenbrand

On 3/19/26 23:07, David Hildenbrand (Arm) wrote:
>>> the allocation of non-compound order>0, like you suggested in 3, we basically
>> 
>> I suggested we'd take it away in the sense of not producing order>0 where
>> head is refcounted, tails are not, and it's not a compound page. I'd rather
>> have an API that applies split_page() before and returns it as order-0
>> refcounted pages, but not the intermediate order>0 non-compound anymore.
> 
> Are you talking about external API or internal API?

In this case of alloc+split, external, and that would make sense to me.

In case of freeing, the current free_pages(order>0) is also external and I
would prefer not to augment it for this free_contig_range() usecase.

> Regarding external interface: I think the crucial part is that an
> external interface (free_contig_range) should always get a range of
> individual order-0 pages: neither compound nor non-compound order > 0.

Ack.

> The individual order-0 pages can either be frozen or refcounted
> (depending on the interface).

Ack.

> Regarding internal interface: To me that implies that FPI_PREPARED will
> never ever have to do any kind of "subpage" (page) free_pages_prepare()
> checks. It must already have been performed on all order-0 pages.
> 
> So the TODO should indeed be dropped.

Agreed. But maybe I misunderstood Zi, so that's why I tried to add so much
detail about what I mean by what.

> I'm not sure I understood whether you think using the
> __free_frozen_pages() with order > 0 is okay, or whether we need a
> different (internal) interface.

I think this is fine. But I agree with you above that this assumes
FPI_PREPARED and will not have to deal with subpages.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-20  8:20               ` Vlastimil Babka (SUSE)
@ 2026-03-20 12:46                 ` Zi Yan
  0 siblings, 0 replies; 20+ messages in thread
From: Zi Yan @ 2026-03-20 12:46 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: David Hildenbrand (Arm), Muhammad Usama Anjum, Ryan.Roberts,
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	david.hildenbrand

On 20 Mar 2026, at 4:20, Vlastimil Babka (SUSE) wrote:

> On 3/19/26 23:07, David Hildenbrand (Arm) wrote:
>>>> the allocation of non-compound order>0, like you suggested in 3, we basically
>>>
>>> I suggested we'd take it away in the sense of not producing order>0 where
>>> head is refcounted, tails are not, and it's not a compound page. I'd rather
>>> have an API that applies split_page() before and returns it as order-0
>>> refcounted pages, but not the intermediate order>0 non-compound anymore.
>>
>> Are you talking about external API or internal API?
>
> In this case of alloc+split, external, and that would make sense to me.
>
> In case of freeing, the current free_pages(order>0) is also external and I
> would prefer not to augment it for this free_contig_range() usecase.
>
>> Regarding external interface: I think the crucial part is that an
>> external interface (free_contig_range) should always get a range of
>> individual order-0 pages: neither compound nor non-compound order > 0.
>
> Ack.
>
>> The individual order-0 pages can either be frozen or refcounted
>> (depending on the interface).
>
> Ack.
>
>> Regarding internal interface: To me that implies that FPI_PREPARED will
>> never ever have to do any kind of "subpage" (page) free_pages_prepare()
>> checks. It must already have been performed on all order-0 pages.
>>
>> So the TODO should indeed be dropped.
>
> Agreed. But maybe I misunderstood Zi, so that's why I tried to add so much
> detail about what I mean by what.

Ack on dropping the TODO.

I was discussing about whether we can have a better interface for freeing these
contiguous pages instead of FPI_PREPARED. Since FPI_PREPARED adds another
form of free pages, where free_pages_prepare() are called on all incoming
pages already. It might be a separate topic. I will think about it more
and come back later. Sorry for the confusion.

>
>> I'm not sure I understood whether you think using the
>> __free_frozen_pages() with order > 0 is okay, or whether we need a
>> different (internal) interface.
>
> I think this is fine. But I agree with you above that this assumes
> FPI_PREPARED and will not have to deal with subpages.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range()
  2026-03-16 15:21   ` Vlastimil Babka
  2026-03-16 16:02     ` Zi Yan
@ 2026-03-16 16:11     ` Muhammad Usama Anjum
  1 sibling, 0 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-16 16:11 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: usama.anjum

On 16/03/2026 3:21 pm, Vlastimil Babka wrote:
> On 3/16/26 12:31, Muhammad Usama Anjum wrote:
>> From: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Decompose the range of order-0 pages to be freed into the set of largest
>> possible power-of-2 size and aligned chunks and free them to the pcp or
>> buddy. This improves on the previous approach which freed each order-0
>> page individually in a loop. Testing shows performance to be improved by
>> more than 10x in some cases.
>>
>> Since each page is order-0, we must decrement each page's reference
>> count individually and only consider the page for freeing as part of a
>> high order chunk if the reference count goes to zero. Additionally
>> free_pages_prepare() must be called for each individual order-0 page
>> too, so that the struct page state and global accounting state can be
>> appropriately managed. But once this is done, the resulting high order
>> chunks can be freed as a unit to the pcp or buddy.
>>
>> This significantly speeds up the free operation but also has the side
>> benefit that high order blocks are added to the pcp instead of each page
>> ending up on the pcp order-0 list; memory remains more readily available
>> in high orders.
>>
>> vmalloc will shortly become a user of this new optimized
>> free_contig_range() since it aggressively allocates high order
>> non-compound pages, but then calls split_page() to end up with
>> contiguous order-0 pages. These can now be freed much more efficiently.
>>
>> The execution time of the following function was measured in a server
>> class arm64 machine:
>>
>> static int page_alloc_high_order_test(void)
>> {
>> 	unsigned int order = HPAGE_PMD_ORDER;
>> 	struct page *page;
>> 	int i;
>>
>> 	for (i = 0; i < 100000; i++) {
>> 		page = alloc_pages(GFP_KERNEL, order);
>> 		if (!page)
>> 			return -1;
>> 		split_page(page, order);
>> 		free_contig_range(page_to_pfn(page), 1UL << order);
>> 	}
>>
>> 	return 0;
>> }
>>
>> Execution time before: 4097358 usec
>> Execution time after:   729831 usec
>>
>> Perf trace before:
>>
>>     99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
>>             |
>>             ---kthread
>>                0xffffb33c12a26af8
>>                |
>>                |--98.13%--0xffffb33c12a26060
>>                |          |
>>                |          |--97.37%--free_contig_range
>>                |          |          |
>>                |          |          |--94.93%--___free_pages
>>                |          |          |          |
>>                |          |          |          |--55.42%--__free_frozen_pages
>>                |          |          |          |          |
>>                |          |          |          |           --43.20%--free_frozen_page_commit
>>                |          |          |          |                     |
>>                |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
>>                |          |          |          |
>>                |          |          |          |--11.53%--_raw_spin_trylock
>>                |          |          |          |
>>                |          |          |          |--8.19%--__preempt_count_dec_and_test
>>                |          |          |          |
>>                |          |          |          |--5.64%--_raw_spin_unlock
>>                |          |          |          |
>>                |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
>>                |          |          |          |
>>                |          |          |           --1.07%--free_frozen_page_commit
>>                |          |          |
>>                |          |           --1.54%--__free_frozen_pages
>>                |          |
>>                |           --0.77%--___free_pages
>>                |
>>                 --0.98%--0xffffb33c12a26078
>>                           alloc_pages_noprof
>>
>> Perf trace after:
>>
>>      8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
>>             |
>>             |--5.52%--__free_contig_range
>>             |          |
>>             |          |--5.00%--free_prepared_contig_range
>>             |          |          |
>>             |          |          |--1.43%--__free_frozen_pages
>>             |          |          |          |
>>             |          |          |           --0.51%--free_frozen_page_commit
>>             |          |          |
>>             |          |          |--1.08%--_raw_spin_trylock
>>             |          |          |
>>             |          |           --0.89%--_raw_spin_unlock
>>             |          |
>>             |           --0.52%--free_pages_prepare
>>             |
>>              --2.90%--ret_from_fork
>>                        kthread
>>                        0xffffae1c12abeaf8
>>                        0xffffae1c12abe7a0
>>                        |
>>                         --2.69%--vfree
>>                                   __free_contig_range
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> ---
>> Changes since v1:
>> - Rebase on mm-new
>> - Move FPI_PREPARED check inside __free_pages_prepare() now that
>>   fpi_flags are already being passed.
>> - Add todo (Zi Yan)
>> - Rerun benchmarks
>> - Convert VM_BUG_ON_PAGE() to VM_WARN_ON_ONCE()
>> - Rework order calculation in free_prepared_contig_range() and use
>>   MAX_PAGE_ORDER as high limit instead of pageblock_order as it must
>>   be up to internal __free_frozen_pages() how it frees them
>> ---
>>  include/linux/gfp.h |   2 +
>>  mm/page_alloc.c     | 110 ++++++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 108 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index f82d74a77cad8..96ac7aae370c4 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages);
>>  void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>  #endif
>>  
>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
>> +
>>  DEFINE_FREE(free_page, void *, free_page((unsigned long)_T))
>>  
>>  #endif /* __LINUX_GFP_H */
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 75ee81445640b..6a9430f720579 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
>>  /* Free the page without taking locks. Rely on trylock only. */
>>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>>  
>> +/*
>> + * free_pages_prepare() has already been called for page(s) being freed.
>> + * TODO: Perform per-subpage free_pages_prepare() checks for order > 0 pages
>> + * (HWPoison, PageNetpp, bad free page).
>> + */
> 
> I'm confused, and reading the v1 thread didn't help either. Where would the
> subpages to check come from? AFAICS we start from order-0 pages always.
> __free_contig_range calls free_pages_prepare on every page with order 0
> unconditionally, so we check every page as an order-0 page. If we then free
> the bunch of individually checked pages as a high-order page, there's no
> reason to check those subpages again, no? Am I missing something?
Zi Yan replied in separate thread. Let's continue this discussion there.

> 
>> +#define FPI_PREPARED		((__force fpi_t)BIT(3))
>> +
>>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>>  static DEFINE_MUTEX(pcp_batch_high_lock);
>>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
>> @@ -1310,6 +1317,9 @@ __always_inline bool __free_pages_prepare(struct page *page,
>>  	bool compound = PageCompound(page);
>>  	struct folio *folio = page_folio(page);
>>  
>> +	if (fpi_flags & FPI_PREPARED)
>> +		return true;
>> +
>>  	VM_BUG_ON_PAGE(PageTail(page), page);
>>  
>>  	trace_mm_page_free(page, order);
>> @@ -1579,8 +1589,10 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>>  	unsigned long pfn = page_to_pfn(page);
>>  	struct zone *zone = page_zone(page);
>>  
>> -	if (__free_pages_prepare(page, order, fpi_flags))
>> -		free_one_page(zone, page, pfn, order, fpi_flags);
>> +	if (!__free_pages_prepare(page, order, fpi_flags))
>> +		return;
>> +
>> +	free_one_page(zone, page, pfn, order, fpi_flags);
> 
> This is not a functional change, can we drop it?
Yes, I'll drop it in the next version.

> 
>>  }
>>  
>>  void __meminit __free_pages_core(struct page *page, unsigned int order,
>> @@ -6784,6 +6796,93 @@ void __init page_alloc_sysctl_init(void)
>>  	register_sysctl_init("vm", page_alloc_sysctl_table);
>>  }
>>  
>> +static void free_prepared_contig_range(struct page *page,
>> +				       unsigned long nr_pages)
>> +{
>> +	while (nr_pages) {
>> +		unsigned int order;
>> +		unsigned long pfn;
>> +
>> +		pfn = page_to_pfn(page);
>> +		/* We are limited by the largest buddy order. */
>> +		order = pfn ? __ffs(pfn) : MAX_PAGE_ORDER;
>> +		/* Don't exceed the number of pages to free. */
>> +		order = min(order, ilog2(nr_pages));
>> +		order = min_t(unsigned int, order, MAX_PAGE_ORDER);
>> +
>> +		/*
>> +		 * Free the chunk as a single block. Our caller has already
>> +		 * called free_pages_prepare() for each order-0 page.
>> +		 */
>> +		__free_frozen_pages(page, order, FPI_PREPARED);
>> +
>> +		page += 1UL << order;
>> +		nr_pages -= 1UL << order;
>> +	}
>> +}
>> +
>> +/**
>> + * __free_contig_range - Free contiguous range of order-0 pages.
>> + * @pfn: Page frame number of the first page in the range.
>> + * @nr_pages: Number of pages to free.
>> + *
>> + * For each order-0 struct page in the physically contiguous range, put a
>> + * reference. Free any page who's reference count falls to zero. The
>> + * implementation is functionally equivalent to, but significantly faster than
>> + * calling __free_page() for each struct page in a loop.
>> + *
>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>> + * can be freed with this API.
>> + *
>> + * Returns the number of pages which were not freed, because their reference
>> + * count did not fall to zero.
> 
> We probably don't need this part.
The only user of this return value is free_contig_range(). Your explanation
below makes sense. I'll drop the the return value and clean free_contif_range()
as well.

> 
>> + *
>> + * Context: May be called in interrupt context or while holding a normal
>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>> + */
>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>> +{
>> +	struct page *page = pfn_to_page(pfn);
>> +	unsigned long not_freed = 0;
>> +	struct page *start = NULL;
>> +	unsigned long i;
>> +	bool can_free;
>> +
>> +	/*
>> +	 * Chunk the range into contiguous runs of pages for which the refcount
>> +	 * went to zero and for which free_pages_prepare() succeeded. If
>> +	 * free_pages_prepare() fails we consider the page to have been freed;
>> +	 * deliberately leak it.
>> +	 *
>> +	 * Code assumes contiguous PFNs have contiguous struct pages, but not
>> +	 * vice versa.
>> +	 */
>> +	for (i = 0; i < nr_pages; i++, page++) {
>> +		VM_WARN_ON_ONCE(PageHead(page));
>> +		VM_WARN_ON_ONCE(PageTail(page));
>> +
>> +		can_free = put_page_testzero(page);
>> +		if (!can_free)
>> +			not_freed++;
>> +		else if (!free_pages_prepare(page, 0))
>> +			can_free = false;
>> +
>> +		if (!can_free && start) {
>> +			free_prepared_contig_range(start, page - start);
>> +			start = NULL;
>> +		} else if (can_free && !start) {
>> +			start = page;
>> +		}
>> +	}
>> +
>> +	if (start)
>> +		free_prepared_contig_range(start, page - start);
>> +
>> +	return not_freed;
>> +}
>> +EXPORT_SYMBOL(__free_contig_range);
>> +
>>  #ifdef CONFIG_CONTIG_ALLOC
>>  /* Usage: See admin-guide/dynamic-debug-howto.rst */
>>  static void alloc_contig_dump_pages(struct list_head *page_list)
>> @@ -7327,11 +7426,14 @@ EXPORT_SYMBOL(free_contig_frozen_range);
>>   */
>>  void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>  {
>> +	unsigned long count;
>> +
>>  	if (WARN_ON_ONCE(PageHead(pfn_to_page(pfn))))
>>  		return;
>>  
>> -	for (; nr_pages--; pfn++)
>> -		__free_page(pfn_to_page(pfn));
>> +	count = __free_contig_range(pfn, nr_pages);
>> +	WARN(count != 0, "%lu pages are still in use!\n", count);
> 
> And we almost certainly don't want this warning. Spurious temporary page
> refcount increases (get_page_unless_zero()) can happen e.g. due to memory
> compaction pfn scanners. It just might mean that side will be then the last
> one to drop the refcount and freeing the order-0 page. For us it means only
> that we abort and restart the batching, so we get worse performance, but
> functionally it's ok, and should be very rare anyway.
> 
>> +
>>  }
>>  EXPORT_SYMBOL(free_contig_range);
>>  #endif /* CONFIG_CONTIG_ALLOC */
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 2/3] vmalloc: Optimize vfree
  2026-03-16 11:31 [PATCH v2 0/3] mm: Free contiguous order-0 pages efficiently Muhammad Usama Anjum
  2026-03-16 11:31 ` [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
@ 2026-03-16 11:31 ` Muhammad Usama Anjum
  2026-03-16 15:49   ` Vlastimil Babka
  2026-03-16 11:31 ` [PATCH v2 3/3] mm/page_alloc: Optimize __free_contig_frozen_range() Muhammad Usama Anjum
  2 siblings, 1 reply; 20+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-16 11:31 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: Ryan Roberts, usama.anjum

From: Ryan Roberts <ryan.roberts@arm.com>

Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
must immediately split_page() to order-0 so that it remains compatible
with users that want to access the underlying struct page.
Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
allocator") recently made it much more likely for vmalloc to allocate
high order pages which are subsequently split to order-0.

Unfortunately this had the side effect of causing performance
regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
benchmarks). See Closes: tag. This happens because the high order pages
must be gotten from the buddy but then because they are split to
order-0, when they are freed they are freed to the order-0 pcp.
Previously allocation was for order-0 pages so they were recycled from
the pcp.

It would be preferable if when vmalloc allocates an (e.g.) order-3 page
that it also frees that order-3 page to the order-3 pcp, then the
regression could be removed.

So let's do exactly that; use the new __free_contig_range() API to
batch-free contiguous ranges of pfns. This not only removes the
regression, but significantly improves performance of vfree beyond the
baseline.

A selection of test_vmalloc benchmarks running on arm64 server class
system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
large order pages from buddy allocator") was added in v6.19-rc1 where we
see regressions. Then with this change performance is much better. (>0
is faster, <0 is slower, (R)/(I) = statistically significant
Regression/Improvement):

+-----------------+----------------------------------------------------------+-------------------+--------------------+
| Benchmark       | Result Class                                             |   mm-new          |  this series       |
+=================+==========================================================+===================+====================+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
+-----------------+----------------------------------------------------------+-------------------+--------------------+

Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
---
Changes since v1:
- Rebase on mm-new
- Rerun benchmarks
---
 mm/vmalloc.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c607307c657a6..8b935395fb068 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3459,18 +3459,34 @@ void vfree(const void *addr)
 
 	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
 		vm_reset_perms(vm);
-	for (i = 0; i < vm->nr_pages; i++) {
-		struct page *page = vm->pages[i];
+
+	if (vm->nr_pages) {
+		bool account = !(vm->flags & VM_MAP_PUT_PAGES);
+		unsigned long start_pfn, pfn;
+		struct page *page = vm->pages[0];
+		int nr = 1;
 
 		BUG_ON(!page);
-		/*
-		 * High-order allocs for huge vmallocs are split, so
-		 * can be freed as an array of order-0 allocations
-		 */
-		if (!(vm->flags & VM_MAP_PUT_PAGES))
+		start_pfn = page_to_pfn(page);
+		if (account)
 			mod_lruvec_page_state(page, NR_VMALLOC, -1);
-		__free_page(page);
-		cond_resched();
+
+		for (i = 1; i < vm->nr_pages; i++) {
+			page = vm->pages[i];
+			BUG_ON(!page);
+			if (account)
+				mod_lruvec_page_state(page, NR_VMALLOC, -1);
+			pfn = page_to_pfn(page);
+			if (start_pfn + nr == pfn) {
+				nr++;
+				continue;
+			}
+			__free_contig_range(start_pfn, nr);
+			start_pfn = pfn;
+			nr = 1;
+			cond_resched();
+		}
+		__free_contig_range(start_pfn, nr);
 	}
 	kvfree(vm->pages);
 	kfree(vm);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] vmalloc: Optimize vfree
  2026-03-16 11:31 ` [PATCH v2 2/3] vmalloc: Optimize vfree Muhammad Usama Anjum
@ 2026-03-16 15:49   ` Vlastimil Babka
  2026-03-17  9:36     ` Muhammad Usama Anjum
  2026-03-20  8:39     ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 20+ messages in thread
From: Vlastimil Babka @ 2026-03-16 15:49 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand

On 3/16/26 12:31, Muhammad Usama Anjum wrote:
> From: Ryan Roberts <ryan.roberts@arm.com>
> 
> Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
> must immediately split_page() to order-0 so that it remains compatible
> with users that want to access the underlying struct page.
> Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
> allocator") recently made it much more likely for vmalloc to allocate
> high order pages which are subsequently split to order-0.
> 
> Unfortunately this had the side effect of causing performance
> regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
> benchmarks). See Closes: tag. This happens because the high order pages
> must be gotten from the buddy but then because they are split to
> order-0, when they are freed they are freed to the order-0 pcp.
> Previously allocation was for order-0 pages so they were recycled from
> the pcp.
> 
> It would be preferable if when vmalloc allocates an (e.g.) order-3 page
> that it also frees that order-3 page to the order-3 pcp, then the
> regression could be removed.
> 
> So let's do exactly that; use the new __free_contig_range() API to
> batch-free contiguous ranges of pfns. This not only removes the
> regression, but significantly improves performance of vfree beyond the
> baseline.
> 
> A selection of test_vmalloc benchmarks running on arm64 server class
> system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
> large order pages from buddy allocator") was added in v6.19-rc1 where we
> see regressions. Then with this change performance is much better. (>0
> is faster, <0 is slower, (R)/(I) = statistically significant
> Regression/Improvement):
> 
> +-----------------+----------------------------------------------------------+-------------------+--------------------+
> | Benchmark       | Result Class                                             |   mm-new          |  this series       |
> +=================+==========================================================+===================+====================+
> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
> +-----------------+----------------------------------------------------------+-------------------+--------------------+
> 
> Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
> Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
> Changes since v1:
> - Rebase on mm-new
> - Rerun benchmarks
> ---
>  mm/vmalloc.c | 34 +++++++++++++++++++++++++---------
>  1 file changed, 25 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c607307c657a6..8b935395fb068 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3459,18 +3459,34 @@ void vfree(const void *addr)
>  
>  	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
>  		vm_reset_perms(vm);
> -	for (i = 0; i < vm->nr_pages; i++) {
> -		struct page *page = vm->pages[i];
> +
> +	if (vm->nr_pages) {
> +		bool account = !(vm->flags & VM_MAP_PUT_PAGES);
> +		unsigned long start_pfn, pfn;
> +		struct page *page = vm->pages[0];
> +		int nr = 1;
>  
>  		BUG_ON(!page);
> -		/*
> -		 * High-order allocs for huge vmallocs are split, so
> -		 * can be freed as an array of order-0 allocations
> -		 */
> -		if (!(vm->flags & VM_MAP_PUT_PAGES))
> +		start_pfn = page_to_pfn(page);
> +		if (account)
>  			mod_lruvec_page_state(page, NR_VMALLOC, -1);
> -		__free_page(page);
> -		cond_resched();
> +
> +		for (i = 1; i < vm->nr_pages; i++) {
> +			page = vm->pages[i];
> +			BUG_ON(!page);

We shouldn't be adding BUG_ON()'s. Rather demote also the pre-existing one
to VM_WARN_ON_ONCE() and skip gracefully.

> +			if (account)
> +				mod_lruvec_page_state(page, NR_VMALLOC, -1);

I think we should be able to batch this too to use "nr"?

> +			pfn = page_to_pfn(page);
> +			if (start_pfn + nr == pfn) {
> +				nr++;
> +				continue;
> +			}
> +			__free_contig_range(start_pfn, nr);
> +			start_pfn = pfn;
> +			nr = 1;
> +			cond_resched();> +		}
> +		__free_contig_range(start_pfn, nr);
>  	}
>  	kvfree(vm->pages);
>  	kfree(vm);



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] vmalloc: Optimize vfree
  2026-03-16 15:49   ` Vlastimil Babka
@ 2026-03-17  9:36     ` Muhammad Usama Anjum
  2026-03-20  8:39     ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-17  9:36 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: usama.anjum

On 16/03/2026 3:49 pm, Vlastimil Babka wrote:
> On 3/16/26 12:31, Muhammad Usama Anjum wrote:
>> From: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
>> must immediately split_page() to order-0 so that it remains compatible
>> with users that want to access the underlying struct page.
>> Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
>> allocator") recently made it much more likely for vmalloc to allocate
>> high order pages which are subsequently split to order-0.
>>
>> Unfortunately this had the side effect of causing performance
>> regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
>> benchmarks). See Closes: tag. This happens because the high order pages
>> must be gotten from the buddy but then because they are split to
>> order-0, when they are freed they are freed to the order-0 pcp.
>> Previously allocation was for order-0 pages so they were recycled from
>> the pcp.
>>
>> It would be preferable if when vmalloc allocates an (e.g.) order-3 page
>> that it also frees that order-3 page to the order-3 pcp, then the
>> regression could be removed.
>>
>> So let's do exactly that; use the new __free_contig_range() API to
>> batch-free contiguous ranges of pfns. This not only removes the
>> regression, but significantly improves performance of vfree beyond the
>> baseline.
>>
>> A selection of test_vmalloc benchmarks running on arm64 server class
>> system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
>> large order pages from buddy allocator") was added in v6.19-rc1 where we
>> see regressions. Then with this change performance is much better. (>0
>> is faster, <0 is slower, (R)/(I) = statistically significant
>> Regression/Improvement):
>>
>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>> | Benchmark       | Result Class                                             |   mm-new          |  this series       |
>> +=================+==========================================================+===================+====================+
>> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
>> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
>> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
>> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
>> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
>> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
>> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
>> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
>> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
>> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
>> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
>> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
>> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
>> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
>> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
>> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
>> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
>> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
>> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>>
>> Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
>> Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> ---
>> Changes since v1:
>> - Rebase on mm-new
>> - Rerun benchmarks
>> ---
>>  mm/vmalloc.c | 34 +++++++++++++++++++++++++---------
>>  1 file changed, 25 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index c607307c657a6..8b935395fb068 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -3459,18 +3459,34 @@ void vfree(const void *addr)
>>  
>>  	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
>>  		vm_reset_perms(vm);
>> -	for (i = 0; i < vm->nr_pages; i++) {
>> -		struct page *page = vm->pages[i];
>> +
>> +	if (vm->nr_pages) {
>> +		bool account = !(vm->flags & VM_MAP_PUT_PAGES);
>> +		unsigned long start_pfn, pfn;
>> +		struct page *page = vm->pages[0];
>> +		int nr = 1;
>>  
>>  		BUG_ON(!page);
>> -		/*
>> -		 * High-order allocs for huge vmallocs are split, so
>> -		 * can be freed as an array of order-0 allocations
>> -		 */
>> -		if (!(vm->flags & VM_MAP_PUT_PAGES))
>> +		start_pfn = page_to_pfn(page);
>> +		if (account)
>>  			mod_lruvec_page_state(page, NR_VMALLOC, -1);
>> -		__free_page(page);
>> -		cond_resched();
>> +
>> +		for (i = 1; i < vm->nr_pages; i++) {
>> +			page = vm->pages[i];
>> +			BUG_ON(!page);
> 
> We shouldn't be adding BUG_ON()'s. Rather demote also the pre-existing one
> to VM_WARN_ON_ONCE() and skip gracefully.
Sure, I'll replace it with WARN_ON_ONE() instead which returns the condition
result as well for easier skip logic.

> 
>> +			if (account)
>> +				mod_lruvec_page_state(page, NR_VMALLOC, -1);
> 
> I think we should be able to batch this too to use "nr"?
Yes, I'll update in the next version.

> 
>> +			pfn = page_to_pfn(page);
>> +			if (start_pfn + nr == pfn) {
>> +				nr++;
>> +				continue;
>> +			}
>> +			__free_contig_range(start_pfn, nr);
>> +			start_pfn = pfn;
>> +			nr = 1;
>> +			cond_resched();> +		}
>> +		__free_contig_range(start_pfn, nr);
>>  	}
>>  	kvfree(vm->pages);
>>  	kfree(vm);
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] vmalloc: Optimize vfree
  2026-03-16 15:49   ` Vlastimil Babka
  2026-03-17  9:36     ` Muhammad Usama Anjum
@ 2026-03-20  8:39     ` David Hildenbrand (Arm)
  2026-03-20 14:33       ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-20  8:39 UTC (permalink / raw)
  To: Vlastimil Babka, Muhammad Usama Anjum, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand

On 3/16/26 16:49, Vlastimil Babka wrote:
> On 3/16/26 12:31, Muhammad Usama Anjum wrote:
>> From: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
>> must immediately split_page() to order-0 so that it remains compatible
>> with users that want to access the underlying struct page.
>> Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
>> allocator") recently made it much more likely for vmalloc to allocate
>> high order pages which are subsequently split to order-0.
>>
>> Unfortunately this had the side effect of causing performance
>> regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
>> benchmarks). See Closes: tag. This happens because the high order pages
>> must be gotten from the buddy but then because they are split to
>> order-0, when they are freed they are freed to the order-0 pcp.
>> Previously allocation was for order-0 pages so they were recycled from
>> the pcp.
>>
>> It would be preferable if when vmalloc allocates an (e.g.) order-3 page
>> that it also frees that order-3 page to the order-3 pcp, then the
>> regression could be removed.
>>
>> So let's do exactly that; use the new __free_contig_range() API to
>> batch-free contiguous ranges of pfns. This not only removes the
>> regression, but significantly improves performance of vfree beyond the
>> baseline.
>>
>> A selection of test_vmalloc benchmarks running on arm64 server class
>> system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request
>> large order pages from buddy allocator") was added in v6.19-rc1 where we
>> see regressions. Then with this change performance is much better. (>0
>> is faster, <0 is slower, (R)/(I) = statistically significant
>> Regression/Improvement):
>>
>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>> | Benchmark       | Result Class                                             |   mm-new          |  this series       |
>> +=================+==========================================================+===================+====================+
>> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
>> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
>> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
>> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
>> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
>> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
>> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
>> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
>> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
>> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
>> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
>> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
>> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
>> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
>> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
>> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
>> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
>> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
>> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
>> +-----------------+----------------------------------------------------------+-------------------+--------------------+
>>
>> Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
>> Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
>> ---
>> Changes since v1:
>> - Rebase on mm-new
>> - Rerun benchmarks
>> ---
>>  mm/vmalloc.c | 34 +++++++++++++++++++++++++---------
>>  1 file changed, 25 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index c607307c657a6..8b935395fb068 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -3459,18 +3459,34 @@ void vfree(const void *addr)
>>  
>>  	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
>>  		vm_reset_perms(vm);
>> -	for (i = 0; i < vm->nr_pages; i++) {
>> -		struct page *page = vm->pages[i];
>> +
>> +	if (vm->nr_pages) {
>> +		bool account = !(vm->flags & VM_MAP_PUT_PAGES);
>> +		unsigned long start_pfn, pfn;
>> +		struct page *page = vm->pages[0];
>> +		int nr = 1;
>>  
>>  		BUG_ON(!page);
>> -		/*
>> -		 * High-order allocs for huge vmallocs are split, so
>> -		 * can be freed as an array of order-0 allocations
>> -		 */
>> -		if (!(vm->flags & VM_MAP_PUT_PAGES))
>> +		start_pfn = page_to_pfn(page);
>> +		if (account)
>>  			mod_lruvec_page_state(page, NR_VMALLOC, -1);
>> -		__free_page(page);
>> -		cond_resched();
>> +
>> +		for (i = 1; i < vm->nr_pages; i++) {
>> +			page = vm->pages[i];
>> +			BUG_ON(!page);
> 
> We shouldn't be adding BUG_ON()'s. Rather demote also the pre-existing one
> to VM_WARN_ON_ONCE() and skip gracefully.
> 
>> +			if (account)
>> +				mod_lruvec_page_state(page, NR_VMALLOC, -1);
> 
> I think we should be able to batch this too to use "nr"?

Are we sure that pages cannot cross nodes etc? It could happen that we
have a contig range that spans zones/nodes/etc ...

Anyhow, should we try to decouple both things, providing a
core-mm function to do the page freeing?

We do have something similar, optimized unpinning of large folios,
in unpin_user_pages_dirty_lock(). This here is a bit different.


So what I am thinking about for this code here to do:

	if (!(vm->flags & VM_MAP_PUT_PAGES)) {
		for (i = 0; i < vm->nr_pages; i++)
			mod_lruvec_page_state(vm->pages[i], NR_VMALLOC, -1);
	}
	free_pages_bulk(vm->pages, vm->nr_pages);


We could optimize the first loop to do batching where possible as well.


free_pages_bulk() would match alloc_pages_bulk()

	void free_pages_bulk(struct page **page_array, unsigned long nr_pages)

Internally we'd do the contig handling.

Was that already discussed?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] vmalloc: Optimize vfree
  2026-03-20  8:39     ` David Hildenbrand (Arm)
@ 2026-03-20 14:33       ` Vlastimil Babka (SUSE)
  2026-03-23 11:28         ` Muhammad Usama Anjum
  0 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-20 14:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Muhammad Usama Anjum, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand

On 3/20/26 09:39, David Hildenbrand (Arm) wrote:
> On 3/16/26 16:49, Vlastimil Babka wrote:
>>>  mm/vmalloc.c | 34 +++++++++++++++++++++++++---------
>>>  1 file changed, 25 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index c607307c657a6..8b935395fb068 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -3459,18 +3459,34 @@ void vfree(const void *addr)
>>>  
>>>  	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
>>>  		vm_reset_perms(vm);
>>> -	for (i = 0; i < vm->nr_pages; i++) {
>>> -		struct page *page = vm->pages[i];
>>> +
>>> +	if (vm->nr_pages) {
>>> +		bool account = !(vm->flags & VM_MAP_PUT_PAGES);
>>> +		unsigned long start_pfn, pfn;
>>> +		struct page *page = vm->pages[0];
>>> +		int nr = 1;
>>>  
>>>  		BUG_ON(!page);
>>> -		/*
>>> -		 * High-order allocs for huge vmallocs are split, so
>>> -		 * can be freed as an array of order-0 allocations
>>> -		 */
>>> -		if (!(vm->flags & VM_MAP_PUT_PAGES))
>>> +		start_pfn = page_to_pfn(page);
>>> +		if (account)
>>>  			mod_lruvec_page_state(page, NR_VMALLOC, -1);
>>> -		__free_page(page);
>>> -		cond_resched();
>>> +
>>> +		for (i = 1; i < vm->nr_pages; i++) {
>>> +			page = vm->pages[i];
>>> +			BUG_ON(!page);
>> 
>> We shouldn't be adding BUG_ON()'s. Rather demote also the pre-existing one
>> to VM_WARN_ON_ONCE() and skip gracefully.
>> 
>>> +			if (account)
>>> +				mod_lruvec_page_state(page, NR_VMALLOC, -1);
>> 
>> I think we should be able to batch this too to use "nr"?
> 
> Are we sure that pages cannot cross nodes etc? It could happen that we
> have a contig range that spans zones/nodes/etc ...

Hmm single order-3 allocation can't but we could be unlucky and get the last
order-3 from zone X and first order-3 from adjacent zone Y.
In that case the loop would need to also check same zone/node.

> Anyhow, should we try to decouple both things, providing a
> core-mm function to do the page freeing?
> 
> We do have something similar, optimized unpinning of large folios,
> in unpin_user_pages_dirty_lock(). This here is a bit different.
> 
> 
> So what I am thinking about for this code here to do:
> 
> 	if (!(vm->flags & VM_MAP_PUT_PAGES)) {
> 		for (i = 0; i < vm->nr_pages; i++)
> 			mod_lruvec_page_state(vm->pages[i], NR_VMALLOC, -1);
> 	}
> 	free_pages_bulk(vm->pages, vm->nr_pages);
> 
> 
> We could optimize the first loop to do batching where possible as well.
> 
> 
> free_pages_bulk() would match alloc_pages_bulk()
> 
> 	void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
> 
> Internally we'd do the contig handling.
> 
> Was that already discussed?

AFAIU some of Zi's replies hinted at this direction. It would make sense, yeah.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] vmalloc: Optimize vfree
  2026-03-20 14:33       ` Vlastimil Babka (SUSE)
@ 2026-03-23 11:28         ` Muhammad Usama Anjum
  0 siblings, 0 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-23 11:28 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), David Hildenbrand (Arm)
  Cc: usama.anjum, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand

On 20/03/2026 2:33 pm, Vlastimil Babka (SUSE) wrote:
> On 3/20/26 09:39, David Hildenbrand (Arm) wrote:
>> On 3/16/26 16:49, Vlastimil Babka wrote:
>>>>  mm/vmalloc.c | 34 +++++++++++++++++++++++++---------
>>>>  1 file changed, 25 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>> index c607307c657a6..8b935395fb068 100644
>>>> --- a/mm/vmalloc.c
>>>> +++ b/mm/vmalloc.c
>>>> @@ -3459,18 +3459,34 @@ void vfree(const void *addr)
>>>>  
>>>>  	if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
>>>>  		vm_reset_perms(vm);
>>>> -	for (i = 0; i < vm->nr_pages; i++) {
>>>> -		struct page *page = vm->pages[i];
>>>> +
>>>> +	if (vm->nr_pages) {
>>>> +		bool account = !(vm->flags & VM_MAP_PUT_PAGES);
>>>> +		unsigned long start_pfn, pfn;
>>>> +		struct page *page = vm->pages[0];
>>>> +		int nr = 1;
>>>>  
>>>>  		BUG_ON(!page);
>>>> -		/*
>>>> -		 * High-order allocs for huge vmallocs are split, so
>>>> -		 * can be freed as an array of order-0 allocations
>>>> -		 */
>>>> -		if (!(vm->flags & VM_MAP_PUT_PAGES))
>>>> +		start_pfn = page_to_pfn(page);
>>>> +		if (account)
>>>>  			mod_lruvec_page_state(page, NR_VMALLOC, -1);
>>>> -		__free_page(page);
>>>> -		cond_resched();
>>>> +
>>>> +		for (i = 1; i < vm->nr_pages; i++) {
>>>> +			page = vm->pages[i];
>>>> +			BUG_ON(!page);
>>>
>>> We shouldn't be adding BUG_ON()'s. Rather demote also the pre-existing one
>>> to VM_WARN_ON_ONCE() and skip gracefully.
>>>
>>>> +			if (account)
>>>> +				mod_lruvec_page_state(page, NR_VMALLOC, -1);
>>>
>>> I think we should be able to batch this too to use "nr"?
>>
>> Are we sure that pages cannot cross nodes etc? It could happen that we
>> have a contig range that spans zones/nodes/etc ...
> 
> Hmm single order-3 allocation can't but we could be unlucky and get the last
> order-3 from zone X and first order-3 from adjacent zone Y.
> In that case the loop would need to also check same zone/node.
> 
>> Anyhow, should we try to decouple both things, providing a
>> core-mm function to do the page freeing?
>>
>> We do have something similar, optimized unpinning of large folios,
>> in unpin_user_pages_dirty_lock(). This here is a bit different.
>>
>>
>> So what I am thinking about for this code here to do:
>>
>> 	if (!(vm->flags & VM_MAP_PUT_PAGES)) {
>> 		for (i = 0; i < vm->nr_pages; i++)
>> 			mod_lruvec_page_state(vm->pages[i], NR_VMALLOC, -1);
>> 	}
>> 	free_pages_bulk(vm->pages, vm->nr_pages);
>>
>>
>> We could optimize the first loop to do batching where possible as well.
>>
>>
>> free_pages_bulk() would match alloc_pages_bulk()
>>
>> 	void free_pages_bulk(struct page **page_array, unsigned long nr_pages)
>>
>> Internally we'd do the contig handling.
>>
>> Was that already discussed?
> 
> AFAIU some of Zi's replies hinted at this direction. It would make sense, yeah.

I'm updating and will send next version.

Thanks,
Usama




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 3/3] mm/page_alloc: Optimize __free_contig_frozen_range()
  2026-03-16 11:31 [PATCH v2 0/3] mm: Free contiguous order-0 pages efficiently Muhammad Usama Anjum
  2026-03-16 11:31 ` [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
  2026-03-16 11:31 ` [PATCH v2 2/3] vmalloc: Optimize vfree Muhammad Usama Anjum
@ 2026-03-16 11:31 ` Muhammad Usama Anjum
  2026-03-16 16:22   ` Vlastimil Babka
  2026-03-20 14:26   ` Zi Yan
  2 siblings, 2 replies; 20+ messages in thread
From: Muhammad Usama Anjum @ 2026-03-16 11:31 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand
  Cc: Muhammad Usama Anjum

Apply the same batch-freeing optimization from free_contig_range() to the
frozen page path. The previous __free_contig_frozen_range() freed each
order-0 page individually via free_frozen_pages(), which is slow for the
same reason the old free_contig_range() was: each page goes to the
order-0 pcp list rather than being coalesced into higher-order blocks.

Rewrite __free_contig_frozen_range() to call free_pages_prepare() for
each order-0 page, then batch the prepared pages into the largest
possible power-of-2 aligned chunks via free_prepared_contig_range().
If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
deliberately not freed; it should not be returned to the allocator.

I've tested CMA through debugfs. The test allocates 16384 pages per
allocation for several iterations. There is 3.5x improvement.

Before: 1406 usec per iteration
After:   402 usec per iteration

Before:

    70.89%     0.69%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
            |
            |--70.20%--free_contig_frozen_range
            |          |
            |          |--46.41%--__free_frozen_pages
            |          |          |
            |          |           --36.18%--free_frozen_page_commit
            |          |                     |
            |          |                      --29.63%--_raw_spin_unlock_irqrestore
            |          |
            |          |--8.76%--_raw_spin_trylock
            |          |
            |          |--7.03%--__preempt_count_dec_and_test
            |          |
            |          |--4.57%--_raw_spin_unlock
            |          |
            |          |--1.96%--__get_pfnblock_flags_mask.isra.0
            |          |
            |           --1.15%--free_frozen_page_commit
            |
             --0.69%--el0t_64_sync

After:

    23.57%     0.00%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
            |
            ---free_contig_frozen_range
               |
               |--20.45%--__free_contig_frozen_range
               |          |
               |          |--17.77%--free_pages_prepare
               |          |
               |           --0.72%--free_prepared_contig_range
               |                     |
               |                      --0.55%--__free_frozen_pages
               |
                --3.12%--free_pages_prepare

Suggested-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
---
 mm/page_alloc.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6a9430f720579..2e99fa85cdc8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7020,8 +7020,22 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
 
 static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
 {
-	for (; nr_pages--; pfn++)
-		free_frozen_pages(pfn_to_page(pfn), 0);
+	struct page *page = pfn_to_page(pfn);
+	struct page *start = NULL;
+	unsigned long i;
+
+	for (i = 0; i < nr_pages; i++, page++) {
+		if (free_pages_prepare(page, 0)) {
+			if (!start)
+				start = page;
+		} else if (start) {
+			free_prepared_contig_range(start, page - start);
+			start = NULL;
+		}
+	}
+
+	if (start)
+		free_prepared_contig_range(start, page - start);
 }
 
 /**
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Optimize __free_contig_frozen_range()
  2026-03-16 11:31 ` [PATCH v2 3/3] mm/page_alloc: Optimize __free_contig_frozen_range() Muhammad Usama Anjum
@ 2026-03-16 16:22   ` Vlastimil Babka
  2026-03-20 14:26   ` Zi Yan
  1 sibling, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2026-03-16 16:22 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Uladzislau Rezki, Nick Terrell,
	David Sterba, Vishal Moola (Oracle), linux-mm, linux-kernel, bpf,
	Ryan.Roberts, david.hildenbrand

On 3/16/26 12:31, Muhammad Usama Anjum wrote:
> Apply the same batch-freeing optimization from free_contig_range() to the
> frozen page path. The previous __free_contig_frozen_range() freed each
> order-0 page individually via free_frozen_pages(), which is slow for the
> same reason the old free_contig_range() was: each page goes to the
> order-0 pcp list rather than being coalesced into higher-order blocks.
> 
> Rewrite __free_contig_frozen_range() to call free_pages_prepare() for
> each order-0 page, then batch the prepared pages into the largest
> possible power-of-2 aligned chunks via free_prepared_contig_range().
> If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
> deliberately not freed; it should not be returned to the allocator.
> 
> I've tested CMA through debugfs. The test allocates 16384 pages per
> allocation for several iterations. There is 3.5x improvement.
> 
> Before: 1406 usec per iteration
> After:   402 usec per iteration
> 
> Before:
> 
>     70.89%     0.69%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
>             |
>             |--70.20%--free_contig_frozen_range
>             |          |
>             |          |--46.41%--__free_frozen_pages
>             |          |          |
>             |          |           --36.18%--free_frozen_page_commit
>             |          |                     |
>             |          |                      --29.63%--_raw_spin_unlock_irqrestore
>             |          |
>             |          |--8.76%--_raw_spin_trylock
>             |          |
>             |          |--7.03%--__preempt_count_dec_and_test
>             |          |
>             |          |--4.57%--_raw_spin_unlock
>             |          |
>             |          |--1.96%--__get_pfnblock_flags_mask.isra.0
>             |          |
>             |           --1.15%--free_frozen_page_commit
>             |
>              --0.69%--el0t_64_sync
> 
> After:
> 
>     23.57%     0.00%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
>             |
>             ---free_contig_frozen_range
>                |
>                |--20.45%--__free_contig_frozen_range
>                |          |
>                |          |--17.77%--free_pages_prepare
>                |          |
>                |           --0.72%--free_prepared_contig_range
>                |                     |
>                |                      --0.55%--__free_frozen_pages
>                |
>                 --3.12%--free_pages_prepare
> 
> Suggested-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>

LGTM.
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

> ---
>  mm/page_alloc.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6a9430f720579..2e99fa85cdc8e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7020,8 +7020,22 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
>  
>  static void __free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
>  {
> -	for (; nr_pages--; pfn++)
> -		free_frozen_pages(pfn_to_page(pfn), 0);
> +	struct page *page = pfn_to_page(pfn);
> +	struct page *start = NULL;
> +	unsigned long i;
> +
> +	for (i = 0; i < nr_pages; i++, page++) {
> +		if (free_pages_prepare(page, 0)) {
> +			if (!start)
> +				start = page;
> +		} else if (start) {
> +			free_prepared_contig_range(start, page - start);
> +			start = NULL;
> +		}
> +	}
> +
> +	if (start)
> +		free_prepared_contig_range(start, page - start);
>  }
>  
>  /**



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm/page_alloc: Optimize __free_contig_frozen_range()
  2026-03-16 11:31 ` [PATCH v2 3/3] mm/page_alloc: Optimize __free_contig_frozen_range() Muhammad Usama Anjum
  2026-03-16 16:22   ` Vlastimil Babka
@ 2026-03-20 14:26   ` Zi Yan
  1 sibling, 0 replies; 20+ messages in thread
From: Zi Yan @ 2026-03-20 14:26 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Uladzislau Rezki, Nick Terrell, David Sterba,
	Vishal Moola (Oracle), linux-mm, linux-kernel, bpf, Ryan.Roberts,
	david.hildenbrand

On 16 Mar 2026, at 7:31, Muhammad Usama Anjum wrote:

> Apply the same batch-freeing optimization from free_contig_range() to the
> frozen page path. The previous __free_contig_frozen_range() freed each
> order-0 page individually via free_frozen_pages(), which is slow for the
> same reason the old free_contig_range() was: each page goes to the
> order-0 pcp list rather than being coalesced into higher-order blocks.
>
> Rewrite __free_contig_frozen_range() to call free_pages_prepare() for
> each order-0 page, then batch the prepared pages into the largest
> possible power-of-2 aligned chunks via free_prepared_contig_range().
> If free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
> deliberately not freed; it should not be returned to the allocator.
>
> I've tested CMA through debugfs. The test allocates 16384 pages per
> allocation for several iterations. There is 3.5x improvement.
>
> Before: 1406 usec per iteration
> After:   402 usec per iteration
>
> Before:
>
>     70.89%     0.69%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
>             |
>             |--70.20%--free_contig_frozen_range
>             |          |
>             |          |--46.41%--__free_frozen_pages
>             |          |          |
>             |          |           --36.18%--free_frozen_page_commit
>             |          |                     |
>             |          |                      --29.63%--_raw_spin_unlock_irqrestore
>             |          |
>             |          |--8.76%--_raw_spin_trylock
>             |          |
>             |          |--7.03%--__preempt_count_dec_and_test
>             |          |
>             |          |--4.57%--_raw_spin_unlock
>             |          |
>             |          |--1.96%--__get_pfnblock_flags_mask.isra.0
>             |          |
>             |           --1.15%--free_frozen_page_commit
>             |
>              --0.69%--el0t_64_sync
>
> After:
>
>     23.57%     0.00%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
>             |
>             ---free_contig_frozen_range
>                |
>                |--20.45%--__free_contig_frozen_range
>                |          |
>                |          |--17.77%--free_pages_prepare
>                |          |
>                |           --0.72%--free_prepared_contig_range
>                |                     |
>                |                      --0.55%--__free_frozen_pages
>                |
>                 --3.12%--free_pages_prepare
>
> Suggested-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
>  mm/page_alloc.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
LGTM.

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-03-23 11:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16 11:31 [PATCH v2 0/3] mm: Free contiguous order-0 pages efficiently Muhammad Usama Anjum
2026-03-16 11:31 ` [PATCH v2 1/3] mm/page_alloc: Optimize free_contig_range() Muhammad Usama Anjum
2026-03-16 15:21   ` Vlastimil Babka
2026-03-16 16:02     ` Zi Yan
2026-03-16 16:19       ` Vlastimil Babka (SUSE)
2026-03-17 15:17         ` Zi Yan
2026-03-17 18:48           ` Vlastimil Babka (SUSE)
2026-03-19 22:07             ` David Hildenbrand (Arm)
2026-03-20  8:20               ` Vlastimil Babka (SUSE)
2026-03-20 12:46                 ` Zi Yan
2026-03-16 16:11     ` Muhammad Usama Anjum
2026-03-16 11:31 ` [PATCH v2 2/3] vmalloc: Optimize vfree Muhammad Usama Anjum
2026-03-16 15:49   ` Vlastimil Babka
2026-03-17  9:36     ` Muhammad Usama Anjum
2026-03-20  8:39     ` David Hildenbrand (Arm)
2026-03-20 14:33       ` Vlastimil Babka (SUSE)
2026-03-23 11:28         ` Muhammad Usama Anjum
2026-03-16 11:31 ` [PATCH v2 3/3] mm/page_alloc: Optimize __free_contig_frozen_range() Muhammad Usama Anjum
2026-03-16 16:22   ` Vlastimil Babka
2026-03-20 14:26   ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox