[PATCH v4 0/2] mm: improve folio refcount scalability

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v4 0/2] mm: improve folio refcount scalability
@ 2026-06-08 21:53 Gladyshev Ilya
  2026-06-08 21:54 ` [PATCH v4 1/2] mm: drop page refcount zero state semantics ilya.gladyshev
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Gladyshev Ilya @ 2026-06-08 21:53 UTC (permalink / raw)
  To: ivgorbunov, Liam.Howlett, akpm, apopple, artem.kuzin, baolin.wang,
	david, foxido, harry.yoo, linux-kernel, linux-mm, lorenzo.stoakes,
	mhocko, muchun.song, rppt, surenb, torvalds, vbabka, willy,
	yuzhao, ziy, pfalcato, kirill

This is v4 of the series, fixing some dumb mistakes from v3:
- Fix asserts that were never firing
- Rename set_page_count_as_frozen -> set_page_count_frozen
  (
   I don't really like tthe proposed "init" in the function name.
   For consistency, we can rename init_page_count -> set_page_count_init()
   However, if anyone insists, I will use the proposed init_...() naming
  )
- Set proper frozen value in the second patch
- Use VM_BUG_ON_PAGE instead of VM_BUG_ON

Original cover letter posted below:

Intro
=====
This patch optimizes small file read performance and overall folio refcount
scalability by refactoring page_ref_add_unless [core of folio_try_get].
This is alternative approach to previous attempts to fix small read
performance by avoiding refcount bumps [1][2].

Overview
========
Current refcount implementation is using zero counter as locked (dead/frozen)
state, which required CAS loop for increments to avoid temporary unlocks in
try_get functions. These CAS loops became a serialization point for otherwise
scalable and fast read side.

Proposed implementation separates "locked" logic from the counting, allowing
the use of optimistic fetch_add() instead of CAS. For more details, please
refer to the commit message of the patch itself.

Proposed logic maintains the same public API as before, including all existing
memory barrier guarantees.

Performance
===========
Performance was measured using a simple custom benchmark based on
will-it-scale[3]. This benchmark spawns N pinned threads/processes that
execute the following loop:
``
char buf[]
fd = open(/* same file in tmpfs */);

while (true) {
    pread(fd, buf, /* read size = */ 64, /* offset = */0)
}
``
While this is a synthetic load, it does highlight existing issue and
doesn't differ a lot from benchmarking in [2] patch.

This benchmark measures operations per second in the inner loop and the
results across all workers. Performance was tested on top of v6.15 kernel
on two platforms. Since threads and processes showed similar performance on
both systems, only the thread results are provided below. The performance
improvement scales linearly between the CPU counts shown.

Platform 1: 2 x E5-2690 v3, 12C/12T each [disabled SMT]

#threads | vanilla | patched | boost (%)
       1 | 1343381 | 1344401 |  +0.1
       2 | 2186160 | 2455837 | +12.3
       5 | 5277092 | 6108030 | +15.7
      10 | 5858123 | 7506328 | +28.1
      12 | 6484445 | 8137706 | +25.5
         /* Cross socket NUMA */
      14 | 3145860 | 4247391 | +35.0
      16 | 2350840 | 4262707 | +81.3
      18 | 2378825 | 4121415 | +73.2
      20 | 2438475 | 4683548 | +92.1
      24 | 2325998 | 4529737 | +94.7

Platform 2: 2 x AMD EPYC 9654, 96C/192T each [enabled SMT]

#threads | vanilla | patched | boost (%)
       1 | 1077276 | 1081653 |  +0.4
       5 | 4286838 | 4682513 |  +9.2
      10 | 1698095 | 1902753 | +12.1
      20 | 1662266 | 1921603 | +15.6
      49 | 1486745 | 1828926 | +23.0
      97 | 1617365 | 2052635 | +26.9
         /* Cross socket NUMA */
     105 | 1368319 | 1798862 | +31.5
     136 | 1008071 | 1393055 | +38.2
     168 |  879332 | 1245210 | +41.6
               /* SMT */
     193 |  905432 | 1294833 | +43.0
     289 |  851988 | 1313110 | +54.1
     353 |  771288 | 1347165 | +74.7

[0]: https://lore.kernel.org/lkml/cover.1776350895.git.gorbunov.ivan@h-partners.com/
[1]: https://lore.kernel.org/linux-mm/CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com/
[2]: https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/
[3]: https://github.com/antonblanchard/will-it-scale

---

Link to v3: https://lore.kernel.org/linux-mm/5dabf3a748fee0c7b142c74367e7586f5db1ed1e@linux.dev/

Gladyshev Ilya (1):
  mm: implement page refcount locking via dedicated bit

Gorbunov Ivan (1):
  mm: drop page refcount zero state semantics

 drivers/pci/p2pdma.c               |  4 +-
 include/linux/mm.h                 |  2 +-
 include/linux/page-flags.h         | 13 +++++++
 include/linux/page_ref.h           | 62 +++++++++++++++++++++++++-----
 kernel/liveupdate/kexec_handover.c |  6 +--
 lib/test_hmm.c                     |  4 +-
 mm/hugetlb.c                       |  2 +-
 mm/internal.h                      |  2 +-
 mm/memremap.c                      |  4 +-
 mm/mm_init.c                       |  6 +--
 mm/page_alloc.c                    |  4 +-
 11 files changed, 82 insertions(+), 27 deletions(-)


base-commit: 2d3090a8aeb596a26935db0955d46c9a5db5c6ce
-- 
2.54.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v4 1/2] mm: drop page refcount zero state semantics
  2026-06-08 21:53 [PATCH v4 0/2] mm: improve folio refcount scalability Gladyshev Ilya
@ 2026-06-08 21:54 ` ilya.gladyshev
  2026-06-08 21:54 ` [PATCH v4 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
  2026-06-08 22:47 ` [PATCH v4 0/2] mm: improve folio refcount scalability Andrew Morton
  2 siblings, 0 replies; 4+ messages in thread
From: ilya.gladyshev @ 2026-06-08 21:54 UTC (permalink / raw)
  To: ivgorbunov, Liam.Howlett, akpm, apopple, artem.kuzin, baolin.wang,
	david, foxido, harry.yoo, linux-kernel, linux-mm, lorenzo.stoakes,
	mhocko, muchun.song, rppt, surenb, torvalds, vbabka, willy,
	yuzhao, ziy, pfalcato, kirill

From: Gorbunov Ivan <ivgorbunov@me.com>

Some call sites manipulate page refcount directly via
set_page_count() instead of using more direct API like set_frozen() /
init_refcount().

This conflicts with the next patch, which will stop treating zeroed
refcount as the indicator of a frozen page. To prepare for that change,
this patch:

- "Deprecates" the internal assumption that a frozen page has refcount=0
(and vice versa). Callers of page_ref_count() still see 0 for frozen
pages.

- Inserts VM_BUG_ON() checks in every refcount API function to prevent
  following errnous behaviour:

page = alloc_frozen_page() // page is frozen
page_ref_inc(page, 1) // BUG: Increment on frozen page instead of init

- Renames _unless_zero() functions into _unless_frozen()

Reviewed-by: Artem Kuzin <artem.kuzin@huawei.com>
Co-developed-by: Gladyshev Ilya <ilya.gladyshev@linux.dev>
Signed-off-by: Gladyshev Ilya <ilya.gladyshev@linux.dev>
Signed-off-by: Gorbunov Ivan <ivgorbunov@me.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # p2pdma.c
---
 drivers/pci/p2pdma.c               |  4 ++--
 include/linux/mm.h                 |  2 +-
 include/linux/page_ref.h           | 36 ++++++++++++++++++++++++------
 kernel/liveupdate/kexec_handover.c |  6 ++---
 lib/test_hmm.c                     |  4 ++--
 mm/hugetlb.c                       |  2 +-
 mm/internal.h                      |  2 +-
 mm/memremap.c                      |  4 ++--
 mm/mm_init.c                       |  6 ++---
 mm/page_alloc.c                    |  4 ++--
 10 files changed, 46 insertions(+), 24 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 7c898542af8d..43ed40a6183b 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -148,7 +148,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
 		 * using it.
 		 */
 		VM_WARN_ON_ONCE_PAGE(page_ref_count(page), page);
-		set_page_count(page, 1);
+		init_page_count(page);
 		ret = vm_insert_page(vma, vaddr, page);
 		if (ret) {
 			gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
@@ -158,7 +158,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
 			 * because we don't want to trigger the
 			 * p2pdma_folio_free() path.
 			 */
-			set_page_count(page, 0);
+			set_page_count_frozen(page);
 			percpu_ref_put(ref);
 			return ret;
 		}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fc2acedf0b76..91482c868f66 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1780,7 +1780,7 @@ static inline int folio_put_testzero(struct folio *folio)
  */
 static inline bool get_page_unless_zero(struct page *page)
 {
-	return page_ref_add_unless_zero(page, 1);
+	return page_ref_add_unless_frozen(page, 1);
 }
 
 static inline struct folio *folio_get_nontail_page(struct page *page)
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 94d3f0e71c06..f784db6f775a 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -62,6 +62,16 @@ static inline void __page_ref_unfreeze(struct page *page, int v)
 
 #endif
 
+static inline bool __page_count_is_frozen(int count)
+{
+	return count == 0;
+}
+
+static inline bool __page_is_frozen(const struct page *page)
+{
+	return __page_count_is_frozen(atomic_read(&page->_refcount));
+}
+
 static inline int page_ref_count(const struct page *page)
 {
 	return atomic_read(&page->_refcount);
@@ -101,9 +111,9 @@ static inline void set_page_count(struct page *page, int v)
 		__page_ref_set(page, v);
 }
 
-static inline void folio_set_count(struct folio *folio, int v)
+static inline void folio_init_count(struct folio *folio)
 {
-	set_page_count(&folio->page, v);
+	set_page_count(&folio->page, 1);
 }
 
 /*
@@ -115,8 +125,14 @@ static inline void init_page_count(struct page *page)
 	set_page_count(page, 1);
 }
 
+static inline void set_page_count_frozen(struct page *page)
+{
+	set_page_count(page, 0);
+}
+
 static inline void page_ref_add(struct page *page, int nr)
 {
+	VM_BUG_ON_PAGE(__page_is_frozen(page), page);
 	atomic_add(nr, &page->_refcount);
 	if (page_ref_tracepoint_active(page_ref_mod))
 		__page_ref_mod(page, nr);
@@ -129,6 +145,7 @@ static inline void folio_ref_add(struct folio *folio, int nr)
 
 static inline void page_ref_sub(struct page *page, int nr)
 {
+	VM_BUG_ON_PAGE(__page_is_frozen(page), page);
 	atomic_sub(nr, &page->_refcount);
 	if (page_ref_tracepoint_active(page_ref_mod))
 		__page_ref_mod(page, -nr);
@@ -142,6 +159,7 @@ static inline void folio_ref_sub(struct folio *folio, int nr)
 static inline int folio_ref_sub_return(struct folio *folio, int nr)
 {
 	int ret = atomic_sub_return(nr, &folio->_refcount);
+	VM_BUG_ON_FOLIO(__page_count_is_frozen(ret + nr), folio);
 
 	if (page_ref_tracepoint_active(page_ref_mod_and_return))
 		__page_ref_mod_and_return(&folio->page, -nr, ret);
@@ -150,6 +168,7 @@ static inline int folio_ref_sub_return(struct folio *folio, int nr)
 
 static inline void page_ref_inc(struct page *page)
 {
+	VM_BUG_ON_PAGE(__page_is_frozen(page), page);
 	atomic_inc(&page->_refcount);
 	if (page_ref_tracepoint_active(page_ref_mod))
 		__page_ref_mod(page, 1);
@@ -162,6 +181,7 @@ static inline void folio_ref_inc(struct folio *folio)
 
 static inline void page_ref_dec(struct page *page)
 {
+	VM_BUG_ON_PAGE(__page_is_frozen(page), page);
 	atomic_dec(&page->_refcount);
 	if (page_ref_tracepoint_active(page_ref_mod))
 		__page_ref_mod(page, -1);
@@ -189,6 +209,7 @@ static inline int folio_ref_sub_and_test(struct folio *folio, int nr)
 static inline int page_ref_inc_return(struct page *page)
 {
 	int ret = atomic_inc_return(&page->_refcount);
+	VM_BUG_ON_PAGE(__page_count_is_frozen(ret - 1), page);
 
 	if (page_ref_tracepoint_active(page_ref_mod_and_return))
 		__page_ref_mod_and_return(page, 1, ret);
@@ -217,6 +238,7 @@ static inline int folio_ref_dec_and_test(struct folio *folio)
 static inline int page_ref_dec_return(struct page *page)
 {
 	int ret = atomic_dec_return(&page->_refcount);
+	VM_BUG_ON_PAGE(__page_count_is_frozen(ret + 1), page);
 
 	if (page_ref_tracepoint_active(page_ref_mod_and_return))
 		__page_ref_mod_and_return(page, -1, ret);
@@ -228,7 +250,7 @@ static inline int folio_ref_dec_return(struct folio *folio)
 	return page_ref_dec_return(&folio->page);
 }
 
-static inline bool page_ref_add_unless_zero(struct page *page, int nr)
+static inline bool page_ref_add_unless_frozen(struct page *page, int nr)
 {
 	bool ret = atomic_add_unless(&page->_refcount, nr, 0);
 
@@ -237,9 +259,9 @@ static inline bool page_ref_add_unless_zero(struct page *page, int nr)
 	return ret;
 }
 
-static inline bool folio_ref_add_unless_zero(struct folio *folio, int nr)
+static inline bool folio_ref_add_unless_frozen(struct folio *folio, int nr)
 {
-	return page_ref_add_unless_zero(&folio->page, nr);
+	return page_ref_add_unless_frozen(&folio->page, nr);
 }
 
 /**
@@ -255,12 +277,12 @@ static inline bool folio_ref_add_unless_zero(struct folio *folio, int nr)
  */
 static inline bool folio_try_get(struct folio *folio)
 {
-	return folio_ref_add_unless_zero(folio, 1);
+	return folio_ref_add_unless_frozen(folio, 1);
 }
 
 static inline bool folio_ref_try_add(struct folio *folio, int count)
 {
-	return folio_ref_add_unless_zero(folio, count);
+	return folio_ref_add_unless_frozen(folio, count);
 }
 
 static inline int page_ref_freeze(struct page *page, int count)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 1b592d86dc48..d436f6d6913f 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -361,7 +361,7 @@ EXPORT_SYMBOL_GPL(kho_radix_walk_tree);
 static void kho_init_pages(struct page *page, unsigned long nr_pages)
 {
 	for (unsigned long i = 0; i < nr_pages; i++) {
-		set_page_count(page + i, 1);
+		init_page_count(page + i);
 		/* Clear each page's codetag to avoid accounting mismatch. */
 		clear_page_tag_ref(page + i);
 	}
@@ -372,13 +372,13 @@ static void kho_init_folio(struct page *page, unsigned int order)
 	unsigned long nr_pages = (1 << order);
 
 	/* Head page gets refcount of 1. */
-	set_page_count(page, 1);
+	init_page_count(page);
 	/* Clear head page's codetag to avoid accounting mismatch. */
 	clear_page_tag_ref(page);
 
 	/* For higher order folios, tail pages get a page count of zero. */
 	for (unsigned long i = 1; i < nr_pages; i++)
-		set_page_count(page + i, 0);
+		set_page_count_frozen(page + i);
 
 	if (order > 0)
 		prep_compound_page(page, order);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 213504915737..0cbcf9da4911 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1715,7 +1715,7 @@ static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
 	if (tail == NULL) {
 		folio_reset_order(rfolio);
 		rfolio->mapping = NULL;
-		folio_set_count(rfolio, 1);
+		folio_init_count(rfolio);
 		return;
 	}
 
@@ -1729,7 +1729,7 @@ static void dmirror_devmem_folio_split(struct folio *head, struct folio *tail)
 
 	folio_page(tail, 0)->mapping = folio_page(head, 0)->mapping;
 	tail->pgmap = head->pgmap;
-	folio_set_count(page_folio(rpage_tail), 1);
+	folio_init_count(page_folio(rpage_tail));
 }
 
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c921287489de..f2fec6b1b1df 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3133,7 +3133,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
 	for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) {
 		__init_single_page(page, pfn, zone, nid);
 		prep_compound_tail(page, &folio->page, order);
-		set_page_count(page, 0);
+		set_page_count_frozen(page);
 	}
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..3f2a91de8a80 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -595,7 +595,7 @@ static inline void set_page_refcounted(struct page *page)
 {
 	VM_BUG_ON_PAGE(PageTail(page), page);
 	VM_BUG_ON_PAGE(page_ref_count(page), page);
-	set_page_count(page, 1);
+	init_page_count(page);
 }
 
 static inline void set_pages_refcounted(struct page *page, unsigned long nr_pages)
diff --git a/mm/memremap.c b/mm/memremap.c
index 053842d45cb1..8025cc27b408 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -462,7 +462,7 @@ void free_zone_device_folio(struct folio *folio)
 		 * Reset the refcount to 1 to prepare for handing out the page
 		 * again.
 		 */
-		folio_set_count(folio, 1);
+		folio_init_count(folio);
 		break;
 
 	case MEMORY_DEVICE_FS_DAX:
@@ -519,7 +519,7 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
 	 * memunmap_pages().
 	 */
 	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
-	set_page_count(page, 1);
+	init_page_count(page);
 	lock_page(page);
 
 	if (order)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f9f8e1af921c..96fcace24b6d 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1040,7 +1040,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_COHERENT:
 	case MEMORY_DEVICE_PCI_P2PDMA:
-		set_page_count(page, 0);
+		set_page_count_frozen(page);
 		break;
 
 	case MEMORY_DEVICE_GENERIC:
@@ -1086,7 +1086,7 @@ static void __ref memmap_init_compound(struct page *head,
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 		prep_compound_tail(page, head, order);
-		set_page_count(page, 0);
+		set_page_count_frozen(page);
 	}
 	prep_compound_head(head, order);
 }
@@ -2224,7 +2224,7 @@ void __init init_cma_reserved_pageblock(struct page *page)
 
 	do {
 		__ClearPageReserved(p);
-		set_page_count(p, 0);
+		set_page_count_frozen(p);
 	} while (++p, --i);
 
 	init_pageblock_migratetype(page, MIGRATE_CMA, false);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d49c254174da..730dc6301a07 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1599,14 +1599,14 @@ void __meminit __free_pages_core(struct page *page, unsigned int order,
 		for (loop = 0; loop < nr_pages; loop++, p++) {
 			VM_WARN_ON_ONCE(PageReserved(p));
 			__ClearPageOffline(p);
-			set_page_count(p, 0);
+			set_page_count_frozen(p);
 		}
 
 		adjust_managed_page_count(page, nr_pages);
 	} else {
 		for (loop = 0; loop < nr_pages; loop++, p++) {
 			__ClearPageReserved(p);
-			set_page_count(p, 0);
+			set_page_count_frozen(p);
 		}
 
 		/* memblock adjusts totalram_pages() manually. */
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v4 2/2] mm: implement page refcount locking via dedicated bit
  2026-06-08 21:53 [PATCH v4 0/2] mm: improve folio refcount scalability Gladyshev Ilya
  2026-06-08 21:54 ` [PATCH v4 1/2] mm: drop page refcount zero state semantics ilya.gladyshev
@ 2026-06-08 21:54 ` Gladyshev Ilya
  2026-06-08 22:47 ` [PATCH v4 0/2] mm: improve folio refcount scalability Andrew Morton
  2 siblings, 0 replies; 4+ messages in thread
From: Gladyshev Ilya @ 2026-06-08 21:54 UTC (permalink / raw)
  To: ivgorbunov, Liam.Howlett, akpm, apopple, artem.kuzin, baolin.wang,
	david, foxido, harry.yoo, linux-kernel, linux-mm, lorenzo.stoakes,
	mhocko, muchun.song, rppt, surenb, torvalds, vbabka, willy,
	yuzhao, ziy, pfalcato, kirill

The current atomic-based page refcount implementation treats zero
counter as dead and requires a compare-and-swap loop in folio_try_get()
to prevent incrementing a dead refcount. This CAS loop acts as a
serialization point and can become a significant bottleneck during
high-frequency file read operations.

This patch introduces PAGEREF_FROZEN_BIT to distinguish between a
(temporary) zero refcount and a locked (dead/frozen) state. Because now
incrementing counter doesn't affect it's locked/unlocked state, it is
possible to use an optimistic atomic_add_return() in
page_ref_add_unless_zero() that operates independently of the locked bit.
The locked state is handled after the increment attempt, eliminating the
need for the CAS loop.

If locked state is detected after atomic_add(), pageref counter will be
reset with CAS loop, eliminating theoretical possibility of overflow.

Reviewed-by: Artem Kuzin <artem.kuzin@huawei.com>
Co-developed-by: Gorbunov Ivan <ivgorbunov@me.com>
Signed-off-by: Gorbunov Ivan <ivgorbunov@me.com>
Signed-off-by: Gladyshev Ilya <ilya.gladyshev@linux.dev>
Acked-by: Linus Torvalds <torvalds@linuxfoundation.org>
---
 include/linux/page-flags.h | 13 +++++++++++++
 include/linux/page_ref.h   | 30 +++++++++++++++++++++++++-----
 2 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..ea9904a67334 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -196,6 +196,19 @@ enum pageflags {
 
 #define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
 
+/* Most significant bit in page refcount */
+#define PAGEREF_FROZEN_BIT BIT(31)
+
+/* Page reference counter can be in 4 logical states,
+ * which are described below with their value representation
+ *        state              |         value
+ * (1)  safe with  owners    |   1...INT_MAX
+ * (2)  safe with no owners  |         0
+ * (3)  frozen               |  INT_MIN....-1
+ *
+ * State (2) can be only temporally inside dec_and_test.
+ */
+
 #ifndef __GENERATING_BOUNDS_H
 
 /*
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index f784db6f775a..64ee0838c52e 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -64,7 +64,7 @@ static inline void __page_ref_unfreeze(struct page *page, int v)
 
 static inline bool __page_count_is_frozen(int count)
 {
-	return count == 0;
+	return count & PAGEREF_FROZEN_BIT;
 }
 
 static inline bool __page_is_frozen(const struct page *page)
@@ -74,7 +74,12 @@ static inline bool __page_is_frozen(const struct page *page)
 
 static inline int page_ref_count(const struct page *page)
 {
-	return atomic_read(&page->_refcount);
+	int val = atomic_read(&page->_refcount);
+
+	if (unlikely(val & PAGEREF_FROZEN_BIT))
+		return 0;
+
+	return val;
 }
 
 /**
@@ -127,7 +132,7 @@ static inline void init_page_count(struct page *page)
 
 static inline void set_page_count_frozen(struct page *page)
 {
-	set_page_count(page, 0);
+	set_page_count(page, PAGEREF_FROZEN_BIT);
 }
 
 static inline void page_ref_add(struct page *page, int nr)
@@ -196,6 +201,9 @@ static inline int page_ref_sub_and_test(struct page *page, int nr)
 {
 	int ret = atomic_sub_and_test(nr, &page->_refcount);
 
+	if (ret)
+		ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_FROZEN_BIT);
+
 	if (page_ref_tracepoint_active(page_ref_mod_and_test))
 		__page_ref_mod_and_test(page, -nr, ret);
 	return ret;
@@ -225,6 +233,9 @@ static inline int page_ref_dec_and_test(struct page *page)
 {
 	int ret = atomic_dec_and_test(&page->_refcount);
 
+	if (ret)
+		ret = !atomic_cmpxchg_relaxed(&page->_refcount, 0, PAGEREF_FROZEN_BIT);
+
 	if (page_ref_tracepoint_active(page_ref_mod_and_test))
 		__page_ref_mod_and_test(page, -1, ret);
 	return ret;
@@ -250,9 +261,18 @@ static inline int folio_ref_dec_return(struct folio *folio)
 	return page_ref_dec_return(&folio->page);
 }
 
+#define _PAGEREF_FROZEN_LIMIT	((1 << 30) | PAGEREF_FROZEN_BIT)
+
 static inline bool page_ref_add_unless_frozen(struct page *page, int nr)
 {
-	bool ret = atomic_add_unless(&page->_refcount, nr, 0);
+	bool ret = false;
+	int val = atomic_add_return(nr, &page->_refcount);
+	// See PAGEREF_FROZEN_BIT declaration in page-flags.h for details
+	ret = !(val & PAGEREF_FROZEN_BIT);
+
+	/* Undo atomic_add() if counter is locked and scary big */
+	while (unlikely((unsigned int)val >= _PAGEREF_FROZEN_LIMIT))
+		val = atomic_cmpxchg_relaxed(&page->_refcount, val, PAGEREF_FROZEN_BIT);
 
 	if (page_ref_tracepoint_active(page_ref_mod_unless))
 		__page_ref_mod_unless(page, nr, ret);
@@ -287,7 +307,7 @@ static inline bool folio_ref_try_add(struct folio *folio, int count)
 
 static inline int page_ref_freeze(struct page *page, int count)
 {
-	int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
+	int ret = likely(atomic_cmpxchg(&page->_refcount, count, PAGEREF_FROZEN_BIT) == count);
 
 	if (page_ref_tracepoint_active(page_ref_freeze))
 		__page_ref_freeze(page, count, ret);
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v4 0/2] mm: improve folio refcount scalability
  2026-06-08 21:53 [PATCH v4 0/2] mm: improve folio refcount scalability Gladyshev Ilya
  2026-06-08 21:54 ` [PATCH v4 1/2] mm: drop page refcount zero state semantics ilya.gladyshev
  2026-06-08 21:54 ` [PATCH v4 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
@ 2026-06-08 22:47 ` Andrew Morton
  2 siblings, 0 replies; 4+ messages in thread
From: Andrew Morton @ 2026-06-08 22:47 UTC (permalink / raw)
  To: Gladyshev Ilya
  Cc: ivgorbunov, Liam.Howlett, apopple, artem.kuzin, baolin.wang,
	david, foxido, harry.yoo, linux-kernel, linux-mm, lorenzo.stoakes,
	mhocko, muchun.song, rppt, surenb, torvalds, vbabka, willy,
	yuzhao, ziy, pfalcato, kirill

On Mon, 08 Jun 2026 21:53:01 +0000 "Gladyshev Ilya" <ilya.gladyshev@linux.dev> wrote:

> This patch optimizes small file read performance and overall folio refcount
> scalability by refactoring page_ref_add_unless [core of folio_try_get].
> This is alternative approach to previous attempts to fix small read
> performance by avoiding refcount bumps [1][2].

Thanks.  Nice numbers.

AI review had some things to say:
	https://sashiko.dev/#/patchset/df26082871b4c65b2bd38d409026237c08572836@linux.dev

I'm not sure we want all those new VM_BUG_ON_PAGE() calls in the long
term.  They look like development-time assistance.  Perhaps you could
make those a standalone patch at tail-of-series so we can keep it in
linux-next for a couple of months then throw it away before any
upstreaming?

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-08 22:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08 21:53 [PATCH v4 0/2] mm: improve folio refcount scalability Gladyshev Ilya
2026-06-08 21:54 ` [PATCH v4 1/2] mm: drop page refcount zero state semantics ilya.gladyshev
2026-06-08 21:54 ` [PATCH v4 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
2026-06-08 22:47 ` [PATCH v4 0/2] mm: improve folio refcount scalability Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox