The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH v5 0/2] mm: improve folio refcount scalability
@ 2026-06-26 18:44 Gladyshev Ilya
  2026-06-26 18:46 ` [PATCH v5 1/2] mm: drop page refcount zero state semantics Gladyshev Ilya
  2026-06-26 18:46 ` [PATCH v5 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya
  0 siblings, 2 replies; 3+ messages in thread
From: Gladyshev Ilya @ 2026-06-26 18:44 UTC (permalink / raw)
  To: Gladyshev Ilya
  Cc: Linus Torvalds, Andrew Morton, ivgorbunov, Liam.Howlett, apopple,
	artem.kuzin, baolin.wang, foxido, harry.yoo, linux-kernel,
	linux-mm, lorenzo.stoakes, mhocko, muchun.song, rppt, surenb,
	vbabka, yuzhao, ziy, pfalcato, kirill

This is v5 of the series, with no changes to the core idea:
- Fix missing BUG_ON checks on the sub_and_check APIs
- Replace BUG_ON with WARN_ON_ONCE
- Fix virtio-mem incorrect page unfreezing (inc -> init)
   (Note: This would be caught by the WARN_ON checks during testing.)
- Do not loop for an additional cycle during CAS reset

Original cover letter posted below:

Intro
=====
This patch optimizes small file read performance and overall folio refcount
scalability by refactoring page_ref_add_unless [core of folio_try_get].
This is alternative approach to previous attempts to fix small read
performance by avoiding refcount bumps [1][2].

Overview
========
Current refcount implementation is using zero counter as locked (dead/frozen)
state, which required CAS loop for increments to avoid temporary unlocks in
try_get functions. These CAS loops became a serialization point for otherwise
scalable and fast read side.
Proposed implementation separates "locked" logic from the counting, allowing
the use of optimistic fetch_add() instead of CAS. For more details, please
refer to the commit message of the patch itself.
Proposed logic maintains the same public API as before, including all existing
memory barrier guarantees.

Performance
===========
Performance was measured using a simple custom benchmark based on
will-it-scale[3]. This benchmark spawns N pinned threads/processes that
execute the following loop:

``
char buf[]
fd = open(/* same file in tmpfs */);
while (true) {
    pread(fd, buf, /* read size = */ 64, /* offset = */0)
}
``

While this is a synthetic load, it does highlight existing issue and
doesn't differ a lot from benchmarking in [2] patch.
This benchmark measures operations per second in the inner loop and the
results across all workers. Performance was tested on top of v6.15 kernel
on two platforms. Since threads and processes showed similar performance on
both systems, only the thread results are provided below. The performance
improvement scales linearly between the CPU counts shown.

Platform 1: 2 x E5-2690 v3, 12C/12T each [disabled SMT]
#threads | vanilla | patched | boost (%)
       1 | 1343381 | 1344401 |  +0.1
       2 | 2186160 | 2455837 | +12.3
       5 | 5277092 | 6108030 | +15.7
      10 | 5858123 | 7506328 | +28.1
      12 | 6484445 | 8137706 | +25.5
         /* Cross socket NUMA */
      14 | 3145860 | 4247391 | +35.0
      16 | 2350840 | 4262707 | +81.3
      18 | 2378825 | 4121415 | +73.2
      20 | 2438475 | 4683548 | +92.1
      24 | 2325998 | 4529737 | +94.7

Platform 2: 2 x AMD EPYC 9654, 96C/192T each [enabled SMT]
#threads | vanilla | patched | boost (%)
       1 | 1077276 | 1081653 |  +0.4
       5 | 4286838 | 4682513 |  +9.2
      10 | 1698095 | 1902753 | +12.1
      20 | 1662266 | 1921603 | +15.6
      49 | 1486745 | 1828926 | +23.0
      97 | 1617365 | 2052635 | +26.9
         /* Cross socket NUMA */
     105 | 1368319 | 1798862 | +31.5
     136 | 1008071 | 1393055 | +38.2
     168 |  879332 | 1245210 | +41.6
               /* SMT */
     193 |  905432 | 1294833 | +43.0
     289 |  851988 | 1313110 | +54.1
     353 |  771288 | 1347165 | +74.7

[0]: https://lore.kernel.org/lkml/cover.1776350895.git.gorbunov.ivan@h-partners.com/
[1]: https://lore.kernel.org/linux-mm/CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com/
[2]: https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/
[3]: https://github.com/antonblanchard/will-it-scale

---

Link to v4: https://lore.kernel.org/linux-mm/df26082871b4c65b2bd38d409026237c08572836@linux.dev/

Gladyshev Ilya (1):
  mm: implement page refcount locking via dedicated bit

Gorbunov Ivan (1):
  mm: drop page refcount zero state semantics

 drivers/pci/p2pdma.c               |  4 +-
 drivers/virtio/virtio_mem.c        |  2 +-
 include/linux/mm.h                 |  2 +-
 include/linux/page-flags.h         | 13 ++++++
 include/linux/page_ref.h           | 68 +++++++++++++++++++++++++-----
 kernel/liveupdate/kexec_handover.c |  6 +--
 lib/test_hmm.c                     |  4 +-
 mm/hugetlb.c                       |  2 +-
 mm/internal.h                      |  2 +-
 mm/memremap.c                      |  4 +-
 mm/mm_init.c                       |  6 +--
 mm/page_alloc.c                    |  4 +-
 12 files changed, 88 insertions(+), 29 deletions(-)


base-commit: 51cb1aa1250c36269474b8b6ca6b6319e170f5a5
-- 
2.54.0

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-26 18:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-26 18:44 [PATCH v5 0/2] mm: improve folio refcount scalability Gladyshev Ilya
2026-06-26 18:46 ` [PATCH v5 1/2] mm: drop page refcount zero state semantics Gladyshev Ilya
2026-06-26 18:46 ` [PATCH v5 2/2] mm: implement page refcount locking via dedicated bit Gladyshev Ilya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox