All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
@ 2026-05-22 16:44 Breno Leitao
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Breno Leitao @ 2026-05-22 16:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, Breno Leitao, kernel-team

While profiling Meta's caching code[1], I found pipe->mutex contention
on the hot path. anon_pipe_write() currently calls alloc_page() once
per page while holding pipe->mutex. The allocation can sleep doing
direct reclaim and runs memcg charging, which extends the critical
section and stalls any concurrent reader on the same mutex.

This series pre-allocates pages outside pipe->mutex in
anon_pipe_write(): for writes that span more than one full page, up
to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
alloc_page() loop before the mutex is taken. anon_pipe_get_page()
then drains the prealloc array first, falls back to the per-pipe
tmp_page[] cache, and only enters the allocator under the mutex for
the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
writes that skip prealloc, or shortfalls when the prealloc loop
fails). Leftover prealloc pages are recycled into tmp_page[] before
unlock and any remainder is put_page()'d after unlock, keeping the
allocator out of the critical section on both sides.

alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
refuses __GFP_ACCOUNT under memcg -- it returns at most one page
when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
__GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
task NUMA mempolicy honoured uniformly without open-coding the
charge.

I also vibe-coded a microbenchmark to validate the change. It sweeps
writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
1 MB pipe and prints throughput + latency percentiles per config.

Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
writes, 1 MB pipe). The numbers below were collected on v1
(alloc_pages_bulk()); v2's per-page loop preserves the dominant
"allocation outside the mutex" win and is expected to land in the same
range.

== No memory pressure (10s per config) ==

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1              readers=5               readers=10
          1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
          2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
          5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1                 readers=5                readers=10
          1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
          2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
          5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)

Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.

== Under memory pressure (--memory-pressure, 6s per config) ==

stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1            readers=5            readers=10
          1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
          2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
          5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1              readers=5              readers=10
          1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
          2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
          5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)

Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.

That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.

Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2:
- Switch the prealloc path from alloc_pages_bulk_mempolicy() to a
  per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop.
- Split the prealloc work out of anon_pipe_write() into dedicated
  helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop /
  anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in
  struct anon_pipe_prealloc, so the write path stays readable.
- Recycle leftover prealloc pages into pipe->tmp_page[] before
  unlocking
- Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org

To: Alexander Viro <viro@zeniv.linux.org.uk>
To: Christian Brauner <brauner@kernel.org>
To: Jan Kara <jack@suse.cz>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (2):
      fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
      selftests/pipe: add pipe_bench microbenchmark

 fs/pipe.c                                 | 105 ++++-
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
 5 files changed, 729 insertions(+), 3 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-05-24 16:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-22 16:51   ` Jeff Layton
2026-05-22 17:55     ` Breno Leitao
2026-05-22 19:48   ` Mateusz Guzik
2026-05-23 16:26   ` Oleg Nesterov
2026-05-24 14:30     ` Breno Leitao
2026-05-24 14:48       ` Mateusz Guzik
2026-05-24 16:47         ` Breno Leitao
2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
2026-05-23 16:43   ` Oleg Nesterov
2026-05-23 16:49     ` Oleg Nesterov
2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.