All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock
@ 2026-05-15 10:28 Breno Leitao
  2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
  2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
  0 siblings, 2 replies; 9+ messages in thread
From: Breno Leitao @ 2026-05-15 10:28 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, Breno Leitao,
	usama.arif, kernel-team

While profiling kernel code on Meta's caching code[1], I found some lock
contention on pipe hot path, and I decide to investigate, and found that
anon_pipe_write() currently calls alloc_page() once per page while
holding pipe->mutex. The allocation can sleep doing direct reclaim and
runs memcg charging, which extends the critical section and stalls any
concurrent reader on the same mutex.

I would like to propose that we bulk pre-allocate pages outside
pipe->mutex in anon_pipe_write -- pre-allocates up to PIPE_PREALLOC_MAX
(8) pages with alloc_pages_bulk() before taking the lock

 * the get-page helper then consumes from tmp_page[] / the prealloc
   array first and only falls back to in-lock alloc_page() for writes
   larger than PIPE_PREALLOC_MAX (8) pages, which is rare.

I also vibe coded a microbenchmark to test the pipe benchmark. Basically
it wweeps writers x readers over {1,2,5} x {1,5,10} with 64KB writes
against a 1 MB pipe and prints throughput + latency percentiles per
config.

Measured on arm64 in virtme-ng (16 vCPUs, 64KB writes, 1 MB pipe).

== No memory pressure (10s per config) ==

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1              readers=5               readers=10
          1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
          2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
          5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1                 readers=5                readers=10
          1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
          2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
          5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)

Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.

== Under memory pressure (--memory-pressure, 6s per config) ==

stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1            readers=5            readers=10
          1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
          2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
          5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1              readers=5              readers=10
          1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
          2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
          5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)

Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.

That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.

Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (2):
      fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
      selftests/pipe: add pipe_bench microbenchmark

 fs/pipe.c                                 |  40 +++-
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 351 ++++++++++++++++++++++++++++++
 5 files changed, 400 insertions(+), 2 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-22 12:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15 10:28 [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock Breno Leitao
2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-15 20:18   ` Mateusz Guzik
2026-05-20 17:09     ` Breno Leitao
2026-05-21 12:58       ` Mateusz Guzik
2026-05-21 15:38         ` Breno Leitao
2026-05-21 17:26           ` Mateusz Guzik
2026-05-22 12:10             ` Breno Leitao
2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.