[PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
@ 2026-05-24 14:44 Breno Leitao
  2026-05-24 14:44 ` [PATCH v3 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Breno Leitao @ 2026-05-24 14:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, Breno Leitao, kernel-team

While profiling Meta's caching code[1], I found pipe->mutex contention
on the hot path. anon_pipe_write() currently calls alloc_page() once
per page while holding pipe->mutex. The allocation can sleep doing
direct reclaim and runs memcg charging, which extends the critical
section and stalls any concurrent reader on the same mutex.

This series pre-allocates pages outside pipe->mutex in
anon_pipe_write(): for writes that span more than one full page, up
to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
alloc_page() loop before the mutex is taken. anon_pipe_get_page()
then drains the prealloc array first, falls back to the per-pipe
tmp_page[] cache, and only enters the allocator under the mutex for
the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
writes that skip prealloc, or shortfalls when the prealloc loop
fails). Leftover prealloc pages are recycled into tmp_page[] before
unlock and any remainder is put_page()'d after unlock, keeping the
allocator out of the critical section on both sides.

alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
refuses __GFP_ACCOUNT under memcg -- it returns at most one page
when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
__GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
task NUMA mempolicy honoured uniformly without open-coding the
charge.

I also vibe-coded a microbenchmark to validate the change. It sweeps
writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
1 MB pipe and prints throughput + latency percentiles per config.

Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
writes, 1 MB pipe). The numbers below were collected on v1
(alloc_pages_bulk()); v2's per-page loop preserves the dominant
"allocation outside the mutex" win and is expected to land in the same
range.

== No memory pressure (10s per config) ==

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1              readers=5               readers=10
          1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
          2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
          5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1                 readers=5                readers=10
          1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
          2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
          5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)

Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.

== Under memory pressure (--memory-pressure, 6s per config) ==

stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1            readers=5            readers=10
          1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
          2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
          5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1              readers=5              readers=10
          1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
          2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
          5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)

Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.

That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.

Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v3:
- Drop the anon_pipe_free_pages() call from the wait branch so
  leftover prealloc pages survive the wait_event sleep. The loop
  resumes once pipe_writable() becomes true and immediately wants
  pages again; freeing them forced the next iteration back into
  alloc_page() under pipe->mutex, defeating the patch for any write
  large enough to block mid-syscall. The out: label still frees any
  remainder on syscall exit, so nothing leaks. (Suggested by Oleg
  Nesterov.)
- Drop the in-loop anon_pipe_refill_tmp_pages() call in the wait
  branch as well: its only purpose was to rescue pages from the
  free_pages() above. With prealloc surviving the sleep,
  anon_pipe_get_page() drains it directly on the next iteration, and
  the out: label still refills tmp_page[] at syscall exit.
- Link to v2: https://patch.msgid.link/20260522-fix_pipe-v2-0-a8b35a78244e@debian.org

Changes in v2:
- Switch the prealloc path from alloc_pages_bulk_mempolicy() to a
  per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop.
- Split the prealloc work out of anon_pipe_write() into dedicated
  helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop /
  anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in
  struct anon_pipe_prealloc, so the write path stays readable.
- Recycle leftover prealloc pages into pipe->tmp_page[] before
  unlocking
- Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org

To: Alexander Viro <viro@zeniv.linux.org.uk>
To: Christian Brauner <brauner@kernel.org>
To: Jan Kara <jack@suse.cz>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (2):
      fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
      selftests/pipe: add pipe_bench microbenchmark

 fs/pipe.c                                 | 103 ++++-
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
 5 files changed, 727 insertions(+), 3 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7

Best regards,
--  
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-24 14:44 [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
@ 2026-05-24 14:44 ` Breno Leitao
  2026-05-24 14:44 ` [PATCH v3 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Breno Leitao @ 2026-05-24 14:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, Breno Leitao, kernel-team

anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole
thing") and then, from the per-iteration anon_pipe_get_page() helper,
used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page
while still holding it.

That allocation can sleep doing direct reclaim and/or runs memcg
charging, which extends the critical section and stalls a concurrent
reader on the very same mutex.

Just pre-alloc the required pages before the lock in an array and just pop
them inside the lock.

This can improve the pipe throughput up to 48% and reduce the
latency in 33%, easily seen when there is memory pressure and direct
reclaim.

Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 fs/pipe.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 100 insertions(+), 3 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 9841648c9cf3..e15795cf0c76 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -111,16 +111,76 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 	pipe_lock(pipe2);
 }
 
-static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
+#define PIPE_PREALLOC_MAX 8
+
+struct anon_pipe_prealloc {
+	struct page *pages[PIPE_PREALLOC_MAX];
+	unsigned int count;
+};
+
+/*
+ * Pre-allocate pages outside pipe->mutex for multi-page writes.
+ * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg
+ * charging; doing it under the mutex stalls a concurrent reader.
+ *
+ * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses
+ * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc:
+ * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single
+ * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy
+ * honoured for every page; the per-call overhead is small compared to the
+ * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
+ * in-lock alloc_page() fallback in anon_pipe_get_page().
+ */
+static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
+					size_t total_len)
+{
+	unsigned int want, i;
+	struct page *page;
+
+	prealloc->count = 0;
+	if (total_len <= PAGE_SIZE)
+		return;
+
+	want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
+		     PIPE_PREALLOC_MAX);
+
+	for (i = 0; i < want; i++) {
+		page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
+		if (!page)
+			break;
+		prealloc->pages[prealloc->count++] = page;
+	}
+}
+
+static struct page *anon_pipe_prealloc_pop(struct anon_pipe_prealloc *prealloc)
+{
+	if (!prealloc->count)
+		return NULL;
+
+	prealloc->count--;
+
+	return prealloc->pages[prealloc->count];
+}
+
+static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
+				       struct anon_pipe_prealloc *prealloc)
 {
+	struct page *page;
+
+	/* Drain prealloc first to keep tmp_page[] hot for later small writes. */
+	page = anon_pipe_prealloc_pop(prealloc);
+	if (page)
+		return page;
+
 	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
 		if (pipe->tmp_page[i]) {
-			struct page *page = pipe->tmp_page[i];
+			page = pipe->tmp_page[i];
 			pipe->tmp_page[i] = NULL;
 			return page;
 		}
 	}
 
+	/* FWIW: This is called with pipe->mutex held */
 	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
 }
 
@@ -139,6 +199,38 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe,
 	put_page(page);
 }
 
+/*
+ * Stash leftover prealloc pages in tmp_page[] so the next write to this
+ * pipe gets a hot page without entering the allocator.
+ */
+static void anon_pipe_refill_tmp_pages(struct pipe_inode_info *pipe,
+				       struct anon_pipe_prealloc *prealloc)
+{
+	int i, idx;
+
+	if (!prealloc->count)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
+		if (pipe->tmp_page[i])
+			continue;
+		if (!prealloc->count)
+			return;
+		idx = --prealloc->count;
+		pipe->tmp_page[i] = prealloc->pages[idx];
+		prealloc->pages[idx] = NULL;
+	}
+}
+
+/* Runs after mutex_unlock() to keep put_page() out of the critical section. */
+static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
+{
+	while (prealloc->count) {
+		prealloc->count--;
+		put_page(prealloc->pages[prealloc->count]);
+	}
+}
+
 static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
@@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
+	struct anon_pipe_prealloc prealloc;
 	unsigned int head;
 	ssize_t ret = 0;
 	size_t total_len = iov_iter_count(from);
@@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	if (unlikely(total_len == 0))
 		return 0;
 
+	anon_pipe_get_page_prealloc(&prealloc, total_len);
+
 	mutex_lock(&pipe->mutex);
 
 	if (!pipe->readers) {
@@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			struct page *page;
 			int copied;
 
-			page = anon_pipe_get_page(pipe);
+			page = anon_pipe_get_page(pipe, &prealloc);
 			if (unlikely(!page)) {
 				if (!ret)
 					ret = -ENOMEM;
@@ -576,9 +671,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		wake_next_writer = true;
 	}
 out:
+	anon_pipe_refill_tmp_pages(pipe, &prealloc);
 	if (pipe_is_full(pipe))
 		wake_next_writer = false;
 	mutex_unlock(&pipe->mutex);
+	anon_pipe_free_pages(&prealloc);
 
 	/*
 	 * If we do do a wakeup event, we do a 'sync' wakeup, because we

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/2] selftests/pipe: add pipe_bench microbenchmark
  2026-05-24 14:44 [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
  2026-05-24 14:44 ` [PATCH v3 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
@ 2026-05-24 14:44 ` Breno Leitao
  2026-05-28 12:34 ` [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Christian Brauner
  2026-06-16 20:47 ` Josh Triplett
  3 siblings, 0 replies; 18+ messages in thread
From: Breno Leitao @ 2026-05-24 14:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, Breno Leitao, kernel-team

Add a small selftest that stresses pipe->mutex contention by spawning N
writer threads that hammer a single pipe with multi-page writes, plus M
reader threads that drain. Each writer records its own write() latency
samples into a log2-bucketed histogram; main aggregates and prints
total writes, throughput, average and percentile (p50/p99) latencies,
and the maximum observed latency.

Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80%
--vm-method all) for the duration of the run, so alloc_page() in
anon_pipe_write() routinely hits direct reclaim. The flag fails
fast if stress-ng is not on $PATH.

Program print something like the following, for different writes,
readers, msgsizes and memory pressure:

	config: writers=X readers=Y msgsize=Z duration=3 pipe_size=1048576
	memory_pressure=[no|yes]
	writes: total=54451 rate=18150/s
	throughput_MBps: 1134.40
	lat_avg_ns: 275355
	lat_p50_ns_upper: 262143
	lat_p99_ns_upper: 1048575
	lat_max_ns: 2145633

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
 4 files changed, 627 insertions(+)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6e59b8f63e41..bcd9db9d292c 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -91,6 +91,7 @@ TARGETS += pcie_bwctrl
 TARGETS += perf_events
 TARGETS += pidfd
 TARGETS += pid_namespace
+TARGETS += pipe
 TARGETS += power_supply
 TARGETS += powerpc
 TARGETS += prctl
diff --git a/tools/testing/selftests/pipe/.gitignore b/tools/testing/selftests/pipe/.gitignore
new file mode 100644
index 000000000000..20b549361a15
--- /dev/null
+++ b/tools/testing/selftests/pipe/.gitignore
@@ -0,0 +1 @@
+pipe_bench
diff --git a/tools/testing/selftests/pipe/Makefile b/tools/testing/selftests/pipe/Makefile
new file mode 100644
index 000000000000..1810c680117b
--- /dev/null
+++ b/tools/testing/selftests/pipe/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+# Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+
+CFLAGS += -O2 -Wall -Wextra -pthread
+
+TEST_GEN_PROGS := pipe_bench
+
+include ../lib.mk
diff --git a/tools/testing/selftests/pipe/pipe_bench.c b/tools/testing/selftests/pipe/pipe_bench.c
new file mode 100644
index 000000000000..7e96429b8fb4
--- /dev/null
+++ b/tools/testing/selftests/pipe/pipe_bench.c
@@ -0,0 +1,616 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pipe_bench - exercise concurrent pipe operation
+ *
+ * N writer threads hammer a single pipe with multi-page writes; M reader
+ * threads drain it. Each writer records its own write() latency histogram.
+ * Multi-page writes (msgsize >= PAGE_SIZE) force the loop in
+ * anon_pipe_write() to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) under
+ * pipe->mutex, which is the critical section the patch shrinks.
+ *
+ * By default the benchmark sweeps writers in {1, 2, 5} x readers in
+ * {1, 5, 10} and prints one block per configuration so two runs (e.g.
+ * baseline vs patched) can be diffed directly. Pass -w and -r to run a
+ * single configuration instead. Pass --memory-pressure to spawn stress-ng
+ * alongside the sweep so the per-page alloc_page() path under pipe->mutex
+ * has to dip into reclaim.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <poll.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#define ARRAY_SIZE(a)	(sizeof(a) / sizeof((a)[0]))
+#define HIST_BUCKETS	32
+
+static size_t g_msgsize = 16 * 4096;
+static int g_duration = 3;
+static int g_pipe_size = 1024 * 1024;
+static int g_memory_pressure;
+
+static atomic_int g_stop;
+static int g_pipe[2];
+
+struct wstats {
+	uint64_t writes;
+	uint64_t bytes;
+	uint64_t lat_sum_ns;
+	uint64_t lat_max_ns;
+	uint64_t lat_hist[HIST_BUCKETS];
+	char *buf;
+};
+
+struct rstats {
+	char *buf;
+};
+
+struct hist_totals {
+	uint64_t writes;
+	uint64_t bytes;
+	uint64_t lat_sum;
+	uint64_t lat_max;
+};
+
+static inline uint64_t now_ns(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
+}
+
+static inline int log2_bucket(uint64_t v)
+{
+	int b = 0;
+
+	if (!v)
+		return 0;
+	while (v >>= 1)
+		b++;
+	return b < HIST_BUCKETS ? b : HIST_BUCKETS - 1;
+}
+
+static void *writer(void *arg)
+{
+	struct wstats *s = arg;
+
+	while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) {
+		uint64_t t0 = now_ns();
+		ssize_t n = write(g_pipe[1], s->buf, g_msgsize);
+		uint64_t dt = now_ns() - t0;
+
+		if (n > 0) {
+			s->writes++;
+			s->bytes += (uint64_t)n;
+			s->lat_sum_ns += dt;
+			if (dt > s->lat_max_ns)
+				s->lat_max_ns = dt;
+			s->lat_hist[log2_bucket(dt)]++;
+		} else if (n < 0 && (errno == EPIPE || errno == EBADF)) {
+			break;
+		}
+	}
+	return NULL;
+}
+
+static void *reader(void *arg)
+{
+	struct rstats *s = arg;
+
+	/*
+	 * Drain until EOF (write end closed by main). g_stop is not checked
+	 * here on purpose: writers may be blocked in write() with the pipe
+	 * full when g_stop is set, so the reader must keep draining until
+	 * main closes the write end.
+	 */
+	for (;;) {
+		ssize_t n = read(g_pipe[0], s->buf, g_msgsize);
+
+		if (n <= 0)
+			break;
+	}
+	return NULL;
+}
+
+/* Sum per-writer stats and per-bucket counts into the caller's aggregates. */
+static void aggregate_wstats(struct wstats *all, int nw,
+			     uint64_t agg[HIST_BUCKETS],
+			     struct hist_totals *t)
+{
+	memset(t, 0, sizeof(*t));
+	for (int i = 0; i < nw; i++) {
+		t->writes += all[i].writes;
+		t->bytes += all[i].bytes;
+		t->lat_sum += all[i].lat_sum_ns;
+		if (all[i].lat_max_ns > t->lat_max)
+			t->lat_max = all[i].lat_max_ns;
+		for (int b = 0; b < HIST_BUCKETS; b++)
+			agg[b] += all[i].lat_hist[b];
+	}
+}
+
+/*
+ * Walk @agg in order, returning the inclusive upper bound (in ns) of the
+ * log2 bucket where the running sum first reaches @target.
+ *
+ * A percentile is undefined with zero samples, and with very low sample
+ * counts integer truncation could make @target zero -- then "cum >= 0"
+ * would latch on the first (possibly empty) bucket. Callers must pass
+ * @target >= 1.
+ */
+static uint64_t bucket_at(const uint64_t agg[HIST_BUCKETS], uint64_t target)
+{
+	uint64_t cum = 0;
+
+	for (int b = 0; b < HIST_BUCKETS; b++) {
+		/* HIST_BUCKETS <= 63, so (b + 1) is always a safe shift. */
+		uint64_t upper = (1ULL << (b + 1)) - 1;
+
+		cum += agg[b];
+		if (cum >= target)
+			return upper;
+	}
+	return 0;
+}
+
+static void compute_p50_p99(const uint64_t agg[HIST_BUCKETS], uint64_t writes,
+			    uint64_t *p50, uint64_t *p99)
+{
+	uint64_t p50_target, p99_target;
+
+	*p50 = *p99 = 0;
+	if (!writes)
+		return;
+
+	p50_target = writes * 50 / 100;
+	p99_target = writes * 99 / 100;
+	if (!p50_target)
+		p50_target = 1;
+	if (!p99_target)
+		p99_target = 1;
+
+	*p50 = bucket_at(agg, p50_target);
+	*p99 = bucket_at(agg, p99_target);
+}
+
+static void print_summary(int nw, int nr, const struct hist_totals *t,
+			  uint64_t p50, uint64_t p99)
+{
+	double sec = g_duration;
+	uint64_t avg_ns = t->writes ? t->lat_sum / t->writes : 0;
+
+	printf("config: writers=%d readers=%d msgsize=%zu duration=%d pipe_size=%d memory_pressure=%s\n",
+	       nw, nr, g_msgsize, g_duration, g_pipe_size,
+	       g_memory_pressure ? "yes" : "no");
+	printf("writes: total=%llu rate=%.0f/s\n",
+	       (unsigned long long)t->writes, (double)t->writes / sec);
+	printf("throughput_MBps: %.2f\n",
+	       ((double)t->bytes / sec) / (1024.0 * 1024.0));
+	printf("lat_avg_ns: %llu\n", (unsigned long long)avg_ns);
+	printf("lat_p50_ns_upper: %llu\n", (unsigned long long)p50);
+	printf("lat_p99_ns_upper: %llu\n", (unsigned long long)p99);
+	printf("lat_max_ns: %llu\n", (unsigned long long)t->lat_max);
+}
+
+static void summarize(struct wstats *all, int nw, int nr)
+{
+	uint64_t agg[HIST_BUCKETS] = {0};
+	struct hist_totals t;
+	uint64_t p50, p99;
+
+	aggregate_wstats(all, nw, agg, &t);
+	compute_p50_p99(agg, t.writes, &p50, &p99);
+	print_summary(nw, nr, &t, p50, p99);
+}
+
+/*
+ * Child branch of fork(): restore SIGPIPE to default (parent ignores it),
+ * exec stress-ng, and on failure write the reason into @hs_wr before
+ * exiting. The parent observes EOF on hs_wr (closed via O_CLOEXEC) when
+ * exec succeeds.
+ */
+static void stress_ng_child(int hs_wr) __attribute__((noreturn));
+static void stress_ng_child(int hs_wr)
+{
+	char errbuf[256];
+
+	signal(SIGPIPE, SIG_DFL);
+	execlp("stress-ng", "stress-ng",
+	       "--vm", "4", "--vm-bytes", "80%",
+	       "--vm-method", "all",
+	       (char *)NULL);
+	snprintf(errbuf, sizeof(errbuf),
+		 "exec stress-ng failed: %s\n", strerror(errno));
+	(void)!write(hs_wr, errbuf, strlen(errbuf));
+	_exit(127);
+}
+
+/*
+ * Read from the O_CLOEXEC handshake pipe. Anything readable means the
+ * child wrote an error before exec; EOF (n == 0) means the write-end
+ * closed because exec succeeded. Returns 0 on exec success, -1 if the
+ * child failed and was reaped.
+ */
+static int stress_ng_wait_handshake(int hs_rd, pid_t pid)
+{
+	struct pollfd pfd = { .fd = hs_rd, .events = POLLIN };
+	char errbuf[256];
+	int status;
+	int ret;
+
+	ret = poll(&pfd, 1, 500);
+	if (ret <= 0)
+		return 0;
+
+	ssize_t n = read(hs_rd, errbuf, sizeof(errbuf) - 1);
+
+	if (n > 0) {
+		errbuf[n] = '\0';
+		fputs(errbuf, stderr);
+		waitpid(pid, &status, 0);
+		return -1;
+	}
+	return 0;
+}
+
+static pid_t spawn_stress_ng(void)
+{
+	int hs[2];
+	pid_t pid;
+
+	/*
+	 * Handshake pipe: child writes one byte and _exit()s on exec
+	 * failure. On exec success the O_CLOEXEC flag closes the write
+	 * end, which the parent observes as EOF. This makes the "is
+	 * stress-ng on $PATH?" check fail fast rather than silently.
+	 */
+	if (pipe2(hs, O_CLOEXEC) < 0) {
+		perror("pipe2");
+		return -1;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		perror("fork");
+		close(hs[0]);
+		close(hs[1]);
+		return -1;
+	}
+	if (pid == 0) {
+		close(hs[0]);
+		stress_ng_child(hs[1]);
+	}
+
+	close(hs[1]);
+	if (stress_ng_wait_handshake(hs[0], pid) < 0) {
+		close(hs[0]);
+		return -1;
+	}
+	close(hs[0]);
+
+	/* Give stress-ng a moment to map its VM regions before measuring. */
+	sleep(1);
+	return pid;
+}
+
+static void kill_stress_ng(pid_t pid)
+{
+	int status;
+
+	if (pid <= 0)
+		return;
+	kill(pid, SIGTERM);
+	for (int i = 0; i < 20; i++) {
+		if (waitpid(pid, &status, WNOHANG) > 0)
+			return;
+		usleep(100 * 1000);
+	}
+	kill(pid, SIGKILL);
+	waitpid(pid, &status, 0);
+}
+
+/*
+ * Allocate per-thread page-aligned buffers in main so a failed
+ * aligned_alloc() aborts the run before any thread starts. Workers used
+ * to allocate their own buffer and return NULL on failure, which left
+ * peers blocked in write()/read() with nobody to unblock them.
+ */
+static int alloc_thread_bufs(struct wstats *ws, int nw,
+			     struct rstats *rs, int nr)
+{
+	for (int i = 0; i < nw; i++) {
+		ws[i].buf = aligned_alloc(4096, g_msgsize);
+		if (!ws[i].buf) {
+			fprintf(stderr, "writer %d: aligned_alloc(%zu) failed\n",
+				i, g_msgsize);
+			return -1;
+		}
+		memset(ws[i].buf, 0xAA, g_msgsize);
+	}
+	for (int i = 0; i < nr; i++) {
+		rs[i].buf = aligned_alloc(4096, g_msgsize);
+		if (!rs[i].buf) {
+			fprintf(stderr, "reader %d: aligned_alloc(%zu) failed\n",
+				i, g_msgsize);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+static void free_thread_bufs(struct wstats *ws, int nw,
+			     struct rstats *rs, int nr)
+{
+	if (ws)
+		for (int i = 0; i < nw; i++)
+			free(ws[i].buf);
+	if (rs)
+		for (int i = 0; i < nr; i++)
+			free(rs[i].buf);
+}
+
+static int start_readers(pthread_t *rt, struct rstats *rs, int nr,
+			 int *created)
+{
+	for (int i = 0; i < nr; i++) {
+		int err = pthread_create(&rt[i], NULL, reader, &rs[i]);
+
+		if (err) {
+			fprintf(stderr, "pthread_create reader %d: %s\n",
+				i, strerror(err));
+			return -1;
+		}
+		(*created)++;
+	}
+	return 0;
+}
+
+static int start_writers(pthread_t *wt, struct wstats *ws, int nw,
+			 int *created)
+{
+	for (int i = 0; i < nw; i++) {
+		int err = pthread_create(&wt[i], NULL, writer, &ws[i]);
+
+		if (err) {
+			fprintf(stderr, "pthread_create writer %d: %s\n",
+				i, strerror(err));
+			return -1;
+		}
+		(*created)++;
+	}
+	return 0;
+}
+
+static int open_bench_pipe(void)
+{
+	if (pipe(g_pipe) < 0) {
+		perror("pipe");
+		return -1;
+	}
+	if (fcntl(g_pipe[1], F_SETPIPE_SZ, g_pipe_size) < 0)
+		perror("F_SETPIPE_SZ (continuing)");
+	return 0;
+}
+
+/*
+ * Normal termination: g_stop tells writers to leave the loop after the
+ * current write() returns. Closing the shared write-end fd means once
+ * the in-flight writes drain, readers see EOF and exit. Writers are not
+ * unblocked by EPIPE here -- g_pipe[0] stays open so readers can keep
+ * draining.
+ *
+ * Error path: some threads may have been created and others skipped, so
+ * writers could be blocked in write() with no reader making progress.
+ * Close both ends -- closing the read end is what delivers EPIPE to a
+ * blocked writer.
+ */
+static void stop_and_join(pthread_t *wt, int nw_created,
+			  pthread_t *rt, int nr_created, int rc)
+{
+	atomic_store(&g_stop, 1);
+	close(g_pipe[1]);
+	if (rc < 0)
+		close(g_pipe[0]);
+	for (int i = 0; i < nw_created; i++)
+		pthread_join(wt[i], NULL);
+	for (int i = 0; i < nr_created; i++)
+		pthread_join(rt[i], NULL);
+	if (rc == 0)
+		close(g_pipe[0]);
+}
+
+static int run_one(int nw, int nr)
+{
+	pthread_t *wt = NULL, *rt = NULL;
+	struct wstats *ws = NULL;
+	struct rstats *rs = NULL;
+	int nw_created = 0, nr_created = 0;
+	int rc = 0;
+
+	atomic_store(&g_stop, 0);
+
+	if (open_bench_pipe() < 0)
+		return -1;
+
+	wt = calloc((size_t)nw, sizeof(*wt));
+	rt = calloc((size_t)nr, sizeof(*rt));
+	ws = calloc((size_t)nw, sizeof(*ws));
+	rs = calloc((size_t)nr, sizeof(*rs));
+	if (!wt || !rt || !ws || !rs) {
+		fprintf(stderr, "alloc failed\n");
+		rc = -1;
+		goto teardown;
+	}
+
+	if (alloc_thread_bufs(ws, nw, rs, nr) < 0) {
+		rc = -1;
+		goto teardown;
+	}
+
+	if (start_readers(rt, rs, nr, &nr_created) < 0 ||
+	    start_writers(wt, ws, nw, &nw_created) < 0) {
+		rc = -1;
+		goto teardown;
+	}
+
+	sleep((unsigned int)g_duration);
+
+teardown:
+	stop_and_join(wt, nw_created, rt, nr_created, rc);
+
+	if (rc == 0) {
+		summarize(ws, nw, nr);
+		fflush(stdout);
+	}
+
+	free_thread_bufs(ws, nw, rs, nr);
+	free(wt);
+	free(rt);
+	free(ws);
+	free(rs);
+	return rc;
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [-w writers] [-r readers] [-s msgsize] [-d secs] [-p pipe_size] [--memory-pressure]\n"
+		"  default: sweep writers={1,2,5} x readers={1,5,10}\n"
+		"  --memory-pressure: spawn stress-ng (--vm 4 --vm-bytes 80%% --vm-method all) for the run\n",
+		prog);
+}
+
+static int parse_args(int argc, char **argv,
+		      int *writers_override, int *readers_override)
+{
+	static const struct option long_opts[] = {
+		{"memory-pressure", no_argument, NULL, 'M'},
+		{0, 0, 0, 0},
+	};
+	int opt;
+
+	while ((opt = getopt_long(argc, argv, "w:r:s:d:p:",
+				  long_opts, NULL)) != -1) {
+		switch (opt) {
+		case 'w':
+			*writers_override = atoi(optarg);
+			break;
+		case 'r':
+			*readers_override = atoi(optarg);
+			break;
+		case 's':
+			g_msgsize = (size_t)atol(optarg);
+			break;
+		case 'd':
+			g_duration = atoi(optarg);
+			break;
+		case 'p':
+			g_pipe_size = atoi(optarg);
+			break;
+		case 'M':
+			g_memory_pressure = 1;
+			break;
+		default:
+			usage(argv[0]);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * aligned_alloc(4096, size) requires size to be a multiple of the
+ * alignment (C11); glibc returns NULL otherwise, which would make
+ * writer/reader threads silently exit and the run report zero writes.
+ * Validate up front instead.
+ */
+static int validate_args(void)
+{
+	if (g_msgsize == 0 || g_msgsize % 4096 != 0) {
+		fprintf(stderr,
+			"msgsize must be a positive multiple of 4096 (got %zu)\n",
+			g_msgsize);
+		return -1;
+	}
+	if (g_duration <= 0) {
+		fprintf(stderr, "duration must be > 0 seconds (got %d)\n",
+			g_duration);
+		return -1;
+	}
+	if (g_pipe_size <= 0) {
+		fprintf(stderr, "pipe_size must be > 0 bytes (got %d)\n",
+			g_pipe_size);
+		return -1;
+	}
+	return 0;
+}
+
+static int run_sweep(void)
+{
+	static const int writers_sweep[] = {1, 2, 5};
+	static const int readers_sweep[] = {1, 5, 10};
+
+	for (size_t i = 0; i < ARRAY_SIZE(writers_sweep); i++) {
+		for (size_t j = 0; j < ARRAY_SIZE(readers_sweep); j++) {
+			printf("---\n");
+			if (run_one(writers_sweep[i], readers_sweep[j]) < 0)
+				return -1;
+		}
+	}
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	int writers_override = 0, readers_override = 0;
+	pid_t stress_pid = -1;
+	int rc = 0;
+
+	if (parse_args(argc, argv, &writers_override, &readers_override) < 0)
+		return 1;
+	if (validate_args() < 0)
+		return 1;
+
+	signal(SIGPIPE, SIG_IGN);
+	setvbuf(stdout, NULL, _IOLBF, 0);
+	setvbuf(stderr, NULL, _IOLBF, 0);
+
+	fprintf(stderr, "pid=%d\n", getpid());
+	fflush(stderr);
+
+	if (g_memory_pressure) {
+		stress_pid = spawn_stress_ng();
+		if (stress_pid < 0) {
+			fprintf(stderr,
+				"memory_pressure requested but stress-ng could not be spawned\n");
+			return 1;
+		}
+	}
+
+	if (writers_override > 0 || readers_override > 0) {
+		int nw = writers_override > 0 ? writers_override : 1;
+		int nr = readers_override > 0 ? readers_override : 1;
+
+		rc = run_one(nw, nr) < 0 ? 1 : 0;
+	} else {
+		rc = run_sweep() < 0 ? 1 : 0;
+	}
+
+	kill_stress_ng(stress_pid);
+	return rc;
+}

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-05-24 14:44 [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
  2026-05-24 14:44 ` [PATCH v3 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
  2026-05-24 14:44 ` [PATCH v3 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
@ 2026-05-28 12:34 ` Christian Brauner
  2026-06-16 20:47 ` Josh Triplett
  3 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2026-05-28 12:34 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, oleg, axboe, kernel-team

On Sun, 24 May 2026 07:44:57 -0700, Breno Leitao wrote:
> While profiling Meta's caching code[1], I found pipe->mutex contention
> on the hot path. anon_pipe_write() currently calls alloc_page() once
> per page while holding pipe->mutex. The allocation can sleep doing
> direct reclaim and runs memcg charging, which extends the critical
> section and stalls any concurrent reader on the same mutex.
> 
> This series pre-allocates pages outside pipe->mutex in
> anon_pipe_write(): for writes that span more than one full page, up
> to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
> alloc_page() loop before the mutex is taken. anon_pipe_get_page()
> then drains the prealloc array first, falls back to the per-pipe
> tmp_page[] cache, and only enters the allocator under the mutex for
> the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
> writes that skip prealloc, or shortfalls when the prealloc loop
> fails). Leftover prealloc pages are recycled into tmp_page[] before
> unlock and any remainder is put_page()'d after unlock, keeping the
> allocator out of the critical section on both sides.
> 
> [...]

Applied to the vfs-7.2.misc branch of the vfs/vfs.git tree.
Patches in the vfs-7.2.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-7.2.misc

[1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
      https://git.kernel.org/vfs/vfs/c/212ed884a1ae
[2/2] selftests/pipe: add pipe_bench microbenchmark
      https://git.kernel.org/vfs/vfs/c/d29bd8efe162

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-05-24 14:44 [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
                   ` (2 preceding siblings ...)
  2026-05-28 12:34 ` [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Christian Brauner
@ 2026-06-16 20:47 ` Josh Triplett
  2026-06-17  8:52   ` Oleg Nesterov
  3 siblings, 1 reply; 18+ messages in thread
From: Josh Triplett @ 2026-06-16 20:47 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, oleg, axboe, kernel-team

On Sun, May 24, 2026 at 07:44:57AM -0700, Breno Leitao wrote:
> This series pre-allocates pages outside pipe->mutex in
> anon_pipe_write(): for writes that span more than one full page, up
> to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
> alloc_page() loop before the mutex is taken. anon_pipe_get_page()
> then drains the prealloc array first, falls back to the per-pipe
> tmp_page[] cache, and only enters the allocator under the mutex for
> the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
> writes that skip prealloc, or shortfalls when the prealloc loop
> fails). Leftover prealloc pages are recycled into tmp_page[] before
> unlock and any remainder is put_page()'d after unlock, keeping the
> allocator out of the critical section on both sides.
[...]
> I also vibe-coded a microbenchmark to validate the change. It sweeps
> writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
> 1 MB pipe and prints throughput + latency percentiles per config.

How do the numbers compare with 1-byte writes/reads? (It's fine if
they're not *faster*, just want to make sure they don't get any
*worse*. This case comes up a lot with pipes used for synchronization or
event reporting, such as with make.)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-16 20:47 ` Josh Triplett
@ 2026-06-17  8:52   ` Oleg Nesterov
  2026-06-17 10:23     ` Breno Leitao
  0 siblings, 1 reply; 18+ messages in thread
From: Oleg Nesterov @ 2026-06-17  8:52 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Breno Leitao, Alexander Viro, Christian Brauner, Jan Kara,
	Shuah Khan, Mateusz Guzik, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On 06/16, Josh Triplett wrote:
>
> On Sun, May 24, 2026 at 07:44:57AM -0700, Breno Leitao wrote:
> > This series pre-allocates pages outside pipe->mutex in
> > anon_pipe_write(): for writes that span more than one full page, up
> > to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
> > alloc_page() loop before the mutex is taken. anon_pipe_get_page()
> > then drains the prealloc array first, falls back to the per-pipe
> > tmp_page[] cache, and only enters the allocator under the mutex for
> > the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
> > writes that skip prealloc, or shortfalls when the prealloc loop
> > fails). Leftover prealloc pages are recycled into tmp_page[] before
> > unlock and any remainder is put_page()'d after unlock, keeping the
> > allocator out of the critical section on both sides.
> [...]
> > I also vibe-coded a microbenchmark to validate the change. It sweeps
> > writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
> > 1 MB pipe and prints throughput + latency percentiles per config.
>
> How do the numbers compare with 1-byte writes/reads? (It's fine if
> they're not *faster*, just want to make sure they don't get any
> *worse*. This case comes up a lot with pipes used for synchronization or
> event reporting, such as with make.)

Note the "for writes that span more than one full page" above. Pre-allocate
does nothing if total_len <= PAGE_SIZE.

Oleg.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17  8:52   ` Oleg Nesterov
@ 2026-06-17 10:23     ` Breno Leitao
  2026-06-17 11:59       ` Mateusz Guzik
  2026-06-17 15:01       ` Mateusz Guzik
  0 siblings, 2 replies; 18+ messages in thread
From: Breno Leitao @ 2026-06-17 10:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Josh Triplett, Alexander Viro, Christian Brauner, Jan Kara,
	Shuah Khan, Mateusz Guzik, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On Wed, Jun 17, 2026 at 10:52:40AM +0200, Oleg Nesterov wrote:
> On 06/16, Josh Triplett wrote:
> >
> > On Sun, May 24, 2026 at 07:44:57AM -0700, Breno Leitao wrote:
> > > This series pre-allocates pages outside pipe->mutex in
> > > anon_pipe_write(): for writes that span more than one full page, up
> > > to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
> > > alloc_page() loop before the mutex is taken. anon_pipe_get_page()
> > > then drains the prealloc array first, falls back to the per-pipe
> > > tmp_page[] cache, and only enters the allocator under the mutex for
> > > the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
> > > writes that skip prealloc, or shortfalls when the prealloc loop
> > > fails). Leftover prealloc pages are recycled into tmp_page[] before
> > > unlock and any remainder is put_page()'d after unlock, keeping the
> > > allocator out of the critical section on both sides.
> > [...]
> > > I also vibe-coded a microbenchmark to validate the change. It sweeps
> > > writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
> > > 1 MB pipe and prints throughput + latency percentiles per config.
> >
> > How do the numbers compare with 1-byte writes/reads? (It's fine if
> > they're not *faster*, just want to make sure they don't get any
> > *worse*. This case comes up a lot with pipes used for synchronization or
> > event reporting, such as with make.)
> 
> Note the "for writes that span more than one full page" above. Pre-allocate
> does nothing if total_len <= PAGE_SIZE.

Exactly.


The pre-allocation only triggers for multi-page writes:

anon_pipe_get_page_prealloc() returns immediately when total_len <= PAGE_SIZE,
so a 1-byte (or any sub-page) write never enters the new path.

anon_pipe_get_page() then falls through to the existing tmp_page/alloc_page
logic exactly as before; the only added cost is one length check and a NULL
prealloc pop, both trivially predicted.

Measured it to _just be sure_, 1-byte ping-pong (perf bench sched pipe -s 1):

    baseline:  2.674 usecs/op
    patched:   2.710 usecs/op   (+1.3%, within run-to-run noise)

--breno

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 10:23     ` Breno Leitao
@ 2026-06-17 11:59       ` Mateusz Guzik
  2026-06-17 14:37         ` Oleg Nesterov
  2026-06-17 16:04         ` Oleg Nesterov
  2026-06-17 15:01       ` Mateusz Guzik
  1 sibling, 2 replies; 18+ messages in thread
From: Mateusz Guzik @ 2026-06-17 11:59 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Oleg Nesterov, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On Wed, Jun 17, 2026 at 12:24 PM Breno Leitao <leitao@debian.org> wrote:
>
> On Wed, Jun 17, 2026 at 10:52:40AM +0200, Oleg Nesterov wrote:
> > On 06/16, Josh Triplett wrote:
> > >
> > > On Sun, May 24, 2026 at 07:44:57AM -0700, Breno Leitao wrote:
> > > > This series pre-allocates pages outside pipe->mutex in
> > > > anon_pipe_write(): for writes that span more than one full page, up
> > > > to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
> > > > alloc_page() loop before the mutex is taken. anon_pipe_get_page()
> > > > then drains the prealloc array first, falls back to the per-pipe
> > > > tmp_page[] cache, and only enters the allocator under the mutex for
> > > > the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
> > > > writes that skip prealloc, or shortfalls when the prealloc loop
> > > > fails). Leftover prealloc pages are recycled into tmp_page[] before
> > > > unlock and any remainder is put_page()'d after unlock, keeping the
> > > > allocator out of the critical section on both sides.
> > > [...]
> > > > I also vibe-coded a microbenchmark to validate the change. It sweeps
> > > > writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
> > > > 1 MB pipe and prints throughput + latency percentiles per config.
> > >
> > > How do the numbers compare with 1-byte writes/reads? (It's fine if
> > > they're not *faster*, just want to make sure they don't get any
> > > *worse*. This case comes up a lot with pipes used for synchronization or
> > > event reporting, such as with make.)
> >
> > Note the "for writes that span more than one full page" above. Pre-allocate
> > does nothing if total_len <= PAGE_SIZE.
>
> Exactly.
>
>
> The pre-allocation only triggers for multi-page writes:
>
> anon_pipe_get_page_prealloc() returns immediately when total_len <= PAGE_SIZE,
> so a 1-byte (or any sub-page) write never enters the new path.
>
> anon_pipe_get_page() then falls through to the existing tmp_page/alloc_page
> logic exactly as before; the only added cost is one length check and a NULL
> prealloc pop, both trivially predicted.
>
> Measured it to _just be sure_, 1-byte ping-pong (perf bench sched pipe -s 1):
>
>     baseline:  2.674 usecs/op
>     patched:   2.710 usecs/op   (+1.3%, within run-to-run noise)
>
> --breno

There are trivial touch ups which can be done by adding a bunch of
predicts and inlining kill_fasync if someone can be bothered.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 11:59       ` Mateusz Guzik
@ 2026-06-17 14:37         ` Oleg Nesterov
  2026-06-17 14:47           ` Breno Leitao
  2026-06-17 14:51           ` Mateusz Guzik
  2026-06-17 16:04         ` Oleg Nesterov
  1 sibling, 2 replies; 18+ messages in thread
From: Oleg Nesterov @ 2026-06-17 14:37 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Breno Leitao, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On 06/17, Mateusz Guzik wrote:
>
> There are trivial touch ups which can be done by adding a bunch of
> predicts and inlining kill_fasync if someone can be bothered.

I was thinking about another change, see below. It assumes that in the
likely case another writer won't steal the pages from ->tmp_page[]
before we take pipe->mutex.

I'm not sure this makes sense, and I have no idea how it would impact
performance in "real" workloads.

Oleg.
---

diff --git a/fs/pipe.c b/fs/pipe.c
index 429b0714ec57..9f07f469830a 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -131,7 +131,8 @@ struct anon_pipe_prealloc {
  * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
  * in-lock alloc_page() fallback in anon_pipe_get_page().
  */
-static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
+static void anon_pipe_get_page_prealloc(struct pipe_inode_info *pipe,
+					struct anon_pipe_prealloc *prealloc,
 					size_t total_len)
 {
 	unsigned int want, i;
@@ -144,6 +145,11 @@ static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
 	want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
 		     PIPE_PREALLOC_MAX);
 
+	for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
+		if (pipe->tmp_page[i] && !--want)
+			return;
+	}
+
 	for (i = 0; i < want; i++) {
 		page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
 		if (!page)
@@ -548,7 +554,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	if (unlikely(total_len == 0))
 		return 0;
 
-	anon_pipe_get_page_prealloc(&prealloc, total_len);
+	anon_pipe_get_page_prealloc(pipe, &prealloc, total_len);
 
 	mutex_lock(&pipe->mutex);
 


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 14:37         ` Oleg Nesterov
@ 2026-06-17 14:47           ` Breno Leitao
  2026-06-17 14:57             ` Mateusz Guzik
  2026-06-17 15:45             ` Oleg Nesterov
  2026-06-17 14:51           ` Mateusz Guzik
  1 sibling, 2 replies; 18+ messages in thread
From: Breno Leitao @ 2026-06-17 14:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mateusz Guzik, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

Hello Mateusz,

On Wed, Jun 17, 2026 at 04:37:24PM +0200, Oleg Nesterov wrote:
> On 06/17, Mateusz Guzik wrote:
> >
> > There are trivial touch ups which can be done by adding a bunch of
> > predicts and inlining kill_fasync if someone can be bothered.
> 
> I was thinking about another change, see below. It assumes that in the
> likely case another writer won't steal the pages from ->tmp_page[]
> before we take pipe->mutex.

Do you think we could eventually eliminate the tmp_page[] array and
consolidate everything into the prealloc pages? That would unify the two
page pools currently used in the pipe write path.

When I examined this previously, it appeared non-trivial but potentially
feasible.

What is your view on it?

Thanks,
--breno

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 14:37         ` Oleg Nesterov
  2026-06-17 14:47           ` Breno Leitao
@ 2026-06-17 14:51           ` Mateusz Guzik
  2026-06-17 15:30             ` Oleg Nesterov
  1 sibling, 1 reply; 18+ messages in thread
From: Mateusz Guzik @ 2026-06-17 14:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Breno Leitao, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On Wed, Jun 17, 2026 at 4:37 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 06/17, Mateusz Guzik wrote:
> >
> > There are trivial touch ups which can be done by adding a bunch of
> > predicts and inlining kill_fasync if someone can be bothered.
>
> I was thinking about another change, see below. It assumes that in the
> likely case another writer won't steal the pages from ->tmp_page[]
> before we take pipe->mutex.
>
> I'm not sure this makes sense, and I have no idea how it would impact
> performance in "real" workloads.
>
> Oleg.
> ---
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 429b0714ec57..9f07f469830a 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -131,7 +131,8 @@ struct anon_pipe_prealloc {
>   * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
>   * in-lock alloc_page() fallback in anon_pipe_get_page().
>   */
> -static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> +static void anon_pipe_get_page_prealloc(struct pipe_inode_info *pipe,
> +                                       struct anon_pipe_prealloc *prealloc,
>                                         size_t total_len)
>  {
>         unsigned int want, i;
> @@ -144,6 +145,11 @@ static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
>         want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
>                      PIPE_PREALLOC_MAX);
>
> +       for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
> +               if (pipe->tmp_page[i] && !--want)
> +                       return;
> +       }
> +
>         for (i = 0; i < want; i++) {
>                 page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
>                 if (!page)
> @@ -548,7 +554,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>         if (unlikely(total_len == 0))
>                 return 0;
>
> -       anon_pipe_get_page_prealloc(&prealloc, total_len);
> +       anon_pipe_get_page_prealloc(pipe, &prealloc, total_len);
>
>         mutex_lock(&pipe->mutex);
>
>

As proposed this will guarantee a big write which fits fine into pages
cached into tmp_page followed by a small write will have to resort to
an allocation under the mutex, partially defeating the original patch.
So you would need to add some provisions to check if you need to
allocate something even in that case.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 14:47           ` Breno Leitao
@ 2026-06-17 14:57             ` Mateusz Guzik
  2026-06-17 15:26               ` Breno Leitao
  2026-06-17 15:45             ` Oleg Nesterov
  1 sibling, 1 reply; 18+ messages in thread
From: Mateusz Guzik @ 2026-06-17 14:57 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Oleg Nesterov, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On Wed, Jun 17, 2026 at 4:47 PM Breno Leitao <leitao@debian.org> wrote:
>
> Hello Mateusz,
>
> On Wed, Jun 17, 2026 at 04:37:24PM +0200, Oleg Nesterov wrote:
> > On 06/17, Mateusz Guzik wrote:
> > >
> > > There are trivial touch ups which can be done by adding a bunch of
> > > predicts and inlining kill_fasync if someone can be bothered.
> >
> > I was thinking about another change, see below. It assumes that in the
> > likely case another writer won't steal the pages from ->tmp_page[]
> > before we take pipe->mutex.
>
> Do you think we could eventually eliminate the tmp_page[] array and
> consolidate everything into the prealloc pages? That would unify the two
> page pools currently used in the pipe write path.
>
> When I examined this previously, it appeared non-trivial but potentially
> feasible.
>

I think I commented on this in my first e-mail.

In order for this to be acceptable there would have to be a way to
reclaim these pages in case of memory shortage. Otherwise, say you
cached 8 pages and the pipe remains unused afterwards -- all the pages
are actively being wasted with no means to free them.

Now that I wrote it though, I think this is very much doable.

Naively you could add a list for all allocated pipes, but that's
terrible from perf standpoint even if said list is distributed enough
to not have contention on pipe creation/destruction in a real setting.

But suppose pipes are allocated from a dedicated slab. There is
presumably a way to walk all of the created pipes, regardless if they
are is "allocated" or "freed", in which case such shrinker could be
implemented without extra overhead for the write side. Someone
familiar with slab (slub?) internals would have to chime in though.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 10:23     ` Breno Leitao
  2026-06-17 11:59       ` Mateusz Guzik
@ 2026-06-17 15:01       ` Mateusz Guzik
  2026-06-17 17:39         ` Breno Leitao
  1 sibling, 1 reply; 18+ messages in thread
From: Mateusz Guzik @ 2026-06-17 15:01 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Oleg Nesterov, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On Wed, Jun 17, 2026 at 03:23:54AM -0700, Breno Leitao wrote:
> Measured it to _just be sure_, 1-byte ping-pong (perf bench sched pipe -s 1):
> 
>     baseline:  2.674 usecs/op
>     patched:   2.710 usecs/op   (+1.3%, within run-to-run noise)
> 

Can you try this:

diff --git a/fs/fcntl.c b/fs/fcntl.c
index c158f082f1da..012d9d87f827 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1118,11 +1118,14 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap
 
 EXPORT_SYMBOL(fasync_helper);
 
-/*
- * rcu_read_lock() is held
- */
-static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
+void __kill_fasync(struct fasync_struct **fp, int sig, int band)
 {
+	struct fasync_struct *fa;
+
+	guard(rcu)();
+
+	fa = rcu_dereference(*fp);
+
 	while (fa) {
 		struct fown_struct *fown;
 		unsigned long flags;
@@ -1148,19 +1151,7 @@ static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
 		fa = rcu_dereference(fa->fa_next);
 	}
 }
-
-void kill_fasync(struct fasync_struct **fp, int sig, int band)
-{
-	/* First a quick test without locking: usually
-	 * the list is empty.
-	 */
-	if (*fp) {
-		rcu_read_lock();
-		kill_fasync_rcu(rcu_dereference(*fp), sig, band);
-		rcu_read_unlock();
-	}
-}
-EXPORT_SYMBOL(kill_fasync);
+EXPORT_SYMBOL(__kill_fasync);
 
 static int __init fcntl_init(void)
 {
diff --git a/fs/pipe.c b/fs/pipe.c
index 429b0714ec57..bea4e92bf0a8 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -426,7 +426,7 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			}
 
 			error = pipe_buf_confirm(pipe, buf);
-			if (error) {
+			if (unlikely(error)) {
 				if (!ret)
 					ret = error;
 				break;
@@ -541,7 +541,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	 * the same time, we could set up lockdep annotations for that, but
 	 * since we don't actually need that, it's simpler to just bail here.
 	 */
-	if (pipe_has_watch_queue(pipe))
+	if (unlikely(pipe_has_watch_queue(pipe)))
 		return -EXDEV;
 
 	/* Null write succeeds. */
@@ -552,7 +552,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 
 	mutex_lock(&pipe->mutex);
 
-	if (!pipe->readers) {
+	if (unlikely(!pipe->readers)) {
 		if ((iocb->ki_flags & IOCB_NOSIGNAL) == 0)
 			send_sig(SIGPIPE, current, 0);
 		ret = -EPIPE;
@@ -593,7 +593,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	for (;;) {
-		if (!pipe->readers) {
+		if (unlikely(!pipe->readers)) {
 			if ((iocb->ki_flags & IOCB_NOSIGNAL) == 0)
 				send_sig(SIGPIPE, current, 0);
 			if (!ret)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d10897b3a1e3..6f86d1fe7589 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1383,7 +1383,12 @@ extern struct fasync_struct *fasync_alloc(void);
 extern void fasync_free(struct fasync_struct *);
 
 /* can be called from interrupts */
-extern void kill_fasync(struct fasync_struct **, int, int);
+void __kill_fasync(struct fasync_struct **fp, int sig, int band);
+static inline void kill_fasync(struct fasync_struct **fp, int sig, int band)
+{
+	if (unlikely(*fp))
+		__kill_fasync(fp, sig, band);
+}
 
 extern void __f_setown(struct file *filp, struct pid *, enum pid_type, int force);
 extern int f_setown(struct file *filp, int who, int force);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 14:57             ` Mateusz Guzik
@ 2026-06-17 15:26               ` Breno Leitao
  0 siblings, 0 replies; 18+ messages in thread
From: Breno Leitao @ 2026-06-17 15:26 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Oleg Nesterov, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On Wed, Jun 17, 2026 at 04:57:16PM +0200, Mateusz Guzik wrote:
> On Wed, Jun 17, 2026 at 4:47 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > Hello Mateusz,
> >
> > On Wed, Jun 17, 2026 at 04:37:24PM +0200, Oleg Nesterov wrote:
> > > On 06/17, Mateusz Guzik wrote:
> > > >
> > > > There are trivial touch ups which can be done by adding a bunch of
> > > > predicts and inlining kill_fasync if someone can be bothered.
> > >
> > > I was thinking about another change, see below. It assumes that in the
> > > likely case another writer won't steal the pages from ->tmp_page[]
> > > before we take pipe->mutex.
> >
> > Do you think we could eventually eliminate the tmp_page[] array and
> > consolidate everything into the prealloc pages? That would unify the two
> > page pools currently used in the pipe write path.
> >
> > When I examined this previously, it appeared non-trivial but potentially
> > feasible.
> >
> 
> I think I commented on this in my first e-mail.
> 
> In order for this to be acceptable there would have to be a way to
> reclaim these pages in case of memory shortage.

Hmm, I understand scenario doesn't apply to the patch that got accepted.
The series doesn't grow the cache: tmp_page[] is still the existing 2-entry
array

anon_pipe_refill_tmp_pages() only fills empty tmp_page[] slots; every
other prealloc page is put_page()'d before the write() returns.

So an idle pipe holds at most 2 cached pages — the same cap mainline
already maintains via anon_pipe_put_page() on the release side, freed in
free_pipe_info().

	/* Runs after mutex_unlock() to keep put_page() out of the critical section. */
	static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
	{
		while (prealloc->count) {
			prealloc->count--;
			put_page(prealloc->pages[prealloc->count]);
		}
	}

Net new unreclaimable footprint from this series is zero.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 14:51           ` Mateusz Guzik
@ 2026-06-17 15:30             ` Oleg Nesterov
  0 siblings, 0 replies; 18+ messages in thread
From: Oleg Nesterov @ 2026-06-17 15:30 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Breno Leitao, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On 06/17, Mateusz Guzik wrote:
>
> > -static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> > +static void anon_pipe_get_page_prealloc(struct pipe_inode_info *pipe,
> > +                                       struct anon_pipe_prealloc *prealloc,
> >                                         size_t total_len)
> >  {
> >         unsigned int want, i;
> > @@ -144,6 +145,11 @@ static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> >         want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
> >                      PIPE_PREALLOC_MAX);
> >
> > +       for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
> > +               if (pipe->tmp_page[i] && !--want)
> > +                       return;
> > +       }

> As proposed this will guarantee a big write which fits fine into pages
> cached into tmp_page followed by a small write will have to resort to
> an allocation under the mutex, partially defeating the original patch.
> So you would need to add some provisions to check if you need to
> allocate something even in that case.

Yes, with the change like this, at least the "total_len <= PAGE_SIZE"
check should be revisited.

But let me repeat: I'm not sure this makes sense, and I have no idea how
it would impact performance in "real" workloads.

In particular, I don't know if the case when another writer steals the
pages from ->tmp_page[] is actually "unlikely".

Oleg.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 14:47           ` Breno Leitao
  2026-06-17 14:57             ` Mateusz Guzik
@ 2026-06-17 15:45             ` Oleg Nesterov
  1 sibling, 0 replies; 18+ messages in thread
From: Oleg Nesterov @ 2026-06-17 15:45 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Mateusz Guzik, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On 06/17, Breno Leitao wrote:
>
> Hello Mateusz,
>
> On Wed, Jun 17, 2026 at 04:37:24PM +0200, Oleg Nesterov wrote:
> > On 06/17, Mateusz Guzik wrote:
> > >
> > > There are trivial touch ups which can be done by adding a bunch of
> > > predicts and inlining kill_fasync if someone can be bothered.
> >
> > I was thinking about another change, see below. It assumes that in the
> > likely case another writer won't steal the pages from ->tmp_page[]
> > before we take pipe->mutex.
>
> Do you think we could eventually eliminate the tmp_page[] array and
> consolidate everything into the prealloc pages? That would unify the two
> page pools currently used in the pipe write path.

Cough... When I saw the 1st version of your patches, my first thought was:
we need to unify prealloc and tmp_page[] somehow.

But nothing simple came to my mind ;)

Oleg.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 11:59       ` Mateusz Guzik
  2026-06-17 14:37         ` Oleg Nesterov
@ 2026-06-17 16:04         ` Oleg Nesterov
  1 sibling, 0 replies; 18+ messages in thread
From: Oleg Nesterov @ 2026-06-17 16:04 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Breno Leitao, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On 06/17, Mateusz Guzik wrote:
>
> There are trivial touch ups which can be done by adding a bunch of
> predicts and inlining kill_fasync if someone can be bothered.

Speaking of trivial touch ups...

anon_pipe_write() does:

	 * Epoll nonsensically wants a wakeup whether the pipe
	 * was already empty or not.
	 */
	if (was_empty || pipe->poll_usage)
		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);

Again, I have no idea if the unnecessary wakeup affects the performance,
probably not. But somehow this "|| poll_usage" condition looks annoying
to me...

Perhaps it makes sense to change pipe_poll() to not set ->poll_usage
unconditionally?

Oleg.
---

diff --git a/fs/pipe.c b/fs/pipe.c
index 429b0714ec57..a60be1b71eb7 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -760,10 +760,12 @@ pipe_poll(struct file *filp, poll_table *wait)
 	struct pipe_inode_info *pipe = filp->private_data;
 	union pipe_index idx;
 
+#ifdef CONFIG_EPOLL
 	/* Epoll has some historical nasty semantics, this enables them */
-	if (unlikely(!READ_ONCE(pipe->poll_usage)))
+	if ((filp->f_mode & FMODE_READ) && filp->f_ep
+	    && unlikely(!READ_ONCE(pipe->poll_usage)))
 		WRITE_ONCE(pipe->poll_usage, true);
-
+#endif
 	/*
 	 * Reading pipe state only -- no need for acquiring the semaphore.
 	 *


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-06-17 15:01       ` Mateusz Guzik
@ 2026-06-17 17:39         ` Breno Leitao
  0 siblings, 0 replies; 18+ messages in thread
From: Breno Leitao @ 2026-06-17 17:39 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Oleg Nesterov, Josh Triplett, Alexander Viro, Christian Brauner,
	Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel,
	linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team

On Wed, Jun 17, 2026 at 05:01:04PM +0200, Mateusz Guzik wrote:
> On Wed, Jun 17, 2026 at 03:23:54AM -0700, Breno Leitao wrote:
> > Measured it to _just be sure_, 1-byte ping-pong (perf bench sched pipe -s 1):
> > 
> >     baseline:  2.674 usecs/op
> >     patched:   2.710 usecs/op   (+1.3%, within run-to-run noise)
> > 
> 
> Can you try this:

I've tested this change, but the results show no real measurable
improvement beyond measurement noise.

	patched + your change: ~2.66 usecs/op   (-1.7%, within noise)

In my experience, branch hints haven't proven particularly effective, at
least that was the case when I worked on PowerPC some time ago.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-06-17 17:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-24 14:44 [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
2026-05-24 14:44 ` [PATCH v3 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-24 14:44 ` [PATCH v3 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
2026-05-28 12:34 ` [PATCH v3 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Christian Brauner
2026-06-16 20:47 ` Josh Triplett
2026-06-17  8:52   ` Oleg Nesterov
2026-06-17 10:23     ` Breno Leitao
2026-06-17 11:59       ` Mateusz Guzik
2026-06-17 14:37         ` Oleg Nesterov
2026-06-17 14:47           ` Breno Leitao
2026-06-17 14:57             ` Mateusz Guzik
2026-06-17 15:26               ` Breno Leitao
2026-06-17 15:45             ` Oleg Nesterov
2026-06-17 14:51           ` Mateusz Guzik
2026-06-17 15:30             ` Oleg Nesterov
2026-06-17 16:04         ` Oleg Nesterov
2026-06-17 15:01       ` Mateusz Guzik
2026-06-17 17:39         ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox