Linux filesystem development
 help / color / mirror / Atom feed
* [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
@ 2026-05-22 16:44 Breno Leitao
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Breno Leitao @ 2026-05-22 16:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, Breno Leitao, kernel-team

While profiling Meta's caching code[1], I found pipe->mutex contention
on the hot path. anon_pipe_write() currently calls alloc_page() once
per page while holding pipe->mutex. The allocation can sleep doing
direct reclaim and runs memcg charging, which extends the critical
section and stalls any concurrent reader on the same mutex.

This series pre-allocates pages outside pipe->mutex in
anon_pipe_write(): for writes that span more than one full page, up
to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
alloc_page() loop before the mutex is taken. anon_pipe_get_page()
then drains the prealloc array first, falls back to the per-pipe
tmp_page[] cache, and only enters the allocator under the mutex for
the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
writes that skip prealloc, or shortfalls when the prealloc loop
fails). Leftover prealloc pages are recycled into tmp_page[] before
unlock and any remainder is put_page()'d after unlock, keeping the
allocator out of the critical section on both sides.

alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
refuses __GFP_ACCOUNT under memcg -- it returns at most one page
when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
__GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
task NUMA mempolicy honoured uniformly without open-coding the
charge.

I also vibe-coded a microbenchmark to validate the change. It sweeps
writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
1 MB pipe and prints throughput + latency percentiles per config.

Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
writes, 1 MB pipe). The numbers below were collected on v1
(alloc_pages_bulk()); v2's per-page loop preserves the dominant
"allocation outside the mutex" win and is expected to land in the same
range.

== No memory pressure (10s per config) ==

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1              readers=5               readers=10
          1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
          2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
          5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1                 readers=5                readers=10
          1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
          2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
          5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)

Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.

== Under memory pressure (--memory-pressure, 6s per config) ==

stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1            readers=5            readers=10
          1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
          2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
          5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1              readers=5              readers=10
          1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
          2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
          5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)

Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.

That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.

Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2:
- Switch the prealloc path from alloc_pages_bulk_mempolicy() to a
  per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop.
- Split the prealloc work out of anon_pipe_write() into dedicated
  helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop /
  anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in
  struct anon_pipe_prealloc, so the write path stays readable.
- Recycle leftover prealloc pages into pipe->tmp_page[] before
  unlocking
- Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org

To: Alexander Viro <viro@zeniv.linux.org.uk>
To: Christian Brauner <brauner@kernel.org>
To: Jan Kara <jack@suse.cz>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (2):
      fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
      selftests/pipe: add pipe_bench microbenchmark

 fs/pipe.c                                 | 105 ++++-
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
 5 files changed, 729 insertions(+), 3 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
@ 2026-05-22 16:44 ` Breno Leitao
  2026-05-22 16:51   ` Jeff Layton
                     ` (2 more replies)
  2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
  2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton
  2 siblings, 3 replies; 13+ messages in thread
From: Breno Leitao @ 2026-05-22 16:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, Breno Leitao, kernel-team

anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole
thing") and then, from the per-iteration anon_pipe_get_page() helper,
used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page
while still holding it.

That allocation can sleep doing direct reclaim and/or runs memcg
charging, which extends the critical section and stalls a concurrent
reader on the very same mutex.

Just pre-alloc the required pages before the lock in an array and just pop
them inside the lock.

This can improve the pipe throughput up to 48% and reduce the
latency in 33%, easily seen when there is memory pressure and direct
reclaim.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 fs/pipe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 102 insertions(+), 3 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 9841648c9cf3e..cff255217bbfe 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -111,16 +111,76 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 	pipe_lock(pipe2);
 }
 
-static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
+#define PIPE_PREALLOC_MAX 8
+
+struct anon_pipe_prealloc {
+	struct page *pages[PIPE_PREALLOC_MAX];
+	unsigned int count;
+};
+
+/*
+ * Pre-allocate pages outside pipe->mutex for multi-page writes.
+ * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg
+ * charging; doing it under the mutex stalls a concurrent reader.
+ *
+ * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses
+ * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc:
+ * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single
+ * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy
+ * honoured for every page; the per-call overhead is small compared to the
+ * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
+ * in-lock alloc_page() fallback in anon_pipe_get_page().
+ */
+static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
+					size_t total_len)
+{
+	unsigned int want, i;
+	struct page *page;
+
+	prealloc->count = 0;
+	if (total_len <= PAGE_SIZE)
+		return;
+
+	want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
+		     PIPE_PREALLOC_MAX);
+
+	for (i = 0; i < want; i++) {
+		page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
+		if (!page)
+			break;
+		prealloc->pages[prealloc->count++] = page;
+	}
+}
+
+static struct page *anon_pipe_prealloc_pop(struct anon_pipe_prealloc *prealloc)
+{
+	if (!prealloc->count)
+		return NULL;
+
+	prealloc->count--;
+
+	return prealloc->pages[prealloc->count];
+}
+
+static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
+				       struct anon_pipe_prealloc *prealloc)
 {
+	struct page *page;
+
+	/* Drain prealloc first to keep tmp_page[] hot for later small writes. */
+	page = anon_pipe_prealloc_pop(prealloc);
+	if (page)
+		return page;
+
 	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
 		if (pipe->tmp_page[i]) {
-			struct page *page = pipe->tmp_page[i];
+			page = pipe->tmp_page[i];
 			pipe->tmp_page[i] = NULL;
 			return page;
 		}
 	}
 
+	/* FWIW: This is called with pipe->mutex held */
 	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
 }
 
@@ -139,6 +199,38 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe,
 	put_page(page);
 }
 
+/*
+ * Stash leftover prealloc pages in tmp_page[] so the next write to this
+ * pipe gets a hot page without entering the allocator.
+ */
+static void anon_pipe_refill_tmp_pages(struct pipe_inode_info *pipe,
+				       struct anon_pipe_prealloc *prealloc)
+{
+	int i, idx;
+
+	if (!prealloc->count)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
+		if (pipe->tmp_page[i])
+			continue;
+		if (!prealloc->count)
+			return;
+		idx = --prealloc->count;
+		pipe->tmp_page[i] = prealloc->pages[idx];
+		prealloc->pages[idx] = NULL;
+	}
+}
+
+/* Runs after mutex_unlock() to keep put_page() out of the critical section. */
+static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
+{
+	while (prealloc->count) {
+		prealloc->count--;
+		put_page(prealloc->pages[prealloc->count]);
+	}
+}
+
 static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
@@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
+	struct anon_pipe_prealloc prealloc;
 	unsigned int head;
 	ssize_t ret = 0;
 	size_t total_len = iov_iter_count(from);
@@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	if (unlikely(total_len == 0))
 		return 0;
 
+	anon_pipe_get_page_prealloc(&prealloc, total_len);
+
 	mutex_lock(&pipe->mutex);
 
 	if (!pipe->readers) {
@@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			struct page *page;
 			int copied;
 
-			page = anon_pipe_get_page(pipe);
+			page = anon_pipe_get_page(pipe, &prealloc);
 			if (unlikely(!page)) {
 				if (!ret)
 					ret = -ENOMEM;
@@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		 * after waiting we need to re-check whether the pipe
 		 * become empty while we dropped the lock.
 		 */
+		anon_pipe_refill_tmp_pages(pipe, &prealloc);
 		mutex_unlock(&pipe->mutex);
+		anon_pipe_free_pages(&prealloc);
 		if (was_empty)
 			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
@@ -576,9 +673,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		wake_next_writer = true;
 	}
 out:
+	anon_pipe_refill_tmp_pages(pipe, &prealloc);
 	if (pipe_is_full(pipe))
 		wake_next_writer = false;
 	mutex_unlock(&pipe->mutex);
+	anon_pipe_free_pages(&prealloc);
 
 	/*
 	 * If we do do a wakeup event, we do a 'sync' wakeup, because we

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark
  2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
@ 2026-05-22 16:44 ` Breno Leitao
  2026-05-23 16:43   ` Oleg Nesterov
  2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton
  2 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-05-22 16:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, Breno Leitao, kernel-team

Add a small selftest that stresses pipe->mutex contention by spawning N
writer threads that hammer a single pipe with multi-page writes, plus M
reader threads that drain. Each writer records its own write() latency
samples into a log2-bucketed histogram; main aggregates and prints
total writes, throughput, average and percentile (p50/p99) latencies,
and the maximum observed latency.

Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80%
--vm-method all) for the duration of the run, so alloc_page() in
anon_pipe_write() routinely hits direct reclaim. The flag fails
fast if stress-ng is not on $PATH.

Program print something like the following, for different writes,
readers, msgsizes and memory pressure:

	config: writers=X readers=Y msgsize=Z duration=3 pipe_size=1048576
	memory_pressure=[no|yes]
	writes: total=54451 rate=18150/s
	throughput_MBps: 1134.40
	lat_avg_ns: 275355
	lat_p50_ns_upper: 262143
	lat_p99_ns_upper: 1048575
	lat_max_ns: 2145633

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
 4 files changed, 627 insertions(+)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6e59b8f63e416..bcd9db9d292ca 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -91,6 +91,7 @@ TARGETS += pcie_bwctrl
 TARGETS += perf_events
 TARGETS += pidfd
 TARGETS += pid_namespace
+TARGETS += pipe
 TARGETS += power_supply
 TARGETS += powerpc
 TARGETS += prctl
diff --git a/tools/testing/selftests/pipe/.gitignore b/tools/testing/selftests/pipe/.gitignore
new file mode 100644
index 0000000000000..20b549361a152
--- /dev/null
+++ b/tools/testing/selftests/pipe/.gitignore
@@ -0,0 +1 @@
+pipe_bench
diff --git a/tools/testing/selftests/pipe/Makefile b/tools/testing/selftests/pipe/Makefile
new file mode 100644
index 0000000000000..1810c680117b3
--- /dev/null
+++ b/tools/testing/selftests/pipe/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+# Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+
+CFLAGS += -O2 -Wall -Wextra -pthread
+
+TEST_GEN_PROGS := pipe_bench
+
+include ../lib.mk
diff --git a/tools/testing/selftests/pipe/pipe_bench.c b/tools/testing/selftests/pipe/pipe_bench.c
new file mode 100644
index 0000000000000..7e96429b8fb4d
--- /dev/null
+++ b/tools/testing/selftests/pipe/pipe_bench.c
@@ -0,0 +1,616 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pipe_bench - exercise concurrent pipe operation
+ *
+ * N writer threads hammer a single pipe with multi-page writes; M reader
+ * threads drain it. Each writer records its own write() latency histogram.
+ * Multi-page writes (msgsize >= PAGE_SIZE) force the loop in
+ * anon_pipe_write() to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) under
+ * pipe->mutex, which is the critical section the patch shrinks.
+ *
+ * By default the benchmark sweeps writers in {1, 2, 5} x readers in
+ * {1, 5, 10} and prints one block per configuration so two runs (e.g.
+ * baseline vs patched) can be diffed directly. Pass -w and -r to run a
+ * single configuration instead. Pass --memory-pressure to spawn stress-ng
+ * alongside the sweep so the per-page alloc_page() path under pipe->mutex
+ * has to dip into reclaim.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <poll.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#define ARRAY_SIZE(a)	(sizeof(a) / sizeof((a)[0]))
+#define HIST_BUCKETS	32
+
+static size_t g_msgsize = 16 * 4096;
+static int g_duration = 3;
+static int g_pipe_size = 1024 * 1024;
+static int g_memory_pressure;
+
+static atomic_int g_stop;
+static int g_pipe[2];
+
+struct wstats {
+	uint64_t writes;
+	uint64_t bytes;
+	uint64_t lat_sum_ns;
+	uint64_t lat_max_ns;
+	uint64_t lat_hist[HIST_BUCKETS];
+	char *buf;
+};
+
+struct rstats {
+	char *buf;
+};
+
+struct hist_totals {
+	uint64_t writes;
+	uint64_t bytes;
+	uint64_t lat_sum;
+	uint64_t lat_max;
+};
+
+static inline uint64_t now_ns(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
+}
+
+static inline int log2_bucket(uint64_t v)
+{
+	int b = 0;
+
+	if (!v)
+		return 0;
+	while (v >>= 1)
+		b++;
+	return b < HIST_BUCKETS ? b : HIST_BUCKETS - 1;
+}
+
+static void *writer(void *arg)
+{
+	struct wstats *s = arg;
+
+	while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) {
+		uint64_t t0 = now_ns();
+		ssize_t n = write(g_pipe[1], s->buf, g_msgsize);
+		uint64_t dt = now_ns() - t0;
+
+		if (n > 0) {
+			s->writes++;
+			s->bytes += (uint64_t)n;
+			s->lat_sum_ns += dt;
+			if (dt > s->lat_max_ns)
+				s->lat_max_ns = dt;
+			s->lat_hist[log2_bucket(dt)]++;
+		} else if (n < 0 && (errno == EPIPE || errno == EBADF)) {
+			break;
+		}
+	}
+	return NULL;
+}
+
+static void *reader(void *arg)
+{
+	struct rstats *s = arg;
+
+	/*
+	 * Drain until EOF (write end closed by main). g_stop is not checked
+	 * here on purpose: writers may be blocked in write() with the pipe
+	 * full when g_stop is set, so the reader must keep draining until
+	 * main closes the write end.
+	 */
+	for (;;) {
+		ssize_t n = read(g_pipe[0], s->buf, g_msgsize);
+
+		if (n <= 0)
+			break;
+	}
+	return NULL;
+}
+
+/* Sum per-writer stats and per-bucket counts into the caller's aggregates. */
+static void aggregate_wstats(struct wstats *all, int nw,
+			     uint64_t agg[HIST_BUCKETS],
+			     struct hist_totals *t)
+{
+	memset(t, 0, sizeof(*t));
+	for (int i = 0; i < nw; i++) {
+		t->writes += all[i].writes;
+		t->bytes += all[i].bytes;
+		t->lat_sum += all[i].lat_sum_ns;
+		if (all[i].lat_max_ns > t->lat_max)
+			t->lat_max = all[i].lat_max_ns;
+		for (int b = 0; b < HIST_BUCKETS; b++)
+			agg[b] += all[i].lat_hist[b];
+	}
+}
+
+/*
+ * Walk @agg in order, returning the inclusive upper bound (in ns) of the
+ * log2 bucket where the running sum first reaches @target.
+ *
+ * A percentile is undefined with zero samples, and with very low sample
+ * counts integer truncation could make @target zero -- then "cum >= 0"
+ * would latch on the first (possibly empty) bucket. Callers must pass
+ * @target >= 1.
+ */
+static uint64_t bucket_at(const uint64_t agg[HIST_BUCKETS], uint64_t target)
+{
+	uint64_t cum = 0;
+
+	for (int b = 0; b < HIST_BUCKETS; b++) {
+		/* HIST_BUCKETS <= 63, so (b + 1) is always a safe shift. */
+		uint64_t upper = (1ULL << (b + 1)) - 1;
+
+		cum += agg[b];
+		if (cum >= target)
+			return upper;
+	}
+	return 0;
+}
+
+static void compute_p50_p99(const uint64_t agg[HIST_BUCKETS], uint64_t writes,
+			    uint64_t *p50, uint64_t *p99)
+{
+	uint64_t p50_target, p99_target;
+
+	*p50 = *p99 = 0;
+	if (!writes)
+		return;
+
+	p50_target = writes * 50 / 100;
+	p99_target = writes * 99 / 100;
+	if (!p50_target)
+		p50_target = 1;
+	if (!p99_target)
+		p99_target = 1;
+
+	*p50 = bucket_at(agg, p50_target);
+	*p99 = bucket_at(agg, p99_target);
+}
+
+static void print_summary(int nw, int nr, const struct hist_totals *t,
+			  uint64_t p50, uint64_t p99)
+{
+	double sec = g_duration;
+	uint64_t avg_ns = t->writes ? t->lat_sum / t->writes : 0;
+
+	printf("config: writers=%d readers=%d msgsize=%zu duration=%d pipe_size=%d memory_pressure=%s\n",
+	       nw, nr, g_msgsize, g_duration, g_pipe_size,
+	       g_memory_pressure ? "yes" : "no");
+	printf("writes: total=%llu rate=%.0f/s\n",
+	       (unsigned long long)t->writes, (double)t->writes / sec);
+	printf("throughput_MBps: %.2f\n",
+	       ((double)t->bytes / sec) / (1024.0 * 1024.0));
+	printf("lat_avg_ns: %llu\n", (unsigned long long)avg_ns);
+	printf("lat_p50_ns_upper: %llu\n", (unsigned long long)p50);
+	printf("lat_p99_ns_upper: %llu\n", (unsigned long long)p99);
+	printf("lat_max_ns: %llu\n", (unsigned long long)t->lat_max);
+}
+
+static void summarize(struct wstats *all, int nw, int nr)
+{
+	uint64_t agg[HIST_BUCKETS] = {0};
+	struct hist_totals t;
+	uint64_t p50, p99;
+
+	aggregate_wstats(all, nw, agg, &t);
+	compute_p50_p99(agg, t.writes, &p50, &p99);
+	print_summary(nw, nr, &t, p50, p99);
+}
+
+/*
+ * Child branch of fork(): restore SIGPIPE to default (parent ignores it),
+ * exec stress-ng, and on failure write the reason into @hs_wr before
+ * exiting. The parent observes EOF on hs_wr (closed via O_CLOEXEC) when
+ * exec succeeds.
+ */
+static void stress_ng_child(int hs_wr) __attribute__((noreturn));
+static void stress_ng_child(int hs_wr)
+{
+	char errbuf[256];
+
+	signal(SIGPIPE, SIG_DFL);
+	execlp("stress-ng", "stress-ng",
+	       "--vm", "4", "--vm-bytes", "80%",
+	       "--vm-method", "all",
+	       (char *)NULL);
+	snprintf(errbuf, sizeof(errbuf),
+		 "exec stress-ng failed: %s\n", strerror(errno));
+	(void)!write(hs_wr, errbuf, strlen(errbuf));
+	_exit(127);
+}
+
+/*
+ * Read from the O_CLOEXEC handshake pipe. Anything readable means the
+ * child wrote an error before exec; EOF (n == 0) means the write-end
+ * closed because exec succeeded. Returns 0 on exec success, -1 if the
+ * child failed and was reaped.
+ */
+static int stress_ng_wait_handshake(int hs_rd, pid_t pid)
+{
+	struct pollfd pfd = { .fd = hs_rd, .events = POLLIN };
+	char errbuf[256];
+	int status;
+	int ret;
+
+	ret = poll(&pfd, 1, 500);
+	if (ret <= 0)
+		return 0;
+
+	ssize_t n = read(hs_rd, errbuf, sizeof(errbuf) - 1);
+
+	if (n > 0) {
+		errbuf[n] = '\0';
+		fputs(errbuf, stderr);
+		waitpid(pid, &status, 0);
+		return -1;
+	}
+	return 0;
+}
+
+static pid_t spawn_stress_ng(void)
+{
+	int hs[2];
+	pid_t pid;
+
+	/*
+	 * Handshake pipe: child writes one byte and _exit()s on exec
+	 * failure. On exec success the O_CLOEXEC flag closes the write
+	 * end, which the parent observes as EOF. This makes the "is
+	 * stress-ng on $PATH?" check fail fast rather than silently.
+	 */
+	if (pipe2(hs, O_CLOEXEC) < 0) {
+		perror("pipe2");
+		return -1;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		perror("fork");
+		close(hs[0]);
+		close(hs[1]);
+		return -1;
+	}
+	if (pid == 0) {
+		close(hs[0]);
+		stress_ng_child(hs[1]);
+	}
+
+	close(hs[1]);
+	if (stress_ng_wait_handshake(hs[0], pid) < 0) {
+		close(hs[0]);
+		return -1;
+	}
+	close(hs[0]);
+
+	/* Give stress-ng a moment to map its VM regions before measuring. */
+	sleep(1);
+	return pid;
+}
+
+static void kill_stress_ng(pid_t pid)
+{
+	int status;
+
+	if (pid <= 0)
+		return;
+	kill(pid, SIGTERM);
+	for (int i = 0; i < 20; i++) {
+		if (waitpid(pid, &status, WNOHANG) > 0)
+			return;
+		usleep(100 * 1000);
+	}
+	kill(pid, SIGKILL);
+	waitpid(pid, &status, 0);
+}
+
+/*
+ * Allocate per-thread page-aligned buffers in main so a failed
+ * aligned_alloc() aborts the run before any thread starts. Workers used
+ * to allocate their own buffer and return NULL on failure, which left
+ * peers blocked in write()/read() with nobody to unblock them.
+ */
+static int alloc_thread_bufs(struct wstats *ws, int nw,
+			     struct rstats *rs, int nr)
+{
+	for (int i = 0; i < nw; i++) {
+		ws[i].buf = aligned_alloc(4096, g_msgsize);
+		if (!ws[i].buf) {
+			fprintf(stderr, "writer %d: aligned_alloc(%zu) failed\n",
+				i, g_msgsize);
+			return -1;
+		}
+		memset(ws[i].buf, 0xAA, g_msgsize);
+	}
+	for (int i = 0; i < nr; i++) {
+		rs[i].buf = aligned_alloc(4096, g_msgsize);
+		if (!rs[i].buf) {
+			fprintf(stderr, "reader %d: aligned_alloc(%zu) failed\n",
+				i, g_msgsize);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+static void free_thread_bufs(struct wstats *ws, int nw,
+			     struct rstats *rs, int nr)
+{
+	if (ws)
+		for (int i = 0; i < nw; i++)
+			free(ws[i].buf);
+	if (rs)
+		for (int i = 0; i < nr; i++)
+			free(rs[i].buf);
+}
+
+static int start_readers(pthread_t *rt, struct rstats *rs, int nr,
+			 int *created)
+{
+	for (int i = 0; i < nr; i++) {
+		int err = pthread_create(&rt[i], NULL, reader, &rs[i]);
+
+		if (err) {
+			fprintf(stderr, "pthread_create reader %d: %s\n",
+				i, strerror(err));
+			return -1;
+		}
+		(*created)++;
+	}
+	return 0;
+}
+
+static int start_writers(pthread_t *wt, struct wstats *ws, int nw,
+			 int *created)
+{
+	for (int i = 0; i < nw; i++) {
+		int err = pthread_create(&wt[i], NULL, writer, &ws[i]);
+
+		if (err) {
+			fprintf(stderr, "pthread_create writer %d: %s\n",
+				i, strerror(err));
+			return -1;
+		}
+		(*created)++;
+	}
+	return 0;
+}
+
+static int open_bench_pipe(void)
+{
+	if (pipe(g_pipe) < 0) {
+		perror("pipe");
+		return -1;
+	}
+	if (fcntl(g_pipe[1], F_SETPIPE_SZ, g_pipe_size) < 0)
+		perror("F_SETPIPE_SZ (continuing)");
+	return 0;
+}
+
+/*
+ * Normal termination: g_stop tells writers to leave the loop after the
+ * current write() returns. Closing the shared write-end fd means once
+ * the in-flight writes drain, readers see EOF and exit. Writers are not
+ * unblocked by EPIPE here -- g_pipe[0] stays open so readers can keep
+ * draining.
+ *
+ * Error path: some threads may have been created and others skipped, so
+ * writers could be blocked in write() with no reader making progress.
+ * Close both ends -- closing the read end is what delivers EPIPE to a
+ * blocked writer.
+ */
+static void stop_and_join(pthread_t *wt, int nw_created,
+			  pthread_t *rt, int nr_created, int rc)
+{
+	atomic_store(&g_stop, 1);
+	close(g_pipe[1]);
+	if (rc < 0)
+		close(g_pipe[0]);
+	for (int i = 0; i < nw_created; i++)
+		pthread_join(wt[i], NULL);
+	for (int i = 0; i < nr_created; i++)
+		pthread_join(rt[i], NULL);
+	if (rc == 0)
+		close(g_pipe[0]);
+}
+
+static int run_one(int nw, int nr)
+{
+	pthread_t *wt = NULL, *rt = NULL;
+	struct wstats *ws = NULL;
+	struct rstats *rs = NULL;
+	int nw_created = 0, nr_created = 0;
+	int rc = 0;
+
+	atomic_store(&g_stop, 0);
+
+	if (open_bench_pipe() < 0)
+		return -1;
+
+	wt = calloc((size_t)nw, sizeof(*wt));
+	rt = calloc((size_t)nr, sizeof(*rt));
+	ws = calloc((size_t)nw, sizeof(*ws));
+	rs = calloc((size_t)nr, sizeof(*rs));
+	if (!wt || !rt || !ws || !rs) {
+		fprintf(stderr, "alloc failed\n");
+		rc = -1;
+		goto teardown;
+	}
+
+	if (alloc_thread_bufs(ws, nw, rs, nr) < 0) {
+		rc = -1;
+		goto teardown;
+	}
+
+	if (start_readers(rt, rs, nr, &nr_created) < 0 ||
+	    start_writers(wt, ws, nw, &nw_created) < 0) {
+		rc = -1;
+		goto teardown;
+	}
+
+	sleep((unsigned int)g_duration);
+
+teardown:
+	stop_and_join(wt, nw_created, rt, nr_created, rc);
+
+	if (rc == 0) {
+		summarize(ws, nw, nr);
+		fflush(stdout);
+	}
+
+	free_thread_bufs(ws, nw, rs, nr);
+	free(wt);
+	free(rt);
+	free(ws);
+	free(rs);
+	return rc;
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [-w writers] [-r readers] [-s msgsize] [-d secs] [-p pipe_size] [--memory-pressure]\n"
+		"  default: sweep writers={1,2,5} x readers={1,5,10}\n"
+		"  --memory-pressure: spawn stress-ng (--vm 4 --vm-bytes 80%% --vm-method all) for the run\n",
+		prog);
+}
+
+static int parse_args(int argc, char **argv,
+		      int *writers_override, int *readers_override)
+{
+	static const struct option long_opts[] = {
+		{"memory-pressure", no_argument, NULL, 'M'},
+		{0, 0, 0, 0},
+	};
+	int opt;
+
+	while ((opt = getopt_long(argc, argv, "w:r:s:d:p:",
+				  long_opts, NULL)) != -1) {
+		switch (opt) {
+		case 'w':
+			*writers_override = atoi(optarg);
+			break;
+		case 'r':
+			*readers_override = atoi(optarg);
+			break;
+		case 's':
+			g_msgsize = (size_t)atol(optarg);
+			break;
+		case 'd':
+			g_duration = atoi(optarg);
+			break;
+		case 'p':
+			g_pipe_size = atoi(optarg);
+			break;
+		case 'M':
+			g_memory_pressure = 1;
+			break;
+		default:
+			usage(argv[0]);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * aligned_alloc(4096, size) requires size to be a multiple of the
+ * alignment (C11); glibc returns NULL otherwise, which would make
+ * writer/reader threads silently exit and the run report zero writes.
+ * Validate up front instead.
+ */
+static int validate_args(void)
+{
+	if (g_msgsize == 0 || g_msgsize % 4096 != 0) {
+		fprintf(stderr,
+			"msgsize must be a positive multiple of 4096 (got %zu)\n",
+			g_msgsize);
+		return -1;
+	}
+	if (g_duration <= 0) {
+		fprintf(stderr, "duration must be > 0 seconds (got %d)\n",
+			g_duration);
+		return -1;
+	}
+	if (g_pipe_size <= 0) {
+		fprintf(stderr, "pipe_size must be > 0 bytes (got %d)\n",
+			g_pipe_size);
+		return -1;
+	}
+	return 0;
+}
+
+static int run_sweep(void)
+{
+	static const int writers_sweep[] = {1, 2, 5};
+	static const int readers_sweep[] = {1, 5, 10};
+
+	for (size_t i = 0; i < ARRAY_SIZE(writers_sweep); i++) {
+		for (size_t j = 0; j < ARRAY_SIZE(readers_sweep); j++) {
+			printf("---\n");
+			if (run_one(writers_sweep[i], readers_sweep[j]) < 0)
+				return -1;
+		}
+	}
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	int writers_override = 0, readers_override = 0;
+	pid_t stress_pid = -1;
+	int rc = 0;
+
+	if (parse_args(argc, argv, &writers_override, &readers_override) < 0)
+		return 1;
+	if (validate_args() < 0)
+		return 1;
+
+	signal(SIGPIPE, SIG_IGN);
+	setvbuf(stdout, NULL, _IOLBF, 0);
+	setvbuf(stderr, NULL, _IOLBF, 0);
+
+	fprintf(stderr, "pid=%d\n", getpid());
+	fflush(stderr);
+
+	if (g_memory_pressure) {
+		stress_pid = spawn_stress_ng();
+		if (stress_pid < 0) {
+			fprintf(stderr,
+				"memory_pressure requested but stress-ng could not be spawned\n");
+			return 1;
+		}
+	}
+
+	if (writers_override > 0 || readers_override > 0) {
+		int nw = writers_override > 0 ? writers_override : 1;
+		int nr = readers_override > 0 ? readers_override : 1;
+
+		rc = run_one(nw, nr) < 0 ? 1 : 0;
+	} else {
+		rc = run_sweep() < 0 ? 1 : 0;
+	}
+
+	kill_stress_ng(stress_pid);
+	return rc;
+}

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
@ 2026-05-22 16:51   ` Jeff Layton
  2026-05-22 17:55     ` Breno Leitao
  2026-05-22 19:48   ` Mateusz Guzik
  2026-05-23 16:26   ` Oleg Nesterov
  2 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2026-05-22 16:51 UTC (permalink / raw)
  To: Breno Leitao, Alexander Viro, Christian Brauner, Jan Kara,
	Shuah Khan, Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, oleg,
	axboe, kernel-team

On Fri, 2026-05-22 at 09:44 -0700, Breno Leitao wrote:
> anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole
> thing") and then, from the per-iteration anon_pipe_get_page() helper,
> used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page
> while still holding it.
> 
> That allocation can sleep doing direct reclaim and/or runs memcg
> charging, which extends the critical section and stalls a concurrent
> reader on the very same mutex.
> 
> Just pre-alloc the required pages before the lock in an array and just pop
> them inside the lock.
> 
> This can improve the pipe throughput up to 48% and reduce the
> latency in 33%, easily seen when there is memory pressure and direct
> reclaim.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  fs/pipe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 102 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 9841648c9cf3e..cff255217bbfe 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -111,16 +111,76 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>  	pipe_lock(pipe2);
>  }
>  
> -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
> +#define PIPE_PREALLOC_MAX 8
> +
> +struct anon_pipe_prealloc {
> +	struct page *pages[PIPE_PREALLOC_MAX];
> +	unsigned int count;
> +};
> +
> +/*
> + * Pre-allocate pages outside pipe->mutex for multi-page writes.
> + * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg
> + * charging; doing it under the mutex stalls a concurrent reader.
> + *
> + * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses
> + * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc:
> + * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single
> + * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy
> + * honoured for every page; the per-call overhead is small compared to the
> + * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
> + * in-lock alloc_page() fallback in anon_pipe_get_page().
> + */
> +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> +					size_t total_len)
> +{
> +	unsigned int want, i;
> +	struct page *page;
> +
> +	prealloc->count = 0;
> +	if (total_len <= PAGE_SIZE)
> +		return;
> +
> +	want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
> +		     PIPE_PREALLOC_MAX);
> +
> +	for (i = 0; i < want; i++) {
> +		page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
> +		if (!page)
> +			break;
> +		prealloc->pages[prealloc->count++] = page;
> +	}

I believe alloc_pages_bulk() is supposed to be more efficient when
allocating several pages at once like this.

> +}
> +
> +static struct page *anon_pipe_prealloc_pop(struct anon_pipe_prealloc *prealloc)
> +{
> +	if (!prealloc->count)
> +		return NULL;
> +
> +	prealloc->count--;
> +
> +	return prealloc->pages[prealloc->count];
> +}
> +
> +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
> +				       struct anon_pipe_prealloc *prealloc)
>  {
> +	struct page *page;
> +
> +	/* Drain prealloc first to keep tmp_page[] hot for later small writes. */
> +	page = anon_pipe_prealloc_pop(prealloc);
> +	if (page)
> +		return page;
> +
>  	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
>  		if (pipe->tmp_page[i]) {
> -			struct page *page = pipe->tmp_page[i];
> +			page = pipe->tmp_page[i];
>  			pipe->tmp_page[i] = NULL;
>  			return page;
>  		}
>  	}
>  
> +	/* FWIW: This is called with pipe->mutex held */
>  	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
>  }
>  
> @@ -139,6 +199,38 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe,
>  	put_page(page);
>  }
>  
> +/*
> + * Stash leftover prealloc pages in tmp_page[] so the next write to this
> + * pipe gets a hot page without entering the allocator.
> + */
> +static void anon_pipe_refill_tmp_pages(struct pipe_inode_info *pipe,
> +				       struct anon_pipe_prealloc *prealloc)
> +{
> +	int i, idx;
> +
> +	if (!prealloc->count)
> +		return;
> +
> +	for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
> +		if (pipe->tmp_page[i])
> +			continue;
> +		if (!prealloc->count)
> +			return;
> +		idx = --prealloc->count;
> +		pipe->tmp_page[i] = prealloc->pages[idx];
> +		prealloc->pages[idx] = NULL;
> +	}
> +}
> +
> +/* Runs after mutex_unlock() to keep put_page() out of the critical section. */
> +static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
> +{
> +	while (prealloc->count) {
> +		prealloc->count--;
> +		put_page(prealloc->pages[prealloc->count]);
> +	}
> +}
> +
>  static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
>  				  struct pipe_buffer *buf)
>  {
> @@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct file *filp = iocb->ki_filp;
>  	struct pipe_inode_info *pipe = filp->private_data;
> +	struct anon_pipe_prealloc prealloc;
>  	unsigned int head;
>  	ssize_t ret = 0;
>  	size_t total_len = iov_iter_count(from);
> @@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(total_len == 0))
>  		return 0;
>  
> +	anon_pipe_get_page_prealloc(&prealloc, total_len);
> +
>  	mutex_lock(&pipe->mutex);
>  
>  	if (!pipe->readers) {
> @@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  			struct page *page;
>  			int copied;
>  
> -			page = anon_pipe_get_page(pipe);
> +			page = anon_pipe_get_page(pipe, &prealloc);
>  			if (unlikely(!page)) {
>  				if (!ret)
>  					ret = -ENOMEM;
> @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  		 * after waiting we need to re-check whether the pipe
>  		 * become empty while we dropped the lock.
>  		 */
> +		anon_pipe_refill_tmp_pages(pipe, &prealloc);
>  		mutex_unlock(&pipe->mutex);
> +		anon_pipe_free_pages(&prealloc);
>  		if (was_empty)
>  			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>  		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> @@ -576,9 +673,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  		wake_next_writer = true;
>  	}
>  out:
> +	anon_pipe_refill_tmp_pages(pipe, &prealloc);
>  	if (pipe_is_full(pipe))
>  		wake_next_writer = false;
>  	mutex_unlock(&pipe->mutex);
> +	anon_pipe_free_pages(&prealloc);
>  
>  	/*
>  	 * If we do do a wakeup event, we do a 'sync' wakeup, because we

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-22 16:51   ` Jeff Layton
@ 2026-05-22 17:55     ` Breno Leitao
  0 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-05-22 17:55 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, oleg, axboe, kernel-team

On Fri, May 22, 2026 at 12:51:40PM -0400, Jeff Layton wrote:
> On Fri, 2026-05-22 at 09:44 -0700, Breno Leitao wrote:
> > +/*
> > + * Pre-allocate pages outside pipe->mutex for multi-page writes.
> > + * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg
> > + * charging; doing it under the mutex stalls a concurrent reader.
> > + *
> > + * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses
> > + * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc:
> > + * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single
> > + * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy
> > + * honoured for every page; the per-call overhead is small compared to the
> > + * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
> > + * in-lock alloc_page() fallback in anon_pipe_get_page().
> > + */
> > +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> > +					size_t total_len)
> > +{
> > +	unsigned int want, i;
> > +	struct page *page;
> > +
> > +	prealloc->count = 0;
> > +	if (total_len <= PAGE_SIZE)
> > +		return;
> > +
> > +	want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
> > +		     PIPE_PREALLOC_MAX);
> > +
> > +	for (i = 0; i < want; i++) {
> > +		page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
> > +		if (!page)
> > +			break;
> > +		prealloc->pages[prealloc->count++] = page;
> > +	}
> 
> I believe alloc_pages_bulk() is supposed to be more efficient when
> allocating several pages at once like this.

Thanks Jeff. I understand bulk allocators refuses __GFP_ACCOUNT under memcg.
(That is the reason I've CCed Shakeel in this patchset).

Anyway, reading the code it shows me:

  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
                          nodemask_t *nodemask, int nr_pages,
                          struct page **page_array)
  {

	...
        /* Bulk allocator does not support memcg accounting. */
        if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT))
                goto failed;


Thanks for the review,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
  2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
@ 2026-05-22 19:43 ` Jeff Layton
  2 siblings, 0 replies; 13+ messages in thread
From: Jeff Layton @ 2026-05-22 19:43 UTC (permalink / raw)
  To: Breno Leitao, Alexander Viro, Christian Brauner, Jan Kara,
	Shuah Khan, Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, oleg,
	axboe, kernel-team

On Fri, 2026-05-22 at 09:44 -0700, Breno Leitao wrote:
> While profiling Meta's caching code[1], I found pipe->mutex contention
> on the hot path. anon_pipe_write() currently calls alloc_page() once
> per page while holding pipe->mutex. The allocation can sleep doing
> direct reclaim and runs memcg charging, which extends the critical
> section and stalls any concurrent reader on the same mutex.
> 
> This series pre-allocates pages outside pipe->mutex in
> anon_pipe_write(): for writes that span more than one full page, up
> to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
> alloc_page() loop before the mutex is taken. anon_pipe_get_page()
> then drains the prealloc array first, falls back to the per-pipe
> tmp_page[] cache, and only enters the allocator under the mutex for
> the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
> writes that skip prealloc, or shortfalls when the prealloc loop
> fails). Leftover prealloc pages are recycled into tmp_page[] before
> unlock and any remainder is put_page()'d after unlock, keeping the
> allocator out of the critical section on both sides.
> 
> alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
> refuses __GFP_ACCOUNT under memcg -- it returns at most one page
> when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
> 8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
> __GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
> task NUMA mempolicy honoured uniformly without open-coding the
> charge.
> 
> I also vibe-coded a microbenchmark to validate the change. It sweeps
> writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
> 1 MB pipe and prints throughput + latency percentiles per config.
> 
> Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
> writes, 1 MB pipe). The numbers below were collected on v1
> (alloc_pages_bulk()); v2's per-page loop preserves the dominant
> "allocation outside the mutex" win and is expected to land in the same
> range.
> 
> == No memory pressure (10s per config) ==
> 
>   Throughput in MB/s (baseline -> patched, delta):
>     writers   readers=1              readers=5               readers=10
>           1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
>           2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
>           5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)
> 
>   Avg write latency in ns (baseline -> patched, delta):
>     writers   readers=1                 readers=5                readers=10
>           1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
>           2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
>           5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)
> 
> Throughput improves +6% to +28% and average write latency drops 5%
> to 22% across every configuration.
> 
> == Under memory pressure (--memory-pressure, 6s per config) ==
> 
> stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
> sweep so the alloc_page() calls inside anon_pipe_write() routinely
> hit direct reclaim -- exactly the regime the patch targets.
> 
>   Throughput in MB/s (baseline -> patched, delta):
>     writers   readers=1            readers=5            readers=10
>           1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
>           2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
>           5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)
> 
>   Avg write latency in ns (baseline -> patched, delta):
>     writers   readers=1              readers=5              readers=10
>           1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
>           2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
>           5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)
> 
> Throughput improves +21% to +48% and average write latency drops
> 17% to 33% -- a noticeably bigger win than the no-pressure run.
> 
> That tracks: when alloc_page() has to dip into reclaim, the cost
> of holding pipe->mutex across it is highest, and pulling the
> allocation out of the critical section pays the most.
> 
> Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v2:
> - Switch the prealloc path from alloc_pages_bulk_mempolicy() to a
>   per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop.
> - Split the prealloc work out of anon_pipe_write() into dedicated
>   helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop /
>   anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in
>   struct anon_pipe_prealloc, so the write path stays readable.
> - Recycle leftover prealloc pages into pipe->tmp_page[] before
>   unlocking
> - Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org
> 
> To: Alexander Viro <viro@zeniv.linux.org.uk>
> To: Christian Brauner <brauner@kernel.org>
> To: Jan Kara <jack@suse.cz>
> To: Shuah Khan <shuah@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-kselftest@vger.kernel.org
> 
> ---
> Breno Leitao (2):
>       fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
>       selftests/pipe: add pipe_bench microbenchmark
> 
>  fs/pipe.c                                 | 105 ++++-
>  tools/testing/selftests/Makefile          |   1 +
>  tools/testing/selftests/pipe/.gitignore   |   1 +
>  tools/testing/selftests/pipe/Makefile     |   9 +
>  tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
>  5 files changed, 729 insertions(+), 3 deletions(-)
> ---
> base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
> change-id: 20260515-fix_pipe-c91677c187e7
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>

Pity this can't use the bulk page allocator, but looks good otherwise.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
  2026-05-22 16:51   ` Jeff Layton
@ 2026-05-22 19:48   ` Mateusz Guzik
  2026-05-23 16:26   ` Oleg Nesterov
  2 siblings, 0 replies; 13+ messages in thread
From: Mateusz Guzik @ 2026-05-22 19:48 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, kernel-team

On Fri, May 22, 2026 at 6:44 PM Breno Leitao <leitao@debian.org> wrote:
>
> anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole
> thing") and then, from the per-iteration anon_pipe_get_page() helper,
> used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page
> while still holding it.
>
> That allocation can sleep doing direct reclaim and/or runs memcg
> charging, which extends the critical section and stalls a concurrent
> reader on the very same mutex.
>
> Just pre-alloc the required pages before the lock in an array and just pop
> them inside the lock.
>
> This can improve the pipe throughput up to 48% and reduce the
> latency in 33%, easily seen when there is memory pressure and direct
> reclaim.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  fs/pipe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 102 insertions(+), 3 deletions(-)
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 9841648c9cf3e..cff255217bbfe 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -111,16 +111,76 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>         pipe_lock(pipe2);
>  }
>
> -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
> +#define PIPE_PREALLOC_MAX 8
> +
> +struct anon_pipe_prealloc {
> +       struct page *pages[PIPE_PREALLOC_MAX];
> +       unsigned int count;
> +};
> +
> +/*
> + * Pre-allocate pages outside pipe->mutex for multi-page writes.
> + * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg
> + * charging; doing it under the mutex stalls a concurrent reader.
> + *
> + * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses
> + * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc:
> + * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single
> + * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy
> + * honoured for every page; the per-call overhead is small compared to the
> + * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
> + * in-lock alloc_page() fallback in anon_pipe_get_page().
> + */
> +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> +                                       size_t total_len)
> +{
> +       unsigned int want, i;
> +       struct page *page;
> +
> +       prealloc->count = 0;
> +       if (total_len <= PAGE_SIZE)
> +               return;
> +
> +       want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
> +                    PIPE_PREALLOC_MAX);
> +
> +       for (i = 0; i < want; i++) {
> +               page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
> +               if (!page)
> +                       break;
> +               prealloc->pages[prealloc->count++] = page;
> +       }
> +}
> +
> +static struct page *anon_pipe_prealloc_pop(struct anon_pipe_prealloc *prealloc)
> +{
> +       if (!prealloc->count)
> +               return NULL;
> +
> +       prealloc->count--;
> +
> +       return prealloc->pages[prealloc->count];
> +}
> +
> +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
> +                                      struct anon_pipe_prealloc *prealloc)
>  {
> +       struct page *page;
> +
> +       /* Drain prealloc first to keep tmp_page[] hot for later small writes. */
> +       page = anon_pipe_prealloc_pop(prealloc);
> +       if (page)
> +               return page;
> +
>         for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
>                 if (pipe->tmp_page[i]) {
> -                       struct page *page = pipe->tmp_page[i];
> +                       page = pipe->tmp_page[i];
>                         pipe->tmp_page[i] = NULL;
>                         return page;
>                 }
>         }
>
> +       /* FWIW: This is called with pipe->mutex held */
>         return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
>  }
>
> @@ -139,6 +199,38 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe,
>         put_page(page);
>  }
>
> +/*
> + * Stash leftover prealloc pages in tmp_page[] so the next write to this
> + * pipe gets a hot page without entering the allocator.
> + */
> +static void anon_pipe_refill_tmp_pages(struct pipe_inode_info *pipe,
> +                                      struct anon_pipe_prealloc *prealloc)
> +{
> +       int i, idx;
> +
> +       if (!prealloc->count)
> +               return;
> +
> +       for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
> +               if (pipe->tmp_page[i])
> +                       continue;
> +               if (!prealloc->count)
> +                       return;
> +               idx = --prealloc->count;
> +               pipe->tmp_page[i] = prealloc->pages[idx];
> +               prealloc->pages[idx] = NULL;
> +       }
> +}
> +
> +/* Runs after mutex_unlock() to keep put_page() out of the critical section. */
> +static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
> +{
> +       while (prealloc->count) {
> +               prealloc->count--;
> +               put_page(prealloc->pages[prealloc->count]);
> +       }
> +}
> +
>  static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
>                                   struct pipe_buffer *buf)
>  {
> @@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  {
>         struct file *filp = iocb->ki_filp;
>         struct pipe_inode_info *pipe = filp->private_data;
> +       struct anon_pipe_prealloc prealloc;
>         unsigned int head;
>         ssize_t ret = 0;
>         size_t total_len = iov_iter_count(from);
> @@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>         if (unlikely(total_len == 0))
>                 return 0;
>
> +       anon_pipe_get_page_prealloc(&prealloc, total_len);
> +
>         mutex_lock(&pipe->mutex);
>
>         if (!pipe->readers) {
> @@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                         struct page *page;
>                         int copied;
>
> -                       page = anon_pipe_get_page(pipe);
> +                       page = anon_pipe_get_page(pipe, &prealloc);
>                         if (unlikely(!page)) {
>                                 if (!ret)
>                                         ret = -ENOMEM;
> @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                  * after waiting we need to re-check whether the pipe
>                  * become empty while we dropped the lock.
>                  */
> +               anon_pipe_refill_tmp_pages(pipe, &prealloc);
>                 mutex_unlock(&pipe->mutex);
> +               anon_pipe_free_pages(&prealloc);
>                 if (was_empty)
>                         wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>                 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> @@ -576,9 +673,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                 wake_next_writer = true;
>         }
>  out:
> +       anon_pipe_refill_tmp_pages(pipe, &prealloc);
>         if (pipe_is_full(pipe))
>                 wake_next_writer = false;

maybe the new call would be nicer moved below pipe_is_full, but that's
just cosmetics

>         mutex_unlock(&pipe->mutex);
> +       anon_pipe_free_pages(&prealloc);
>
>         /*
>          * If we do do a wakeup event, we do a 'sync' wakeup, because we
>
> --
> 2.54.0
>

Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
  2026-05-22 16:51   ` Jeff Layton
  2026-05-22 19:48   ` Mateusz Guzik
@ 2026-05-23 16:26   ` Oleg Nesterov
  2026-05-24 14:30     ` Breno Leitao
  2 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2026-05-23 16:26 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

To be honest, I didn't read this patch carefully, but let me ask anyway.

On 05/22, Breno Leitao wrote:
>
> @@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct file *filp = iocb->ki_filp;
>  	struct pipe_inode_info *pipe = filp->private_data;
> +	struct anon_pipe_prealloc prealloc;
>  	unsigned int head;
>  	ssize_t ret = 0;
>  	size_t total_len = iov_iter_count(from);
> @@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(total_len == 0))
>  		return 0;
>  
> +	anon_pipe_get_page_prealloc(&prealloc, total_len);
> +
>  	mutex_lock(&pipe->mutex);
>  
>  	if (!pipe->readers) {
> @@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  			struct page *page;
>  			int copied;
>  
> -			page = anon_pipe_get_page(pipe);
> +			page = anon_pipe_get_page(pipe, &prealloc);
>  			if (unlikely(!page)) {
>  				if (!ret)
>  					ret = -ENOMEM;
> @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  		 * after waiting we need to re-check whether the pipe
>  		 * become empty while we dropped the lock.
>  		 */
> +		anon_pipe_refill_tmp_pages(pipe, &prealloc);
>  		mutex_unlock(&pipe->mutex);
> +		anon_pipe_free_pages(&prealloc);

Do we really want to call anon_pipe_free_pages() at this point?

The main loop will continue when pipe_writable() becomes true again...

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark
  2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
@ 2026-05-23 16:43   ` Oleg Nesterov
  2026-05-23 16:49     ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2026-05-23 16:43 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

On 05/22, Breno Leitao wrote:
>
> Add a small selftest that stresses pipe->mutex contention by spawning N
> writer threads that hammer a single pipe with multi-page writes, plus M
> reader threads that drain. Each writer records its own write() latency
> samples into a log2-bucketed histogram; main aggregates and prints
> total writes, throughput, average and percentile (p50/p99) latencies,
> and the maximum observed latency.
> 
> Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80%
> --vm-method all) for the duration of the run, so alloc_page() in
> anon_pipe_write() routinely hits direct reclaim. The flag fails
> fast if stress-ng is not on $PATH.
> 
> Program print something like the following, for different writes,
> readers, msgsizes and memory pressure:
> 
> 	config: writers=X readers=Y msgsize=Z duration=3 pipe_size=1048576
> 	memory_pressure=[no|yes]
> 	writes: total=54451 rate=18150/s
> 	throughput_MBps: 1134.40
> 	lat_avg_ns: 275355
> 	lat_p50_ns_upper: 262143
> 	lat_p99_ns_upper: 1048575
> 	lat_max_ns: 2145633


The more tests the better... but since it comes as 2/2, perhaps the changelog
could mention the difference before/after 1/2 according to this test?

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark
  2026-05-23 16:43   ` Oleg Nesterov
@ 2026-05-23 16:49     ` Oleg Nesterov
  0 siblings, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2026-05-23 16:49 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

On 05/23, Oleg Nesterov wrote:
>
> The more tests the better... but since it comes as 2/2, perhaps the changelog
> could mention the difference before/after 1/2 according to this test?

Ah, I didn't read 0/2, sorry for the noise ;)

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-23 16:26   ` Oleg Nesterov
@ 2026-05-24 14:30     ` Breno Leitao
  2026-05-24 14:48       ` Mateusz Guzik
  0 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-05-24 14:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote:
> > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> >  		 * after waiting we need to re-check whether the pipe
> >  		 * become empty while we dropped the lock.
> >  		 */
> > +		anon_pipe_refill_tmp_pages(pipe, &prealloc);
> >  		mutex_unlock(&pipe->mutex);
> > +		anon_pipe_free_pages(&prealloc);
> 
> Do we really want to call anon_pipe_free_pages() at this point?
> 
> The main loop will continue when pipe_writable() becomes true again...

I went back and forth on this. The argument for freeing was that
wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or
stopped reader), and holding up the prealloc pages felt antisocial --
especially under the memory pressure this series targets, where those pages are
more useful on the freelists than parked on a sleeping task.

On the other side, on wakeup the loop is guaranteed to want pages again, and
re-entering the allocator under the mutex puts us back in the contended state
the patch removes. For any write() large enough to wait mid-syscall (which is
the workload patch 2/2 measures), keeping them strictly wins on throughput /
p99.

I think your read is the better one -- the throughput case is the whole point
of the series, and the memory-hoarding concern is bounded (8 pages per blocked
writer, freed in the out: path on syscall exit). Will drop the free_pages()
call from the wait branch in v3 and keep only the one after the out: label,
together with the nit from Mateusz.

Thanks for the review,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-24 14:30     ` Breno Leitao
@ 2026-05-24 14:48       ` Mateusz Guzik
  2026-05-24 16:47         ` Breno Leitao
  0 siblings, 1 reply; 13+ messages in thread
From: Mateusz Guzik @ 2026-05-24 14:48 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Oleg Nesterov, Alexander Viro, Christian Brauner, Jan Kara,
	Shuah Khan, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

On Sun, May 24, 2026 at 4:30 PM Breno Leitao <leitao@debian.org> wrote:
>
> On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote:
> > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> > >              * after waiting we need to re-check whether the pipe
> > >              * become empty while we dropped the lock.
> > >              */
> > > +           anon_pipe_refill_tmp_pages(pipe, &prealloc);
> > >             mutex_unlock(&pipe->mutex);
> > > +           anon_pipe_free_pages(&prealloc);
> >
> > Do we really want to call anon_pipe_free_pages() at this point?
> >
> > The main loop will continue when pipe_writable() becomes true again...
>
> I went back and forth on this. The argument for freeing was that
> wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or
> stopped reader), and holding up the prealloc pages felt antisocial --
> especially under the memory pressure this series targets, where those pages are
> more useful on the freelists than parked on a sleeping task.
>
> On the other side, on wakeup the loop is guaranteed to want pages again, and
> re-entering the allocator under the mutex puts us back in the contended state
> the patch removes. For any write() large enough to wait mid-syscall (which is
> the workload patch 2/2 measures), keeping them strictly wins on throughput /
> p99.
>

You can still prealloc after wakeup for whatever reminder you got
though, but I can agree dropping these frees is a sensible way out and
it is easier and I'm not going to insist on one way or the other.

However, I think it would be prudent to add a tracepoint to some
machines on your fleet to find out how often they allocate pages under
the mutex (and for what i/o size). Initial alloc for the first write <
PAGE_SIZE definitely happens under the mutex which is probably not a
problem, but for anything later? The tracepoint can have a trivial
indicator if this is the first write if that matters. One can
speculate all day, nothing beats checking what happens.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-24 14:48       ` Mateusz Guzik
@ 2026-05-24 16:47         ` Breno Leitao
  0 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-05-24 16:47 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Oleg Nesterov, Alexander Viro, Christian Brauner, Jan Kara,
	Shuah Khan, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

Hello Mateusz,

On Sun, May 24, 2026 at 04:48:14PM +0200, Mateusz Guzik wrote:
> On Sun, May 24, 2026 at 4:30 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote:
> > > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> > > >              * after waiting we need to re-check whether the pipe
> > > >              * become empty while we dropped the lock.
> > > >              */
> > > > +           anon_pipe_refill_tmp_pages(pipe, &prealloc);
> > > >             mutex_unlock(&pipe->mutex);
> > > > +           anon_pipe_free_pages(&prealloc);
> > >
> > > Do we really want to call anon_pipe_free_pages() at this point?
> > >
> > > The main loop will continue when pipe_writable() becomes true again...
> >
> > I went back and forth on this. The argument for freeing was that
> > wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or
> > stopped reader), and holding up the prealloc pages felt antisocial --
> > especially under the memory pressure this series targets, where those pages are
> > more useful on the freelists than parked on a sleeping task.
> >
> > On the other side, on wakeup the loop is guaranteed to want pages again, and
> > re-entering the allocator under the mutex puts us back in the contended state
> > the patch removes. For any write() large enough to wait mid-syscall (which is
> > the workload patch 2/2 measures), keeping them strictly wins on throughput /
> > p99.
> >
> 
> You can still prealloc after wakeup for whatever reminder you got
> though, but I can agree dropping these frees is a sensible way out and
> it is easier and I'm not going to insist on one way or the other.

Ack. I've sent a v3 with anon_pipe_free_pages() and
anon_pipe_refill_tmp_pages() dropped.

> However, I think it would be prudent to add a tracepoint to some
> machines on your fleet to find out how often they allocate pages under
> the mutex (and for what i/o size). Initial alloc for the first write <
> PAGE_SIZE definitely happens under the mutex which is probably not a
> problem, but for anything later? 

> The tracepoint can have a trivial
> indicator if this is the first write if that matters. One can

Isn't this what I've reported earlier?

https://lore.kernel.org/all/ag3Ty3T24wjn1aFw@gmail.com/

Adding a tracepoint is harder than usual, given kernel rollout takes ages.

But I hacked a bpftrace script and ran it on a random sample of fleet hosts (5
min each). 

As reported earlier, multi-page pipe writes are not uncommon: on one
host a single long-running process produced 196,476 under-mutex alloc_page()
calls in 5 minutes, with allocs-per-write distributions reaching 16+ -- exactly
the pattern this patch removes. 

Most hosts sit at the boring ~20-30 allocs/sec dominated by one-page
first-writes that the patch's `total_len <= PAGE_SIZE` early-return skips
anyway, so the win is concentrated on the workloads that actually need it.

None of the allocs hit reclaim during the trace I ran, but I would expect
direct reclaim to happen with the lock held.

Thanks for the review and direction,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-05-24 16:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-22 16:51   ` Jeff Layton
2026-05-22 17:55     ` Breno Leitao
2026-05-22 19:48   ` Mateusz Guzik
2026-05-23 16:26   ` Oleg Nesterov
2026-05-24 14:30     ` Breno Leitao
2026-05-24 14:48       ` Mateusz Guzik
2026-05-24 16:47         ` Breno Leitao
2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
2026-05-23 16:43   ` Oleg Nesterov
2026-05-23 16:49     ` Oleg Nesterov
2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox