All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
@ 2026-05-22 16:44 Breno Leitao
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Breno Leitao @ 2026-05-22 16:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, Breno Leitao, kernel-team

While profiling Meta's caching code[1], I found pipe->mutex contention
on the hot path. anon_pipe_write() currently calls alloc_page() once
per page while holding pipe->mutex. The allocation can sleep doing
direct reclaim and runs memcg charging, which extends the critical
section and stalls any concurrent reader on the same mutex.

This series pre-allocates pages outside pipe->mutex in
anon_pipe_write(): for writes that span more than one full page, up
to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
alloc_page() loop before the mutex is taken. anon_pipe_get_page()
then drains the prealloc array first, falls back to the per-pipe
tmp_page[] cache, and only enters the allocator under the mutex for
the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
writes that skip prealloc, or shortfalls when the prealloc loop
fails). Leftover prealloc pages are recycled into tmp_page[] before
unlock and any remainder is put_page()'d after unlock, keeping the
allocator out of the critical section on both sides.

alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
refuses __GFP_ACCOUNT under memcg -- it returns at most one page
when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
__GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
task NUMA mempolicy honoured uniformly without open-coding the
charge.

I also vibe-coded a microbenchmark to validate the change. It sweeps
writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
1 MB pipe and prints throughput + latency percentiles per config.

Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
writes, 1 MB pipe). The numbers below were collected on v1
(alloc_pages_bulk()); v2's per-page loop preserves the dominant
"allocation outside the mutex" win and is expected to land in the same
range.

== No memory pressure (10s per config) ==

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1              readers=5               readers=10
          1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
          2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
          5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1                 readers=5                readers=10
          1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
          2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
          5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)

Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.

== Under memory pressure (--memory-pressure, 6s per config) ==

stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1            readers=5            readers=10
          1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
          2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
          5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1              readers=5              readers=10
          1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
          2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
          5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)

Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.

That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.

Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2:
- Switch the prealloc path from alloc_pages_bulk_mempolicy() to a
  per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop.
- Split the prealloc work out of anon_pipe_write() into dedicated
  helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop /
  anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in
  struct anon_pipe_prealloc, so the write path stays readable.
- Recycle leftover prealloc pages into pipe->tmp_page[] before
  unlocking
- Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org

To: Alexander Viro <viro@zeniv.linux.org.uk>
To: Christian Brauner <brauner@kernel.org>
To: Jan Kara <jack@suse.cz>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Breno Leitao (2):
      fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
      selftests/pipe: add pipe_bench microbenchmark

 fs/pipe.c                                 | 105 ++++-
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
 5 files changed, 729 insertions(+), 3 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
@ 2026-05-22 16:44 ` Breno Leitao
  2026-05-22 16:51   ` Jeff Layton
                     ` (2 more replies)
  2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
  2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton
  2 siblings, 3 replies; 13+ messages in thread
From: Breno Leitao @ 2026-05-22 16:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, Breno Leitao, kernel-team

anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole
thing") and then, from the per-iteration anon_pipe_get_page() helper,
used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page
while still holding it.

That allocation can sleep doing direct reclaim and/or runs memcg
charging, which extends the critical section and stalls a concurrent
reader on the very same mutex.

Just pre-alloc the required pages before the lock in an array and just pop
them inside the lock.

This can improve the pipe throughput up to 48% and reduce the
latency in 33%, easily seen when there is memory pressure and direct
reclaim.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 fs/pipe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 102 insertions(+), 3 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 9841648c9cf3e..cff255217bbfe 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -111,16 +111,76 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 	pipe_lock(pipe2);
 }
 
-static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
+#define PIPE_PREALLOC_MAX 8
+
+struct anon_pipe_prealloc {
+	struct page *pages[PIPE_PREALLOC_MAX];
+	unsigned int count;
+};
+
+/*
+ * Pre-allocate pages outside pipe->mutex for multi-page writes.
+ * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg
+ * charging; doing it under the mutex stalls a concurrent reader.
+ *
+ * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses
+ * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc:
+ * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single
+ * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy
+ * honoured for every page; the per-call overhead is small compared to the
+ * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
+ * in-lock alloc_page() fallback in anon_pipe_get_page().
+ */
+static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
+					size_t total_len)
+{
+	unsigned int want, i;
+	struct page *page;
+
+	prealloc->count = 0;
+	if (total_len <= PAGE_SIZE)
+		return;
+
+	want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
+		     PIPE_PREALLOC_MAX);
+
+	for (i = 0; i < want; i++) {
+		page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
+		if (!page)
+			break;
+		prealloc->pages[prealloc->count++] = page;
+	}
+}
+
+static struct page *anon_pipe_prealloc_pop(struct anon_pipe_prealloc *prealloc)
+{
+	if (!prealloc->count)
+		return NULL;
+
+	prealloc->count--;
+
+	return prealloc->pages[prealloc->count];
+}
+
+static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
+				       struct anon_pipe_prealloc *prealloc)
 {
+	struct page *page;
+
+	/* Drain prealloc first to keep tmp_page[] hot for later small writes. */
+	page = anon_pipe_prealloc_pop(prealloc);
+	if (page)
+		return page;
+
 	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
 		if (pipe->tmp_page[i]) {
-			struct page *page = pipe->tmp_page[i];
+			page = pipe->tmp_page[i];
 			pipe->tmp_page[i] = NULL;
 			return page;
 		}
 	}
 
+	/* FWIW: This is called with pipe->mutex held */
 	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
 }
 
@@ -139,6 +199,38 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe,
 	put_page(page);
 }
 
+/*
+ * Stash leftover prealloc pages in tmp_page[] so the next write to this
+ * pipe gets a hot page without entering the allocator.
+ */
+static void anon_pipe_refill_tmp_pages(struct pipe_inode_info *pipe,
+				       struct anon_pipe_prealloc *prealloc)
+{
+	int i, idx;
+
+	if (!prealloc->count)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
+		if (pipe->tmp_page[i])
+			continue;
+		if (!prealloc->count)
+			return;
+		idx = --prealloc->count;
+		pipe->tmp_page[i] = prealloc->pages[idx];
+		prealloc->pages[idx] = NULL;
+	}
+}
+
+/* Runs after mutex_unlock() to keep put_page() out of the critical section. */
+static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
+{
+	while (prealloc->count) {
+		prealloc->count--;
+		put_page(prealloc->pages[prealloc->count]);
+	}
+}
+
 static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
@@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
+	struct anon_pipe_prealloc prealloc;
 	unsigned int head;
 	ssize_t ret = 0;
 	size_t total_len = iov_iter_count(from);
@@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	if (unlikely(total_len == 0))
 		return 0;
 
+	anon_pipe_get_page_prealloc(&prealloc, total_len);
+
 	mutex_lock(&pipe->mutex);
 
 	if (!pipe->readers) {
@@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			struct page *page;
 			int copied;
 
-			page = anon_pipe_get_page(pipe);
+			page = anon_pipe_get_page(pipe, &prealloc);
 			if (unlikely(!page)) {
 				if (!ret)
 					ret = -ENOMEM;
@@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		 * after waiting we need to re-check whether the pipe
 		 * become empty while we dropped the lock.
 		 */
+		anon_pipe_refill_tmp_pages(pipe, &prealloc);
 		mutex_unlock(&pipe->mutex);
+		anon_pipe_free_pages(&prealloc);
 		if (was_empty)
 			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
@@ -576,9 +673,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		wake_next_writer = true;
 	}
 out:
+	anon_pipe_refill_tmp_pages(pipe, &prealloc);
 	if (pipe_is_full(pipe))
 		wake_next_writer = false;
 	mutex_unlock(&pipe->mutex);
+	anon_pipe_free_pages(&prealloc);
 
 	/*
 	 * If we do do a wakeup event, we do a 'sync' wakeup, because we

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark
  2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
@ 2026-05-22 16:44 ` Breno Leitao
  2026-05-23 16:43   ` Oleg Nesterov
  2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton
  2 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-05-22 16:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, Breno Leitao, kernel-team

Add a small selftest that stresses pipe->mutex contention by spawning N
writer threads that hammer a single pipe with multi-page writes, plus M
reader threads that drain. Each writer records its own write() latency
samples into a log2-bucketed histogram; main aggregates and prints
total writes, throughput, average and percentile (p50/p99) latencies,
and the maximum observed latency.

Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80%
--vm-method all) for the duration of the run, so alloc_page() in
anon_pipe_write() routinely hits direct reclaim. The flag fails
fast if stress-ng is not on $PATH.

Program print something like the following, for different writes,
readers, msgsizes and memory pressure:

	config: writers=X readers=Y msgsize=Z duration=3 pipe_size=1048576
	memory_pressure=[no|yes]
	writes: total=54451 rate=18150/s
	throughput_MBps: 1134.40
	lat_avg_ns: 275355
	lat_p50_ns_upper: 262143
	lat_p99_ns_upper: 1048575
	lat_max_ns: 2145633

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
 4 files changed, 627 insertions(+)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6e59b8f63e416..bcd9db9d292ca 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -91,6 +91,7 @@ TARGETS += pcie_bwctrl
 TARGETS += perf_events
 TARGETS += pidfd
 TARGETS += pid_namespace
+TARGETS += pipe
 TARGETS += power_supply
 TARGETS += powerpc
 TARGETS += prctl
diff --git a/tools/testing/selftests/pipe/.gitignore b/tools/testing/selftests/pipe/.gitignore
new file mode 100644
index 0000000000000..20b549361a152
--- /dev/null
+++ b/tools/testing/selftests/pipe/.gitignore
@@ -0,0 +1 @@
+pipe_bench
diff --git a/tools/testing/selftests/pipe/Makefile b/tools/testing/selftests/pipe/Makefile
new file mode 100644
index 0000000000000..1810c680117b3
--- /dev/null
+++ b/tools/testing/selftests/pipe/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+# Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+
+CFLAGS += -O2 -Wall -Wextra -pthread
+
+TEST_GEN_PROGS := pipe_bench
+
+include ../lib.mk
diff --git a/tools/testing/selftests/pipe/pipe_bench.c b/tools/testing/selftests/pipe/pipe_bench.c
new file mode 100644
index 0000000000000..7e96429b8fb4d
--- /dev/null
+++ b/tools/testing/selftests/pipe/pipe_bench.c
@@ -0,0 +1,616 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pipe_bench - exercise concurrent pipe operation
+ *
+ * N writer threads hammer a single pipe with multi-page writes; M reader
+ * threads drain it. Each writer records its own write() latency histogram.
+ * Multi-page writes (msgsize >= PAGE_SIZE) force the loop in
+ * anon_pipe_write() to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) under
+ * pipe->mutex, which is the critical section the patch shrinks.
+ *
+ * By default the benchmark sweeps writers in {1, 2, 5} x readers in
+ * {1, 5, 10} and prints one block per configuration so two runs (e.g.
+ * baseline vs patched) can be diffed directly. Pass -w and -r to run a
+ * single configuration instead. Pass --memory-pressure to spawn stress-ng
+ * alongside the sweep so the per-page alloc_page() path under pipe->mutex
+ * has to dip into reclaim.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <poll.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#define ARRAY_SIZE(a)	(sizeof(a) / sizeof((a)[0]))
+#define HIST_BUCKETS	32
+
+static size_t g_msgsize = 16 * 4096;
+static int g_duration = 3;
+static int g_pipe_size = 1024 * 1024;
+static int g_memory_pressure;
+
+static atomic_int g_stop;
+static int g_pipe[2];
+
+struct wstats {
+	uint64_t writes;
+	uint64_t bytes;
+	uint64_t lat_sum_ns;
+	uint64_t lat_max_ns;
+	uint64_t lat_hist[HIST_BUCKETS];
+	char *buf;
+};
+
+struct rstats {
+	char *buf;
+};
+
+struct hist_totals {
+	uint64_t writes;
+	uint64_t bytes;
+	uint64_t lat_sum;
+	uint64_t lat_max;
+};
+
+static inline uint64_t now_ns(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
+}
+
+static inline int log2_bucket(uint64_t v)
+{
+	int b = 0;
+
+	if (!v)
+		return 0;
+	while (v >>= 1)
+		b++;
+	return b < HIST_BUCKETS ? b : HIST_BUCKETS - 1;
+}
+
+static void *writer(void *arg)
+{
+	struct wstats *s = arg;
+
+	while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) {
+		uint64_t t0 = now_ns();
+		ssize_t n = write(g_pipe[1], s->buf, g_msgsize);
+		uint64_t dt = now_ns() - t0;
+
+		if (n > 0) {
+			s->writes++;
+			s->bytes += (uint64_t)n;
+			s->lat_sum_ns += dt;
+			if (dt > s->lat_max_ns)
+				s->lat_max_ns = dt;
+			s->lat_hist[log2_bucket(dt)]++;
+		} else if (n < 0 && (errno == EPIPE || errno == EBADF)) {
+			break;
+		}
+	}
+	return NULL;
+}
+
+static void *reader(void *arg)
+{
+	struct rstats *s = arg;
+
+	/*
+	 * Drain until EOF (write end closed by main). g_stop is not checked
+	 * here on purpose: writers may be blocked in write() with the pipe
+	 * full when g_stop is set, so the reader must keep draining until
+	 * main closes the write end.
+	 */
+	for (;;) {
+		ssize_t n = read(g_pipe[0], s->buf, g_msgsize);
+
+		if (n <= 0)
+			break;
+	}
+	return NULL;
+}
+
+/* Sum per-writer stats and per-bucket counts into the caller's aggregates. */
+static void aggregate_wstats(struct wstats *all, int nw,
+			     uint64_t agg[HIST_BUCKETS],
+			     struct hist_totals *t)
+{
+	memset(t, 0, sizeof(*t));
+	for (int i = 0; i < nw; i++) {
+		t->writes += all[i].writes;
+		t->bytes += all[i].bytes;
+		t->lat_sum += all[i].lat_sum_ns;
+		if (all[i].lat_max_ns > t->lat_max)
+			t->lat_max = all[i].lat_max_ns;
+		for (int b = 0; b < HIST_BUCKETS; b++)
+			agg[b] += all[i].lat_hist[b];
+	}
+}
+
+/*
+ * Walk @agg in order, returning the inclusive upper bound (in ns) of the
+ * log2 bucket where the running sum first reaches @target.
+ *
+ * A percentile is undefined with zero samples, and with very low sample
+ * counts integer truncation could make @target zero -- then "cum >= 0"
+ * would latch on the first (possibly empty) bucket. Callers must pass
+ * @target >= 1.
+ */
+static uint64_t bucket_at(const uint64_t agg[HIST_BUCKETS], uint64_t target)
+{
+	uint64_t cum = 0;
+
+	for (int b = 0; b < HIST_BUCKETS; b++) {
+		/* HIST_BUCKETS <= 63, so (b + 1) is always a safe shift. */
+		uint64_t upper = (1ULL << (b + 1)) - 1;
+
+		cum += agg[b];
+		if (cum >= target)
+			return upper;
+	}
+	return 0;
+}
+
+static void compute_p50_p99(const uint64_t agg[HIST_BUCKETS], uint64_t writes,
+			    uint64_t *p50, uint64_t *p99)
+{
+	uint64_t p50_target, p99_target;
+
+	*p50 = *p99 = 0;
+	if (!writes)
+		return;
+
+	p50_target = writes * 50 / 100;
+	p99_target = writes * 99 / 100;
+	if (!p50_target)
+		p50_target = 1;
+	if (!p99_target)
+		p99_target = 1;
+
+	*p50 = bucket_at(agg, p50_target);
+	*p99 = bucket_at(agg, p99_target);
+}
+
+static void print_summary(int nw, int nr, const struct hist_totals *t,
+			  uint64_t p50, uint64_t p99)
+{
+	double sec = g_duration;
+	uint64_t avg_ns = t->writes ? t->lat_sum / t->writes : 0;
+
+	printf("config: writers=%d readers=%d msgsize=%zu duration=%d pipe_size=%d memory_pressure=%s\n",
+	       nw, nr, g_msgsize, g_duration, g_pipe_size,
+	       g_memory_pressure ? "yes" : "no");
+	printf("writes: total=%llu rate=%.0f/s\n",
+	       (unsigned long long)t->writes, (double)t->writes / sec);
+	printf("throughput_MBps: %.2f\n",
+	       ((double)t->bytes / sec) / (1024.0 * 1024.0));
+	printf("lat_avg_ns: %llu\n", (unsigned long long)avg_ns);
+	printf("lat_p50_ns_upper: %llu\n", (unsigned long long)p50);
+	printf("lat_p99_ns_upper: %llu\n", (unsigned long long)p99);
+	printf("lat_max_ns: %llu\n", (unsigned long long)t->lat_max);
+}
+
+static void summarize(struct wstats *all, int nw, int nr)
+{
+	uint64_t agg[HIST_BUCKETS] = {0};
+	struct hist_totals t;
+	uint64_t p50, p99;
+
+	aggregate_wstats(all, nw, agg, &t);
+	compute_p50_p99(agg, t.writes, &p50, &p99);
+	print_summary(nw, nr, &t, p50, p99);
+}
+
+/*
+ * Child branch of fork(): restore SIGPIPE to default (parent ignores it),
+ * exec stress-ng, and on failure write the reason into @hs_wr before
+ * exiting. The parent observes EOF on hs_wr (closed via O_CLOEXEC) when
+ * exec succeeds.
+ */
+static void stress_ng_child(int hs_wr) __attribute__((noreturn));
+static void stress_ng_child(int hs_wr)
+{
+	char errbuf[256];
+
+	signal(SIGPIPE, SIG_DFL);
+	execlp("stress-ng", "stress-ng",
+	       "--vm", "4", "--vm-bytes", "80%",
+	       "--vm-method", "all",
+	       (char *)NULL);
+	snprintf(errbuf, sizeof(errbuf),
+		 "exec stress-ng failed: %s\n", strerror(errno));
+	(void)!write(hs_wr, errbuf, strlen(errbuf));
+	_exit(127);
+}
+
+/*
+ * Read from the O_CLOEXEC handshake pipe. Anything readable means the
+ * child wrote an error before exec; EOF (n == 0) means the write-end
+ * closed because exec succeeded. Returns 0 on exec success, -1 if the
+ * child failed and was reaped.
+ */
+static int stress_ng_wait_handshake(int hs_rd, pid_t pid)
+{
+	struct pollfd pfd = { .fd = hs_rd, .events = POLLIN };
+	char errbuf[256];
+	int status;
+	int ret;
+
+	ret = poll(&pfd, 1, 500);
+	if (ret <= 0)
+		return 0;
+
+	ssize_t n = read(hs_rd, errbuf, sizeof(errbuf) - 1);
+
+	if (n > 0) {
+		errbuf[n] = '\0';
+		fputs(errbuf, stderr);
+		waitpid(pid, &status, 0);
+		return -1;
+	}
+	return 0;
+}
+
+static pid_t spawn_stress_ng(void)
+{
+	int hs[2];
+	pid_t pid;
+
+	/*
+	 * Handshake pipe: child writes one byte and _exit()s on exec
+	 * failure. On exec success the O_CLOEXEC flag closes the write
+	 * end, which the parent observes as EOF. This makes the "is
+	 * stress-ng on $PATH?" check fail fast rather than silently.
+	 */
+	if (pipe2(hs, O_CLOEXEC) < 0) {
+		perror("pipe2");
+		return -1;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		perror("fork");
+		close(hs[0]);
+		close(hs[1]);
+		return -1;
+	}
+	if (pid == 0) {
+		close(hs[0]);
+		stress_ng_child(hs[1]);
+	}
+
+	close(hs[1]);
+	if (stress_ng_wait_handshake(hs[0], pid) < 0) {
+		close(hs[0]);
+		return -1;
+	}
+	close(hs[0]);
+
+	/* Give stress-ng a moment to map its VM regions before measuring. */
+	sleep(1);
+	return pid;
+}
+
+static void kill_stress_ng(pid_t pid)
+{
+	int status;
+
+	if (pid <= 0)
+		return;
+	kill(pid, SIGTERM);
+	for (int i = 0; i < 20; i++) {
+		if (waitpid(pid, &status, WNOHANG) > 0)
+			return;
+		usleep(100 * 1000);
+	}
+	kill(pid, SIGKILL);
+	waitpid(pid, &status, 0);
+}
+
+/*
+ * Allocate per-thread page-aligned buffers in main so a failed
+ * aligned_alloc() aborts the run before any thread starts. Workers used
+ * to allocate their own buffer and return NULL on failure, which left
+ * peers blocked in write()/read() with nobody to unblock them.
+ */
+static int alloc_thread_bufs(struct wstats *ws, int nw,
+			     struct rstats *rs, int nr)
+{
+	for (int i = 0; i < nw; i++) {
+		ws[i].buf = aligned_alloc(4096, g_msgsize);
+		if (!ws[i].buf) {
+			fprintf(stderr, "writer %d: aligned_alloc(%zu) failed\n",
+				i, g_msgsize);
+			return -1;
+		}
+		memset(ws[i].buf, 0xAA, g_msgsize);
+	}
+	for (int i = 0; i < nr; i++) {
+		rs[i].buf = aligned_alloc(4096, g_msgsize);
+		if (!rs[i].buf) {
+			fprintf(stderr, "reader %d: aligned_alloc(%zu) failed\n",
+				i, g_msgsize);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+static void free_thread_bufs(struct wstats *ws, int nw,
+			     struct rstats *rs, int nr)
+{
+	if (ws)
+		for (int i = 0; i < nw; i++)
+			free(ws[i].buf);
+	if (rs)
+		for (int i = 0; i < nr; i++)
+			free(rs[i].buf);
+}
+
+static int start_readers(pthread_t *rt, struct rstats *rs, int nr,
+			 int *created)
+{
+	for (int i = 0; i < nr; i++) {
+		int err = pthread_create(&rt[i], NULL, reader, &rs[i]);
+
+		if (err) {
+			fprintf(stderr, "pthread_create reader %d: %s\n",
+				i, strerror(err));
+			return -1;
+		}
+		(*created)++;
+	}
+	return 0;
+}
+
+static int start_writers(pthread_t *wt, struct wstats *ws, int nw,
+			 int *created)
+{
+	for (int i = 0; i < nw; i++) {
+		int err = pthread_create(&wt[i], NULL, writer, &ws[i]);
+
+		if (err) {
+			fprintf(stderr, "pthread_create writer %d: %s\n",
+				i, strerror(err));
+			return -1;
+		}
+		(*created)++;
+	}
+	return 0;
+}
+
+static int open_bench_pipe(void)
+{
+	if (pipe(g_pipe) < 0) {
+		perror("pipe");
+		return -1;
+	}
+	if (fcntl(g_pipe[1], F_SETPIPE_SZ, g_pipe_size) < 0)
+		perror("F_SETPIPE_SZ (continuing)");
+	return 0;
+}
+
+/*
+ * Normal termination: g_stop tells writers to leave the loop after the
+ * current write() returns. Closing the shared write-end fd means once
+ * the in-flight writes drain, readers see EOF and exit. Writers are not
+ * unblocked by EPIPE here -- g_pipe[0] stays open so readers can keep
+ * draining.
+ *
+ * Error path: some threads may have been created and others skipped, so
+ * writers could be blocked in write() with no reader making progress.
+ * Close both ends -- closing the read end is what delivers EPIPE to a
+ * blocked writer.
+ */
+static void stop_and_join(pthread_t *wt, int nw_created,
+			  pthread_t *rt, int nr_created, int rc)
+{
+	atomic_store(&g_stop, 1);
+	close(g_pipe[1]);
+	if (rc < 0)
+		close(g_pipe[0]);
+	for (int i = 0; i < nw_created; i++)
+		pthread_join(wt[i], NULL);
+	for (int i = 0; i < nr_created; i++)
+		pthread_join(rt[i], NULL);
+	if (rc == 0)
+		close(g_pipe[0]);
+}
+
+static int run_one(int nw, int nr)
+{
+	pthread_t *wt = NULL, *rt = NULL;
+	struct wstats *ws = NULL;
+	struct rstats *rs = NULL;
+	int nw_created = 0, nr_created = 0;
+	int rc = 0;
+
+	atomic_store(&g_stop, 0);
+
+	if (open_bench_pipe() < 0)
+		return -1;
+
+	wt = calloc((size_t)nw, sizeof(*wt));
+	rt = calloc((size_t)nr, sizeof(*rt));
+	ws = calloc((size_t)nw, sizeof(*ws));
+	rs = calloc((size_t)nr, sizeof(*rs));
+	if (!wt || !rt || !ws || !rs) {
+		fprintf(stderr, "alloc failed\n");
+		rc = -1;
+		goto teardown;
+	}
+
+	if (alloc_thread_bufs(ws, nw, rs, nr) < 0) {
+		rc = -1;
+		goto teardown;
+	}
+
+	if (start_readers(rt, rs, nr, &nr_created) < 0 ||
+	    start_writers(wt, ws, nw, &nw_created) < 0) {
+		rc = -1;
+		goto teardown;
+	}
+
+	sleep((unsigned int)g_duration);
+
+teardown:
+	stop_and_join(wt, nw_created, rt, nr_created, rc);
+
+	if (rc == 0) {
+		summarize(ws, nw, nr);
+		fflush(stdout);
+	}
+
+	free_thread_bufs(ws, nw, rs, nr);
+	free(wt);
+	free(rt);
+	free(ws);
+	free(rs);
+	return rc;
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [-w writers] [-r readers] [-s msgsize] [-d secs] [-p pipe_size] [--memory-pressure]\n"
+		"  default: sweep writers={1,2,5} x readers={1,5,10}\n"
+		"  --memory-pressure: spawn stress-ng (--vm 4 --vm-bytes 80%% --vm-method all) for the run\n",
+		prog);
+}
+
+static int parse_args(int argc, char **argv,
+		      int *writers_override, int *readers_override)
+{
+	static const struct option long_opts[] = {
+		{"memory-pressure", no_argument, NULL, 'M'},
+		{0, 0, 0, 0},
+	};
+	int opt;
+
+	while ((opt = getopt_long(argc, argv, "w:r:s:d:p:",
+				  long_opts, NULL)) != -1) {
+		switch (opt) {
+		case 'w':
+			*writers_override = atoi(optarg);
+			break;
+		case 'r':
+			*readers_override = atoi(optarg);
+			break;
+		case 's':
+			g_msgsize = (size_t)atol(optarg);
+			break;
+		case 'd':
+			g_duration = atoi(optarg);
+			break;
+		case 'p':
+			g_pipe_size = atoi(optarg);
+			break;
+		case 'M':
+			g_memory_pressure = 1;
+			break;
+		default:
+			usage(argv[0]);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * aligned_alloc(4096, size) requires size to be a multiple of the
+ * alignment (C11); glibc returns NULL otherwise, which would make
+ * writer/reader threads silently exit and the run report zero writes.
+ * Validate up front instead.
+ */
+static int validate_args(void)
+{
+	if (g_msgsize == 0 || g_msgsize % 4096 != 0) {
+		fprintf(stderr,
+			"msgsize must be a positive multiple of 4096 (got %zu)\n",
+			g_msgsize);
+		return -1;
+	}
+	if (g_duration <= 0) {
+		fprintf(stderr, "duration must be > 0 seconds (got %d)\n",
+			g_duration);
+		return -1;
+	}
+	if (g_pipe_size <= 0) {
+		fprintf(stderr, "pipe_size must be > 0 bytes (got %d)\n",
+			g_pipe_size);
+		return -1;
+	}
+	return 0;
+}
+
+static int run_sweep(void)
+{
+	static const int writers_sweep[] = {1, 2, 5};
+	static const int readers_sweep[] = {1, 5, 10};
+
+	for (size_t i = 0; i < ARRAY_SIZE(writers_sweep); i++) {
+		for (size_t j = 0; j < ARRAY_SIZE(readers_sweep); j++) {
+			printf("---\n");
+			if (run_one(writers_sweep[i], readers_sweep[j]) < 0)
+				return -1;
+		}
+	}
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	int writers_override = 0, readers_override = 0;
+	pid_t stress_pid = -1;
+	int rc = 0;
+
+	if (parse_args(argc, argv, &writers_override, &readers_override) < 0)
+		return 1;
+	if (validate_args() < 0)
+		return 1;
+
+	signal(SIGPIPE, SIG_IGN);
+	setvbuf(stdout, NULL, _IOLBF, 0);
+	setvbuf(stderr, NULL, _IOLBF, 0);
+
+	fprintf(stderr, "pid=%d\n", getpid());
+	fflush(stderr);
+
+	if (g_memory_pressure) {
+		stress_pid = spawn_stress_ng();
+		if (stress_pid < 0) {
+			fprintf(stderr,
+				"memory_pressure requested but stress-ng could not be spawned\n");
+			return 1;
+		}
+	}
+
+	if (writers_override > 0 || readers_override > 0) {
+		int nw = writers_override > 0 ? writers_override : 1;
+		int nr = readers_override > 0 ? readers_override : 1;
+
+		rc = run_one(nw, nr) < 0 ? 1 : 0;
+	} else {
+		rc = run_sweep() < 0 ? 1 : 0;
+	}
+
+	kill_stress_ng(stress_pid);
+	return rc;
+}

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
@ 2026-05-22 16:51   ` Jeff Layton
  2026-05-22 17:55     ` Breno Leitao
  2026-05-22 19:48   ` Mateusz Guzik
  2026-05-23 16:26   ` Oleg Nesterov
  2 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2026-05-22 16:51 UTC (permalink / raw)
  To: Breno Leitao, Alexander Viro, Christian Brauner, Jan Kara,
	Shuah Khan, Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, oleg,
	axboe, kernel-team

On Fri, 2026-05-22 at 09:44 -0700, Breno Leitao wrote:
> anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole
> thing") and then, from the per-iteration anon_pipe_get_page() helper,
> used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page
> while still holding it.
> 
> That allocation can sleep doing direct reclaim and/or runs memcg
> charging, which extends the critical section and stalls a concurrent
> reader on the very same mutex.
> 
> Just pre-alloc the required pages before the lock in an array and just pop
> them inside the lock.
> 
> This can improve the pipe throughput up to 48% and reduce the
> latency in 33%, easily seen when there is memory pressure and direct
> reclaim.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  fs/pipe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 102 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 9841648c9cf3e..cff255217bbfe 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -111,16 +111,76 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>  	pipe_lock(pipe2);
>  }
>  
> -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
> +#define PIPE_PREALLOC_MAX 8
> +
> +struct anon_pipe_prealloc {
> +	struct page *pages[PIPE_PREALLOC_MAX];
> +	unsigned int count;
> +};
> +
> +/*
> + * Pre-allocate pages outside pipe->mutex for multi-page writes.
> + * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg
> + * charging; doing it under the mutex stalls a concurrent reader.
> + *
> + * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses
> + * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc:
> + * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single
> + * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy
> + * honoured for every page; the per-call overhead is small compared to the
> + * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
> + * in-lock alloc_page() fallback in anon_pipe_get_page().
> + */
> +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> +					size_t total_len)
> +{
> +	unsigned int want, i;
> +	struct page *page;
> +
> +	prealloc->count = 0;
> +	if (total_len <= PAGE_SIZE)
> +		return;
> +
> +	want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
> +		     PIPE_PREALLOC_MAX);
> +
> +	for (i = 0; i < want; i++) {
> +		page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
> +		if (!page)
> +			break;
> +		prealloc->pages[prealloc->count++] = page;
> +	}

I believe alloc_pages_bulk() is supposed to be more efficient when
allocating several pages at once like this.

> +}
> +
> +static struct page *anon_pipe_prealloc_pop(struct anon_pipe_prealloc *prealloc)
> +{
> +	if (!prealloc->count)
> +		return NULL;
> +
> +	prealloc->count--;
> +
> +	return prealloc->pages[prealloc->count];
> +}
> +
> +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
> +				       struct anon_pipe_prealloc *prealloc)
>  {
> +	struct page *page;
> +
> +	/* Drain prealloc first to keep tmp_page[] hot for later small writes. */
> +	page = anon_pipe_prealloc_pop(prealloc);
> +	if (page)
> +		return page;
> +
>  	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
>  		if (pipe->tmp_page[i]) {
> -			struct page *page = pipe->tmp_page[i];
> +			page = pipe->tmp_page[i];
>  			pipe->tmp_page[i] = NULL;
>  			return page;
>  		}
>  	}
>  
> +	/* FWIW: This is called with pipe->mutex held */
>  	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
>  }
>  
> @@ -139,6 +199,38 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe,
>  	put_page(page);
>  }
>  
> +/*
> + * Stash leftover prealloc pages in tmp_page[] so the next write to this
> + * pipe gets a hot page without entering the allocator.
> + */
> +static void anon_pipe_refill_tmp_pages(struct pipe_inode_info *pipe,
> +				       struct anon_pipe_prealloc *prealloc)
> +{
> +	int i, idx;
> +
> +	if (!prealloc->count)
> +		return;
> +
> +	for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
> +		if (pipe->tmp_page[i])
> +			continue;
> +		if (!prealloc->count)
> +			return;
> +		idx = --prealloc->count;
> +		pipe->tmp_page[i] = prealloc->pages[idx];
> +		prealloc->pages[idx] = NULL;
> +	}
> +}
> +
> +/* Runs after mutex_unlock() to keep put_page() out of the critical section. */
> +static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
> +{
> +	while (prealloc->count) {
> +		prealloc->count--;
> +		put_page(prealloc->pages[prealloc->count]);
> +	}
> +}
> +
>  static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
>  				  struct pipe_buffer *buf)
>  {
> @@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct file *filp = iocb->ki_filp;
>  	struct pipe_inode_info *pipe = filp->private_data;
> +	struct anon_pipe_prealloc prealloc;
>  	unsigned int head;
>  	ssize_t ret = 0;
>  	size_t total_len = iov_iter_count(from);
> @@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(total_len == 0))
>  		return 0;
>  
> +	anon_pipe_get_page_prealloc(&prealloc, total_len);
> +
>  	mutex_lock(&pipe->mutex);
>  
>  	if (!pipe->readers) {
> @@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  			struct page *page;
>  			int copied;
>  
> -			page = anon_pipe_get_page(pipe);
> +			page = anon_pipe_get_page(pipe, &prealloc);
>  			if (unlikely(!page)) {
>  				if (!ret)
>  					ret = -ENOMEM;
> @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  		 * after waiting we need to re-check whether the pipe
>  		 * become empty while we dropped the lock.
>  		 */
> +		anon_pipe_refill_tmp_pages(pipe, &prealloc);
>  		mutex_unlock(&pipe->mutex);
> +		anon_pipe_free_pages(&prealloc);
>  		if (was_empty)
>  			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>  		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> @@ -576,9 +673,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  		wake_next_writer = true;
>  	}
>  out:
> +	anon_pipe_refill_tmp_pages(pipe, &prealloc);
>  	if (pipe_is_full(pipe))
>  		wake_next_writer = false;
>  	mutex_unlock(&pipe->mutex);
> +	anon_pipe_free_pages(&prealloc);
>  
>  	/*
>  	 * If we do do a wakeup event, we do a 'sync' wakeup, because we

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-22 16:51   ` Jeff Layton
@ 2026-05-22 17:55     ` Breno Leitao
  0 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-05-22 17:55 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, oleg, axboe, kernel-team

On Fri, May 22, 2026 at 12:51:40PM -0400, Jeff Layton wrote:
> On Fri, 2026-05-22 at 09:44 -0700, Breno Leitao wrote:
> > +/*
> > + * Pre-allocate pages outside pipe->mutex for multi-page writes.
> > + * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg
> > + * charging; doing it under the mutex stalls a concurrent reader.
> > + *
> > + * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses
> > + * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc:
> > + * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single
> > + * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy
> > + * honoured for every page; the per-call overhead is small compared to the
> > + * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
> > + * in-lock alloc_page() fallback in anon_pipe_get_page().
> > + */
> > +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> > +					size_t total_len)
> > +{
> > +	unsigned int want, i;
> > +	struct page *page;
> > +
> > +	prealloc->count = 0;
> > +	if (total_len <= PAGE_SIZE)
> > +		return;
> > +
> > +	want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
> > +		     PIPE_PREALLOC_MAX);
> > +
> > +	for (i = 0; i < want; i++) {
> > +		page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
> > +		if (!page)
> > +			break;
> > +		prealloc->pages[prealloc->count++] = page;
> > +	}
> 
> I believe alloc_pages_bulk() is supposed to be more efficient when
> allocating several pages at once like this.

Thanks Jeff. I understand bulk allocators refuses __GFP_ACCOUNT under memcg.
(That is the reason I've CCed Shakeel in this patchset).

Anyway, reading the code it shows me:

  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
                          nodemask_t *nodemask, int nr_pages,
                          struct page **page_array)
  {

	...
        /* Bulk allocator does not support memcg accounting. */
        if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT))
                goto failed;


Thanks for the review,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
  2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
  2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
@ 2026-05-22 19:43 ` Jeff Layton
  2 siblings, 0 replies; 13+ messages in thread
From: Jeff Layton @ 2026-05-22 19:43 UTC (permalink / raw)
  To: Breno Leitao, Alexander Viro, Christian Brauner, Jan Kara,
	Shuah Khan, Mateusz Guzik
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, oleg,
	axboe, kernel-team

On Fri, 2026-05-22 at 09:44 -0700, Breno Leitao wrote:
> While profiling Meta's caching code[1], I found pipe->mutex contention
> on the hot path. anon_pipe_write() currently calls alloc_page() once
> per page while holding pipe->mutex. The allocation can sleep doing
> direct reclaim and runs memcg charging, which extends the critical
> section and stalls any concurrent reader on the same mutex.
> 
> This series pre-allocates pages outside pipe->mutex in
> anon_pipe_write(): for writes that span more than one full page, up
> to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
> alloc_page() loop before the mutex is taken. anon_pipe_get_page()
> then drains the prealloc array first, falls back to the per-pipe
> tmp_page[] cache, and only enters the allocator under the mutex for
> the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
> writes that skip prealloc, or shortfalls when the prealloc loop
> fails). Leftover prealloc pages are recycled into tmp_page[] before
> unlock and any remainder is put_page()'d after unlock, keeping the
> allocator out of the critical section on both sides.
> 
> alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
> refuses __GFP_ACCOUNT under memcg -- it returns at most one page
> when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
> 8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
> __GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
> task NUMA mempolicy honoured uniformly without open-coding the
> charge.
> 
> I also vibe-coded a microbenchmark to validate the change. It sweeps
> writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
> 1 MB pipe and prints throughput + latency percentiles per config.
> 
> Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
> writes, 1 MB pipe). The numbers below were collected on v1
> (alloc_pages_bulk()); v2's per-page loop preserves the dominant
> "allocation outside the mutex" win and is expected to land in the same
> range.
> 
> == No memory pressure (10s per config) ==
> 
>   Throughput in MB/s (baseline -> patched, delta):
>     writers   readers=1              readers=5               readers=10
>           1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
>           2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
>           5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)
> 
>   Avg write latency in ns (baseline -> patched, delta):
>     writers   readers=1                 readers=5                readers=10
>           1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
>           2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
>           5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)
> 
> Throughput improves +6% to +28% and average write latency drops 5%
> to 22% across every configuration.
> 
> == Under memory pressure (--memory-pressure, 6s per config) ==
> 
> stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
> sweep so the alloc_page() calls inside anon_pipe_write() routinely
> hit direct reclaim -- exactly the regime the patch targets.
> 
>   Throughput in MB/s (baseline -> patched, delta):
>     writers   readers=1            readers=5            readers=10
>           1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
>           2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
>           5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)
> 
>   Avg write latency in ns (baseline -> patched, delta):
>     writers   readers=1              readers=5              readers=10
>           1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
>           2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
>           5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)
> 
> Throughput improves +21% to +48% and average write latency drops
> 17% to 33% -- a noticeably bigger win than the no-pressure run.
> 
> That tracks: when alloc_page() has to dip into reclaim, the cost
> of holding pipe->mutex across it is highest, and pulling the
> allocation out of the critical section pays the most.
> 
> Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v2:
> - Switch the prealloc path from alloc_pages_bulk_mempolicy() to a
>   per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop.
> - Split the prealloc work out of anon_pipe_write() into dedicated
>   helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop /
>   anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in
>   struct anon_pipe_prealloc, so the write path stays readable.
> - Recycle leftover prealloc pages into pipe->tmp_page[] before
>   unlocking
> - Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org
> 
> To: Alexander Viro <viro@zeniv.linux.org.uk>
> To: Christian Brauner <brauner@kernel.org>
> To: Jan Kara <jack@suse.cz>
> To: Shuah Khan <shuah@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-kselftest@vger.kernel.org
> 
> ---
> Breno Leitao (2):
>       fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
>       selftests/pipe: add pipe_bench microbenchmark
> 
>  fs/pipe.c                                 | 105 ++++-
>  tools/testing/selftests/Makefile          |   1 +
>  tools/testing/selftests/pipe/.gitignore   |   1 +
>  tools/testing/selftests/pipe/Makefile     |   9 +
>  tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
>  5 files changed, 729 insertions(+), 3 deletions(-)
> ---
> base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
> change-id: 20260515-fix_pipe-c91677c187e7
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>

Pity this can't use the bulk page allocator, but looks good otherwise.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
  2026-05-22 16:51   ` Jeff Layton
@ 2026-05-22 19:48   ` Mateusz Guzik
  2026-05-23 16:26   ` Oleg Nesterov
  2 siblings, 0 replies; 13+ messages in thread
From: Mateusz Guzik @ 2026-05-22 19:48 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
	jlayton, oleg, axboe, kernel-team

On Fri, May 22, 2026 at 6:44 PM Breno Leitao <leitao@debian.org> wrote:
>
> anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole
> thing") and then, from the per-iteration anon_pipe_get_page() helper,
> used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page
> while still holding it.
>
> That allocation can sleep doing direct reclaim and/or runs memcg
> charging, which extends the critical section and stalls a concurrent
> reader on the very same mutex.
>
> Just pre-alloc the required pages before the lock in an array and just pop
> them inside the lock.
>
> This can improve the pipe throughput up to 48% and reduce the
> latency in 33%, easily seen when there is memory pressure and direct
> reclaim.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  fs/pipe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 102 insertions(+), 3 deletions(-)
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 9841648c9cf3e..cff255217bbfe 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -111,16 +111,76 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>         pipe_lock(pipe2);
>  }
>
> -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
> +#define PIPE_PREALLOC_MAX 8
> +
> +struct anon_pipe_prealloc {
> +       struct page *pages[PIPE_PREALLOC_MAX];
> +       unsigned int count;
> +};
> +
> +/*
> + * Pre-allocate pages outside pipe->mutex for multi-page writes.
> + * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg
> + * charging; doing it under the mutex stalls a concurrent reader.
> + *
> + * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses
> + * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc:
> + * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single
> + * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy
> + * honoured for every page; the per-call overhead is small compared to the
> + * pipe->mutex hold-time being shrunk. Any shortfall is covered by the
> + * in-lock alloc_page() fallback in anon_pipe_get_page().
> + */
> +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> +                                       size_t total_len)
> +{
> +       unsigned int want, i;
> +       struct page *page;
> +
> +       prealloc->count = 0;
> +       if (total_len <= PAGE_SIZE)
> +               return;
> +
> +       want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE),
> +                    PIPE_PREALLOC_MAX);
> +
> +       for (i = 0; i < want; i++) {
> +               page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
> +               if (!page)
> +                       break;
> +               prealloc->pages[prealloc->count++] = page;
> +       }
> +}
> +
> +static struct page *anon_pipe_prealloc_pop(struct anon_pipe_prealloc *prealloc)
> +{
> +       if (!prealloc->count)
> +               return NULL;
> +
> +       prealloc->count--;
> +
> +       return prealloc->pages[prealloc->count];
> +}
> +
> +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
> +                                      struct anon_pipe_prealloc *prealloc)
>  {
> +       struct page *page;
> +
> +       /* Drain prealloc first to keep tmp_page[] hot for later small writes. */
> +       page = anon_pipe_prealloc_pop(prealloc);
> +       if (page)
> +               return page;
> +
>         for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
>                 if (pipe->tmp_page[i]) {
> -                       struct page *page = pipe->tmp_page[i];
> +                       page = pipe->tmp_page[i];
>                         pipe->tmp_page[i] = NULL;
>                         return page;
>                 }
>         }
>
> +       /* FWIW: This is called with pipe->mutex held */
>         return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
>  }
>
> @@ -139,6 +199,38 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe,
>         put_page(page);
>  }
>
> +/*
> + * Stash leftover prealloc pages in tmp_page[] so the next write to this
> + * pipe gets a hot page without entering the allocator.
> + */
> +static void anon_pipe_refill_tmp_pages(struct pipe_inode_info *pipe,
> +                                      struct anon_pipe_prealloc *prealloc)
> +{
> +       int i, idx;
> +
> +       if (!prealloc->count)
> +               return;
> +
> +       for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
> +               if (pipe->tmp_page[i])
> +                       continue;
> +               if (!prealloc->count)
> +                       return;
> +               idx = --prealloc->count;
> +               pipe->tmp_page[i] = prealloc->pages[idx];
> +               prealloc->pages[idx] = NULL;
> +       }
> +}
> +
> +/* Runs after mutex_unlock() to keep put_page() out of the critical section. */
> +static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
> +{
> +       while (prealloc->count) {
> +               prealloc->count--;
> +               put_page(prealloc->pages[prealloc->count]);
> +       }
> +}
> +
>  static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
>                                   struct pipe_buffer *buf)
>  {
> @@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  {
>         struct file *filp = iocb->ki_filp;
>         struct pipe_inode_info *pipe = filp->private_data;
> +       struct anon_pipe_prealloc prealloc;
>         unsigned int head;
>         ssize_t ret = 0;
>         size_t total_len = iov_iter_count(from);
> @@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>         if (unlikely(total_len == 0))
>                 return 0;
>
> +       anon_pipe_get_page_prealloc(&prealloc, total_len);
> +
>         mutex_lock(&pipe->mutex);
>
>         if (!pipe->readers) {
> @@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                         struct page *page;
>                         int copied;
>
> -                       page = anon_pipe_get_page(pipe);
> +                       page = anon_pipe_get_page(pipe, &prealloc);
>                         if (unlikely(!page)) {
>                                 if (!ret)
>                                         ret = -ENOMEM;
> @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                  * after waiting we need to re-check whether the pipe
>                  * become empty while we dropped the lock.
>                  */
> +               anon_pipe_refill_tmp_pages(pipe, &prealloc);
>                 mutex_unlock(&pipe->mutex);
> +               anon_pipe_free_pages(&prealloc);
>                 if (was_empty)
>                         wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>                 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> @@ -576,9 +673,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                 wake_next_writer = true;
>         }
>  out:
> +       anon_pipe_refill_tmp_pages(pipe, &prealloc);
>         if (pipe_is_full(pipe))
>                 wake_next_writer = false;

maybe the new call would be nicer moved below pipe_is_full, but that's
just cosmetics

>         mutex_unlock(&pipe->mutex);
> +       anon_pipe_free_pages(&prealloc);
>
>         /*
>          * If we do do a wakeup event, we do a 'sync' wakeup, because we
>
> --
> 2.54.0
>

Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
  2026-05-22 16:51   ` Jeff Layton
  2026-05-22 19:48   ` Mateusz Guzik
@ 2026-05-23 16:26   ` Oleg Nesterov
  2026-05-24 14:30     ` Breno Leitao
  2 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2026-05-23 16:26 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

To be honest, I didn't read this patch carefully, but let me ask anyway.

On 05/22, Breno Leitao wrote:
>
> @@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct file *filp = iocb->ki_filp;
>  	struct pipe_inode_info *pipe = filp->private_data;
> +	struct anon_pipe_prealloc prealloc;
>  	unsigned int head;
>  	ssize_t ret = 0;
>  	size_t total_len = iov_iter_count(from);
> @@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(total_len == 0))
>  		return 0;
>  
> +	anon_pipe_get_page_prealloc(&prealloc, total_len);
> +
>  	mutex_lock(&pipe->mutex);
>  
>  	if (!pipe->readers) {
> @@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  			struct page *page;
>  			int copied;
>  
> -			page = anon_pipe_get_page(pipe);
> +			page = anon_pipe_get_page(pipe, &prealloc);
>  			if (unlikely(!page)) {
>  				if (!ret)
>  					ret = -ENOMEM;
> @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  		 * after waiting we need to re-check whether the pipe
>  		 * become empty while we dropped the lock.
>  		 */
> +		anon_pipe_refill_tmp_pages(pipe, &prealloc);
>  		mutex_unlock(&pipe->mutex);
> +		anon_pipe_free_pages(&prealloc);

Do we really want to call anon_pipe_free_pages() at this point?

The main loop will continue when pipe_writable() becomes true again...

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark
  2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
@ 2026-05-23 16:43   ` Oleg Nesterov
  2026-05-23 16:49     ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2026-05-23 16:43 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

On 05/22, Breno Leitao wrote:
>
> Add a small selftest that stresses pipe->mutex contention by spawning N
> writer threads that hammer a single pipe with multi-page writes, plus M
> reader threads that drain. Each writer records its own write() latency
> samples into a log2-bucketed histogram; main aggregates and prints
> total writes, throughput, average and percentile (p50/p99) latencies,
> and the maximum observed latency.
> 
> Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80%
> --vm-method all) for the duration of the run, so alloc_page() in
> anon_pipe_write() routinely hits direct reclaim. The flag fails
> fast if stress-ng is not on $PATH.
> 
> Program print something like the following, for different writes,
> readers, msgsizes and memory pressure:
> 
> 	config: writers=X readers=Y msgsize=Z duration=3 pipe_size=1048576
> 	memory_pressure=[no|yes]
> 	writes: total=54451 rate=18150/s
> 	throughput_MBps: 1134.40
> 	lat_avg_ns: 275355
> 	lat_p50_ns_upper: 262143
> 	lat_p99_ns_upper: 1048575
> 	lat_max_ns: 2145633


The more tests the better... but since it comes as 2/2, perhaps the changelog
could mention the difference before/after 1/2 according to this test?

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark
  2026-05-23 16:43   ` Oleg Nesterov
@ 2026-05-23 16:49     ` Oleg Nesterov
  0 siblings, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2026-05-23 16:49 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

On 05/23, Oleg Nesterov wrote:
>
> The more tests the better... but since it comes as 2/2, perhaps the changelog
> could mention the difference before/after 1/2 according to this test?

Ah, I didn't read 0/2, sorry for the noise ;)

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-23 16:26   ` Oleg Nesterov
@ 2026-05-24 14:30     ` Breno Leitao
  2026-05-24 14:48       ` Mateusz Guzik
  0 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-05-24 14:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote:
> > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> >  		 * after waiting we need to re-check whether the pipe
> >  		 * become empty while we dropped the lock.
> >  		 */
> > +		anon_pipe_refill_tmp_pages(pipe, &prealloc);
> >  		mutex_unlock(&pipe->mutex);
> > +		anon_pipe_free_pages(&prealloc);
> 
> Do we really want to call anon_pipe_free_pages() at this point?
> 
> The main loop will continue when pipe_writable() becomes true again...

I went back and forth on this. The argument for freeing was that
wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or
stopped reader), and holding up the prealloc pages felt antisocial --
especially under the memory pressure this series targets, where those pages are
more useful on the freelists than parked on a sleeping task.

On the other side, on wakeup the loop is guaranteed to want pages again, and
re-entering the allocator under the mutex puts us back in the contended state
the patch removes. For any write() large enough to wait mid-syscall (which is
the workload patch 2/2 measures), keeping them strictly wins on throughput /
p99.

I think your read is the better one -- the throughput case is the whole point
of the series, and the memory-hoarding concern is bounded (8 pages per blocked
writer, freed in the out: path on syscall exit). Will drop the free_pages()
call from the wait branch in v3 and keep only the one after the out: label,
together with the nit from Mateusz.

Thanks for the review,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-24 14:30     ` Breno Leitao
@ 2026-05-24 14:48       ` Mateusz Guzik
  2026-05-24 16:47         ` Breno Leitao
  0 siblings, 1 reply; 13+ messages in thread
From: Mateusz Guzik @ 2026-05-24 14:48 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Oleg Nesterov, Alexander Viro, Christian Brauner, Jan Kara,
	Shuah Khan, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

On Sun, May 24, 2026 at 4:30 PM Breno Leitao <leitao@debian.org> wrote:
>
> On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote:
> > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> > >              * after waiting we need to re-check whether the pipe
> > >              * become empty while we dropped the lock.
> > >              */
> > > +           anon_pipe_refill_tmp_pages(pipe, &prealloc);
> > >             mutex_unlock(&pipe->mutex);
> > > +           anon_pipe_free_pages(&prealloc);
> >
> > Do we really want to call anon_pipe_free_pages() at this point?
> >
> > The main loop will continue when pipe_writable() becomes true again...
>
> I went back and forth on this. The argument for freeing was that
> wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or
> stopped reader), and holding up the prealloc pages felt antisocial --
> especially under the memory pressure this series targets, where those pages are
> more useful on the freelists than parked on a sleeping task.
>
> On the other side, on wakeup the loop is guaranteed to want pages again, and
> re-entering the allocator under the mutex puts us back in the contended state
> the patch removes. For any write() large enough to wait mid-syscall (which is
> the workload patch 2/2 measures), keeping them strictly wins on throughput /
> p99.
>

You can still prealloc after wakeup for whatever reminder you got
though, but I can agree dropping these frees is a sensible way out and
it is easier and I'm not going to insist on one way or the other.

However, I think it would be prudent to add a tracepoint to some
machines on your fleet to find out how often they allocate pages under
the mutex (and for what i/o size). Initial alloc for the first write <
PAGE_SIZE definitely happens under the mutex which is probably not a
problem, but for anything later? The tracepoint can have a trivial
indicator if this is the first write if that matters. One can
speculate all day, nothing beats checking what happens.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-24 14:48       ` Mateusz Guzik
@ 2026-05-24 16:47         ` Breno Leitao
  0 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-05-24 16:47 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Oleg Nesterov, Alexander Viro, Christian Brauner, Jan Kara,
	Shuah Khan, linux-fsdevel, linux-kernel, linux-kselftest,
	shakeel.butt, jlayton, axboe, kernel-team

Hello Mateusz,

On Sun, May 24, 2026 at 04:48:14PM +0200, Mateusz Guzik wrote:
> On Sun, May 24, 2026 at 4:30 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote:
> > > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> > > >              * after waiting we need to re-check whether the pipe
> > > >              * become empty while we dropped the lock.
> > > >              */
> > > > +           anon_pipe_refill_tmp_pages(pipe, &prealloc);
> > > >             mutex_unlock(&pipe->mutex);
> > > > +           anon_pipe_free_pages(&prealloc);
> > >
> > > Do we really want to call anon_pipe_free_pages() at this point?
> > >
> > > The main loop will continue when pipe_writable() becomes true again...
> >
> > I went back and forth on this. The argument for freeing was that
> > wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or
> > stopped reader), and holding up the prealloc pages felt antisocial --
> > especially under the memory pressure this series targets, where those pages are
> > more useful on the freelists than parked on a sleeping task.
> >
> > On the other side, on wakeup the loop is guaranteed to want pages again, and
> > re-entering the allocator under the mutex puts us back in the contended state
> > the patch removes. For any write() large enough to wait mid-syscall (which is
> > the workload patch 2/2 measures), keeping them strictly wins on throughput /
> > p99.
> >
> 
> You can still prealloc after wakeup for whatever reminder you got
> though, but I can agree dropping these frees is a sensible way out and
> it is easier and I'm not going to insist on one way or the other.

Ack. I've sent a v3 with anon_pipe_free_pages() and
anon_pipe_refill_tmp_pages() dropped.

> However, I think it would be prudent to add a tracepoint to some
> machines on your fleet to find out how often they allocate pages under
> the mutex (and for what i/o size). Initial alloc for the first write <
> PAGE_SIZE definitely happens under the mutex which is probably not a
> problem, but for anything later? 

> The tracepoint can have a trivial
> indicator if this is the first write if that matters. One can

Isn't this what I've reported earlier?

https://lore.kernel.org/all/ag3Ty3T24wjn1aFw@gmail.com/

Adding a tracepoint is harder than usual, given kernel rollout takes ages.

But I hacked a bpftrace script and ran it on a random sample of fleet hosts (5
min each). 

As reported earlier, multi-page pipe writes are not uncommon: on one
host a single long-running process produced 196,476 under-mutex alloc_page()
calls in 5 minutes, with allocs-per-write distributions reaching 16+ -- exactly
the pattern this patch removes. 

Most hosts sit at the boring ~20-30 allocs/sec dominated by one-page
first-writes that the patch's `total_len <= PAGE_SIZE` early-return skips
anyway, so the win is concentrated on the workloads that actually need it.

None of the allocs hit reclaim during the trace I ran, but I would expect
direct reclaim to happen with the lock held.

Thanks for the review and direction,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-05-24 16:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-22 16:51   ` Jeff Layton
2026-05-22 17:55     ` Breno Leitao
2026-05-22 19:48   ` Mateusz Guzik
2026-05-23 16:26   ` Oleg Nesterov
2026-05-24 14:30     ` Breno Leitao
2026-05-24 14:48       ` Mateusz Guzik
2026-05-24 16:47         ` Breno Leitao
2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
2026-05-23 16:43   ` Oleg Nesterov
2026-05-23 16:49     ` Oleg Nesterov
2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.