[PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock
@ 2026-05-15 10:28 Breno Leitao
  2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
  2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
  0 siblings, 2 replies; 5+ messages in thread
From: Breno Leitao @ 2026-05-15 10:28 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, Breno Leitao,
	usama.arif, kernel-team

While profiling kernel code on Meta's caching code[1], I found some lock
contention on pipe hot path, and I decide to investigate, and found that
anon_pipe_write() currently calls alloc_page() once per page while
holding pipe->mutex. The allocation can sleep doing direct reclaim and
runs memcg charging, which extends the critical section and stalls any
concurrent reader on the same mutex.

I would like to propose that we bulk pre-allocate pages outside
pipe->mutex in anon_pipe_write -- pre-allocates up to PIPE_PREALLOC_MAX
(8) pages with alloc_pages_bulk() before taking the lock

 * the get-page helper then consumes from tmp_page[] / the prealloc
   array first and only falls back to in-lock alloc_page() for writes
   larger than PIPE_PREALLOC_MAX (8) pages, which is rare.

I also vibe coded a microbenchmark to test the pipe benchmark. Basically
it wweeps writers x readers over {1,2,5} x {1,5,10} with 64KB writes
against a 1 MB pipe and prints throughput + latency percentiles per
config.

Measured on arm64 in virtme-ng (16 vCPUs, 64KB writes, 1 MB pipe).

== No memory pressure (10s per config) ==

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1              readers=5               readers=10
          1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
          2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
          5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1                 readers=5                readers=10
          1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
          2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
          5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)

Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.

== Under memory pressure (--memory-pressure, 6s per config) ==

stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1            readers=5            readers=10
          1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
          2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
          5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1              readers=5              readers=10
          1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
          2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
          5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)

Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.

That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.

Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (2):
      fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
      selftests/pipe: add pipe_bench microbenchmark

 fs/pipe.c                                 |  40 +++-
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 351 ++++++++++++++++++++++++++++++
 5 files changed, 400 insertions(+), 2 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7

Best regards,
--  
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-15 10:28 [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock Breno Leitao
@ 2026-05-15 10:28 ` Breno Leitao
  2026-05-15 20:18   ` Mateusz Guzik
  2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
  1 sibling, 1 reply; 5+ messages in thread
From: Breno Leitao @ 2026-05-15 10:28 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, Breno Leitao,
	usama.arif, kernel-team

anon_pipe_write() takes pipe->mutex and then, from the per-iteration
anon_pipe_get_page() helper, used to call alloc_page(GFP_HIGHUSER |
__GFP_ACCOUNT) once per page while still holding it. That allocation
can sleep doing direct reclaim and/or runs memcg charging, which extends
the critical section and stalls a concurrent reader on the very same
mutex.

Bulk pre-allocate DIV_ROUND_UP(total_len, PAGE_SIZE) pages (up to
PIPE_PREALLOC_MAX (8)) pages outside the mutex when total_len >=
PAGE_SIZE, using alloc_pages_bulk().  (Under memcg, alloc_pages_bulk()
with __GFP_ACCOUNT might return less pages than requested, but, this is
still a win, given some pages allocation is moved outside of the
lock).

Pass the array into anon_pipe_get_page(), which now consumes from
tmp_page[] first, then from the prealloc array, and only as a last
resort falls back to alloc_page() under the mutex (reached only for
writes larger than 8 pages, where the prealloc cap is exhausted).

Doing this in one bulk call before the lock keeps the fast path's
mutex held for a single, write-bounded critical section -- no extra
mutex_unlock/_lock cycles -- so it avoids the per-page lock-handoff
overhead that a per-page drop-and-retake design would introduce, while
still moving the typical multi-page allocation off the critical
section. Unused prealloc pages are pushed to the pipe's tmp_page[]
cache (or freed) before unlock, so a subsequent write to the same
pipe gets a hot cached page rather than paying for an alloc again.

Sub-PAGE_SIZE writes are unchanged: the merge path handles them
without ever needing a fresh page, so it is not worth speculatively
allocating for them.

This can improve the pipe throughput up to 48% and reduce the latency in
33%.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 fs/pipe.c | 40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 9841648c9cf3e..7a1517d15107a 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -111,7 +111,11 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 	pipe_lock(pipe2);
 }

-static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
+#define PIPE_PREALLOC_MAX 8
+
+static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
+				       struct page **prealloc,
+				       unsigned int *prealloc_n)
 {
 	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
 		if (pipe->tmp_page[i]) {
@@ -121,6 +125,14 @@ static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
 		}
 	}

+	if (*prealloc_n) {
+		unsigned int idx = --(*prealloc_n);
+		struct page *page = prealloc[idx];
+
+		prealloc[idx] = NULL;
+		return page;
+	}
+
 	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
 }

@@ -438,6 +450,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t chars;
 	bool was_empty = false;
 	bool wake_next_writer = false;
+	struct page *prealloc[PIPE_PREALLOC_MAX] = { NULL };
+	unsigned int prealloc_n = 0;

 	/*
 	 * Reject writing to watch queue pipes before the point where we lock
@@ -455,6 +469,26 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	if (unlikely(total_len == 0))
 		return 0;

+	/*
+	 * Bulk pre-allocate up to PIPE_PREALLOC_MAX pages outside pipe->mutex
+	 * for writes that span at least one full page. alloc_page() with
+	 * GFP_HIGHUSER may sleep doing reclaim and runs memcg charging, so
+	 * doing it under the mutex extends the critical section and stalls
+	 * the reader. The merge path handles sub-PAGE_SIZE writes without
+	 * needing a fresh page; for writes larger than PIPE_PREALLOC_MAX
+	 * pages, anon_pipe_get_page() falls back to a single alloc_page()
+	 * under the mutex for the remainder. Unused prealloc pages are
+	 * returned to the pipe's tmp_page[] cache (or freed) before unlock.
+	 */
+	if (total_len >= PAGE_SIZE) {
+		unsigned int want = min_t(unsigned int,
+					  DIV_ROUND_UP(total_len, PAGE_SIZE),
+					  PIPE_PREALLOC_MAX);
+
+		prealloc_n = alloc_pages_bulk(GFP_HIGHUSER | __GFP_ACCOUNT,
+					      want, prealloc);
+	}
+
 	mutex_lock(&pipe->mutex);

 	if (!pipe->readers) {
@@ -512,7 +546,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			struct page *page;
 			int copied;

-			page = anon_pipe_get_page(pipe);
+			page = anon_pipe_get_page(pipe, prealloc, &prealloc_n);
 			if (unlikely(!page)) {
 				if (!ret)
 					ret = -ENOMEM;
@@ -576,6 +610,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		wake_next_writer = true;
 	}
 out:
+	while (prealloc_n)
+		anon_pipe_put_page(pipe, prealloc[--prealloc_n]);
 	if (pipe_is_full(pipe))
 		wake_next_writer = false;
 	mutex_unlock(&pipe->mutex);

-- 
2.53.0-Meta

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark
  2026-05-15 10:28 [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock Breno Leitao
  2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
@ 2026-05-15 10:28 ` Breno Leitao
  1 sibling, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2026-05-15 10:28 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, Breno Leitao,
	usama.arif, kernel-team

Add a small selftest that stresses pipe->mutex contention by spawning N
writer threads that hammer a single pipe with multi-page writes, plus M
reader threads that drain. Each writer records its own write() latency
samples into a log2-bucketed histogram; main aggregates and prints
total writes, throughput, average and percentile (p50/p99/p99.9)
latencies, and the maximum observed latency.

This was used to validate "fs/pipe: bulk pre-allocate pages outside
pipe->mutex in anon_pipe_write". By default it sweeps over
writers={1,2,5} x readers={1,5,10} using 64KB writes for 3s on a 1 MB
pipe (~27s total); -w/-r switch to a single configuration and -s/-d/-p
tune msgsize/duration/pipe size. Output is one-line-per-metric with a
"---" separator between configurations so two runs (e.g. baseline vs
patched) can be diffed directly.

Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80%
--vm-method all) for the duration of the run, so alloc_page() in
anon_pipe_write() routinely hits direct reclaim. The flag fails
fast if stress-ng is not on $PATH.

The program exits 0 on a clean run and reports its results to stdout,
so it integrates with the kselftest framework via TEST_GEN_PROGS.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 351 ++++++++++++++++++++++++++++++
 4 files changed, 362 insertions(+)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6e59b8f63e416..bcd9db9d292ca 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -91,6 +91,7 @@ TARGETS += pcie_bwctrl
 TARGETS += perf_events
 TARGETS += pidfd
 TARGETS += pid_namespace
+TARGETS += pipe
 TARGETS += power_supply
 TARGETS += powerpc
 TARGETS += prctl
diff --git a/tools/testing/selftests/pipe/.gitignore b/tools/testing/selftests/pipe/.gitignore
new file mode 100644
index 0000000000000..20b549361a152
--- /dev/null
+++ b/tools/testing/selftests/pipe/.gitignore
@@ -0,0 +1 @@
+pipe_bench
diff --git a/tools/testing/selftests/pipe/Makefile b/tools/testing/selftests/pipe/Makefile
new file mode 100644
index 0000000000000..1810c680117b3
--- /dev/null
+++ b/tools/testing/selftests/pipe/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+# Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+
+CFLAGS += -O2 -Wall -Wextra -pthread
+
+TEST_GEN_PROGS := pipe_bench
+
+include ../lib.mk
diff --git a/tools/testing/selftests/pipe/pipe_bench.c b/tools/testing/selftests/pipe/pipe_bench.c
new file mode 100644
index 0000000000000..4b4ee6c8c0ced
--- /dev/null
+++ b/tools/testing/selftests/pipe/pipe_bench.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pipe_bench - exercise pipe->mutex contention under concurrent writers.
+ *
+ * N writer threads hammer a single pipe with multi-page writes; M reader
+ * threads drain it. Each writer records its own write() latency histogram.
+ * Multi-page writes (msgsize >= PAGE_SIZE) force the loop in
+ * anon_pipe_write() to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) under
+ * pipe->mutex, which is the critical section the patch shrinks.
+ *
+ * By default the benchmark sweeps writers in {1, 2, 5} x readers in
+ * {1, 5, 10} and prints one block per configuration so two runs (e.g.
+ * baseline vs patched) can be diffed directly. Pass -w and -r to run a
+ * single configuration instead. Pass --memory-pressure to spawn stress-ng
+ * alongside the sweep so the per-page alloc_page() path under pipe->mutex
+ * has to dip into reclaim.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#define ARRAY_SIZE(a)	(sizeof(a) / sizeof((a)[0]))
+#define HIST_BUCKETS	32
+
+static size_t g_msgsize = 16 * 4096;
+static int g_duration = 3;
+static int g_pipe_size = 1024 * 1024;
+static int g_memory_pressure;
+
+static atomic_int g_stop;
+static int g_pipe[2];
+
+struct wstats {
+	uint64_t writes;
+	uint64_t bytes;
+	uint64_t lat_sum_ns;
+	uint64_t lat_max_ns;
+	uint64_t lat_hist[HIST_BUCKETS];
+};
+
+static inline uint64_t now_ns(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return (uint64_t)ts.tv_sec * 1000000000ull + ts.tv_nsec;
+}
+
+static inline int log2_bucket(uint64_t v)
+{
+	int b = 0;
+
+	if (!v)
+		return 0;
+	while (v >>= 1)
+		b++;
+	return b < HIST_BUCKETS ? b : HIST_BUCKETS - 1;
+}
+
+static void *writer(void *arg)
+{
+	struct wstats *s = arg;
+	char *buf = aligned_alloc(4096, g_msgsize);
+
+	if (!buf)
+		return NULL;
+	memset(buf, 0xAA, g_msgsize);
+
+	while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) {
+		uint64_t t0 = now_ns();
+		ssize_t n = write(g_pipe[1], buf, g_msgsize);
+		uint64_t dt = now_ns() - t0;
+
+		if (n > 0) {
+			s->writes++;
+			s->bytes += n;
+			s->lat_sum_ns += dt;
+			if (dt > s->lat_max_ns)
+				s->lat_max_ns = dt;
+			s->lat_hist[log2_bucket(dt)]++;
+		} else if (n < 0 && errno == EPIPE) {
+			break;
+		}
+	}
+	free(buf);
+	return NULL;
+}
+
+static void *reader(void *arg)
+{
+	char *buf = aligned_alloc(4096, g_msgsize);
+
+	(void)arg;
+	if (!buf)
+		return NULL;
+	/*
+	 * Drain until EOF (write end closed by main). g_stop is not checked
+	 * here on purpose: writers may be blocked in write() with the pipe
+	 * full when g_stop is set, so the reader must keep draining until
+	 * main closes the write end.
+	 */
+	for (;;) {
+		ssize_t n = read(g_pipe[0], buf, g_msgsize);
+
+		if (n <= 0)
+			break;
+	}
+	free(buf);
+	return NULL;
+}
+
+static void summarize(struct wstats *all, int nw, int nr)
+{
+	uint64_t total_writes = 0, total_bytes = 0, total_lat = 0;
+	uint64_t max_lat = 0;
+	uint64_t agg[HIST_BUCKETS] = {0};
+	uint64_t p50_target, p99_target, p999_target;
+	uint64_t cum = 0, p50 = 0, p99 = 0, p999 = 0;
+	uint64_t avg_ns;
+	double sec;
+
+	for (int i = 0; i < nw; i++) {
+		total_writes += all[i].writes;
+		total_bytes += all[i].bytes;
+		total_lat += all[i].lat_sum_ns;
+		if (all[i].lat_max_ns > max_lat)
+			max_lat = all[i].lat_max_ns;
+		for (int b = 0; b < HIST_BUCKETS; b++)
+			agg[b] += all[i].lat_hist[b];
+	}
+
+	p50_target = total_writes * 50 / 100;
+	p99_target = total_writes * 99 / 100;
+	p999_target = total_writes * 999 / 1000;
+
+	for (int b = 0; b < HIST_BUCKETS; b++) {
+		cum += agg[b];
+		if (!p50 && cum >= p50_target)
+			p50 = 1ULL << b;
+		if (!p99 && cum >= p99_target)
+			p99 = 1ULL << b;
+		if (!p999 && cum >= p999_target)
+			p999 = 1ULL << b;
+	}
+
+	sec = g_duration;
+	avg_ns = total_writes ? total_lat / total_writes : 0;
+
+	printf("config: writers=%d readers=%d msgsize=%zu duration=%d pipe_size=%d memory_pressure=%s\n",
+	       nw, nr, g_msgsize, g_duration, g_pipe_size,
+	       g_memory_pressure ? "yes" : "no");
+	printf("writes: total=%llu rate=%.0f/s\n",
+	       (unsigned long long)total_writes, total_writes / sec);
+	printf("throughput_MBps: %.2f\n",
+	       (total_bytes / sec) / (1024.0 * 1024.0));
+	printf("lat_avg_ns: %llu\n", (unsigned long long)avg_ns);
+	printf("lat_p50_ns_upper: %llu\n", (unsigned long long)p50);
+	printf("lat_p99_ns_upper: %llu\n", (unsigned long long)p99);
+	printf("lat_p999_ns_upper: %llu\n", (unsigned long long)p999);
+	printf("lat_max_ns: %llu\n", (unsigned long long)max_lat);
+}
+
+static pid_t spawn_stress_ng(void)
+{
+	pid_t pid = fork();
+
+	if (pid < 0) {
+		perror("fork");
+		return -1;
+	}
+	if (pid == 0) {
+		execlp("stress-ng", "stress-ng",
+		       "--vm", "4", "--vm-bytes", "80%",
+		       "--vm-method", "all",
+		       (char *)NULL);
+		fprintf(stderr, "exec stress-ng failed: %s\n",
+			strerror(errno));
+		_exit(127);
+	}
+	/* Give stress-ng a moment to map its VM regions before measuring. */
+	sleep(1);
+	return pid;
+}
+
+static void kill_stress_ng(pid_t pid)
+{
+	int status;
+
+	if (pid <= 0)
+		return;
+	kill(pid, SIGTERM);
+	for (int i = 0; i < 20; i++) {
+		if (waitpid(pid, &status, WNOHANG) > 0)
+			return;
+		usleep(100 * 1000);
+	}
+	kill(pid, SIGKILL);
+	waitpid(pid, &status, 0);
+}
+
+static int run_one(int nw, int nr)
+{
+	pthread_t *wt, *rt;
+	struct wstats *ws;
+
+	atomic_store(&g_stop, 0);
+
+	if (pipe(g_pipe) < 0) {
+		perror("pipe");
+		return -1;
+	}
+	if (fcntl(g_pipe[1], F_SETPIPE_SZ, g_pipe_size) < 0)
+		perror("F_SETPIPE_SZ (continuing)");
+
+	wt = calloc(nw, sizeof(*wt));
+	rt = calloc(nr, sizeof(*rt));
+	ws = calloc(nw, sizeof(*ws));
+
+	if (!wt || !rt || !ws) {
+		fprintf(stderr, "alloc failed\n");
+		free(wt);
+		free(rt);
+		free(ws);
+		close(g_pipe[0]);
+		close(g_pipe[1]);
+		return -1;
+	}
+
+	for (int i = 0; i < nr; i++)
+		pthread_create(&rt[i], NULL, reader, NULL);
+	for (int i = 0; i < nw; i++)
+		pthread_create(&wt[i], NULL, writer, &ws[i]);
+
+	sleep(g_duration);
+	atomic_store(&g_stop, 1);
+
+	/*
+	 * Close write end first so any writer blocked in write() gets EPIPE
+	 * and exits, and so the readers see EOF after draining.
+	 */
+	close(g_pipe[1]);
+	for (int i = 0; i < nw; i++)
+		pthread_join(wt[i], NULL);
+	for (int i = 0; i < nr; i++)
+		pthread_join(rt[i], NULL);
+	close(g_pipe[0]);
+
+	summarize(ws, nw, nr);
+	fflush(stdout);
+
+	free(wt);
+	free(rt);
+	free(ws);
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	static const int writers_sweep[] = {1, 2, 5};
+	static const int readers_sweep[] = {1, 5, 10};
+	static const struct option long_opts[] = {
+		{"memory-pressure", no_argument, NULL, 'M'},
+		{0, 0, 0, 0},
+	};
+	int writers_override = 0, readers_override = 0;
+	pid_t stress_pid = -1;
+	int rc = 0, opt;
+
+	while ((opt = getopt_long(argc, argv, "w:r:s:d:p:",
+				  long_opts, NULL)) != -1) {
+		switch (opt) {
+		case 'w':
+			writers_override = atoi(optarg);
+			break;
+		case 'r':
+			readers_override = atoi(optarg);
+			break;
+		case 's':
+			g_msgsize = atol(optarg);
+			break;
+		case 'd':
+			g_duration = atoi(optarg);
+			break;
+		case 'p':
+			g_pipe_size = atoi(optarg);
+			break;
+		case 'M':
+			g_memory_pressure = 1;
+			break;
+		default:
+			fprintf(stderr,
+				"usage: %s [-w writers] [-r readers] [-s msgsize] [-d secs] [-p pipe_size] [--memory-pressure]\n"
+				"  default: sweep writers={1,2,5} x readers={1,5,10}\n"
+				"  --memory-pressure: spawn stress-ng (--vm 4 --vm-bytes 80%% --vm-method all) for the run\n",
+				argv[0]);
+			return 1;
+		}
+	}
+
+	signal(SIGPIPE, SIG_IGN);
+	setvbuf(stdout, NULL, _IOLBF, 0);
+	setvbuf(stderr, NULL, _IOLBF, 0);
+
+	fprintf(stderr, "pid=%d\n", getpid());
+	fflush(stderr);
+
+	if (g_memory_pressure) {
+		stress_pid = spawn_stress_ng();
+		if (stress_pid < 0) {
+			fprintf(stderr,
+				"memory_pressure requested but stress-ng could not be spawned\n");
+			return 1;
+		}
+	}
+
+	if (writers_override > 0 || readers_override > 0) {
+		int nw = writers_override > 0 ? writers_override : 1;
+		int nr = readers_override > 0 ? readers_override : 1;
+
+		rc = run_one(nw, nr) < 0 ? 1 : 0;
+		goto out;
+	}
+
+	for (size_t i = 0; i < ARRAY_SIZE(writers_sweep); i++) {
+		for (size_t j = 0; j < ARRAY_SIZE(readers_sweep); j++) {
+			printf("---\n");
+			if (run_one(writers_sweep[i], readers_sweep[j]) < 0) {
+				rc = 1;
+				goto out;
+			}
+		}
+	}
+out:
+	kill_stress_ng(stress_pid);
+	return rc;
+}

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
@ 2026-05-15 20:18   ` Mateusz Guzik
  2026-05-20 17:09     ` Breno Leitao
  0 siblings, 1 reply; 5+ messages in thread
From: Mateusz Guzik @ 2026-05-15 20:18 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	linux-fsdevel, linux-kernel, linux-kselftest, usama.arif,
	kernel-team

On Fri, May 15, 2026 at 03:28:44AM -0700, Breno Leitao wrote:
> anon_pipe_write() takes pipe->mutex and then, from the per-iteration
> anon_pipe_get_page() helper, used to call alloc_page(GFP_HIGHUSER |
> __GFP_ACCOUNT) once per page while still holding it. That allocation
> can sleep doing direct reclaim and/or runs memcg charging, which extends
> the critical section and stalls a concurrent reader on the very same
> mutex.
> 
> Bulk pre-allocate DIV_ROUND_UP(total_len, PAGE_SIZE) pages (up to
> PIPE_PREALLOC_MAX (8)) pages outside the mutex when total_len >=
> PAGE_SIZE, using alloc_pages_bulk().  (Under memcg, alloc_pages_bulk()
> with __GFP_ACCOUNT might return less pages than requested, but, this is
> still a win, given some pages allocation is moved outside of the
> lock).
> 
> Pass the array into anon_pipe_get_page(), which now consumes from
> tmp_page[] first, then from the prealloc array, and only as a last
> resort falls back to alloc_page() under the mutex (reached only for
> writes larger than 8 pages, where the prealloc cap is exhausted).
> 
> Doing this in one bulk call before the lock keeps the fast path's
> mutex held for a single, write-bounded critical section -- no extra
> mutex_unlock/_lock cycles -- so it avoids the per-page lock-handoff
> overhead that a per-page drop-and-retake design would introduce, while
> still moving the typical multi-page allocation off the critical
> section. Unused prealloc pages are pushed to the pipe's tmp_page[]
> cache (or freed) before unlock, so a subsequent write to the same
> pipe gets a hot cached page rather than paying for an alloc again.
> 
> Sub-PAGE_SIZE writes are unchanged: the merge path handles them
> without ever needing a fresh page, so it is not worth speculatively
> allocating for them.
> 
> This can improve the pipe throughput up to 48% and reduce the latency in
> 33%.


This change is quite a hammer.

Do you have a real workload where there are multiple processes issuing
large ops to the same pipe on both reader and writer side?

The only workload I'm aware of sharing the pipe between more than just
one reader and one writer is anything with make, but that is mostly
operating on 1-byte ops.

I'm asking because this would be better implemented taking into account
how many pages are hanging out in the tmp_page array. With only one
writer and one reader there is no concern someone will steal the pages.

Preferably the tmp_page array would grow to accomodate more memory, but
admittedly this runs into a problem where a pipe can linger unused
indefinitely while hoarding several pages. As in, some shrinking
mechanism would be needed so that's probably a no-go and preallocation
is the way to go for now.

Even then, this can definitely be integrated much better instead of
being open-coded.

> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  fs/pipe.c | 40 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 38 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 9841648c9cf3e..7a1517d15107a 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -111,7 +111,11 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>  	pipe_lock(pipe2);
>  }
>  
> -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
> +#define PIPE_PREALLOC_MAX 8
> +
> +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
> +				       struct page **prealloc,
> +				       unsigned int *prealloc_n)
>  {
>  	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
>  		if (pipe->tmp_page[i]) {
> @@ -121,6 +125,14 @@ static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
>  		}
>  	}
>  
> +	if (*prealloc_n) {
> +		unsigned int idx = --(*prealloc_n);

This is some nasty golfing.

Instead, you could have something like:
struct tmp_page_prealloc {
	struct page *pages;
	unsigned int count;
};


then...

> +		struct page *page = prealloc[idx];
> +
> +		prealloc[idx] = NULL;
> +		return page;
> +	}
> +
>  	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
>  }
>  
> @@ -438,6 +450,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  	ssize_t chars;
>  	bool was_empty = false;
>  	bool wake_next_writer = false;
> +	struct page *prealloc[PIPE_PREALLOC_MAX] = { NULL };
> +	unsigned int prealloc_n = 0;
>  

	struct tmp_page_prealloc tpp = {};

>  	/*
>  	 * Reject writing to watch queue pipes before the point where we lock
> @@ -455,6 +469,26 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(total_len == 0))
>  		return 0;
>  
> +	/*
> +	 * Bulk pre-allocate up to PIPE_PREALLOC_MAX pages outside pipe->mutex
> +	 * for writes that span at least one full page. alloc_page() with
> +	 * GFP_HIGHUSER may sleep doing reclaim and runs memcg charging, so
> +	 * doing it under the mutex extends the critical section and stalls
> +	 * the reader. The merge path handles sub-PAGE_SIZE writes without
> +	 * needing a fresh page; for writes larger than PIPE_PREALLOC_MAX
> +	 * pages, anon_pipe_get_page() falls back to a single alloc_page()
> +	 * under the mutex for the remainder. Unused prealloc pages are
> +	 * returned to the pipe's tmp_page[] cache (or freed) before unlock.
> +	 */
> +	if (total_len >= PAGE_SIZE) {
> +		unsigned int want = min_t(unsigned int,
> +					  DIV_ROUND_UP(total_len, PAGE_SIZE),
> +					  PIPE_PREALLOC_MAX);
> +
> +		prealloc_n = alloc_pages_bulk(GFP_HIGHUSER | __GFP_ACCOUNT,
> +					      want, prealloc);
> +	}
> +

This should move to a dedicated routine, perhaps:
	anon_get_page_prealloc(pipe, &tmp_page_prealloc);

then it can also easily decide how much to prealloc based on existing
state

>  	mutex_lock(&pipe->mutex);
>  
>  	if (!pipe->readers) {
> @@ -512,7 +546,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  			struct page *page;
>  			int copied;
>  
> -			page = anon_pipe_get_page(pipe);
> +			page = anon_pipe_get_page(pipe, prealloc, &prealloc_n);
>  			if (unlikely(!page)) {
>  				if (!ret)
>  					ret = -ENOMEM;
> @@ -576,6 +610,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  		wake_next_writer = true;
>  	}
>  out:
> +	while (prealloc_n)
> +		anon_pipe_put_page(pipe, prealloc[--prealloc_n]);

this results in put_page() calls under the mutex, still extending the hold time

you could split this into 2 ops, for example:
+	anon_pipe_cache_pages(pipe, &tmp_page_prealloc);
  	if (pipe_is_full(pipe))
  		wake_next_writer = false;
  	mutex_unlock(&pipe->mutex);
+	anon_pipe_free_pages(pipe, &tmp_page_prealloc);

All that aside, I have no idea about performance impact on arm64, but
there is *definitely* some throughput lost based on the fact that memory
copy is restarted per page. I presume arm64 has an equivalent of amd64's
SMAP which again will not be free and is paid for every time as well.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-15 20:18   ` Mateusz Guzik
@ 2026-05-20 17:09     ` Breno Leitao
  0 siblings, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2026-05-20 17:09 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	linux-fsdevel, linux-kernel, linux-kselftest, usama.arif,
	kernel-team

Hello Mateusz,

Thanks for the careful review. Let me address the points inline,
with data I collected on three Meta production hosts: two aarch64 NVIDIA Grace
and one x86_64 Bergamo / EPYC 9D64, using bpftrace and perf lock contention.

On Fri, May 15, 2026 at 10:18:19PM +0200, Mateusz Guzik wrote:
> This change is quite a hammer.

Are you seeing any downside/trade-off with this current approach?

>
> Do you have a real workload where there are multiple processes issuing
> large ops to the same pipe on both reader and writer side?
>
> The only workload I'm aware of sharing the pipe between more than just
> one reader and one writer is anything with make, but that is mostly
> operating on 1-byte ops.

You're right -- most pipe writes really are tiny. On Bergamo, 120s sample of
anon_pipe_write, 299629 calls bucketed by iov_iter_count(from):

    size                writes      %
    <128 B              293602    98.0    (merge path, no alloc)
    128 B - 4 KB          1108     0.4
    4 - 8 KB              2149     0.72   <-- patch start helps from here
    8 - 16 KB                3
    16 - 32 KB               0
    32 - 64 KB               5
    64 - 128 KB              2            <-- up to 72KB writes

So only ~0.72% of writes are >= PAGE_SIZE. The dominant comm by
contention count is a Meta fleet workload spraying sub-128-byte
metric strings -- it does *zero* writes >= PAGE_SIZE itself; it
just hits the mutex very often.

But the writers that *do* take the alloc-under-mutex path are
real. Lock-wait data confirms it shows up:

  perf lock contention -ab (30s, anon_pipe_* only):

                                contentions  total wait  max wait
    arm64 (NVIDIA Grace)
      anon_pipe_write                 7660    4.39 ms     50 us
      anon_pipe_read                     7   55.78 us     17 us
    x86_64 (Bergamo)
      anon_pipe_write                 3154    4.53 ms     79 us
      anon_pipe_read                   817    1.72 ms     82 us

The reader side stands out on x86 (817 vs ~7 on arm64) -- that's
the symptom the patch targets: a multi-page writer holding
pipe->mutex across alloc_page() stalls the concurrent reader on
the same mutex.

  bpftrace, mutex_lock duration inside anon_pipe_write, 300s:

                  Grace-A  Grace-B  Bergamo
    [4K,  8K)      1090     2799      980
    [8K, 16K)      1785     1353      953
    [16K,32K)       755      623      530
    [32K,64K)       153      172      184
    [64K,128K)       13        8       59
    [128K,256K)      -         -        4
    [256K,512K)      -         -        1    <-- ~500us blocked

So the patch's value is shortening the long critical sections that
the 99% of small writers are blocked on -- the contention everyone
sees drops because the few multi-page writers do their alloc work
outside the lock.

>= 64KB, which are not common but existent.

> I'm asking because this would be better implemented taking into account
> how many pages are hanging out in the tmp_page array. With only one
> writer and one reader there is no concern someone will steal the pages.
>
> Preferably the tmp_page array would grow to accomodate more memory, but
> admittedly this runs into a problem where a pipe can linger unused
> indefinitely while hoarding several pages. As in, some shrinking
> mechanism would be needed so that's probably a no-go and preallocation
> is the way to go for now.
>
> Even then, this can definitely be integrated much better instead of
> being open-coded.

> This is some nasty golfing.
>
> Instead, you could have something like:
> struct tmp_page_prealloc {
> 	struct page *pages;
> 	unsigned int count;
> };

Agreed -- v2 will wrap the array+counter in
struct tmp_page_prealloc and drop the open-coded indexing.

> 	struct tmp_page_prealloc tpp = {};
> [...]
> This should move to a dedicated routine, perhaps:
> 	anon_get_page_prealloc(pipe, &tmp_page_prealloc);
>
> then it can also easily decide how much to prealloc based on existing
> state

Yes. v2 will have anon_get_page_prealloc(pipe, &tpp) doing both:
top up to PIPE_PREALLOC_MAX based on tmp_page[] occupancy, and
keep the policy in one place.

> > +	while (prealloc_n)
> > +		anon_pipe_put_page(pipe, prealloc[--prealloc_n]);
>
> this results in put_page() calls under the mutex, still extending the hold time
>
> you could split this into 2 ops, for example:
> +	anon_pipe_cache_pages(pipe, &tmp_page_prealloc);
>   	if (pipe_is_full(pipe))
>   		wake_next_writer = false;
>   	mutex_unlock(&pipe->mutex);
> +	anon_pipe_free_pages(pipe, &tmp_page_prealloc);

Well spotted!

> All that aside, I have no idea about performance impact on arm64, but
> there is *definitely* some throughput lost based on the fact that memory
> copy is restarted per page. I presume arm64 has an equivalent of amd64's
> SMAP which again will not be free and is paid for every time as well.

Can you say a bit more about what you'd want to see here? I want to make sure
I'm reading the suggestion the same way you are.

My understanding: anon_pipe_write loops once per page calling
copy_page_from_iter(), and each call re-enters the iov_iter
machinery and pays a STAC/CLAC pair on x86 (PAN toggle on arm64).

So an N-page write pays the user-access bracket N times instead of
once, on top of resetting the per-page memcpy and losing the
hardware prefetch streak.

The way I see it, is opening a single user_read_access_begin() region around
the page loop and use unsafe_copy_from_user() inside, with a labeled
fault-fallback path that closes the region and finishes via the slow path.

Is this what you meant?

Really thanks for your review,
--breno

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-20 17:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15 10:28 [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock Breno Leitao
2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-15 20:18   ` Mateusz Guzik
2026-05-20 17:09     ` Breno Leitao
2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox