[PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock
@ 2026-05-15 10:28 Breno Leitao
  2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
  2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
  0 siblings, 2 replies; 9+ messages in thread
From: Breno Leitao @ 2026-05-15 10:28 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, Breno Leitao,
	usama.arif, kernel-team

While profiling kernel code on Meta's caching code[1], I found some lock
contention on pipe hot path, and I decide to investigate, and found that
anon_pipe_write() currently calls alloc_page() once per page while
holding pipe->mutex. The allocation can sleep doing direct reclaim and
runs memcg charging, which extends the critical section and stalls any
concurrent reader on the same mutex.

I would like to propose that we bulk pre-allocate pages outside
pipe->mutex in anon_pipe_write -- pre-allocates up to PIPE_PREALLOC_MAX
(8) pages with alloc_pages_bulk() before taking the lock

 * the get-page helper then consumes from tmp_page[] / the prealloc
   array first and only falls back to in-lock alloc_page() for writes
   larger than PIPE_PREALLOC_MAX (8) pages, which is rare.

I also vibe coded a microbenchmark to test the pipe benchmark. Basically
it wweeps writers x readers over {1,2,5} x {1,5,10} with 64KB writes
against a 1 MB pipe and prints throughput + latency percentiles per
config.

Measured on arm64 in virtme-ng (16 vCPUs, 64KB writes, 1 MB pipe).

== No memory pressure (10s per config) ==

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1              readers=5               readers=10
          1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
          2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
          5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1                 readers=5                readers=10
          1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
          2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
          5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)

Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.

== Under memory pressure (--memory-pressure, 6s per config) ==

stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1            readers=5            readers=10
          1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
          2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
          5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1              readers=5              readers=10
          1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
          2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
          5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)

Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.

That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.

Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (2):
      fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
      selftests/pipe: add pipe_bench microbenchmark

 fs/pipe.c                                 |  40 +++-
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 351 ++++++++++++++++++++++++++++++
 5 files changed, 400 insertions(+), 2 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7

Best regards,
--  
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-15 10:28 [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock Breno Leitao
@ 2026-05-15 10:28 ` Breno Leitao
  2026-05-15 20:18   ` Mateusz Guzik
  2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
  1 sibling, 1 reply; 9+ messages in thread
From: Breno Leitao @ 2026-05-15 10:28 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, Breno Leitao,
	usama.arif, kernel-team

anon_pipe_write() takes pipe->mutex and then, from the per-iteration
anon_pipe_get_page() helper, used to call alloc_page(GFP_HIGHUSER |
__GFP_ACCOUNT) once per page while still holding it. That allocation
can sleep doing direct reclaim and/or runs memcg charging, which extends
the critical section and stalls a concurrent reader on the very same
mutex.

Bulk pre-allocate DIV_ROUND_UP(total_len, PAGE_SIZE) pages (up to
PIPE_PREALLOC_MAX (8)) pages outside the mutex when total_len >=
PAGE_SIZE, using alloc_pages_bulk().  (Under memcg, alloc_pages_bulk()
with __GFP_ACCOUNT might return less pages than requested, but, this is
still a win, given some pages allocation is moved outside of the
lock).

Pass the array into anon_pipe_get_page(), which now consumes from
tmp_page[] first, then from the prealloc array, and only as a last
resort falls back to alloc_page() under the mutex (reached only for
writes larger than 8 pages, where the prealloc cap is exhausted).

Doing this in one bulk call before the lock keeps the fast path's
mutex held for a single, write-bounded critical section -- no extra
mutex_unlock/_lock cycles -- so it avoids the per-page lock-handoff
overhead that a per-page drop-and-retake design would introduce, while
still moving the typical multi-page allocation off the critical
section. Unused prealloc pages are pushed to the pipe's tmp_page[]
cache (or freed) before unlock, so a subsequent write to the same
pipe gets a hot cached page rather than paying for an alloc again.

Sub-PAGE_SIZE writes are unchanged: the merge path handles them
without ever needing a fresh page, so it is not worth speculatively
allocating for them.

This can improve the pipe throughput up to 48% and reduce the latency in
33%.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 fs/pipe.c | 40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 9841648c9cf3e..7a1517d15107a 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -111,7 +111,11 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 	pipe_lock(pipe2);
 }

-static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
+#define PIPE_PREALLOC_MAX 8
+
+static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
+				       struct page **prealloc,
+				       unsigned int *prealloc_n)
 {
 	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
 		if (pipe->tmp_page[i]) {
@@ -121,6 +125,14 @@ static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
 		}
 	}

+	if (*prealloc_n) {
+		unsigned int idx = --(*prealloc_n);
+		struct page *page = prealloc[idx];
+
+		prealloc[idx] = NULL;
+		return page;
+	}
+
 	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
 }

@@ -438,6 +450,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t chars;
 	bool was_empty = false;
 	bool wake_next_writer = false;
+	struct page *prealloc[PIPE_PREALLOC_MAX] = { NULL };
+	unsigned int prealloc_n = 0;

 	/*
 	 * Reject writing to watch queue pipes before the point where we lock
@@ -455,6 +469,26 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	if (unlikely(total_len == 0))
 		return 0;

+	/*
+	 * Bulk pre-allocate up to PIPE_PREALLOC_MAX pages outside pipe->mutex
+	 * for writes that span at least one full page. alloc_page() with
+	 * GFP_HIGHUSER may sleep doing reclaim and runs memcg charging, so
+	 * doing it under the mutex extends the critical section and stalls
+	 * the reader. The merge path handles sub-PAGE_SIZE writes without
+	 * needing a fresh page; for writes larger than PIPE_PREALLOC_MAX
+	 * pages, anon_pipe_get_page() falls back to a single alloc_page()
+	 * under the mutex for the remainder. Unused prealloc pages are
+	 * returned to the pipe's tmp_page[] cache (or freed) before unlock.
+	 */
+	if (total_len >= PAGE_SIZE) {
+		unsigned int want = min_t(unsigned int,
+					  DIV_ROUND_UP(total_len, PAGE_SIZE),
+					  PIPE_PREALLOC_MAX);
+
+		prealloc_n = alloc_pages_bulk(GFP_HIGHUSER | __GFP_ACCOUNT,
+					      want, prealloc);
+	}
+
 	mutex_lock(&pipe->mutex);

 	if (!pipe->readers) {
@@ -512,7 +546,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			struct page *page;
 			int copied;

-			page = anon_pipe_get_page(pipe);
+			page = anon_pipe_get_page(pipe, prealloc, &prealloc_n);
 			if (unlikely(!page)) {
 				if (!ret)
 					ret = -ENOMEM;
@@ -576,6 +610,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		wake_next_writer = true;
 	}
 out:
+	while (prealloc_n)
+		anon_pipe_put_page(pipe, prealloc[--prealloc_n]);
 	if (pipe_is_full(pipe))
 		wake_next_writer = false;
 	mutex_unlock(&pipe->mutex);

-- 
2.53.0-Meta

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
@ 2026-05-15 20:18   ` Mateusz Guzik
  2026-05-20 17:09     ` Breno Leitao
  0 siblings, 1 reply; 9+ messages in thread
From: Mateusz Guzik @ 2026-05-15 20:18 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	linux-fsdevel, linux-kernel, linux-kselftest, usama.arif,
	kernel-team

On Fri, May 15, 2026 at 03:28:44AM -0700, Breno Leitao wrote:
> anon_pipe_write() takes pipe->mutex and then, from the per-iteration
> anon_pipe_get_page() helper, used to call alloc_page(GFP_HIGHUSER |
> __GFP_ACCOUNT) once per page while still holding it. That allocation
> can sleep doing direct reclaim and/or runs memcg charging, which extends
> the critical section and stalls a concurrent reader on the very same
> mutex.
> 
> Bulk pre-allocate DIV_ROUND_UP(total_len, PAGE_SIZE) pages (up to
> PIPE_PREALLOC_MAX (8)) pages outside the mutex when total_len >=
> PAGE_SIZE, using alloc_pages_bulk().  (Under memcg, alloc_pages_bulk()
> with __GFP_ACCOUNT might return less pages than requested, but, this is
> still a win, given some pages allocation is moved outside of the
> lock).
> 
> Pass the array into anon_pipe_get_page(), which now consumes from
> tmp_page[] first, then from the prealloc array, and only as a last
> resort falls back to alloc_page() under the mutex (reached only for
> writes larger than 8 pages, where the prealloc cap is exhausted).
> 
> Doing this in one bulk call before the lock keeps the fast path's
> mutex held for a single, write-bounded critical section -- no extra
> mutex_unlock/_lock cycles -- so it avoids the per-page lock-handoff
> overhead that a per-page drop-and-retake design would introduce, while
> still moving the typical multi-page allocation off the critical
> section. Unused prealloc pages are pushed to the pipe's tmp_page[]
> cache (or freed) before unlock, so a subsequent write to the same
> pipe gets a hot cached page rather than paying for an alloc again.
> 
> Sub-PAGE_SIZE writes are unchanged: the merge path handles them
> without ever needing a fresh page, so it is not worth speculatively
> allocating for them.
> 
> This can improve the pipe throughput up to 48% and reduce the latency in
> 33%.


This change is quite a hammer.

Do you have a real workload where there are multiple processes issuing
large ops to the same pipe on both reader and writer side?

The only workload I'm aware of sharing the pipe between more than just
one reader and one writer is anything with make, but that is mostly
operating on 1-byte ops.

I'm asking because this would be better implemented taking into account
how many pages are hanging out in the tmp_page array. With only one
writer and one reader there is no concern someone will steal the pages.

Preferably the tmp_page array would grow to accomodate more memory, but
admittedly this runs into a problem where a pipe can linger unused
indefinitely while hoarding several pages. As in, some shrinking
mechanism would be needed so that's probably a no-go and preallocation
is the way to go for now.

Even then, this can definitely be integrated much better instead of
being open-coded.

> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  fs/pipe.c | 40 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 38 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 9841648c9cf3e..7a1517d15107a 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -111,7 +111,11 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>  	pipe_lock(pipe2);
>  }
>  
> -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
> +#define PIPE_PREALLOC_MAX 8
> +
> +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
> +				       struct page **prealloc,
> +				       unsigned int *prealloc_n)
>  {
>  	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
>  		if (pipe->tmp_page[i]) {
> @@ -121,6 +125,14 @@ static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
>  		}
>  	}
>  
> +	if (*prealloc_n) {
> +		unsigned int idx = --(*prealloc_n);

This is some nasty golfing.

Instead, you could have something like:
struct tmp_page_prealloc {
	struct page *pages;
	unsigned int count;
};


then...

> +		struct page *page = prealloc[idx];
> +
> +		prealloc[idx] = NULL;
> +		return page;
> +	}
> +
>  	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
>  }
>  
> @@ -438,6 +450,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  	ssize_t chars;
>  	bool was_empty = false;
>  	bool wake_next_writer = false;
> +	struct page *prealloc[PIPE_PREALLOC_MAX] = { NULL };
> +	unsigned int prealloc_n = 0;
>  

	struct tmp_page_prealloc tpp = {};

>  	/*
>  	 * Reject writing to watch queue pipes before the point where we lock
> @@ -455,6 +469,26 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(total_len == 0))
>  		return 0;
>  
> +	/*
> +	 * Bulk pre-allocate up to PIPE_PREALLOC_MAX pages outside pipe->mutex
> +	 * for writes that span at least one full page. alloc_page() with
> +	 * GFP_HIGHUSER may sleep doing reclaim and runs memcg charging, so
> +	 * doing it under the mutex extends the critical section and stalls
> +	 * the reader. The merge path handles sub-PAGE_SIZE writes without
> +	 * needing a fresh page; for writes larger than PIPE_PREALLOC_MAX
> +	 * pages, anon_pipe_get_page() falls back to a single alloc_page()
> +	 * under the mutex for the remainder. Unused prealloc pages are
> +	 * returned to the pipe's tmp_page[] cache (or freed) before unlock.
> +	 */
> +	if (total_len >= PAGE_SIZE) {
> +		unsigned int want = min_t(unsigned int,
> +					  DIV_ROUND_UP(total_len, PAGE_SIZE),
> +					  PIPE_PREALLOC_MAX);
> +
> +		prealloc_n = alloc_pages_bulk(GFP_HIGHUSER | __GFP_ACCOUNT,
> +					      want, prealloc);
> +	}
> +

This should move to a dedicated routine, perhaps:
	anon_get_page_prealloc(pipe, &tmp_page_prealloc);

then it can also easily decide how much to prealloc based on existing
state

>  	mutex_lock(&pipe->mutex);
>  
>  	if (!pipe->readers) {
> @@ -512,7 +546,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  			struct page *page;
>  			int copied;
>  
> -			page = anon_pipe_get_page(pipe);
> +			page = anon_pipe_get_page(pipe, prealloc, &prealloc_n);
>  			if (unlikely(!page)) {
>  				if (!ret)
>  					ret = -ENOMEM;
> @@ -576,6 +610,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  		wake_next_writer = true;
>  	}
>  out:
> +	while (prealloc_n)
> +		anon_pipe_put_page(pipe, prealloc[--prealloc_n]);

this results in put_page() calls under the mutex, still extending the hold time

you could split this into 2 ops, for example:
+	anon_pipe_cache_pages(pipe, &tmp_page_prealloc);
  	if (pipe_is_full(pipe))
  		wake_next_writer = false;
  	mutex_unlock(&pipe->mutex);
+	anon_pipe_free_pages(pipe, &tmp_page_prealloc);

All that aside, I have no idea about performance impact on arm64, but
there is *definitely* some throughput lost based on the fact that memory
copy is restarted per page. I presume arm64 has an equivalent of amd64's
SMAP which again will not be free and is paid for every time as well.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-15 20:18   ` Mateusz Guzik
@ 2026-05-20 17:09     ` Breno Leitao
  2026-05-21 12:58       ` Mateusz Guzik
  0 siblings, 1 reply; 9+ messages in thread
From: Breno Leitao @ 2026-05-20 17:09 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	linux-fsdevel, linux-kernel, linux-kselftest, usama.arif,
	kernel-team

Hello Mateusz,

Thanks for the careful review. Let me address the points inline,
with data I collected on three Meta production hosts: two aarch64 NVIDIA Grace
and one x86_64 Bergamo / EPYC 9D64, using bpftrace and perf lock contention.

On Fri, May 15, 2026 at 10:18:19PM +0200, Mateusz Guzik wrote:
> This change is quite a hammer.

Are you seeing any downside/trade-off with this current approach?

>
> Do you have a real workload where there are multiple processes issuing
> large ops to the same pipe on both reader and writer side?
>
> The only workload I'm aware of sharing the pipe between more than just
> one reader and one writer is anything with make, but that is mostly
> operating on 1-byte ops.

You're right -- most pipe writes really are tiny. On Bergamo, 120s sample of
anon_pipe_write, 299629 calls bucketed by iov_iter_count(from):

    size                writes      %
    <128 B              293602    98.0    (merge path, no alloc)
    128 B - 4 KB          1108     0.4
    4 - 8 KB              2149     0.72   <-- patch start helps from here
    8 - 16 KB                3
    16 - 32 KB               0
    32 - 64 KB               5
    64 - 128 KB              2            <-- up to 72KB writes

So only ~0.72% of writes are >= PAGE_SIZE. The dominant comm by
contention count is a Meta fleet workload spraying sub-128-byte
metric strings -- it does *zero* writes >= PAGE_SIZE itself; it
just hits the mutex very often.

But the writers that *do* take the alloc-under-mutex path are
real. Lock-wait data confirms it shows up:

  perf lock contention -ab (30s, anon_pipe_* only):

                                contentions  total wait  max wait
    arm64 (NVIDIA Grace)
      anon_pipe_write                 7660    4.39 ms     50 us
      anon_pipe_read                     7   55.78 us     17 us
    x86_64 (Bergamo)
      anon_pipe_write                 3154    4.53 ms     79 us
      anon_pipe_read                   817    1.72 ms     82 us

The reader side stands out on x86 (817 vs ~7 on arm64) -- that's
the symptom the patch targets: a multi-page writer holding
pipe->mutex across alloc_page() stalls the concurrent reader on
the same mutex.

  bpftrace, mutex_lock duration inside anon_pipe_write, 300s:

                  Grace-A  Grace-B  Bergamo
    [4K,  8K)      1090     2799      980
    [8K, 16K)      1785     1353      953
    [16K,32K)       755      623      530
    [32K,64K)       153      172      184
    [64K,128K)       13        8       59
    [128K,256K)      -         -        4
    [256K,512K)      -         -        1    <-- ~500us blocked

So the patch's value is shortening the long critical sections that
the 99% of small writers are blocked on -- the contention everyone
sees drops because the few multi-page writers do their alloc work
outside the lock.

>= 64KB, which are not common but existent.

> I'm asking because this would be better implemented taking into account
> how many pages are hanging out in the tmp_page array. With only one
> writer and one reader there is no concern someone will steal the pages.
>
> Preferably the tmp_page array would grow to accomodate more memory, but
> admittedly this runs into a problem where a pipe can linger unused
> indefinitely while hoarding several pages. As in, some shrinking
> mechanism would be needed so that's probably a no-go and preallocation
> is the way to go for now.
>
> Even then, this can definitely be integrated much better instead of
> being open-coded.

> This is some nasty golfing.
>
> Instead, you could have something like:
> struct tmp_page_prealloc {
> 	struct page *pages;
> 	unsigned int count;
> };

Agreed -- v2 will wrap the array+counter in
struct tmp_page_prealloc and drop the open-coded indexing.

> 	struct tmp_page_prealloc tpp = {};
> [...]
> This should move to a dedicated routine, perhaps:
> 	anon_get_page_prealloc(pipe, &tmp_page_prealloc);
>
> then it can also easily decide how much to prealloc based on existing
> state

Yes. v2 will have anon_get_page_prealloc(pipe, &tpp) doing both:
top up to PIPE_PREALLOC_MAX based on tmp_page[] occupancy, and
keep the policy in one place.

> > +	while (prealloc_n)
> > +		anon_pipe_put_page(pipe, prealloc[--prealloc_n]);
>
> this results in put_page() calls under the mutex, still extending the hold time
>
> you could split this into 2 ops, for example:
> +	anon_pipe_cache_pages(pipe, &tmp_page_prealloc);
>   	if (pipe_is_full(pipe))
>   		wake_next_writer = false;
>   	mutex_unlock(&pipe->mutex);
> +	anon_pipe_free_pages(pipe, &tmp_page_prealloc);

Well spotted!

> All that aside, I have no idea about performance impact on arm64, but
> there is *definitely* some throughput lost based on the fact that memory
> copy is restarted per page. I presume arm64 has an equivalent of amd64's
> SMAP which again will not be free and is paid for every time as well.

Can you say a bit more about what you'd want to see here? I want to make sure
I'm reading the suggestion the same way you are.

My understanding: anon_pipe_write loops once per page calling
copy_page_from_iter(), and each call re-enters the iov_iter
machinery and pays a STAC/CLAC pair on x86 (PAN toggle on arm64).

So an N-page write pays the user-access bracket N times instead of
once, on top of resetting the per-page memcpy and losing the
hardware prefetch streak.

The way I see it, is opening a single user_read_access_begin() region around
the page loop and use unsafe_copy_from_user() inside, with a labeled
fault-fallback path that closes the region and finishes via the slow path.

Is this what you meant?

Really thanks for your review,
--breno

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-20 17:09     ` Breno Leitao
@ 2026-05-21 12:58       ` Mateusz Guzik
  2026-05-21 15:38         ` Breno Leitao
  0 siblings, 1 reply; 9+ messages in thread
From: Mateusz Guzik @ 2026-05-21 12:58 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	linux-fsdevel, linux-kernel, linux-kselftest, usama.arif,
	kernel-team

On Wed, May 20, 2026 at 7:09 PM Breno Leitao <leitao@debian.org> wrote:
> On Fri, May 15, 2026 at 10:18:19PM +0200, Mateusz Guzik wrote:
> > This change is quite a hammer.
>
> Are you seeing any downside/trade-off with this current approach?
>

Well there is a lot to say about the current code to begin with, I'm
going to refrain from writing long rants here as that's of no use in
this thread.

The idea behind the patch has a lot of merit and I'm definitely not
protesting it or denying there is a problem.
> >
> > Do you have a real workload where there are multiple processes issuing
> > large ops to the same pipe on both reader and writer side?
> >
> > The only workload I'm aware of sharing the pipe between more than just
> > one reader and one writer is anything with make, but that is mostly
> > operating on 1-byte ops.
>
> You're right -- most pipe writes really are tiny. On Bergamo, 120s sample of
> anon_pipe_write, 299629 calls bucketed by iov_iter_count(from):
>
>     size                writes      %
>     <128 B              293602    98.0    (merge path, no alloc)
>     128 B - 4 KB          1108     0.4
>     4 - 8 KB              2149     0.72   <-- patch start helps from here
>     8 - 16 KB                3
>     16 - 32 KB               0
>     32 - 64 KB               5
>     64 - 128 KB              2            <-- up to 72KB writes
>
> So only ~0.72% of writes are >= PAGE_SIZE. The dominant comm by
> contention count is a Meta fleet workload spraying sub-128-byte
> metric strings -- it does *zero* writes >= PAGE_SIZE itself; it
> just hits the mutex very often.
>
> But the writers that *do* take the alloc-under-mutex path are
> real. Lock-wait data confirms it shows up:
>
>   perf lock contention -ab (30s, anon_pipe_* only):
>
>                                 contentions  total wait  max wait
>     arm64 (NVIDIA Grace)
>       anon_pipe_write                 7660    4.39 ms     50 us
>       anon_pipe_read                     7   55.78 us     17 us
>     x86_64 (Bergamo)
>       anon_pipe_write                 3154    4.53 ms     79 us
>       anon_pipe_read                   817    1.72 ms     82 us
>
> The reader side stands out on x86 (817 vs ~7 on arm64) -- that's
> the symptom the patch targets: a multi-page writer holding
> pipe->mutex across alloc_page() stalls the concurrent reader on
> the same mutex.
>
>   bpftrace, mutex_lock duration inside anon_pipe_write, 300s:
>
>                   Grace-A  Grace-B  Bergamo
>     [4K,  8K)      1090     2799      980
>     [8K, 16K)      1785     1353      953
>     [16K,32K)       755      623      530
>     [32K,64K)       153      172      184
>     [64K,128K)       13        8       59
>     [128K,256K)      -         -        4
>     [256K,512K)      -         -        1    <-- ~500us blocked
>
> So the patch's value is shortening the long critical sections that
> the 99% of small writers are blocked on -- the contention everyone
> sees drops because the few multi-page writers do their alloc work
> outside the lock.
>
> >= 64KB, which are not common but existent.
>

Ye this makes sense to me.

> > This should move to a dedicated routine, perhaps:
> >       anon_get_page_prealloc(pipe, &tmp_page_prealloc);
> >
> > then it can also easily decide how much to prealloc based on existing
> > state
>
> Yes. v2 will have anon_get_page_prealloc(pipe, &tpp) doing both:
> top up to PIPE_PREALLOC_MAX based on tmp_page[] occupancy, and
> keep the policy in one place.
>

Note I lost the 'pipe' word. This definitely needs to be anon_pipe_-prefixed.

> > All that aside, I have no idea about performance impact on arm64, but
> > there is *definitely* some throughput lost based on the fact that memory
> > copy is restarted per page. I presume arm64 has an equivalent of amd64's
> > SMAP which again will not be free and is paid for every time as well.
>
> Can you say a bit more about what you'd want to see here? I want to make sure
> I'm reading the suggestion the same way you are.
>
> My understanding: anon_pipe_write loops once per page calling
> copy_page_from_iter(), and each call re-enters the iov_iter
> machinery and pays a STAC/CLAC pair on x86 (PAN toggle on arm64).

I was told arm64 has an explicit instruction set to deal with user
memory, so it might be this is implemented differently there and
fixing the problem might happen to not alter perf on that arch. It
will definitely help on amd64.

> So an N-page write pays the user-access bracket N times instead of
> once, on top of resetting the per-page memcpy and losing the
> hardware prefetch streak.
>

This bit is unavoidable with the current implementation as backing
pages are not guaranteed to be even virtually contiguous.

The extra SMAP trips however can be avoided.

> The way I see it, is opening a single user_read_access_begin() region around
> the page loop and use unsafe_copy_from_user() inside, with a labeled
> fault-fallback path that closes the region and finishes via the slow path.
>
> Is this what you meant?
>

Yes indeed. This is however optional and even if implemented it should
be a separate patch.

As a nit you could sprinkle some predicts in a separate commit,  for
example "if (!pipe->readers) {" definitely warrants an unlikely().

The real deal is the following: the wakeup policy is to *always*
wakeup *all* readers if any data is available and *all* writers if any
space is available. In case of heavily shared usage this leads to the
thundering herd problem (and it is causing performance issues with the
BSD and GNU makes). While this cannot be changed by default, I was
thinking about an opt-in fcntl which changes to policy to only wake up
one waiter.

In case of a microbenchmark like the one your provided here such a
policy might end up reducing throughput as there will be more time
spent off cpu by the benchmark. However, in case of real workloads
which presumably have stuff to do, it should be a win.

That is to say, if you have 1 reader and 1 writer, this is irrelevant.
But if you have multiple and they block on the pipe, there might be a
win hiding here for your actual workload by utilizing the above.
Something to consider.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-21 12:58       ` Mateusz Guzik
@ 2026-05-21 15:38         ` Breno Leitao
  2026-05-21 17:26           ` Mateusz Guzik
  0 siblings, 1 reply; 9+ messages in thread
From: Breno Leitao @ 2026-05-21 15:38 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	linux-fsdevel, linux-kernel, linux-kselftest, usama.arif,
	kernel-team

On Thu, May 21, 2026 at 02:58:48PM +0200, Mateusz Guzik wrote:
> On Wed, May 20, 2026 at 7:09 PM Breno Leitao <leitao@debian.org> wrote:
> > On Fri, May 15, 2026 at 10:18:19PM +0200, Mateusz Guzik wrote:
> > > This change is quite a hammer.
> >
> > Are you seeing any downside/trade-off with this current approach?
> >
> 
> Well there is a lot to say about the current code to begin with, I'm
> going to refrain from writing long rants here as that's of no use in
> this thread.
> 
> The idea behind the patch has a lot of merit and I'm definitely not
> protesting it or denying there is a problem.

Ack, thanks for the support.

> > Yes. v2 will have anon_get_page_prealloc(pipe, &tpp) doing both:
> > top up to PIPE_PREALLOC_MAX based on tmp_page[] occupancy, and
> > keep the policy in one place.
> >
> 
> Note I lost the 'pipe' word. This definitely needs to be anon_pipe_-prefixed.

Ack!

> As a nit you could sprinkle some predicts in a separate commit,  for
> example "if (!pipe->readers) {" definitely warrants an unlikely().
> 
> The real deal is the following: the wakeup policy is to *always*
> wakeup *all* readers if any data is available and *all* writers if any
> space is available. In case of heavily shared usage this leads to the
> thundering herd problem (and it is causing performance issues with the
> BSD and GNU makes). While this cannot be changed by default, I was
> thinking about an opt-in fcntl which changes to policy to only wake up
> one waiter.

Happy to look at it as a follow-up series once this one lands.

Also, I am not reusing the leftover pages back to pipe->tmp_page, given
I really don't expect to have much leftovers, but, I am happy to do so,
in case you think this would be useful.

Anyway, my reading of the discussion above changed my approach to the
following patch. Is it any better?

Thanks for you suggestions so far,
--breno

commit 2394068de8ee0ab9e582ce9776990ed695397553
Author: Breno Leitao <leitao@debian.org>
Date:   Fri May 15 01:35:13 2026 -0700

    fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
    
    anon_pipe_write() takes pipe->mutex and then, from the per-iteration
    anon_pipe_get_page() helper, used to call alloc_page(GFP_HIGHUSER |
    __GFP_ACCOUNT) once per page while still holding it. That allocation
    can sleep doing direct reclaim and/or runs memcg charging, which extends
    the critical section and stalls a concurrent reader on the very same
    mutex.
    
    For writes that span more than one full page, bulk pre-allocate up to
    PIPE_PREALLOC_MAX (8) pages with alloc_pages_bulk() before taking
    pipe->mutex. anon_pipe_get_page() consumes from pipe->tmp_page[] first,
    then from the prealloc array, and only falls back to alloc_page() under
    the mutex for writes larger than PIPE_PREALLOC_MAX pages (rare) or for
    single-page writes. Sub-PAGE_SIZE writes are unchanged: the merge path
    handles them without needing a fresh page.
    
    The prealloc array, its count, and the helpers operating on it live in
    struct anon_pipe_prealloc:
    
      - anon_pipe_get_page_prealloc() issues a single alloc_pages_bulk()
        call before the mutex is taken.
      - anon_pipe_free_pages() runs after mutex_unlock() and put_page()s
        any leftover prealloc pages, keeping put_page() off the critical
        section. Leftovers happen when the per-pipe tmp_page[] cache or the
        merge path covered some of what we preallocated, or when the write
        exits early.
    
    This can improve the pipe throughput up to 48% and reduce the latency in
    33%.
    
    Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/fs/pipe.c b/fs/pipe.c
index 9841648c9cf3e..a6441c44f1e3f 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -111,7 +111,32 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 	pipe_lock(pipe2);
 }
 
-static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
+#define PIPE_PREALLOC_MAX 8
+
+struct anon_pipe_prealloc {
+	struct page *pages[PIPE_PREALLOC_MAX];
+	unsigned int count;
+};
+
+/*
+ * Bulk pre-allocate pages outside pipe->mutex. Callers must gate on
+ * total_len > PAGE_SIZE; single-page writes are handled by the in-lock
+ * alloc_page() fallback. Any pages not consumed by the write loop are
+ * freed by anon_pipe_free_pages() after the mutex is dropped.
+ */
+static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
+					size_t total_len)
+{
+	unsigned int want = min_t(unsigned int,
+				  DIV_ROUND_UP(total_len, PAGE_SIZE),
+				  PIPE_PREALLOC_MAX);
+
+	prealloc->count = alloc_pages_bulk(GFP_HIGHUSER | __GFP_ACCOUNT,
+					   want, prealloc->pages);
+}
+
+static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
+				       struct anon_pipe_prealloc *prealloc)
 {
 	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
 		if (pipe->tmp_page[i]) {
@@ -121,6 +146,15 @@ static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
 		}
 	}
 
+	if (prealloc->count) {
+		unsigned int idx = prealloc->count - 1;
+		struct page *page = prealloc->pages[idx];
+
+		prealloc->pages[idx] = NULL;
+		prealloc->count = idx;
+		return page;
+	}
+
 	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
 }
 
@@ -139,6 +173,17 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe,
 	put_page(page);
 }
 
+/* Called after pipe->mutex is dropped, to keep put_page() off the critical section. */
+static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
+{
+	while (prealloc->count) {
+		unsigned int idx = prealloc->count - 1;
+
+		put_page(prealloc->pages[idx]);
+		prealloc->count = idx;
+	}
+}
+
 static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
@@ -438,6 +483,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t chars;
 	bool was_empty = false;
 	bool wake_next_writer = false;
+	struct anon_pipe_prealloc prealloc = {};
 
 	/*
 	 * Reject writing to watch queue pipes before the point where we lock
@@ -455,6 +501,18 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	if (unlikely(total_len == 0))
 		return 0;
 
+	/*
+	 * Bulk pre-allocate pages outside pipe->mutex for writes that span more
+	 * than one full page. alloc_page() with GFP_HIGHUSER may sleep doing
+	 * reclaim and runs memcg charging, so doing it under the mutex
+	 * extends the critical section and stalls the reader. The merge path
+	 * handles sub-PAGE_SIZE writes without a fresh page; single-page and
+	 * >PIPE_PREALLOC_MAX-page writes fall back to alloc_page() under the
+	 * mutex for the remainder.
+	 */
+	if (total_len > PAGE_SIZE)
+		anon_pipe_get_page_prealloc(&prealloc, total_len);
+
 	mutex_lock(&pipe->mutex);
 
 	if (!pipe->readers) {
@@ -512,7 +570,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			struct page *page;
 			int copied;
 
-			page = anon_pipe_get_page(pipe);
+			page = anon_pipe_get_page(pipe, &prealloc);
 			if (unlikely(!page)) {
 				if (!ret)
 					ret = -ENOMEM;
@@ -579,6 +637,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	if (pipe_is_full(pipe))
 		wake_next_writer = false;
 	mutex_unlock(&pipe->mutex);
+	anon_pipe_free_pages(&prealloc);
 
 	/*
 	 * If we do do a wakeup event, we do a 'sync' wakeup, because we

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-21 15:38         ` Breno Leitao
@ 2026-05-21 17:26           ` Mateusz Guzik
  2026-05-22 12:10             ` Breno Leitao
  0 siblings, 1 reply; 9+ messages in thread
From: Mateusz Guzik @ 2026-05-21 17:26 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	linux-fsdevel, linux-kernel, linux-kselftest, usama.arif,
	kernel-team

On Thu, May 21, 2026 at 5:38 PM Breno Leitao <leitao@debian.org> wrote:
>
> Also, I am not reusing the leftover pages back to pipe->tmp_page, given
> I really don't expect to have much leftovers, but, I am happy to do so,
> in case you think this would be useful.
>

Since on the allocation side you do take from tmp_page, a big write
ends up depleting it, afterwards you free up whatever preallocated
pages.

If this is followed by a small write, that small write now has to
suffer allocation *with* the lock held, bringing the original problem.
Easiest way out is to take from preallocated pages before looking at
tmp_page.

I do suspect the best way forward in the long run is repopulate the
tmp_page array if you have pages to do so.

> commit 2394068de8ee0ab9e582ce9776990ed695397553
> Author: Breno Leitao <leitao@debian.org>
> Date:   Fri May 15 01:35:13 2026 -0700
>
>     fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
>
>     anon_pipe_write() takes pipe->mutex and then, from the per-iteration
>     anon_pipe_get_page() helper, used to call alloc_page(GFP_HIGHUSER |
>     __GFP_ACCOUNT) once per page while still holding it. That allocation
>     can sleep doing direct reclaim and/or runs memcg charging, which extends
>     the critical section and stalls a concurrent reader on the very same
>     mutex.
>
>     For writes that span more than one full page, bulk pre-allocate up to
>     PIPE_PREALLOC_MAX (8) pages with alloc_pages_bulk() before taking
>     pipe->mutex. anon_pipe_get_page() consumes from pipe->tmp_page[] first,
>     then from the prealloc array, and only falls back to alloc_page() under
>     the mutex for writes larger than PIPE_PREALLOC_MAX pages (rare) or for
>     single-page writes. Sub-PAGE_SIZE writes are unchanged: the merge path
>     handles them without needing a fresh page.
>
>     The prealloc array, its count, and the helpers operating on it live in
>     struct anon_pipe_prealloc:
>
>       - anon_pipe_get_page_prealloc() issues a single alloc_pages_bulk()
>         call before the mutex is taken.
>       - anon_pipe_free_pages() runs after mutex_unlock() and put_page()s
>         any leftover prealloc pages, keeping put_page() off the critical
>         section. Leftovers happen when the per-pipe tmp_page[] cache or the
>         merge path covered some of what we preallocated, or when the write
>         exits early.
>
>     This can improve the pipe throughput up to 48% and reduce the latency in
>     33%.
>
>     Suggested-by: Mateusz Guzik <mjguzik@gmail.com>

I have not suggested preallocation, so this credit is not warranted.

>     Signed-off-by: Breno Leitao <leitao@debian.org>
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 9841648c9cf3e..a6441c44f1e3f 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -111,7 +111,32 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>         pipe_lock(pipe2);
>  }
>
> -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
> +#define PIPE_PREALLOC_MAX 8
> +
> +struct anon_pipe_prealloc {
> +       struct page *pages[PIPE_PREALLOC_MAX];
> +       unsigned int count;
> +};
> +
> +/*
> + * Bulk pre-allocate pages outside pipe->mutex. Callers must gate on
> + * total_len > PAGE_SIZE; single-page writes are handled by the in-lock
> + * alloc_page() fallback. Any pages not consumed by the write loop are
> + * freed by anon_pipe_free_pages() after the mutex is dropped.
> + */
> +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
> +                                       size_t total_len)
> +{
> +       unsigned int want = min_t(unsigned int,
> +                                 DIV_ROUND_UP(total_len, PAGE_SIZE),
> +                                 PIPE_PREALLOC_MAX);
> +
> +       prealloc->count = alloc_pages_bulk(GFP_HIGHUSER | __GFP_ACCOUNT,
> +                                          want, prealloc->pages);
> +}
> +
> +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe,
> +                                      struct anon_pipe_prealloc *prealloc)
>  {
>         for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
>                 if (pipe->tmp_page[i]) {
> @@ -121,6 +146,15 @@ static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
>                 }
>         }
>
> +       if (prealloc->count) {
> +               unsigned int idx = prealloc->count - 1;
> +               struct page *page = prealloc->pages[idx];
> +
> +               prealloc->pages[idx] = NULL;
> +               prealloc->count = idx;
> +               return page;
> +       }

I would drop the idx var.

*maybe* swap macro can help here

> +
>         return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
>  }
>
> @@ -139,6 +173,17 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe,
>         put_page(page);
>  }
>
> +/* Called after pipe->mutex is dropped, to keep put_page() off the critical section. */
> +static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc)
> +{
> +       while (prealloc->count) {
> +               unsigned int idx = prealloc->count - 1;
> +
> +               put_page(prealloc->pages[idx]);
> +               prealloc->count = idx;
> +       }
> +}
> +
>  static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
>                                   struct pipe_buffer *buf)
>  {
> @@ -438,6 +483,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>         ssize_t chars;
>         bool was_empty = false;
>         bool wake_next_writer = false;
> +       struct anon_pipe_prealloc prealloc = {};
>

I forgot this bit is going to be a problem with gcc. I verified it
emits rep stosq in place, which is going to completely unnecessarily
slow things down especially on older uarchs. This is a known bug with
gcc doing terrible job optimizing this.

The problem will be avoided by merely initializing the count to 0.
which looks kind of ugly if done here, but see below.

>         /*
>          * Reject writing to watch queue pipes before the point where we lock
> @@ -455,6 +501,18 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>         if (unlikely(total_len == 0))
>                 return 0;
>
> +       /*
> +        * Bulk pre-allocate pages outside pipe->mutex for writes that span more
> +        * than one full page. alloc_page() with GFP_HIGHUSER may sleep doing
> +        * reclaim and runs memcg charging, so doing it under the mutex
> +        * extends the critical section and stalls the reader. The merge path
> +        * handles sub-PAGE_SIZE writes without a fresh page; single-page and
> +        * >PIPE_PREALLOC_MAX-page writes fall back to alloc_page() under the
> +        * mutex for the remainder.
> +        */
> +       if (total_len > PAGE_SIZE)
> +               anon_pipe_get_page_prealloc(&prealloc, total_len);
> +

I don't think this comment belongs here, it should move above the
prealloc routine.

How about this: anon_pipe_get_page_prealloc gets called
unconditionally and expects an uninitialized prealloc struct. For
total_len > PAGE_SIZE you roll with the current code. Otherwise you
just set ->count to 0, which prevents the ->pages array from being
looked at. You can even pre-set to 0 on entry, just don't memset the
entire obj.

>         mutex_lock(&pipe->mutex);
>
>         if (!pipe->readers) {
> @@ -512,7 +570,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                         struct page *page;
>                         int copied;
>
> -                       page = anon_pipe_get_page(pipe);
> +                       page = anon_pipe_get_page(pipe, &prealloc);
>                         if (unlikely(!page)) {
>                                 if (!ret)
>                                         ret = -ENOMEM;
> @@ -579,6 +637,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>         if (pipe_is_full(pipe))
>                 wake_next_writer = false;
>         mutex_unlock(&pipe->mutex);
> +       anon_pipe_free_pages(&prealloc);
>
>         /*
>          * If we do do a wakeup event, we do a 'sync' wakeup, because we

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
  2026-05-21 17:26           ` Mateusz Guzik
@ 2026-05-22 12:10             ` Breno Leitao
  0 siblings, 0 replies; 9+ messages in thread
From: Breno Leitao @ 2026-05-22 12:10 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
	linux-fsdevel, linux-kernel, linux-kselftest, usama.arif,
	kernel-team

On Thu, May 21, 2026 at 07:26:48PM +0200, Mateusz Guzik wrote:
> On Thu, May 21, 2026 at 5:38 PM Breno Leitao <leitao@debian.org> wrote:
>
> I do suspect the best way forward in the long run is repopulate the
> tmp_page array if you have pages to do so.

Ack!

> > +       struct anon_pipe_prealloc prealloc = {};
> >
> 
> I forgot this bit is going to be a problem with gcc. I verified it
> emits rep stosq in place, which is going to completely unnecessarily
> slow things down especially on older uarchs. This is a known bug with
> gcc doing terrible job optimizing this.
> 
> The problem will be avoided by merely initializing the count to 0.
> which looks kind of ugly if done here, but see below.

Ack, I can do it inside anon_pipe_get_page_prealloc()

	static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc,
						size_t total_len)
	{
		prealloc->count = 0;
		if (total_len <= PAGE_SIZE)
			return;
			...
	}

> > +       /*
> > +        * Bulk pre-allocate pages outside pipe->mutex for writes that span more
> > +        * than one full page. alloc_page() with GFP_HIGHUSER may sleep doing
> > +        * reclaim and runs memcg charging, so doing it under the mutex
> > +        * extends the critical section and stalls the reader. The merge path
> > +        * handles sub-PAGE_SIZE writes without a fresh page; single-page and
> > +        * >PIPE_PREALLOC_MAX-page writes fall back to alloc_page() under the
> > +        * mutex for the remainder.
> > +        */
> > +       if (total_len > PAGE_SIZE)
> > +               anon_pipe_get_page_prealloc(&prealloc, total_len);
> > +
> 
> I don't think this comment belongs here, it should move above the
> prealloc routine.
> 
> How about this: anon_pipe_get_page_prealloc gets called
> unconditionally and expects an uninitialized prealloc struct. For
> total_len > PAGE_SIZE you roll with the current code. Otherwise you
> just set ->count to 0, which prevents the ->pages array from being
> looked at. You can even pre-set to 0 on entry, just don't memset the
> entire obj.

Thanks, let me respin a v2.
--breno

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark
  2026-05-15 10:28 [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock Breno Leitao
  2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
@ 2026-05-15 10:28 ` Breno Leitao
  1 sibling, 0 replies; 9+ messages in thread
From: Breno Leitao @ 2026-05-15 10:28 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan
  Cc: linux-fsdevel, linux-kernel, linux-kselftest, Breno Leitao,
	usama.arif, kernel-team

Add a small selftest that stresses pipe->mutex contention by spawning N
writer threads that hammer a single pipe with multi-page writes, plus M
reader threads that drain. Each writer records its own write() latency
samples into a log2-bucketed histogram; main aggregates and prints
total writes, throughput, average and percentile (p50/p99/p99.9)
latencies, and the maximum observed latency.

This was used to validate "fs/pipe: bulk pre-allocate pages outside
pipe->mutex in anon_pipe_write". By default it sweeps over
writers={1,2,5} x readers={1,5,10} using 64KB writes for 3s on a 1 MB
pipe (~27s total); -w/-r switch to a single configuration and -s/-d/-p
tune msgsize/duration/pipe size. Output is one-line-per-metric with a
"---" separator between configurations so two runs (e.g. baseline vs
patched) can be diffed directly.

Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80%
--vm-method all) for the duration of the run, so alloc_page() in
anon_pipe_write() routinely hits direct reclaim. The flag fails
fast if stress-ng is not on $PATH.

The program exits 0 on a clean run and reports its results to stdout,
so it integrates with the kselftest framework via TEST_GEN_PROGS.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/pipe/.gitignore   |   1 +
 tools/testing/selftests/pipe/Makefile     |   9 +
 tools/testing/selftests/pipe/pipe_bench.c | 351 ++++++++++++++++++++++++++++++
 4 files changed, 362 insertions(+)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6e59b8f63e416..bcd9db9d292ca 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -91,6 +91,7 @@ TARGETS += pcie_bwctrl
 TARGETS += perf_events
 TARGETS += pidfd
 TARGETS += pid_namespace
+TARGETS += pipe
 TARGETS += power_supply
 TARGETS += powerpc
 TARGETS += prctl
diff --git a/tools/testing/selftests/pipe/.gitignore b/tools/testing/selftests/pipe/.gitignore
new file mode 100644
index 0000000000000..20b549361a152
--- /dev/null
+++ b/tools/testing/selftests/pipe/.gitignore
@@ -0,0 +1 @@
+pipe_bench
diff --git a/tools/testing/selftests/pipe/Makefile b/tools/testing/selftests/pipe/Makefile
new file mode 100644
index 0000000000000..1810c680117b3
--- /dev/null
+++ b/tools/testing/selftests/pipe/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+# Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+
+CFLAGS += -O2 -Wall -Wextra -pthread
+
+TEST_GEN_PROGS := pipe_bench
+
+include ../lib.mk
diff --git a/tools/testing/selftests/pipe/pipe_bench.c b/tools/testing/selftests/pipe/pipe_bench.c
new file mode 100644
index 0000000000000..4b4ee6c8c0ced
--- /dev/null
+++ b/tools/testing/selftests/pipe/pipe_bench.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * pipe_bench - exercise pipe->mutex contention under concurrent writers.
+ *
+ * N writer threads hammer a single pipe with multi-page writes; M reader
+ * threads drain it. Each writer records its own write() latency histogram.
+ * Multi-page writes (msgsize >= PAGE_SIZE) force the loop in
+ * anon_pipe_write() to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) under
+ * pipe->mutex, which is the critical section the patch shrinks.
+ *
+ * By default the benchmark sweeps writers in {1, 2, 5} x readers in
+ * {1, 5, 10} and prints one block per configuration so two runs (e.g.
+ * baseline vs patched) can be diffed directly. Pass -w and -r to run a
+ * single configuration instead. Pass --memory-pressure to spawn stress-ng
+ * alongside the sweep so the per-page alloc_page() path under pipe->mutex
+ * has to dip into reclaim.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#define ARRAY_SIZE(a)	(sizeof(a) / sizeof((a)[0]))
+#define HIST_BUCKETS	32
+
+static size_t g_msgsize = 16 * 4096;
+static int g_duration = 3;
+static int g_pipe_size = 1024 * 1024;
+static int g_memory_pressure;
+
+static atomic_int g_stop;
+static int g_pipe[2];
+
+struct wstats {
+	uint64_t writes;
+	uint64_t bytes;
+	uint64_t lat_sum_ns;
+	uint64_t lat_max_ns;
+	uint64_t lat_hist[HIST_BUCKETS];
+};
+
+static inline uint64_t now_ns(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return (uint64_t)ts.tv_sec * 1000000000ull + ts.tv_nsec;
+}
+
+static inline int log2_bucket(uint64_t v)
+{
+	int b = 0;
+
+	if (!v)
+		return 0;
+	while (v >>= 1)
+		b++;
+	return b < HIST_BUCKETS ? b : HIST_BUCKETS - 1;
+}
+
+static void *writer(void *arg)
+{
+	struct wstats *s = arg;
+	char *buf = aligned_alloc(4096, g_msgsize);
+
+	if (!buf)
+		return NULL;
+	memset(buf, 0xAA, g_msgsize);
+
+	while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) {
+		uint64_t t0 = now_ns();
+		ssize_t n = write(g_pipe[1], buf, g_msgsize);
+		uint64_t dt = now_ns() - t0;
+
+		if (n > 0) {
+			s->writes++;
+			s->bytes += n;
+			s->lat_sum_ns += dt;
+			if (dt > s->lat_max_ns)
+				s->lat_max_ns = dt;
+			s->lat_hist[log2_bucket(dt)]++;
+		} else if (n < 0 && errno == EPIPE) {
+			break;
+		}
+	}
+	free(buf);
+	return NULL;
+}
+
+static void *reader(void *arg)
+{
+	char *buf = aligned_alloc(4096, g_msgsize);
+
+	(void)arg;
+	if (!buf)
+		return NULL;
+	/*
+	 * Drain until EOF (write end closed by main). g_stop is not checked
+	 * here on purpose: writers may be blocked in write() with the pipe
+	 * full when g_stop is set, so the reader must keep draining until
+	 * main closes the write end.
+	 */
+	for (;;) {
+		ssize_t n = read(g_pipe[0], buf, g_msgsize);
+
+		if (n <= 0)
+			break;
+	}
+	free(buf);
+	return NULL;
+}
+
+static void summarize(struct wstats *all, int nw, int nr)
+{
+	uint64_t total_writes = 0, total_bytes = 0, total_lat = 0;
+	uint64_t max_lat = 0;
+	uint64_t agg[HIST_BUCKETS] = {0};
+	uint64_t p50_target, p99_target, p999_target;
+	uint64_t cum = 0, p50 = 0, p99 = 0, p999 = 0;
+	uint64_t avg_ns;
+	double sec;
+
+	for (int i = 0; i < nw; i++) {
+		total_writes += all[i].writes;
+		total_bytes += all[i].bytes;
+		total_lat += all[i].lat_sum_ns;
+		if (all[i].lat_max_ns > max_lat)
+			max_lat = all[i].lat_max_ns;
+		for (int b = 0; b < HIST_BUCKETS; b++)
+			agg[b] += all[i].lat_hist[b];
+	}
+
+	p50_target = total_writes * 50 / 100;
+	p99_target = total_writes * 99 / 100;
+	p999_target = total_writes * 999 / 1000;
+
+	for (int b = 0; b < HIST_BUCKETS; b++) {
+		cum += agg[b];
+		if (!p50 && cum >= p50_target)
+			p50 = 1ULL << b;
+		if (!p99 && cum >= p99_target)
+			p99 = 1ULL << b;
+		if (!p999 && cum >= p999_target)
+			p999 = 1ULL << b;
+	}
+
+	sec = g_duration;
+	avg_ns = total_writes ? total_lat / total_writes : 0;
+
+	printf("config: writers=%d readers=%d msgsize=%zu duration=%d pipe_size=%d memory_pressure=%s\n",
+	       nw, nr, g_msgsize, g_duration, g_pipe_size,
+	       g_memory_pressure ? "yes" : "no");
+	printf("writes: total=%llu rate=%.0f/s\n",
+	       (unsigned long long)total_writes, total_writes / sec);
+	printf("throughput_MBps: %.2f\n",
+	       (total_bytes / sec) / (1024.0 * 1024.0));
+	printf("lat_avg_ns: %llu\n", (unsigned long long)avg_ns);
+	printf("lat_p50_ns_upper: %llu\n", (unsigned long long)p50);
+	printf("lat_p99_ns_upper: %llu\n", (unsigned long long)p99);
+	printf("lat_p999_ns_upper: %llu\n", (unsigned long long)p999);
+	printf("lat_max_ns: %llu\n", (unsigned long long)max_lat);
+}
+
+static pid_t spawn_stress_ng(void)
+{
+	pid_t pid = fork();
+
+	if (pid < 0) {
+		perror("fork");
+		return -1;
+	}
+	if (pid == 0) {
+		execlp("stress-ng", "stress-ng",
+		       "--vm", "4", "--vm-bytes", "80%",
+		       "--vm-method", "all",
+		       (char *)NULL);
+		fprintf(stderr, "exec stress-ng failed: %s\n",
+			strerror(errno));
+		_exit(127);
+	}
+	/* Give stress-ng a moment to map its VM regions before measuring. */
+	sleep(1);
+	return pid;
+}
+
+static void kill_stress_ng(pid_t pid)
+{
+	int status;
+
+	if (pid <= 0)
+		return;
+	kill(pid, SIGTERM);
+	for (int i = 0; i < 20; i++) {
+		if (waitpid(pid, &status, WNOHANG) > 0)
+			return;
+		usleep(100 * 1000);
+	}
+	kill(pid, SIGKILL);
+	waitpid(pid, &status, 0);
+}
+
+static int run_one(int nw, int nr)
+{
+	pthread_t *wt, *rt;
+	struct wstats *ws;
+
+	atomic_store(&g_stop, 0);
+
+	if (pipe(g_pipe) < 0) {
+		perror("pipe");
+		return -1;
+	}
+	if (fcntl(g_pipe[1], F_SETPIPE_SZ, g_pipe_size) < 0)
+		perror("F_SETPIPE_SZ (continuing)");
+
+	wt = calloc(nw, sizeof(*wt));
+	rt = calloc(nr, sizeof(*rt));
+	ws = calloc(nw, sizeof(*ws));
+
+	if (!wt || !rt || !ws) {
+		fprintf(stderr, "alloc failed\n");
+		free(wt);
+		free(rt);
+		free(ws);
+		close(g_pipe[0]);
+		close(g_pipe[1]);
+		return -1;
+	}
+
+	for (int i = 0; i < nr; i++)
+		pthread_create(&rt[i], NULL, reader, NULL);
+	for (int i = 0; i < nw; i++)
+		pthread_create(&wt[i], NULL, writer, &ws[i]);
+
+	sleep(g_duration);
+	atomic_store(&g_stop, 1);
+
+	/*
+	 * Close write end first so any writer blocked in write() gets EPIPE
+	 * and exits, and so the readers see EOF after draining.
+	 */
+	close(g_pipe[1]);
+	for (int i = 0; i < nw; i++)
+		pthread_join(wt[i], NULL);
+	for (int i = 0; i < nr; i++)
+		pthread_join(rt[i], NULL);
+	close(g_pipe[0]);
+
+	summarize(ws, nw, nr);
+	fflush(stdout);
+
+	free(wt);
+	free(rt);
+	free(ws);
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	static const int writers_sweep[] = {1, 2, 5};
+	static const int readers_sweep[] = {1, 5, 10};
+	static const struct option long_opts[] = {
+		{"memory-pressure", no_argument, NULL, 'M'},
+		{0, 0, 0, 0},
+	};
+	int writers_override = 0, readers_override = 0;
+	pid_t stress_pid = -1;
+	int rc = 0, opt;
+
+	while ((opt = getopt_long(argc, argv, "w:r:s:d:p:",
+				  long_opts, NULL)) != -1) {
+		switch (opt) {
+		case 'w':
+			writers_override = atoi(optarg);
+			break;
+		case 'r':
+			readers_override = atoi(optarg);
+			break;
+		case 's':
+			g_msgsize = atol(optarg);
+			break;
+		case 'd':
+			g_duration = atoi(optarg);
+			break;
+		case 'p':
+			g_pipe_size = atoi(optarg);
+			break;
+		case 'M':
+			g_memory_pressure = 1;
+			break;
+		default:
+			fprintf(stderr,
+				"usage: %s [-w writers] [-r readers] [-s msgsize] [-d secs] [-p pipe_size] [--memory-pressure]\n"
+				"  default: sweep writers={1,2,5} x readers={1,5,10}\n"
+				"  --memory-pressure: spawn stress-ng (--vm 4 --vm-bytes 80%% --vm-method all) for the run\n",
+				argv[0]);
+			return 1;
+		}
+	}
+
+	signal(SIGPIPE, SIG_IGN);
+	setvbuf(stdout, NULL, _IOLBF, 0);
+	setvbuf(stderr, NULL, _IOLBF, 0);
+
+	fprintf(stderr, "pid=%d\n", getpid());
+	fflush(stderr);
+
+	if (g_memory_pressure) {
+		stress_pid = spawn_stress_ng();
+		if (stress_pid < 0) {
+			fprintf(stderr,
+				"memory_pressure requested but stress-ng could not be spawned\n");
+			return 1;
+		}
+	}
+
+	if (writers_override > 0 || readers_override > 0) {
+		int nw = writers_override > 0 ? writers_override : 1;
+		int nr = readers_override > 0 ? readers_override : 1;
+
+		rc = run_one(nw, nr) < 0 ? 1 : 0;
+		goto out;
+	}
+
+	for (size_t i = 0; i < ARRAY_SIZE(writers_sweep); i++) {
+		for (size_t j = 0; j < ARRAY_SIZE(readers_sweep); j++) {
+			printf("---\n");
+			if (run_one(writers_sweep[i], readers_sweep[j]) < 0) {
+				rc = 1;
+				goto out;
+			}
+		}
+	}
+out:
+	kill_stress_ng(stress_pid);
+	return rc;
+}

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-22 12:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15 10:28 [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock Breno Leitao
2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-15 20:18   ` Mateusz Guzik
2026-05-20 17:09     ` Breno Leitao
2026-05-21 12:58       ` Mateusz Guzik
2026-05-21 15:38         ` Breno Leitao
2026-05-21 17:26           ` Mateusz Guzik
2026-05-22 12:10             ` Breno Leitao
2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.