* [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock
@ 2026-05-22 16:44 Breno Leitao
2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Breno Leitao @ 2026-05-22 16:44 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan,
Mateusz Guzik
Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt,
jlayton, oleg, axboe, Breno Leitao, kernel-team
While profiling Meta's caching code[1], I found pipe->mutex contention
on the hot path. anon_pipe_write() currently calls alloc_page() once
per page while holding pipe->mutex. The allocation can sleep doing
direct reclaim and runs memcg charging, which extends the critical
section and stalls any concurrent reader on the same mutex.
This series pre-allocates pages outside pipe->mutex in
anon_pipe_write(): for writes that span more than one full page, up
to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
alloc_page() loop before the mutex is taken. anon_pipe_get_page()
then drains the prealloc array first, falls back to the per-pipe
tmp_page[] cache, and only enters the allocator under the mutex for
the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
writes that skip prealloc, or shortfalls when the prealloc loop
fails). Leftover prealloc pages are recycled into tmp_page[] before
unlock and any remainder is put_page()'d after unlock, keeping the
allocator out of the critical section on both sides.
alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
refuses __GFP_ACCOUNT under memcg -- it returns at most one page
when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
__GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
task NUMA mempolicy honoured uniformly without open-coding the
charge.
I also vibe-coded a microbenchmark to validate the change. It sweeps
writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
1 MB pipe and prints throughput + latency percentiles per config.
Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
writes, 1 MB pipe). The numbers below were collected on v1
(alloc_pages_bulk()); v2's per-page loop preserves the dominant
"allocation outside the mutex" win and is expected to land in the same
range.
== No memory pressure (10s per config) ==
Throughput in MB/s (baseline -> patched, delta):
writers readers=1 readers=5 readers=10
1 1119 -> 1354 (+21%) 1132 -> 1195 (+6%) 1060 -> 1240 (+17%)
2 1162 -> 1487 (+28%) 1034 -> 1285 (+24%) 1069 -> 1213 (+14%)
5 1152 -> 1357 (+18%) 1021 -> 1164 (+14%) 997 -> 1239 (+24%)
Avg write latency in ns (baseline -> patched, delta):
writers readers=1 readers=5 readers=10
1 55786 -> 46103 (-17%) 55164 -> 52260 (-5%) 58906 -> 50370 (-14%)
2 107546 -> 84011 (-22%) 120837 -> 97206 (-20%) 116860 -> 103036 (-12%)
5 271293 -> 230170 (-15%) 306089 -> 268429 (-12%) 313300 -> 252232 (-19%)
Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.
== Under memory pressure (--memory-pressure, 6s per config) ==
stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.
Throughput in MB/s (baseline -> patched, delta):
writers readers=1 readers=5 readers=10
1 1088 -> 1438 (+32%) 996 -> 1477 (+48%) 989 -> 1194 (+21%)
2 1076 -> 1378 (+28%) 1007 -> 1269 (+26%) 1018 -> 1234 (+21%)
5 1052 -> 1311 (+25%) 986 -> 1225 (+24%) 972 -> 1249 (+29%)
Avg write latency in ns (baseline -> patched, delta):
writers readers=1 readers=5 readers=10
1 57397 -> 43406 (-24%) 62690 -> 42272 (-33%) 63136 -> 52272 (-17%)
2 116121 -> 90700 (-22%) 124098 -> 98481 (-21%) 122754 -> 101217 (-18%)
5 297122 -> 238322 (-20%) 316836 -> 255095 (-19%) 321496 -> 250189 (-22%)
Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.
That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.
Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1]
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2:
- Switch the prealloc path from alloc_pages_bulk_mempolicy() to a
per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop.
- Split the prealloc work out of anon_pipe_write() into dedicated
helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop /
anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in
struct anon_pipe_prealloc, so the write path stays readable.
- Recycle leftover prealloc pages into pipe->tmp_page[] before
unlocking
- Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org
To: Alexander Viro <viro@zeniv.linux.org.uk>
To: Christian Brauner <brauner@kernel.org>
To: Jan Kara <jack@suse.cz>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
Breno Leitao (2):
fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
selftests/pipe: add pipe_bench microbenchmark
fs/pipe.c | 105 ++++-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/pipe/.gitignore | 1 +
tools/testing/selftests/pipe/Makefile | 9 +
tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++
5 files changed, 729 insertions(+), 3 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-fix_pipe-c91677c187e7
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write 2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao @ 2026-05-22 16:44 ` Breno Leitao 2026-05-22 16:51 ` Jeff Layton ` (2 more replies) 2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao 2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton 2 siblings, 3 replies; 13+ messages in thread From: Breno Leitao @ 2026-05-22 16:44 UTC (permalink / raw) To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Mateusz Guzik Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, jlayton, oleg, axboe, Breno Leitao, kernel-team anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole thing") and then, from the per-iteration anon_pipe_get_page() helper, used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page while still holding it. That allocation can sleep doing direct reclaim and/or runs memcg charging, which extends the critical section and stalls a concurrent reader on the very same mutex. Just pre-alloc the required pages before the lock in an array and just pop them inside the lock. This can improve the pipe throughput up to 48% and reduce the latency in 33%, easily seen when there is memory pressure and direct reclaim. Signed-off-by: Breno Leitao <leitao@debian.org> --- fs/pipe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 102 insertions(+), 3 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 9841648c9cf3e..cff255217bbfe 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -111,16 +111,76 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, pipe_lock(pipe2); } -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe) +#define PIPE_PREALLOC_MAX 8 + +struct anon_pipe_prealloc { + struct page *pages[PIPE_PREALLOC_MAX]; + unsigned int count; +}; + +/* + * Pre-allocate pages outside pipe->mutex for multi-page writes. + * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg + * charging; doing it under the mutex stalls a concurrent reader. + * + * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses + * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc: + * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single + * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy + * honoured for every page; the per-call overhead is small compared to the + * pipe->mutex hold-time being shrunk. Any shortfall is covered by the + * in-lock alloc_page() fallback in anon_pipe_get_page(). + */ +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc, + size_t total_len) +{ + unsigned int want, i; + struct page *page; + + prealloc->count = 0; + if (total_len <= PAGE_SIZE) + return; + + want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE), + PIPE_PREALLOC_MAX); + + for (i = 0; i < want; i++) { + page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT); + if (!page) + break; + prealloc->pages[prealloc->count++] = page; + } +} + +static struct page *anon_pipe_prealloc_pop(struct anon_pipe_prealloc *prealloc) +{ + if (!prealloc->count) + return NULL; + + prealloc->count--; + + return prealloc->pages[prealloc->count]; +} + +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe, + struct anon_pipe_prealloc *prealloc) { + struct page *page; + + /* Drain prealloc first to keep tmp_page[] hot for later small writes. */ + page = anon_pipe_prealloc_pop(prealloc); + if (page) + return page; + for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) { if (pipe->tmp_page[i]) { - struct page *page = pipe->tmp_page[i]; + page = pipe->tmp_page[i]; pipe->tmp_page[i] = NULL; return page; } } + /* FWIW: This is called with pipe->mutex held */ return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT); } @@ -139,6 +199,38 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe, put_page(page); } +/* + * Stash leftover prealloc pages in tmp_page[] so the next write to this + * pipe gets a hot page without entering the allocator. + */ +static void anon_pipe_refill_tmp_pages(struct pipe_inode_info *pipe, + struct anon_pipe_prealloc *prealloc) +{ + int i, idx; + + if (!prealloc->count) + return; + + for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) { + if (pipe->tmp_page[i]) + continue; + if (!prealloc->count) + return; + idx = --prealloc->count; + pipe->tmp_page[i] = prealloc->pages[idx]; + prealloc->pages[idx] = NULL; + } +} + +/* Runs after mutex_unlock() to keep put_page() out of the critical section. */ +static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc) +{ + while (prealloc->count) { + prealloc->count--; + put_page(prealloc->pages[prealloc->count]); + } +} + static void anon_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { @@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) { struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; + struct anon_pipe_prealloc prealloc; unsigned int head; ssize_t ret = 0; size_t total_len = iov_iter_count(from); @@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) if (unlikely(total_len == 0)) return 0; + anon_pipe_get_page_prealloc(&prealloc, total_len); + mutex_lock(&pipe->mutex); if (!pipe->readers) { @@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) struct page *page; int copied; - page = anon_pipe_get_page(pipe); + page = anon_pipe_get_page(pipe, &prealloc); if (unlikely(!page)) { if (!ret) ret = -ENOMEM; @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) * after waiting we need to re-check whether the pipe * become empty while we dropped the lock. */ + anon_pipe_refill_tmp_pages(pipe, &prealloc); mutex_unlock(&pipe->mutex); + anon_pipe_free_pages(&prealloc); if (was_empty) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); @@ -576,9 +673,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) wake_next_writer = true; } out: + anon_pipe_refill_tmp_pages(pipe, &prealloc); if (pipe_is_full(pipe)) wake_next_writer = false; mutex_unlock(&pipe->mutex); + anon_pipe_free_pages(&prealloc); /* * If we do do a wakeup event, we do a 'sync' wakeup, because we -- 2.54.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write 2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao @ 2026-05-22 16:51 ` Jeff Layton 2026-05-22 17:55 ` Breno Leitao 2026-05-22 19:48 ` Mateusz Guzik 2026-05-23 16:26 ` Oleg Nesterov 2 siblings, 1 reply; 13+ messages in thread From: Jeff Layton @ 2026-05-22 16:51 UTC (permalink / raw) To: Breno Leitao, Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Mateusz Guzik Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, oleg, axboe, kernel-team On Fri, 2026-05-22 at 09:44 -0700, Breno Leitao wrote: > anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole > thing") and then, from the per-iteration anon_pipe_get_page() helper, > used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page > while still holding it. > > That allocation can sleep doing direct reclaim and/or runs memcg > charging, which extends the critical section and stalls a concurrent > reader on the very same mutex. > > Just pre-alloc the required pages before the lock in an array and just pop > them inside the lock. > > This can improve the pipe throughput up to 48% and reduce the > latency in 33%, easily seen when there is memory pressure and direct > reclaim. > > Signed-off-by: Breno Leitao <leitao@debian.org> > --- > fs/pipe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 102 insertions(+), 3 deletions(-) > > diff --git a/fs/pipe.c b/fs/pipe.c > index 9841648c9cf3e..cff255217bbfe 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -111,16 +111,76 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, > pipe_lock(pipe2); > } > > -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe) > +#define PIPE_PREALLOC_MAX 8 > + > +struct anon_pipe_prealloc { > + struct page *pages[PIPE_PREALLOC_MAX]; > + unsigned int count; > +}; > + > +/* > + * Pre-allocate pages outside pipe->mutex for multi-page writes. > + * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg > + * charging; doing it under the mutex stalls a concurrent reader. > + * > + * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses > + * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc: > + * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single > + * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy > + * honoured for every page; the per-call overhead is small compared to the > + * pipe->mutex hold-time being shrunk. Any shortfall is covered by the > + * in-lock alloc_page() fallback in anon_pipe_get_page(). > + */ > +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc, > + size_t total_len) > +{ > + unsigned int want, i; > + struct page *page; > + > + prealloc->count = 0; > + if (total_len <= PAGE_SIZE) > + return; > + > + want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE), > + PIPE_PREALLOC_MAX); > + > + for (i = 0; i < want; i++) { > + page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT); > + if (!page) > + break; > + prealloc->pages[prealloc->count++] = page; > + } I believe alloc_pages_bulk() is supposed to be more efficient when allocating several pages at once like this. > +} > + > +static struct page *anon_pipe_prealloc_pop(struct anon_pipe_prealloc *prealloc) > +{ > + if (!prealloc->count) > + return NULL; > + > + prealloc->count--; > + > + return prealloc->pages[prealloc->count]; > +} > + > +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe, > + struct anon_pipe_prealloc *prealloc) > { > + struct page *page; > + > + /* Drain prealloc first to keep tmp_page[] hot for later small writes. */ > + page = anon_pipe_prealloc_pop(prealloc); > + if (page) > + return page; > + > for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) { > if (pipe->tmp_page[i]) { > - struct page *page = pipe->tmp_page[i]; > + page = pipe->tmp_page[i]; > pipe->tmp_page[i] = NULL; > return page; > } > } > > + /* FWIW: This is called with pipe->mutex held */ > return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT); > } > > @@ -139,6 +199,38 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe, > put_page(page); > } > > +/* > + * Stash leftover prealloc pages in tmp_page[] so the next write to this > + * pipe gets a hot page without entering the allocator. > + */ > +static void anon_pipe_refill_tmp_pages(struct pipe_inode_info *pipe, > + struct anon_pipe_prealloc *prealloc) > +{ > + int i, idx; > + > + if (!prealloc->count) > + return; > + > + for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) { > + if (pipe->tmp_page[i]) > + continue; > + if (!prealloc->count) > + return; > + idx = --prealloc->count; > + pipe->tmp_page[i] = prealloc->pages[idx]; > + prealloc->pages[idx] = NULL; > + } > +} > + > +/* Runs after mutex_unlock() to keep put_page() out of the critical section. */ > +static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc) > +{ > + while (prealloc->count) { > + prealloc->count--; > + put_page(prealloc->pages[prealloc->count]); > + } > +} > + > static void anon_pipe_buf_release(struct pipe_inode_info *pipe, > struct pipe_buffer *buf) > { > @@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > { > struct file *filp = iocb->ki_filp; > struct pipe_inode_info *pipe = filp->private_data; > + struct anon_pipe_prealloc prealloc; > unsigned int head; > ssize_t ret = 0; > size_t total_len = iov_iter_count(from); > @@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > if (unlikely(total_len == 0)) > return 0; > > + anon_pipe_get_page_prealloc(&prealloc, total_len); > + > mutex_lock(&pipe->mutex); > > if (!pipe->readers) { > @@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > struct page *page; > int copied; > > - page = anon_pipe_get_page(pipe); > + page = anon_pipe_get_page(pipe, &prealloc); > if (unlikely(!page)) { > if (!ret) > ret = -ENOMEM; > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > * after waiting we need to re-check whether the pipe > * become empty while we dropped the lock. > */ > + anon_pipe_refill_tmp_pages(pipe, &prealloc); > mutex_unlock(&pipe->mutex); > + anon_pipe_free_pages(&prealloc); > if (was_empty) > wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); > kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > @@ -576,9 +673,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > wake_next_writer = true; > } > out: > + anon_pipe_refill_tmp_pages(pipe, &prealloc); > if (pipe_is_full(pipe)) > wake_next_writer = false; > mutex_unlock(&pipe->mutex); > + anon_pipe_free_pages(&prealloc); > > /* > * If we do do a wakeup event, we do a 'sync' wakeup, because we -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write 2026-05-22 16:51 ` Jeff Layton @ 2026-05-22 17:55 ` Breno Leitao 0 siblings, 0 replies; 13+ messages in thread From: Breno Leitao @ 2026-05-22 17:55 UTC (permalink / raw) To: Jeff Layton Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, oleg, axboe, kernel-team On Fri, May 22, 2026 at 12:51:40PM -0400, Jeff Layton wrote: > On Fri, 2026-05-22 at 09:44 -0700, Breno Leitao wrote: > > +/* > > + * Pre-allocate pages outside pipe->mutex for multi-page writes. > > + * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg > > + * charging; doing it under the mutex stalls a concurrent reader. > > + * > > + * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses > > + * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc: > > + * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single > > + * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy > > + * honoured for every page; the per-call overhead is small compared to the > > + * pipe->mutex hold-time being shrunk. Any shortfall is covered by the > > + * in-lock alloc_page() fallback in anon_pipe_get_page(). > > + */ > > +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc, > > + size_t total_len) > > +{ > > + unsigned int want, i; > > + struct page *page; > > + > > + prealloc->count = 0; > > + if (total_len <= PAGE_SIZE) > > + return; > > + > > + want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE), > > + PIPE_PREALLOC_MAX); > > + > > + for (i = 0; i < want; i++) { > > + page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT); > > + if (!page) > > + break; > > + prealloc->pages[prealloc->count++] = page; > > + } > > I believe alloc_pages_bulk() is supposed to be more efficient when > allocating several pages at once like this. Thanks Jeff. I understand bulk allocators refuses __GFP_ACCOUNT under memcg. (That is the reason I've CCed Shakeel in this patchset). Anyway, reading the code it shows me: unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid, nodemask_t *nodemask, int nr_pages, struct page **page_array) { ... /* Bulk allocator does not support memcg accounting. */ if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT)) goto failed; Thanks for the review, --breno ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write 2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao 2026-05-22 16:51 ` Jeff Layton @ 2026-05-22 19:48 ` Mateusz Guzik 2026-05-23 16:26 ` Oleg Nesterov 2 siblings, 0 replies; 13+ messages in thread From: Mateusz Guzik @ 2026-05-22 19:48 UTC (permalink / raw) To: Breno Leitao Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, jlayton, oleg, axboe, kernel-team On Fri, May 22, 2026 at 6:44 PM Breno Leitao <leitao@debian.org> wrote: > > anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole > thing") and then, from the per-iteration anon_pipe_get_page() helper, > used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page > while still holding it. > > That allocation can sleep doing direct reclaim and/or runs memcg > charging, which extends the critical section and stalls a concurrent > reader on the very same mutex. > > Just pre-alloc the required pages before the lock in an array and just pop > them inside the lock. > > This can improve the pipe throughput up to 48% and reduce the > latency in 33%, easily seen when there is memory pressure and direct > reclaim. > > Signed-off-by: Breno Leitao <leitao@debian.org> > --- > fs/pipe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 102 insertions(+), 3 deletions(-) > > diff --git a/fs/pipe.c b/fs/pipe.c > index 9841648c9cf3e..cff255217bbfe 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -111,16 +111,76 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, > pipe_lock(pipe2); > } > > -static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe) > +#define PIPE_PREALLOC_MAX 8 > + > +struct anon_pipe_prealloc { > + struct page *pages[PIPE_PREALLOC_MAX]; > + unsigned int count; > +}; > + > +/* > + * Pre-allocate pages outside pipe->mutex for multi-page writes. > + * alloc_page() with GFP_HIGHUSER can sleep in reclaim and runs memcg > + * charging; doing it under the mutex stalls a concurrent reader. > + * > + * Loop alloc_page() instead of alloc_pages_bulk_*(): the bulk path refuses > + * __GFP_ACCOUNT under memcg (see commit 8dcb3060d81d "memcg: page_alloc: > + * skip bulk allocator for __GFP_ACCOUNT") and silently degrades to a single > + * page. A per-page loop keeps memcg accounting and the task NUMA mempolicy > + * honoured for every page; the per-call overhead is small compared to the > + * pipe->mutex hold-time being shrunk. Any shortfall is covered by the > + * in-lock alloc_page() fallback in anon_pipe_get_page(). > + */ > +static void anon_pipe_get_page_prealloc(struct anon_pipe_prealloc *prealloc, > + size_t total_len) > +{ > + unsigned int want, i; > + struct page *page; > + > + prealloc->count = 0; > + if (total_len <= PAGE_SIZE) > + return; > + > + want = min_t(unsigned int, DIV_ROUND_UP(total_len, PAGE_SIZE), > + PIPE_PREALLOC_MAX); > + > + for (i = 0; i < want; i++) { > + page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT); > + if (!page) > + break; > + prealloc->pages[prealloc->count++] = page; > + } > +} > + > +static struct page *anon_pipe_prealloc_pop(struct anon_pipe_prealloc *prealloc) > +{ > + if (!prealloc->count) > + return NULL; > + > + prealloc->count--; > + > + return prealloc->pages[prealloc->count]; > +} > + > +static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe, > + struct anon_pipe_prealloc *prealloc) > { > + struct page *page; > + > + /* Drain prealloc first to keep tmp_page[] hot for later small writes. */ > + page = anon_pipe_prealloc_pop(prealloc); > + if (page) > + return page; > + > for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) { > if (pipe->tmp_page[i]) { > - struct page *page = pipe->tmp_page[i]; > + page = pipe->tmp_page[i]; > pipe->tmp_page[i] = NULL; > return page; > } > } > > + /* FWIW: This is called with pipe->mutex held */ > return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT); > } > > @@ -139,6 +199,38 @@ static void anon_pipe_put_page(struct pipe_inode_info *pipe, > put_page(page); > } > > +/* > + * Stash leftover prealloc pages in tmp_page[] so the next write to this > + * pipe gets a hot page without entering the allocator. > + */ > +static void anon_pipe_refill_tmp_pages(struct pipe_inode_info *pipe, > + struct anon_pipe_prealloc *prealloc) > +{ > + int i, idx; > + > + if (!prealloc->count) > + return; > + > + for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) { > + if (pipe->tmp_page[i]) > + continue; > + if (!prealloc->count) > + return; > + idx = --prealloc->count; > + pipe->tmp_page[i] = prealloc->pages[idx]; > + prealloc->pages[idx] = NULL; > + } > +} > + > +/* Runs after mutex_unlock() to keep put_page() out of the critical section. */ > +static void anon_pipe_free_pages(struct anon_pipe_prealloc *prealloc) > +{ > + while (prealloc->count) { > + prealloc->count--; > + put_page(prealloc->pages[prealloc->count]); > + } > +} > + > static void anon_pipe_buf_release(struct pipe_inode_info *pipe, > struct pipe_buffer *buf) > { > @@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > { > struct file *filp = iocb->ki_filp; > struct pipe_inode_info *pipe = filp->private_data; > + struct anon_pipe_prealloc prealloc; > unsigned int head; > ssize_t ret = 0; > size_t total_len = iov_iter_count(from); > @@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > if (unlikely(total_len == 0)) > return 0; > > + anon_pipe_get_page_prealloc(&prealloc, total_len); > + > mutex_lock(&pipe->mutex); > > if (!pipe->readers) { > @@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > struct page *page; > int copied; > > - page = anon_pipe_get_page(pipe); > + page = anon_pipe_get_page(pipe, &prealloc); > if (unlikely(!page)) { > if (!ret) > ret = -ENOMEM; > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > * after waiting we need to re-check whether the pipe > * become empty while we dropped the lock. > */ > + anon_pipe_refill_tmp_pages(pipe, &prealloc); > mutex_unlock(&pipe->mutex); > + anon_pipe_free_pages(&prealloc); > if (was_empty) > wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); > kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > @@ -576,9 +673,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > wake_next_writer = true; > } > out: > + anon_pipe_refill_tmp_pages(pipe, &prealloc); > if (pipe_is_full(pipe)) > wake_next_writer = false; maybe the new call would be nicer moved below pipe_is_full, but that's just cosmetics > mutex_unlock(&pipe->mutex); > + anon_pipe_free_pages(&prealloc); > > /* > * If we do do a wakeup event, we do a 'sync' wakeup, because we > > -- > 2.54.0 > Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write 2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao 2026-05-22 16:51 ` Jeff Layton 2026-05-22 19:48 ` Mateusz Guzik @ 2026-05-23 16:26 ` Oleg Nesterov 2026-05-24 14:30 ` Breno Leitao 2 siblings, 1 reply; 13+ messages in thread From: Oleg Nesterov @ 2026-05-23 16:26 UTC (permalink / raw) To: Breno Leitao Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team To be honest, I didn't read this patch carefully, but let me ask anyway. On 05/22, Breno Leitao wrote: > > @@ -432,6 +524,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > { > struct file *filp = iocb->ki_filp; > struct pipe_inode_info *pipe = filp->private_data; > + struct anon_pipe_prealloc prealloc; > unsigned int head; > ssize_t ret = 0; > size_t total_len = iov_iter_count(from); > @@ -455,6 +548,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > if (unlikely(total_len == 0)) > return 0; > > + anon_pipe_get_page_prealloc(&prealloc, total_len); > + > mutex_lock(&pipe->mutex); > > if (!pipe->readers) { > @@ -512,7 +607,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > struct page *page; > int copied; > > - page = anon_pipe_get_page(pipe); > + page = anon_pipe_get_page(pipe, &prealloc); > if (unlikely(!page)) { > if (!ret) > ret = -ENOMEM; > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > * after waiting we need to re-check whether the pipe > * become empty while we dropped the lock. > */ > + anon_pipe_refill_tmp_pages(pipe, &prealloc); > mutex_unlock(&pipe->mutex); > + anon_pipe_free_pages(&prealloc); Do we really want to call anon_pipe_free_pages() at this point? The main loop will continue when pipe_writable() becomes true again... Oleg. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write 2026-05-23 16:26 ` Oleg Nesterov @ 2026-05-24 14:30 ` Breno Leitao 2026-05-24 14:48 ` Mateusz Guzik 0 siblings, 1 reply; 13+ messages in thread From: Breno Leitao @ 2026-05-24 14:30 UTC (permalink / raw) To: Oleg Nesterov Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote: > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > > * after waiting we need to re-check whether the pipe > > * become empty while we dropped the lock. > > */ > > + anon_pipe_refill_tmp_pages(pipe, &prealloc); > > mutex_unlock(&pipe->mutex); > > + anon_pipe_free_pages(&prealloc); > > Do we really want to call anon_pipe_free_pages() at this point? > > The main loop will continue when pipe_writable() becomes true again... I went back and forth on this. The argument for freeing was that wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or stopped reader), and holding up the prealloc pages felt antisocial -- especially under the memory pressure this series targets, where those pages are more useful on the freelists than parked on a sleeping task. On the other side, on wakeup the loop is guaranteed to want pages again, and re-entering the allocator under the mutex puts us back in the contended state the patch removes. For any write() large enough to wait mid-syscall (which is the workload patch 2/2 measures), keeping them strictly wins on throughput / p99. I think your read is the better one -- the throughput case is the whole point of the series, and the memory-hoarding concern is bounded (8 pages per blocked writer, freed in the out: path on syscall exit). Will drop the free_pages() call from the wait branch in v3 and keep only the one after the out: label, together with the nit from Mateusz. Thanks for the review, --breno ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write 2026-05-24 14:30 ` Breno Leitao @ 2026-05-24 14:48 ` Mateusz Guzik 2026-05-24 16:47 ` Breno Leitao 0 siblings, 1 reply; 13+ messages in thread From: Mateusz Guzik @ 2026-05-24 14:48 UTC (permalink / raw) To: Breno Leitao Cc: Oleg Nesterov, Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team On Sun, May 24, 2026 at 4:30 PM Breno Leitao <leitao@debian.org> wrote: > > On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote: > > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > > > * after waiting we need to re-check whether the pipe > > > * become empty while we dropped the lock. > > > */ > > > + anon_pipe_refill_tmp_pages(pipe, &prealloc); > > > mutex_unlock(&pipe->mutex); > > > + anon_pipe_free_pages(&prealloc); > > > > Do we really want to call anon_pipe_free_pages() at this point? > > > > The main loop will continue when pipe_writable() becomes true again... > > I went back and forth on this. The argument for freeing was that > wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or > stopped reader), and holding up the prealloc pages felt antisocial -- > especially under the memory pressure this series targets, where those pages are > more useful on the freelists than parked on a sleeping task. > > On the other side, on wakeup the loop is guaranteed to want pages again, and > re-entering the allocator under the mutex puts us back in the contended state > the patch removes. For any write() large enough to wait mid-syscall (which is > the workload patch 2/2 measures), keeping them strictly wins on throughput / > p99. > You can still prealloc after wakeup for whatever reminder you got though, but I can agree dropping these frees is a sensible way out and it is easier and I'm not going to insist on one way or the other. However, I think it would be prudent to add a tracepoint to some machines on your fleet to find out how often they allocate pages under the mutex (and for what i/o size). Initial alloc for the first write < PAGE_SIZE definitely happens under the mutex which is probably not a problem, but for anything later? The tracepoint can have a trivial indicator if this is the first write if that matters. One can speculate all day, nothing beats checking what happens. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write 2026-05-24 14:48 ` Mateusz Guzik @ 2026-05-24 16:47 ` Breno Leitao 0 siblings, 0 replies; 13+ messages in thread From: Breno Leitao @ 2026-05-24 16:47 UTC (permalink / raw) To: Mateusz Guzik Cc: Oleg Nesterov, Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team Hello Mateusz, On Sun, May 24, 2026 at 04:48:14PM +0200, Mateusz Guzik wrote: > On Sun, May 24, 2026 at 4:30 PM Breno Leitao <leitao@debian.org> wrote: > > > > On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote: > > > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from) > > > > * after waiting we need to re-check whether the pipe > > > > * become empty while we dropped the lock. > > > > */ > > > > + anon_pipe_refill_tmp_pages(pipe, &prealloc); > > > > mutex_unlock(&pipe->mutex); > > > > + anon_pipe_free_pages(&prealloc); > > > > > > Do we really want to call anon_pipe_free_pages() at this point? > > > > > > The main loop will continue when pipe_writable() becomes true again... > > > > I went back and forth on this. The argument for freeing was that > > wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or > > stopped reader), and holding up the prealloc pages felt antisocial -- > > especially under the memory pressure this series targets, where those pages are > > more useful on the freelists than parked on a sleeping task. > > > > On the other side, on wakeup the loop is guaranteed to want pages again, and > > re-entering the allocator under the mutex puts us back in the contended state > > the patch removes. For any write() large enough to wait mid-syscall (which is > > the workload patch 2/2 measures), keeping them strictly wins on throughput / > > p99. > > > > You can still prealloc after wakeup for whatever reminder you got > though, but I can agree dropping these frees is a sensible way out and > it is easier and I'm not going to insist on one way or the other. Ack. I've sent a v3 with anon_pipe_free_pages() and anon_pipe_refill_tmp_pages() dropped. > However, I think it would be prudent to add a tracepoint to some > machines on your fleet to find out how often they allocate pages under > the mutex (and for what i/o size). Initial alloc for the first write < > PAGE_SIZE definitely happens under the mutex which is probably not a > problem, but for anything later? > The tracepoint can have a trivial > indicator if this is the first write if that matters. One can Isn't this what I've reported earlier? https://lore.kernel.org/all/ag3Ty3T24wjn1aFw@gmail.com/ Adding a tracepoint is harder than usual, given kernel rollout takes ages. But I hacked a bpftrace script and ran it on a random sample of fleet hosts (5 min each). As reported earlier, multi-page pipe writes are not uncommon: on one host a single long-running process produced 196,476 under-mutex alloc_page() calls in 5 minutes, with allocs-per-write distributions reaching 16+ -- exactly the pattern this patch removes. Most hosts sit at the boring ~20-30 allocs/sec dominated by one-page first-writes that the patch's `total_len <= PAGE_SIZE` early-return skips anyway, so the win is concentrated on the workloads that actually need it. None of the allocs hit reclaim during the trace I ran, but I would expect direct reclaim to happen with the lock held. Thanks for the review and direction, --breno ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark 2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao 2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao @ 2026-05-22 16:44 ` Breno Leitao 2026-05-23 16:43 ` Oleg Nesterov 2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton 2 siblings, 1 reply; 13+ messages in thread From: Breno Leitao @ 2026-05-22 16:44 UTC (permalink / raw) To: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Mateusz Guzik Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, jlayton, oleg, axboe, Breno Leitao, kernel-team Add a small selftest that stresses pipe->mutex contention by spawning N writer threads that hammer a single pipe with multi-page writes, plus M reader threads that drain. Each writer records its own write() latency samples into a log2-bucketed histogram; main aggregates and prints total writes, throughput, average and percentile (p50/p99) latencies, and the maximum observed latency. Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80% --vm-method all) for the duration of the run, so alloc_page() in anon_pipe_write() routinely hits direct reclaim. The flag fails fast if stress-ng is not on $PATH. Program print something like the following, for different writes, readers, msgsizes and memory pressure: config: writers=X readers=Y msgsize=Z duration=3 pipe_size=1048576 memory_pressure=[no|yes] writes: total=54451 rate=18150/s throughput_MBps: 1134.40 lat_avg_ns: 275355 lat_p50_ns_upper: 262143 lat_p99_ns_upper: 1048575 lat_max_ns: 2145633 Signed-off-by: Breno Leitao <leitao@debian.org> --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/pipe/.gitignore | 1 + tools/testing/selftests/pipe/Makefile | 9 + tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++ 4 files changed, 627 insertions(+) diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 6e59b8f63e416..bcd9db9d292ca 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -91,6 +91,7 @@ TARGETS += pcie_bwctrl TARGETS += perf_events TARGETS += pidfd TARGETS += pid_namespace +TARGETS += pipe TARGETS += power_supply TARGETS += powerpc TARGETS += prctl diff --git a/tools/testing/selftests/pipe/.gitignore b/tools/testing/selftests/pipe/.gitignore new file mode 100644 index 0000000000000..20b549361a152 --- /dev/null +++ b/tools/testing/selftests/pipe/.gitignore @@ -0,0 +1 @@ +pipe_bench diff --git a/tools/testing/selftests/pipe/Makefile b/tools/testing/selftests/pipe/Makefile new file mode 100644 index 0000000000000..1810c680117b3 --- /dev/null +++ b/tools/testing/selftests/pipe/Makefile @@ -0,0 +1,9 @@ +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2026 Meta Platforms, Inc. and affiliates +# Copyright (c) 2026 Breno Leitao <leitao@debian.org> + +CFLAGS += -O2 -Wall -Wextra -pthread + +TEST_GEN_PROGS := pipe_bench + +include ../lib.mk diff --git a/tools/testing/selftests/pipe/pipe_bench.c b/tools/testing/selftests/pipe/pipe_bench.c new file mode 100644 index 0000000000000..7e96429b8fb4d --- /dev/null +++ b/tools/testing/selftests/pipe/pipe_bench.c @@ -0,0 +1,616 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pipe_bench - exercise concurrent pipe operation + * + * N writer threads hammer a single pipe with multi-page writes; M reader + * threads drain it. Each writer records its own write() latency histogram. + * Multi-page writes (msgsize >= PAGE_SIZE) force the loop in + * anon_pipe_write() to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) under + * pipe->mutex, which is the critical section the patch shrinks. + * + * By default the benchmark sweeps writers in {1, 2, 5} x readers in + * {1, 5, 10} and prints one block per configuration so two runs (e.g. + * baseline vs patched) can be diffed directly. Pass -w and -r to run a + * single configuration instead. Pass --memory-pressure to spawn stress-ng + * alongside the sweep so the per-page alloc_page() path under pipe->mutex + * has to dip into reclaim. + * + * Copyright (c) 2026 Meta Platforms, Inc. and affiliates + * Copyright (c) 2026 Breno Leitao <leitao@debian.org> + */ + +#define _GNU_SOURCE +#include <errno.h> +#include <fcntl.h> +#include <getopt.h> +#include <poll.h> +#include <pthread.h> +#include <signal.h> +#include <stdatomic.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/wait.h> +#include <time.h> +#include <unistd.h> + +#define ARRAY_SIZE(a) (sizeof(a) / sizeof((a)[0])) +#define HIST_BUCKETS 32 + +static size_t g_msgsize = 16 * 4096; +static int g_duration = 3; +static int g_pipe_size = 1024 * 1024; +static int g_memory_pressure; + +static atomic_int g_stop; +static int g_pipe[2]; + +struct wstats { + uint64_t writes; + uint64_t bytes; + uint64_t lat_sum_ns; + uint64_t lat_max_ns; + uint64_t lat_hist[HIST_BUCKETS]; + char *buf; +}; + +struct rstats { + char *buf; +}; + +struct hist_totals { + uint64_t writes; + uint64_t bytes; + uint64_t lat_sum; + uint64_t lat_max; +}; + +static inline uint64_t now_ns(void) +{ + struct timespec ts; + + clock_gettime(CLOCK_MONOTONIC, &ts); + return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec; +} + +static inline int log2_bucket(uint64_t v) +{ + int b = 0; + + if (!v) + return 0; + while (v >>= 1) + b++; + return b < HIST_BUCKETS ? b : HIST_BUCKETS - 1; +} + +static void *writer(void *arg) +{ + struct wstats *s = arg; + + while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) { + uint64_t t0 = now_ns(); + ssize_t n = write(g_pipe[1], s->buf, g_msgsize); + uint64_t dt = now_ns() - t0; + + if (n > 0) { + s->writes++; + s->bytes += (uint64_t)n; + s->lat_sum_ns += dt; + if (dt > s->lat_max_ns) + s->lat_max_ns = dt; + s->lat_hist[log2_bucket(dt)]++; + } else if (n < 0 && (errno == EPIPE || errno == EBADF)) { + break; + } + } + return NULL; +} + +static void *reader(void *arg) +{ + struct rstats *s = arg; + + /* + * Drain until EOF (write end closed by main). g_stop is not checked + * here on purpose: writers may be blocked in write() with the pipe + * full when g_stop is set, so the reader must keep draining until + * main closes the write end. + */ + for (;;) { + ssize_t n = read(g_pipe[0], s->buf, g_msgsize); + + if (n <= 0) + break; + } + return NULL; +} + +/* Sum per-writer stats and per-bucket counts into the caller's aggregates. */ +static void aggregate_wstats(struct wstats *all, int nw, + uint64_t agg[HIST_BUCKETS], + struct hist_totals *t) +{ + memset(t, 0, sizeof(*t)); + for (int i = 0; i < nw; i++) { + t->writes += all[i].writes; + t->bytes += all[i].bytes; + t->lat_sum += all[i].lat_sum_ns; + if (all[i].lat_max_ns > t->lat_max) + t->lat_max = all[i].lat_max_ns; + for (int b = 0; b < HIST_BUCKETS; b++) + agg[b] += all[i].lat_hist[b]; + } +} + +/* + * Walk @agg in order, returning the inclusive upper bound (in ns) of the + * log2 bucket where the running sum first reaches @target. + * + * A percentile is undefined with zero samples, and with very low sample + * counts integer truncation could make @target zero -- then "cum >= 0" + * would latch on the first (possibly empty) bucket. Callers must pass + * @target >= 1. + */ +static uint64_t bucket_at(const uint64_t agg[HIST_BUCKETS], uint64_t target) +{ + uint64_t cum = 0; + + for (int b = 0; b < HIST_BUCKETS; b++) { + /* HIST_BUCKETS <= 63, so (b + 1) is always a safe shift. */ + uint64_t upper = (1ULL << (b + 1)) - 1; + + cum += agg[b]; + if (cum >= target) + return upper; + } + return 0; +} + +static void compute_p50_p99(const uint64_t agg[HIST_BUCKETS], uint64_t writes, + uint64_t *p50, uint64_t *p99) +{ + uint64_t p50_target, p99_target; + + *p50 = *p99 = 0; + if (!writes) + return; + + p50_target = writes * 50 / 100; + p99_target = writes * 99 / 100; + if (!p50_target) + p50_target = 1; + if (!p99_target) + p99_target = 1; + + *p50 = bucket_at(agg, p50_target); + *p99 = bucket_at(agg, p99_target); +} + +static void print_summary(int nw, int nr, const struct hist_totals *t, + uint64_t p50, uint64_t p99) +{ + double sec = g_duration; + uint64_t avg_ns = t->writes ? t->lat_sum / t->writes : 0; + + printf("config: writers=%d readers=%d msgsize=%zu duration=%d pipe_size=%d memory_pressure=%s\n", + nw, nr, g_msgsize, g_duration, g_pipe_size, + g_memory_pressure ? "yes" : "no"); + printf("writes: total=%llu rate=%.0f/s\n", + (unsigned long long)t->writes, (double)t->writes / sec); + printf("throughput_MBps: %.2f\n", + ((double)t->bytes / sec) / (1024.0 * 1024.0)); + printf("lat_avg_ns: %llu\n", (unsigned long long)avg_ns); + printf("lat_p50_ns_upper: %llu\n", (unsigned long long)p50); + printf("lat_p99_ns_upper: %llu\n", (unsigned long long)p99); + printf("lat_max_ns: %llu\n", (unsigned long long)t->lat_max); +} + +static void summarize(struct wstats *all, int nw, int nr) +{ + uint64_t agg[HIST_BUCKETS] = {0}; + struct hist_totals t; + uint64_t p50, p99; + + aggregate_wstats(all, nw, agg, &t); + compute_p50_p99(agg, t.writes, &p50, &p99); + print_summary(nw, nr, &t, p50, p99); +} + +/* + * Child branch of fork(): restore SIGPIPE to default (parent ignores it), + * exec stress-ng, and on failure write the reason into @hs_wr before + * exiting. The parent observes EOF on hs_wr (closed via O_CLOEXEC) when + * exec succeeds. + */ +static void stress_ng_child(int hs_wr) __attribute__((noreturn)); +static void stress_ng_child(int hs_wr) +{ + char errbuf[256]; + + signal(SIGPIPE, SIG_DFL); + execlp("stress-ng", "stress-ng", + "--vm", "4", "--vm-bytes", "80%", + "--vm-method", "all", + (char *)NULL); + snprintf(errbuf, sizeof(errbuf), + "exec stress-ng failed: %s\n", strerror(errno)); + (void)!write(hs_wr, errbuf, strlen(errbuf)); + _exit(127); +} + +/* + * Read from the O_CLOEXEC handshake pipe. Anything readable means the + * child wrote an error before exec; EOF (n == 0) means the write-end + * closed because exec succeeded. Returns 0 on exec success, -1 if the + * child failed and was reaped. + */ +static int stress_ng_wait_handshake(int hs_rd, pid_t pid) +{ + struct pollfd pfd = { .fd = hs_rd, .events = POLLIN }; + char errbuf[256]; + int status; + int ret; + + ret = poll(&pfd, 1, 500); + if (ret <= 0) + return 0; + + ssize_t n = read(hs_rd, errbuf, sizeof(errbuf) - 1); + + if (n > 0) { + errbuf[n] = '\0'; + fputs(errbuf, stderr); + waitpid(pid, &status, 0); + return -1; + } + return 0; +} + +static pid_t spawn_stress_ng(void) +{ + int hs[2]; + pid_t pid; + + /* + * Handshake pipe: child writes one byte and _exit()s on exec + * failure. On exec success the O_CLOEXEC flag closes the write + * end, which the parent observes as EOF. This makes the "is + * stress-ng on $PATH?" check fail fast rather than silently. + */ + if (pipe2(hs, O_CLOEXEC) < 0) { + perror("pipe2"); + return -1; + } + + pid = fork(); + if (pid < 0) { + perror("fork"); + close(hs[0]); + close(hs[1]); + return -1; + } + if (pid == 0) { + close(hs[0]); + stress_ng_child(hs[1]); + } + + close(hs[1]); + if (stress_ng_wait_handshake(hs[0], pid) < 0) { + close(hs[0]); + return -1; + } + close(hs[0]); + + /* Give stress-ng a moment to map its VM regions before measuring. */ + sleep(1); + return pid; +} + +static void kill_stress_ng(pid_t pid) +{ + int status; + + if (pid <= 0) + return; + kill(pid, SIGTERM); + for (int i = 0; i < 20; i++) { + if (waitpid(pid, &status, WNOHANG) > 0) + return; + usleep(100 * 1000); + } + kill(pid, SIGKILL); + waitpid(pid, &status, 0); +} + +/* + * Allocate per-thread page-aligned buffers in main so a failed + * aligned_alloc() aborts the run before any thread starts. Workers used + * to allocate their own buffer and return NULL on failure, which left + * peers blocked in write()/read() with nobody to unblock them. + */ +static int alloc_thread_bufs(struct wstats *ws, int nw, + struct rstats *rs, int nr) +{ + for (int i = 0; i < nw; i++) { + ws[i].buf = aligned_alloc(4096, g_msgsize); + if (!ws[i].buf) { + fprintf(stderr, "writer %d: aligned_alloc(%zu) failed\n", + i, g_msgsize); + return -1; + } + memset(ws[i].buf, 0xAA, g_msgsize); + } + for (int i = 0; i < nr; i++) { + rs[i].buf = aligned_alloc(4096, g_msgsize); + if (!rs[i].buf) { + fprintf(stderr, "reader %d: aligned_alloc(%zu) failed\n", + i, g_msgsize); + return -1; + } + } + return 0; +} + +static void free_thread_bufs(struct wstats *ws, int nw, + struct rstats *rs, int nr) +{ + if (ws) + for (int i = 0; i < nw; i++) + free(ws[i].buf); + if (rs) + for (int i = 0; i < nr; i++) + free(rs[i].buf); +} + +static int start_readers(pthread_t *rt, struct rstats *rs, int nr, + int *created) +{ + for (int i = 0; i < nr; i++) { + int err = pthread_create(&rt[i], NULL, reader, &rs[i]); + + if (err) { + fprintf(stderr, "pthread_create reader %d: %s\n", + i, strerror(err)); + return -1; + } + (*created)++; + } + return 0; +} + +static int start_writers(pthread_t *wt, struct wstats *ws, int nw, + int *created) +{ + for (int i = 0; i < nw; i++) { + int err = pthread_create(&wt[i], NULL, writer, &ws[i]); + + if (err) { + fprintf(stderr, "pthread_create writer %d: %s\n", + i, strerror(err)); + return -1; + } + (*created)++; + } + return 0; +} + +static int open_bench_pipe(void) +{ + if (pipe(g_pipe) < 0) { + perror("pipe"); + return -1; + } + if (fcntl(g_pipe[1], F_SETPIPE_SZ, g_pipe_size) < 0) + perror("F_SETPIPE_SZ (continuing)"); + return 0; +} + +/* + * Normal termination: g_stop tells writers to leave the loop after the + * current write() returns. Closing the shared write-end fd means once + * the in-flight writes drain, readers see EOF and exit. Writers are not + * unblocked by EPIPE here -- g_pipe[0] stays open so readers can keep + * draining. + * + * Error path: some threads may have been created and others skipped, so + * writers could be blocked in write() with no reader making progress. + * Close both ends -- closing the read end is what delivers EPIPE to a + * blocked writer. + */ +static void stop_and_join(pthread_t *wt, int nw_created, + pthread_t *rt, int nr_created, int rc) +{ + atomic_store(&g_stop, 1); + close(g_pipe[1]); + if (rc < 0) + close(g_pipe[0]); + for (int i = 0; i < nw_created; i++) + pthread_join(wt[i], NULL); + for (int i = 0; i < nr_created; i++) + pthread_join(rt[i], NULL); + if (rc == 0) + close(g_pipe[0]); +} + +static int run_one(int nw, int nr) +{ + pthread_t *wt = NULL, *rt = NULL; + struct wstats *ws = NULL; + struct rstats *rs = NULL; + int nw_created = 0, nr_created = 0; + int rc = 0; + + atomic_store(&g_stop, 0); + + if (open_bench_pipe() < 0) + return -1; + + wt = calloc((size_t)nw, sizeof(*wt)); + rt = calloc((size_t)nr, sizeof(*rt)); + ws = calloc((size_t)nw, sizeof(*ws)); + rs = calloc((size_t)nr, sizeof(*rs)); + if (!wt || !rt || !ws || !rs) { + fprintf(stderr, "alloc failed\n"); + rc = -1; + goto teardown; + } + + if (alloc_thread_bufs(ws, nw, rs, nr) < 0) { + rc = -1; + goto teardown; + } + + if (start_readers(rt, rs, nr, &nr_created) < 0 || + start_writers(wt, ws, nw, &nw_created) < 0) { + rc = -1; + goto teardown; + } + + sleep((unsigned int)g_duration); + +teardown: + stop_and_join(wt, nw_created, rt, nr_created, rc); + + if (rc == 0) { + summarize(ws, nw, nr); + fflush(stdout); + } + + free_thread_bufs(ws, nw, rs, nr); + free(wt); + free(rt); + free(ws); + free(rs); + return rc; +} + +static void usage(const char *prog) +{ + fprintf(stderr, + "usage: %s [-w writers] [-r readers] [-s msgsize] [-d secs] [-p pipe_size] [--memory-pressure]\n" + " default: sweep writers={1,2,5} x readers={1,5,10}\n" + " --memory-pressure: spawn stress-ng (--vm 4 --vm-bytes 80%% --vm-method all) for the run\n", + prog); +} + +static int parse_args(int argc, char **argv, + int *writers_override, int *readers_override) +{ + static const struct option long_opts[] = { + {"memory-pressure", no_argument, NULL, 'M'}, + {0, 0, 0, 0}, + }; + int opt; + + while ((opt = getopt_long(argc, argv, "w:r:s:d:p:", + long_opts, NULL)) != -1) { + switch (opt) { + case 'w': + *writers_override = atoi(optarg); + break; + case 'r': + *readers_override = atoi(optarg); + break; + case 's': + g_msgsize = (size_t)atol(optarg); + break; + case 'd': + g_duration = atoi(optarg); + break; + case 'p': + g_pipe_size = atoi(optarg); + break; + case 'M': + g_memory_pressure = 1; + break; + default: + usage(argv[0]); + return -1; + } + } + return 0; +} + +/* + * aligned_alloc(4096, size) requires size to be a multiple of the + * alignment (C11); glibc returns NULL otherwise, which would make + * writer/reader threads silently exit and the run report zero writes. + * Validate up front instead. + */ +static int validate_args(void) +{ + if (g_msgsize == 0 || g_msgsize % 4096 != 0) { + fprintf(stderr, + "msgsize must be a positive multiple of 4096 (got %zu)\n", + g_msgsize); + return -1; + } + if (g_duration <= 0) { + fprintf(stderr, "duration must be > 0 seconds (got %d)\n", + g_duration); + return -1; + } + if (g_pipe_size <= 0) { + fprintf(stderr, "pipe_size must be > 0 bytes (got %d)\n", + g_pipe_size); + return -1; + } + return 0; +} + +static int run_sweep(void) +{ + static const int writers_sweep[] = {1, 2, 5}; + static const int readers_sweep[] = {1, 5, 10}; + + for (size_t i = 0; i < ARRAY_SIZE(writers_sweep); i++) { + for (size_t j = 0; j < ARRAY_SIZE(readers_sweep); j++) { + printf("---\n"); + if (run_one(writers_sweep[i], readers_sweep[j]) < 0) + return -1; + } + } + return 0; +} + +int main(int argc, char **argv) +{ + int writers_override = 0, readers_override = 0; + pid_t stress_pid = -1; + int rc = 0; + + if (parse_args(argc, argv, &writers_override, &readers_override) < 0) + return 1; + if (validate_args() < 0) + return 1; + + signal(SIGPIPE, SIG_IGN); + setvbuf(stdout, NULL, _IOLBF, 0); + setvbuf(stderr, NULL, _IOLBF, 0); + + fprintf(stderr, "pid=%d\n", getpid()); + fflush(stderr); + + if (g_memory_pressure) { + stress_pid = spawn_stress_ng(); + if (stress_pid < 0) { + fprintf(stderr, + "memory_pressure requested but stress-ng could not be spawned\n"); + return 1; + } + } + + if (writers_override > 0 || readers_override > 0) { + int nw = writers_override > 0 ? writers_override : 1; + int nr = readers_override > 0 ? readers_override : 1; + + rc = run_one(nw, nr) < 0 ? 1 : 0; + } else { + rc = run_sweep() < 0 ? 1 : 0; + } + + kill_stress_ng(stress_pid); + return rc; +} -- 2.54.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark 2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao @ 2026-05-23 16:43 ` Oleg Nesterov 2026-05-23 16:49 ` Oleg Nesterov 0 siblings, 1 reply; 13+ messages in thread From: Oleg Nesterov @ 2026-05-23 16:43 UTC (permalink / raw) To: Breno Leitao Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team On 05/22, Breno Leitao wrote: > > Add a small selftest that stresses pipe->mutex contention by spawning N > writer threads that hammer a single pipe with multi-page writes, plus M > reader threads that drain. Each writer records its own write() latency > samples into a log2-bucketed histogram; main aggregates and prints > total writes, throughput, average and percentile (p50/p99) latencies, > and the maximum observed latency. > > Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80% > --vm-method all) for the duration of the run, so alloc_page() in > anon_pipe_write() routinely hits direct reclaim. The flag fails > fast if stress-ng is not on $PATH. > > Program print something like the following, for different writes, > readers, msgsizes and memory pressure: > > config: writers=X readers=Y msgsize=Z duration=3 pipe_size=1048576 > memory_pressure=[no|yes] > writes: total=54451 rate=18150/s > throughput_MBps: 1134.40 > lat_avg_ns: 275355 > lat_p50_ns_upper: 262143 > lat_p99_ns_upper: 1048575 > lat_max_ns: 2145633 The more tests the better... but since it comes as 2/2, perhaps the changelog could mention the difference before/after 1/2 according to this test? Oleg. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark 2026-05-23 16:43 ` Oleg Nesterov @ 2026-05-23 16:49 ` Oleg Nesterov 0 siblings, 0 replies; 13+ messages in thread From: Oleg Nesterov @ 2026-05-23 16:49 UTC (permalink / raw) To: Breno Leitao Cc: Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Mateusz Guzik, linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, jlayton, axboe, kernel-team On 05/23, Oleg Nesterov wrote: > > The more tests the better... but since it comes as 2/2, perhaps the changelog > could mention the difference before/after 1/2 according to this test? Ah, I didn't read 0/2, sorry for the noise ;) Oleg. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock 2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao 2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao 2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao @ 2026-05-22 19:43 ` Jeff Layton 2 siblings, 0 replies; 13+ messages in thread From: Jeff Layton @ 2026-05-22 19:43 UTC (permalink / raw) To: Breno Leitao, Alexander Viro, Christian Brauner, Jan Kara, Shuah Khan, Mateusz Guzik Cc: linux-fsdevel, linux-kernel, linux-kselftest, shakeel.butt, oleg, axboe, kernel-team On Fri, 2026-05-22 at 09:44 -0700, Breno Leitao wrote: > While profiling Meta's caching code[1], I found pipe->mutex contention > on the hot path. anon_pipe_write() currently calls alloc_page() once > per page while holding pipe->mutex. The allocation can sleep doing > direct reclaim and runs memcg charging, which extends the critical > section and stalls any concurrent reader on the same mutex. > > This series pre-allocates pages outside pipe->mutex in > anon_pipe_write(): for writes that span more than one full page, up > to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page > alloc_page() loop before the mutex is taken. anon_pipe_get_page() > then drains the prealloc array first, falls back to the per-pipe > tmp_page[] cache, and only enters the allocator under the mutex for > the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page > writes that skip prealloc, or shortfalls when the prealloc loop > fails). Leftover prealloc pages are recycled into tmp_page[] before > unlock and any remainder is put_page()'d after unlock, keeping the > allocator out of the critical section on both sides. > > alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator > refuses __GFP_ACCOUNT under memcg -- it returns at most one page > when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit > 8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for > __GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the > task NUMA mempolicy honoured uniformly without open-coding the > charge. > > I also vibe-coded a microbenchmark to validate the change. It sweeps > writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a > 1 MB pipe and prints throughput + latency percentiles per config. > > Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB > writes, 1 MB pipe). The numbers below were collected on v1 > (alloc_pages_bulk()); v2's per-page loop preserves the dominant > "allocation outside the mutex" win and is expected to land in the same > range. > > == No memory pressure (10s per config) == > > Throughput in MB/s (baseline -> patched, delta): > writers readers=1 readers=5 readers=10 > 1 1119 -> 1354 (+21%) 1132 -> 1195 (+6%) 1060 -> 1240 (+17%) > 2 1162 -> 1487 (+28%) 1034 -> 1285 (+24%) 1069 -> 1213 (+14%) > 5 1152 -> 1357 (+18%) 1021 -> 1164 (+14%) 997 -> 1239 (+24%) > > Avg write latency in ns (baseline -> patched, delta): > writers readers=1 readers=5 readers=10 > 1 55786 -> 46103 (-17%) 55164 -> 52260 (-5%) 58906 -> 50370 (-14%) > 2 107546 -> 84011 (-22%) 120837 -> 97206 (-20%) 116860 -> 103036 (-12%) > 5 271293 -> 230170 (-15%) 306089 -> 268429 (-12%) 313300 -> 252232 (-19%) > > Throughput improves +6% to +28% and average write latency drops 5% > to 22% across every configuration. > > == Under memory pressure (--memory-pressure, 6s per config) == > > stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the > sweep so the alloc_page() calls inside anon_pipe_write() routinely > hit direct reclaim -- exactly the regime the patch targets. > > Throughput in MB/s (baseline -> patched, delta): > writers readers=1 readers=5 readers=10 > 1 1088 -> 1438 (+32%) 996 -> 1477 (+48%) 989 -> 1194 (+21%) > 2 1076 -> 1378 (+28%) 1007 -> 1269 (+26%) 1018 -> 1234 (+21%) > 5 1052 -> 1311 (+25%) 986 -> 1225 (+24%) 972 -> 1249 (+29%) > > Avg write latency in ns (baseline -> patched, delta): > writers readers=1 readers=5 readers=10 > 1 57397 -> 43406 (-24%) 62690 -> 42272 (-33%) 63136 -> 52272 (-17%) > 2 116121 -> 90700 (-22%) 124098 -> 98481 (-21%) 122754 -> 101217 (-18%) > 5 297122 -> 238322 (-20%) 316836 -> 255095 (-19%) 321496 -> 250189 (-22%) > > Throughput improves +21% to +48% and average write latency drops > 17% to 33% -- a noticeably bigger win than the no-pressure run. > > That tracks: when alloc_page() has to dip into reclaim, the cost > of holding pipe->mutex across it is highest, and pulling the > allocation out of the critical section pays the most. > > Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1] > > Signed-off-by: Breno Leitao <leitao@debian.org> > --- > Changes in v2: > - Switch the prealloc path from alloc_pages_bulk_mempolicy() to a > per-page alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) loop. > - Split the prealloc work out of anon_pipe_write() into dedicated > helpers (anon_pipe_get_page_prealloc / anon_pipe_prealloc_pop / > anon_pipe_refill_tmp_pages / anon_pipe_free_pages) gathered in > struct anon_pipe_prealloc, so the write path stays readable. > - Recycle leftover prealloc pages into pipe->tmp_page[] before > unlocking > - Link to v1: https://patch.msgid.link/20260515-fix_pipe-v1-0-b14c840c7555@debian.org > > To: Alexander Viro <viro@zeniv.linux.org.uk> > To: Christian Brauner <brauner@kernel.org> > To: Jan Kara <jack@suse.cz> > To: Shuah Khan <shuah@kernel.org> > Cc: linux-fsdevel@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-kselftest@vger.kernel.org > > --- > Breno Leitao (2): > fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write > selftests/pipe: add pipe_bench microbenchmark > > fs/pipe.c | 105 ++++- > tools/testing/selftests/Makefile | 1 + > tools/testing/selftests/pipe/.gitignore | 1 + > tools/testing/selftests/pipe/Makefile | 9 + > tools/testing/selftests/pipe/pipe_bench.c | 616 ++++++++++++++++++++++++++++++ > 5 files changed, 729 insertions(+), 3 deletions(-) > --- > base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83 > change-id: 20260515-fix_pipe-c91677c187e7 > > Best regards, > -- > Breno Leitao <leitao@debian.org> Pity this can't use the bulk page allocator, but looks good otherwise. Reviewed-by: Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-05-24 16:47 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao 2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao 2026-05-22 16:51 ` Jeff Layton 2026-05-22 17:55 ` Breno Leitao 2026-05-22 19:48 ` Mateusz Guzik 2026-05-23 16:26 ` Oleg Nesterov 2026-05-24 14:30 ` Breno Leitao 2026-05-24 14:48 ` Mateusz Guzik 2026-05-24 16:47 ` Breno Leitao 2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao 2026-05-23 16:43 ` Oleg Nesterov 2026-05-23 16:49 ` Oleg Nesterov 2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox