From: Breno Leitao <leitao@debian.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Shuah Khan <shuah@kernel.org>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-kselftest@vger.kernel.org, usama.arif@linux.dev,
kernel-team@meta.com
Subject: Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
Date: Wed, 20 May 2026 10:09:00 -0700 [thread overview]
Message-ID: <ag3Ty3T24wjn1aFw@gmail.com> (raw)
In-Reply-To: <q37bmptbjr2gysl2g67bqo5axxqor4lsrrnf7ajixxjdrumbup@tgu3kz6kien5>
Hello Mateusz,
Thanks for the careful review. Let me address the points inline,
with data I collected on three Meta production hosts: two aarch64 NVIDIA Grace
and one x86_64 Bergamo / EPYC 9D64, using bpftrace and perf lock contention.
On Fri, May 15, 2026 at 10:18:19PM +0200, Mateusz Guzik wrote:
> This change is quite a hammer.
Are you seeing any downside/trade-off with this current approach?
>
> Do you have a real workload where there are multiple processes issuing
> large ops to the same pipe on both reader and writer side?
>
> The only workload I'm aware of sharing the pipe between more than just
> one reader and one writer is anything with make, but that is mostly
> operating on 1-byte ops.
You're right -- most pipe writes really are tiny. On Bergamo, 120s sample of
anon_pipe_write, 299629 calls bucketed by iov_iter_count(from):
size writes %
<128 B 293602 98.0 (merge path, no alloc)
128 B - 4 KB 1108 0.4
4 - 8 KB 2149 0.72 <-- patch start helps from here
8 - 16 KB 3
16 - 32 KB 0
32 - 64 KB 5
64 - 128 KB 2 <-- up to 72KB writes
So only ~0.72% of writes are >= PAGE_SIZE. The dominant comm by
contention count is a Meta fleet workload spraying sub-128-byte
metric strings -- it does *zero* writes >= PAGE_SIZE itself; it
just hits the mutex very often.
But the writers that *do* take the alloc-under-mutex path are
real. Lock-wait data confirms it shows up:
perf lock contention -ab (30s, anon_pipe_* only):
contentions total wait max wait
arm64 (NVIDIA Grace)
anon_pipe_write 7660 4.39 ms 50 us
anon_pipe_read 7 55.78 us 17 us
x86_64 (Bergamo)
anon_pipe_write 3154 4.53 ms 79 us
anon_pipe_read 817 1.72 ms 82 us
The reader side stands out on x86 (817 vs ~7 on arm64) -- that's
the symptom the patch targets: a multi-page writer holding
pipe->mutex across alloc_page() stalls the concurrent reader on
the same mutex.
bpftrace, mutex_lock duration inside anon_pipe_write, 300s:
Grace-A Grace-B Bergamo
[4K, 8K) 1090 2799 980
[8K, 16K) 1785 1353 953
[16K,32K) 755 623 530
[32K,64K) 153 172 184
[64K,128K) 13 8 59
[128K,256K) - - 4
[256K,512K) - - 1 <-- ~500us blocked
So the patch's value is shortening the long critical sections that
the 99% of small writers are blocked on -- the contention everyone
sees drops because the few multi-page writers do their alloc work
outside the lock.
>= 64KB, which are not common but existent.
> I'm asking because this would be better implemented taking into account
> how many pages are hanging out in the tmp_page array. With only one
> writer and one reader there is no concern someone will steal the pages.
>
> Preferably the tmp_page array would grow to accomodate more memory, but
> admittedly this runs into a problem where a pipe can linger unused
> indefinitely while hoarding several pages. As in, some shrinking
> mechanism would be needed so that's probably a no-go and preallocation
> is the way to go for now.
>
> Even then, this can definitely be integrated much better instead of
> being open-coded.
> This is some nasty golfing.
>
> Instead, you could have something like:
> struct tmp_page_prealloc {
> struct page *pages;
> unsigned int count;
> };
Agreed -- v2 will wrap the array+counter in
struct tmp_page_prealloc and drop the open-coded indexing.
> struct tmp_page_prealloc tpp = {};
> [...]
> This should move to a dedicated routine, perhaps:
> anon_get_page_prealloc(pipe, &tmp_page_prealloc);
>
> then it can also easily decide how much to prealloc based on existing
> state
Yes. v2 will have anon_get_page_prealloc(pipe, &tpp) doing both:
top up to PIPE_PREALLOC_MAX based on tmp_page[] occupancy, and
keep the policy in one place.
> > + while (prealloc_n)
> > + anon_pipe_put_page(pipe, prealloc[--prealloc_n]);
>
> this results in put_page() calls under the mutex, still extending the hold time
>
> you could split this into 2 ops, for example:
> + anon_pipe_cache_pages(pipe, &tmp_page_prealloc);
> if (pipe_is_full(pipe))
> wake_next_writer = false;
> mutex_unlock(&pipe->mutex);
> + anon_pipe_free_pages(pipe, &tmp_page_prealloc);
Well spotted!
> All that aside, I have no idea about performance impact on arm64, but
> there is *definitely* some throughput lost based on the fact that memory
> copy is restarted per page. I presume arm64 has an equivalent of amd64's
> SMAP which again will not be free and is paid for every time as well.
Can you say a bit more about what you'd want to see here? I want to make sure
I'm reading the suggestion the same way you are.
My understanding: anon_pipe_write loops once per page calling
copy_page_from_iter(), and each call re-enters the iov_iter
machinery and pays a STAC/CLAC pair on x86 (PAN toggle on arm64).
So an N-page write pays the user-access bracket N times instead of
once, on top of resetting the per-page memcpy and losing the
hardware prefetch streak.
The way I see it, is opening a single user_read_access_begin() region around
the page loop and use unsafe_copy_from_user() inside, with a labeled
fault-fallback path that closes the region and finishes via the slow path.
Is this what you meant?
Really thanks for your review,
--breno
next prev parent reply other threads:[~2026-05-20 17:09 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-15 10:28 [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock Breno Leitao
2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-15 20:18 ` Mateusz Guzik
2026-05-20 17:09 ` Breno Leitao [this message]
2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ag3Ty3T24wjn1aFw@gmail.com \
--to=leitao@debian.org \
--cc=brauner@kernel.org \
--cc=jack@suse.cz \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mjguzik@gmail.com \
--cc=shuah@kernel.org \
--cc=usama.arif@linux.dev \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox