The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	 Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Shuah Khan <shuah@kernel.org>,
	 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,  usama.arif@linux.dev,
	kernel-team@meta.com
Subject: Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
Date: Wed, 20 May 2026 10:09:00 -0700	[thread overview]
Message-ID: <ag3Ty3T24wjn1aFw@gmail.com> (raw)
In-Reply-To: <q37bmptbjr2gysl2g67bqo5axxqor4lsrrnf7ajixxjdrumbup@tgu3kz6kien5>


Hello Mateusz,

Thanks for the careful review. Let me address the points inline,
with data I collected on three Meta production hosts: two aarch64 NVIDIA Grace
and one x86_64 Bergamo / EPYC 9D64, using bpftrace and perf lock contention.

On Fri, May 15, 2026 at 10:18:19PM +0200, Mateusz Guzik wrote:
> This change is quite a hammer.

Are you seeing any downside/trade-off with this current approach?

>
> Do you have a real workload where there are multiple processes issuing
> large ops to the same pipe on both reader and writer side?
>
> The only workload I'm aware of sharing the pipe between more than just
> one reader and one writer is anything with make, but that is mostly
> operating on 1-byte ops.

You're right -- most pipe writes really are tiny. On Bergamo, 120s sample of
anon_pipe_write, 299629 calls bucketed by iov_iter_count(from):

    size                writes      %
    <128 B              293602    98.0    (merge path, no alloc)
    128 B - 4 KB          1108     0.4
    4 - 8 KB              2149     0.72   <-- patch start helps from here
    8 - 16 KB                3
    16 - 32 KB               0
    32 - 64 KB               5
    64 - 128 KB              2            <-- up to 72KB writes

So only ~0.72% of writes are >= PAGE_SIZE. The dominant comm by
contention count is a Meta fleet workload spraying sub-128-byte
metric strings -- it does *zero* writes >= PAGE_SIZE itself; it
just hits the mutex very often.

But the writers that *do* take the alloc-under-mutex path are
real. Lock-wait data confirms it shows up:

  perf lock contention -ab (30s, anon_pipe_* only):

                                contentions  total wait  max wait
    arm64 (NVIDIA Grace)
      anon_pipe_write                 7660    4.39 ms     50 us
      anon_pipe_read                     7   55.78 us     17 us
    x86_64 (Bergamo)
      anon_pipe_write                 3154    4.53 ms     79 us
      anon_pipe_read                   817    1.72 ms     82 us

The reader side stands out on x86 (817 vs ~7 on arm64) -- that's
the symptom the patch targets: a multi-page writer holding
pipe->mutex across alloc_page() stalls the concurrent reader on
the same mutex.

  bpftrace, mutex_lock duration inside anon_pipe_write, 300s:

                  Grace-A  Grace-B  Bergamo
    [4K,  8K)      1090     2799      980
    [8K, 16K)      1785     1353      953
    [16K,32K)       755      623      530
    [32K,64K)       153      172      184
    [64K,128K)       13        8       59
    [128K,256K)      -         -        4
    [256K,512K)      -         -        1    <-- ~500us blocked

So the patch's value is shortening the long critical sections that
the 99% of small writers are blocked on -- the contention everyone
sees drops because the few multi-page writers do their alloc work
outside the lock.

>= 64KB, which are not common but existent.

> I'm asking because this would be better implemented taking into account
> how many pages are hanging out in the tmp_page array. With only one
> writer and one reader there is no concern someone will steal the pages.
>
> Preferably the tmp_page array would grow to accomodate more memory, but
> admittedly this runs into a problem where a pipe can linger unused
> indefinitely while hoarding several pages. As in, some shrinking
> mechanism would be needed so that's probably a no-go and preallocation
> is the way to go for now.
>
> Even then, this can definitely be integrated much better instead of
> being open-coded.

> This is some nasty golfing.
>
> Instead, you could have something like:
> struct tmp_page_prealloc {
> 	struct page *pages;
> 	unsigned int count;
> };

Agreed -- v2 will wrap the array+counter in
struct tmp_page_prealloc and drop the open-coded indexing.

> 	struct tmp_page_prealloc tpp = {};
> [...]
> This should move to a dedicated routine, perhaps:
> 	anon_get_page_prealloc(pipe, &tmp_page_prealloc);
>
> then it can also easily decide how much to prealloc based on existing
> state

Yes. v2 will have anon_get_page_prealloc(pipe, &tpp) doing both:
top up to PIPE_PREALLOC_MAX based on tmp_page[] occupancy, and
keep the policy in one place.

> > +	while (prealloc_n)
> > +		anon_pipe_put_page(pipe, prealloc[--prealloc_n]);
>
> this results in put_page() calls under the mutex, still extending the hold time
>
> you could split this into 2 ops, for example:
> +	anon_pipe_cache_pages(pipe, &tmp_page_prealloc);
>   	if (pipe_is_full(pipe))
>   		wake_next_writer = false;
>   	mutex_unlock(&pipe->mutex);
> +	anon_pipe_free_pages(pipe, &tmp_page_prealloc);

Well spotted!


> All that aside, I have no idea about performance impact on arm64, but
> there is *definitely* some throughput lost based on the fact that memory
> copy is restarted per page. I presume arm64 has an equivalent of amd64's
> SMAP which again will not be free and is paid for every time as well.

Can you say a bit more about what you'd want to see here? I want to make sure
I'm reading the suggestion the same way you are.

My understanding: anon_pipe_write loops once per page calling
copy_page_from_iter(), and each call re-enters the iov_iter
machinery and pays a STAC/CLAC pair on x86 (PAN toggle on arm64).

So an N-page write pays the user-access bracket N times instead of
once, on top of resetting the per-page memcpy and losing the
hardware prefetch streak.

The way I see it, is opening a single user_read_access_begin() region around
the page loop and use unsafe_copy_from_user() inside, with a labeled
fault-fallback path that closes the region and finishes via the slow path.

Is this what you meant?

Really thanks for your review,
--breno

  reply	other threads:[~2026-05-20 17:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-15 10:28 [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock Breno Leitao
2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-15 20:18   ` Mateusz Guzik
2026-05-20 17:09     ` Breno Leitao [this message]
2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ag3Ty3T24wjn1aFw@gmail.com \
    --to=leitao@debian.org \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=shuah@kernel.org \
    --cc=usama.arif@linux.dev \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox