All of lore.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	 Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Shuah Khan <shuah@kernel.org>,
	 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,  usama.arif@linux.dev,
	kernel-team@meta.com
Subject: Re: [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write
Date: Wed, 20 May 2026 10:09:00 -0700	[thread overview]
Message-ID: <ag3Ty3T24wjn1aFw@gmail.com> (raw)
In-Reply-To: <q37bmptbjr2gysl2g67bqo5axxqor4lsrrnf7ajixxjdrumbup@tgu3kz6kien5>


Hello Mateusz,

Thanks for the careful review. Let me address the points inline,
with data I collected on three Meta production hosts: two aarch64 NVIDIA Grace
and one x86_64 Bergamo / EPYC 9D64, using bpftrace and perf lock contention.

On Fri, May 15, 2026 at 10:18:19PM +0200, Mateusz Guzik wrote:
> This change is quite a hammer.

Are you seeing any downside/trade-off with this current approach?

>
> Do you have a real workload where there are multiple processes issuing
> large ops to the same pipe on both reader and writer side?
>
> The only workload I'm aware of sharing the pipe between more than just
> one reader and one writer is anything with make, but that is mostly
> operating on 1-byte ops.

You're right -- most pipe writes really are tiny. On Bergamo, 120s sample of
anon_pipe_write, 299629 calls bucketed by iov_iter_count(from):

    size                writes      %
    <128 B              293602    98.0    (merge path, no alloc)
    128 B - 4 KB          1108     0.4
    4 - 8 KB              2149     0.72   <-- patch start helps from here
    8 - 16 KB                3
    16 - 32 KB               0
    32 - 64 KB               5
    64 - 128 KB              2            <-- up to 72KB writes

So only ~0.72% of writes are >= PAGE_SIZE. The dominant comm by
contention count is a Meta fleet workload spraying sub-128-byte
metric strings -- it does *zero* writes >= PAGE_SIZE itself; it
just hits the mutex very often.

But the writers that *do* take the alloc-under-mutex path are
real. Lock-wait data confirms it shows up:

  perf lock contention -ab (30s, anon_pipe_* only):

                                contentions  total wait  max wait
    arm64 (NVIDIA Grace)
      anon_pipe_write                 7660    4.39 ms     50 us
      anon_pipe_read                     7   55.78 us     17 us
    x86_64 (Bergamo)
      anon_pipe_write                 3154    4.53 ms     79 us
      anon_pipe_read                   817    1.72 ms     82 us

The reader side stands out on x86 (817 vs ~7 on arm64) -- that's
the symptom the patch targets: a multi-page writer holding
pipe->mutex across alloc_page() stalls the concurrent reader on
the same mutex.

  bpftrace, mutex_lock duration inside anon_pipe_write, 300s:

                  Grace-A  Grace-B  Bergamo
    [4K,  8K)      1090     2799      980
    [8K, 16K)      1785     1353      953
    [16K,32K)       755      623      530
    [32K,64K)       153      172      184
    [64K,128K)       13        8       59
    [128K,256K)      -         -        4
    [256K,512K)      -         -        1    <-- ~500us blocked

So the patch's value is shortening the long critical sections that
the 99% of small writers are blocked on -- the contention everyone
sees drops because the few multi-page writers do their alloc work
outside the lock.

>= 64KB, which are not common but existent.

> I'm asking because this would be better implemented taking into account
> how many pages are hanging out in the tmp_page array. With only one
> writer and one reader there is no concern someone will steal the pages.
>
> Preferably the tmp_page array would grow to accomodate more memory, but
> admittedly this runs into a problem where a pipe can linger unused
> indefinitely while hoarding several pages. As in, some shrinking
> mechanism would be needed so that's probably a no-go and preallocation
> is the way to go for now.
>
> Even then, this can definitely be integrated much better instead of
> being open-coded.

> This is some nasty golfing.
>
> Instead, you could have something like:
> struct tmp_page_prealloc {
> 	struct page *pages;
> 	unsigned int count;
> };

Agreed -- v2 will wrap the array+counter in
struct tmp_page_prealloc and drop the open-coded indexing.

> 	struct tmp_page_prealloc tpp = {};
> [...]
> This should move to a dedicated routine, perhaps:
> 	anon_get_page_prealloc(pipe, &tmp_page_prealloc);
>
> then it can also easily decide how much to prealloc based on existing
> state

Yes. v2 will have anon_get_page_prealloc(pipe, &tpp) doing both:
top up to PIPE_PREALLOC_MAX based on tmp_page[] occupancy, and
keep the policy in one place.

> > +	while (prealloc_n)
> > +		anon_pipe_put_page(pipe, prealloc[--prealloc_n]);
>
> this results in put_page() calls under the mutex, still extending the hold time
>
> you could split this into 2 ops, for example:
> +	anon_pipe_cache_pages(pipe, &tmp_page_prealloc);
>   	if (pipe_is_full(pipe))
>   		wake_next_writer = false;
>   	mutex_unlock(&pipe->mutex);
> +	anon_pipe_free_pages(pipe, &tmp_page_prealloc);

Well spotted!


> All that aside, I have no idea about performance impact on arm64, but
> there is *definitely* some throughput lost based on the fact that memory
> copy is restarted per page. I presume arm64 has an equivalent of amd64's
> SMAP which again will not be free and is paid for every time as well.

Can you say a bit more about what you'd want to see here? I want to make sure
I'm reading the suggestion the same way you are.

My understanding: anon_pipe_write loops once per page calling
copy_page_from_iter(), and each call re-enters the iov_iter
machinery and pays a STAC/CLAC pair on x86 (PAN toggle on arm64).

So an N-page write pays the user-access bracket N times instead of
once, on top of resetting the per-page memcpy and losing the
hardware prefetch streak.

The way I see it, is opening a single user_read_access_begin() region around
the page loop and use unsafe_copy_from_user() inside, with a labeled
fault-fallback path that closes the region and finishes via the slow path.

Is this what you meant?

Really thanks for your review,
--breno

  reply	other threads:[~2026-05-20 17:09 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-15 10:28 [PATCH 0/2] fs/pipe: reduce pipe->mutex contention by bulk-allocating outside the lock Breno Leitao
2026-05-15 10:28 ` [PATCH 1/2] fs/pipe: bulk pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-15 20:18   ` Mateusz Guzik
2026-05-20 17:09     ` Breno Leitao [this message]
2026-05-21 12:58       ` Mateusz Guzik
2026-05-21 15:38         ` Breno Leitao
2026-05-21 17:26           ` Mateusz Guzik
2026-05-22 12:10             ` Breno Leitao
2026-05-15 10:28 ` [PATCH 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ag3Ty3T24wjn1aFw@gmail.com \
    --to=leitao@debian.org \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=shuah@kernel.org \
    --cc=usama.arif@linux.dev \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.