From: Breno Leitao <leitao@debian.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Shuah Khan <shuah@kernel.org>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-kselftest@vger.kernel.org, shakeel.butt@linux.dev,
jlayton@kernel.org, axboe@kernel.dk, kernel-team@meta.com
Subject: Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
Date: Sun, 24 May 2026 09:47:09 -0700 [thread overview]
Message-ID: <ahMYl8ExhnSudJ33@gmail.com> (raw)
In-Reply-To: <CAGudoHEPj-aOxqBsh5y4JFfONLnZfzgw_UUs5hqK6BpBcgHO5Q@mail.gmail.com>
Hello Mateusz,
On Sun, May 24, 2026 at 04:48:14PM +0200, Mateusz Guzik wrote:
> On Sun, May 24, 2026 at 4:30 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote:
> > > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> > > > * after waiting we need to re-check whether the pipe
> > > > * become empty while we dropped the lock.
> > > > */
> > > > + anon_pipe_refill_tmp_pages(pipe, &prealloc);
> > > > mutex_unlock(&pipe->mutex);
> > > > + anon_pipe_free_pages(&prealloc);
> > >
> > > Do we really want to call anon_pipe_free_pages() at this point?
> > >
> > > The main loop will continue when pipe_writable() becomes true again...
> >
> > I went back and forth on this. The argument for freeing was that
> > wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or
> > stopped reader), and holding up the prealloc pages felt antisocial --
> > especially under the memory pressure this series targets, where those pages are
> > more useful on the freelists than parked on a sleeping task.
> >
> > On the other side, on wakeup the loop is guaranteed to want pages again, and
> > re-entering the allocator under the mutex puts us back in the contended state
> > the patch removes. For any write() large enough to wait mid-syscall (which is
> > the workload patch 2/2 measures), keeping them strictly wins on throughput /
> > p99.
> >
>
> You can still prealloc after wakeup for whatever reminder you got
> though, but I can agree dropping these frees is a sensible way out and
> it is easier and I'm not going to insist on one way or the other.
Ack. I've sent a v3 with anon_pipe_free_pages() and
anon_pipe_refill_tmp_pages() dropped.
> However, I think it would be prudent to add a tracepoint to some
> machines on your fleet to find out how often they allocate pages under
> the mutex (and for what i/o size). Initial alloc for the first write <
> PAGE_SIZE definitely happens under the mutex which is probably not a
> problem, but for anything later?
> The tracepoint can have a trivial
> indicator if this is the first write if that matters. One can
Isn't this what I've reported earlier?
https://lore.kernel.org/all/ag3Ty3T24wjn1aFw@gmail.com/
Adding a tracepoint is harder than usual, given kernel rollout takes ages.
But I hacked a bpftrace script and ran it on a random sample of fleet hosts (5
min each).
As reported earlier, multi-page pipe writes are not uncommon: on one
host a single long-running process produced 196,476 under-mutex alloc_page()
calls in 5 minutes, with allocs-per-write distributions reaching 16+ -- exactly
the pattern this patch removes.
Most hosts sit at the boring ~20-30 allocs/sec dominated by one-page
first-writes that the patch's `total_len <= PAGE_SIZE` early-return skips
anyway, so the win is concentrated on the workloads that actually need it.
None of the allocs hit reclaim during the trace I ran, but I would expect
direct reclaim to happen with the lock held.
Thanks for the review and direction,
--breno
next prev parent reply other threads:[~2026-05-24 16:47 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-22 16:51 ` Jeff Layton
2026-05-22 17:55 ` Breno Leitao
2026-05-22 19:48 ` Mateusz Guzik
2026-05-23 16:26 ` Oleg Nesterov
2026-05-24 14:30 ` Breno Leitao
2026-05-24 14:48 ` Mateusz Guzik
2026-05-24 16:47 ` Breno Leitao [this message]
2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
2026-05-23 16:43 ` Oleg Nesterov
2026-05-23 16:49 ` Oleg Nesterov
2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ahMYl8ExhnSudJ33@gmail.com \
--to=leitao@debian.org \
--cc=axboe@kernel.dk \
--cc=brauner@kernel.org \
--cc=jack@suse.cz \
--cc=jlayton@kernel.org \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mjguzik@gmail.com \
--cc=oleg@redhat.com \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox