From: Breno Leitao <leitao@debian.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Shuah Khan <shuah@kernel.org>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-kselftest@vger.kernel.org, shakeel.butt@linux.dev,
jlayton@kernel.org, axboe@kernel.dk, kernel-team@meta.com
Subject: Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
Date: Sun, 24 May 2026 09:47:09 -0700 [thread overview]
Message-ID: <ahMYl8ExhnSudJ33@gmail.com> (raw)
In-Reply-To: <CAGudoHEPj-aOxqBsh5y4JFfONLnZfzgw_UUs5hqK6BpBcgHO5Q@mail.gmail.com>
Hello Mateusz,
On Sun, May 24, 2026 at 04:48:14PM +0200, Mateusz Guzik wrote:
> On Sun, May 24, 2026 at 4:30 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote:
> > > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> > > > * after waiting we need to re-check whether the pipe
> > > > * become empty while we dropped the lock.
> > > > */
> > > > + anon_pipe_refill_tmp_pages(pipe, &prealloc);
> > > > mutex_unlock(&pipe->mutex);
> > > > + anon_pipe_free_pages(&prealloc);
> > >
> > > Do we really want to call anon_pipe_free_pages() at this point?
> > >
> > > The main loop will continue when pipe_writable() becomes true again...
> >
> > I went back and forth on this. The argument for freeing was that
> > wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or
> > stopped reader), and holding up the prealloc pages felt antisocial --
> > especially under the memory pressure this series targets, where those pages are
> > more useful on the freelists than parked on a sleeping task.
> >
> > On the other side, on wakeup the loop is guaranteed to want pages again, and
> > re-entering the allocator under the mutex puts us back in the contended state
> > the patch removes. For any write() large enough to wait mid-syscall (which is
> > the workload patch 2/2 measures), keeping them strictly wins on throughput /
> > p99.
> >
>
> You can still prealloc after wakeup for whatever reminder you got
> though, but I can agree dropping these frees is a sensible way out and
> it is easier and I'm not going to insist on one way or the other.
Ack. I've sent a v3 with anon_pipe_free_pages() and
anon_pipe_refill_tmp_pages() dropped.
> However, I think it would be prudent to add a tracepoint to some
> machines on your fleet to find out how often they allocate pages under
> the mutex (and for what i/o size). Initial alloc for the first write <
> PAGE_SIZE definitely happens under the mutex which is probably not a
> problem, but for anything later?
> The tracepoint can have a trivial
> indicator if this is the first write if that matters. One can
Isn't this what I've reported earlier?
https://lore.kernel.org/all/ag3Ty3T24wjn1aFw@gmail.com/
Adding a tracepoint is harder than usual, given kernel rollout takes ages.
But I hacked a bpftrace script and ran it on a random sample of fleet hosts (5
min each).
As reported earlier, multi-page pipe writes are not uncommon: on one
host a single long-running process produced 196,476 under-mutex alloc_page()
calls in 5 minutes, with allocs-per-write distributions reaching 16+ -- exactly
the pattern this patch removes.
Most hosts sit at the boring ~20-30 allocs/sec dominated by one-page
first-writes that the patch's `total_len <= PAGE_SIZE` early-return skips
anyway, so the win is concentrated on the workloads that actually need it.
None of the allocs hit reclaim during the trace I ran, but I would expect
direct reclaim to happen with the lock held.
Thanks for the review and direction,
--breno
next prev parent reply other threads:[~2026-05-24 16:47 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-22 16:44 [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Breno Leitao
2026-05-22 16:44 ` [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Breno Leitao
2026-05-22 16:51 ` Jeff Layton
2026-05-22 17:55 ` Breno Leitao
2026-05-22 19:48 ` Mateusz Guzik
2026-05-23 16:26 ` Oleg Nesterov
2026-05-24 14:30 ` Breno Leitao
2026-05-24 14:48 ` Mateusz Guzik
2026-05-24 16:47 ` Breno Leitao [this message]
2026-05-22 16:44 ` [PATCH v2 2/2] selftests/pipe: add pipe_bench microbenchmark Breno Leitao
2026-05-23 16:43 ` Oleg Nesterov
2026-05-23 16:49 ` Oleg Nesterov
2026-05-22 19:43 ` [PATCH v2 0/2] fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ahMYl8ExhnSudJ33@gmail.com \
--to=leitao@debian.org \
--cc=axboe@kernel.dk \
--cc=brauner@kernel.org \
--cc=jack@suse.cz \
--cc=jlayton@kernel.org \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mjguzik@gmail.com \
--cc=oleg@redhat.com \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.