From: "Darrick J. Wong" <djwong@kernel.org>
To: Bernd Schubert <bernd@bsbernd.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>,
jim.harris@nvidia.com, linux-fsdevel@vger.kernel.org,
Miklos Szeredi <miklos@szeredi.hu>,
Max Gurtovoy <mgurtovoy@nvidia.com>, Idan Zach <izach@nvidia.com>,
Konrad Sztyber <ksztyber@nvidia.com>,
German Maglione <gmaglione@redhat.com>,
hreitz@redhat.com
Subject: Re: fuse: buffered reads limited to 256KB regardless of negotiated max_pages
Date: Tue, 24 Mar 2026 10:11:38 -0700 [thread overview]
Message-ID: <20260324171138.GS6202@frogsfrogsfrogs> (raw)
In-Reply-To: <30a271bd-eaba-4248-af66-df18a61864ca@bsbernd.com>
On Tue, Mar 24, 2026 at 05:05:45PM +0100, Bernd Schubert wrote:
>
>
> On 3/24/26 14:21, Stefan Hajnoczi wrote:
> > On Mon, Mar 16, 2026 at 07:54:35AM -0700, jim.harris@nvidia.com wrote:
> >> Hi all,
> >>
> >> We have a FUSE server that advertises max_write=1MB and max_pages=256
> >> in the FUSE_INIT response. Buffered sequential writes arrive at the
> >> server at the full 1MB as expected. However, buffered sequential reads
> >> are capped at 256KB per FUSE READ request.
> >>
> >> The cap comes from the BDI readahead window. bdi->ra_pages defaults to
> >> VM_READAHEAD_PAGES (32 pages / 128KB). For sequential access (e.g. cp),
> >> posix_fadvise(POSIX_FADV_SEQUENTIAL) doubles the per-file readahead
> >> window to 2 * bdi->ra_pages (256KB), producing the observed 256KB
> >> limit. A 1MB application read() results in four sequential 256KB
> >> round trips to the FUSE server instead of one.
> >
> > Hi Jim,
> > Thanks for sharing this issue. I am CCing Geman Maglione and Hanna
> > Czenczek who work on virtiofsd and are also becoming more involved in
> > the virtiofs kernel driver.
> >
> >> In process_init_reply(), the kernel processes the
> >> server's max_readahead response like this:
> >>
> >> ra_pages = arg->max_readahead / PAGE_SIZE;
> >> fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
> >>
> >> Since bdi->ra_pages starts at VM_READAHEAD_PAGES (128KB), and the
> >> kernel sends this value as init_in->max_readahead, the server can only
> >> decrease readahead -- never increase it. Even if the server responds
> >> with max_readahead=1MB, the min() clamps it back to 128KB.
> >>
> >> Other filesystems set ra_pages or io_pages based on server/device
> >> capabilities:
> >>
> >> - SMB/CIFS sets ra_pages directly (2 * rsize, or from mount option)
> >> - Ceph sets ra_pages directly from mount option
> >> - 9P sets both ra_pages and io_pages from maxdata
> >> - NFS sets io_pages from rsize
> >>
> >> I see two possible approaches and would like feedback:
> >>
> >> Option A: Fix the max_readahead negotiation
> >>
> >> Replace the current:
> >>
> >> fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
> >>
> >> with:
> >>
> >> fm->sb->s_bdi->ra_pages = min(ra_pages, fc->max_pages);
> >>
> >> This uses the server's max_readahead response directly, capped by
> >> fc->max_pages for safety. I think this is backward compatible:
> >> existing servers that echo the kernel's 128KB value get the same
> >> result. Servers that return a lower value still reduce it. Only
> >> servers that return a higher value see changed behavior.
> >>
> >> FUSE servers can opt in by advertising a larger max_readahead in the
> >> FUSE_INIT response.
> >>
> >> Option B: Set io_pages from max_pages
> >>
> >> Set bdi->io_pages after FUSE_INIT negotiation:
> >>
> >> fm->sb->s_bdi->io_pages = fc->max_pages;
> >>
> >> This matches what NFS does (setting io_pages from rsize). The
> >> readahead code uses max(bdi->io_pages, ra->ra_pages) to determine
> >> the maximum readahead size, so a large io_pages would allow larger
> >> readahead submissions.
> >>
> >> This is simpler since no server-side change is needed. However, it
> >> bypasses the max_readahead protocol field, making max_readahead
> >> effectively meaningless for any device with large max_pages.
> >>
> >> In both cases, fc->max_pages is already clamped by
> >> fc->max_pages_limit, which for virtio-fs accounts for the virtqueue
> >> descriptor count.
> >>
> >> Thoughts?
> >
> > I think this is a question for Miklos. You could also send a patch with
> > your preferred solution to expediate this.
> >
> > Thank you for looking into this - it will be nice to remove this
> > performance limitation.
>
>
> We have this patch in our branch:
>
> commit 763c96da4bd6d1bb95d8e6bb7fd352389f3a17b9
> Author: Bernd Schubert <bschubert@ddn.com>
> Date: Wed May 7 23:39:00 2025 +0200
>
> fuse: Use fuse-server provided read-ahead for CAP_SYS_ADMIN
>
> read-ahead is currently limited to bdi->ra_pages. One can change
> that after the mount with something like
>
> minor=$(stat -c "%d" /path/to/fuse)
> echo 1024 > /sys/class/bdi/0:$(minor)/read_ahead_kb
>
> Issue is that fuse-server cannot do that from its ->init method,
> as it has to know about device minor, which blocks before
> init is complete.
>
> Fuse already sets the bdi value, but upper limit is the current
> bdi value. For CAP_SYS_ADMIN we can allow higher values.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index af8404f539a3..417598bb3575 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1369,7 +1369,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> fc->no_flock = 1;
> }
>
> - fm->sb->s_bdi->ra_pages =
> + if (CAP_SYS_ADMIN)
I really hope your branch has:
if (has_capability_noaudit(current, CAP_SYS_ADMIN))
and not what's written there. ;)
--D
> + fm->sb->s_bdi->ra_pages = ra_pages;
> + else
> + fm->sb->s_bdi->ra_pages =
> min(fm->sb->s_bdi->ra_pages, ra_pages);
> fc->minor = arg->minor;
> fc->max_write = arg->minor < 5 ? 4096 : arg->max_write;
>
>
>
>
> Maybe we can increase the limit for non privileged servers, but
> not to fc->max_pages, because an admin might tune fuse_max_pages_limit
> to a rather large value. I was able to kind of crash the system
> with >16MB read-ahead in early 6.x kernels (I think 6.1).
>
>
>
>
next prev parent reply other threads:[~2026-03-24 17:11 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 14:54 fuse: buffered reads limited to 256KB regardless of negotiated max_pages jim.harris
2026-03-24 13:21 ` Stefan Hajnoczi
2026-03-24 16:05 ` Bernd Schubert
2026-03-24 17:11 ` Darrick J. Wong [this message]
2026-03-24 17:40 ` Bernd Schubert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260324171138.GS6202@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=bernd@bsbernd.com \
--cc=gmaglione@redhat.com \
--cc=hreitz@redhat.com \
--cc=izach@nvidia.com \
--cc=jim.harris@nvidia.com \
--cc=ksztyber@nvidia.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mgurtovoy@nvidia.com \
--cc=miklos@szeredi.hu \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox