From: Bernd Schubert <bernd@bsbernd.com>
To: Stefan Hajnoczi <stefanha@redhat.com>, jim.harris@nvidia.com
Cc: linux-fsdevel@vger.kernel.org, Miklos Szeredi <miklos@szeredi.hu>,
Max Gurtovoy <mgurtovoy@nvidia.com>, Idan Zach <izach@nvidia.com>,
Konrad Sztyber <ksztyber@nvidia.com>,
German Maglione <gmaglione@redhat.com>,
hreitz@redhat.com
Subject: Re: fuse: buffered reads limited to 256KB regardless of negotiated max_pages
Date: Tue, 24 Mar 2026 17:05:45 +0100 [thread overview]
Message-ID: <30a271bd-eaba-4248-af66-df18a61864ca@bsbernd.com> (raw)
In-Reply-To: <20260324132149.GB668717@fedora>
On 3/24/26 14:21, Stefan Hajnoczi wrote:
> On Mon, Mar 16, 2026 at 07:54:35AM -0700, jim.harris@nvidia.com wrote:
>> Hi all,
>>
>> We have a FUSE server that advertises max_write=1MB and max_pages=256
>> in the FUSE_INIT response. Buffered sequential writes arrive at the
>> server at the full 1MB as expected. However, buffered sequential reads
>> are capped at 256KB per FUSE READ request.
>>
>> The cap comes from the BDI readahead window. bdi->ra_pages defaults to
>> VM_READAHEAD_PAGES (32 pages / 128KB). For sequential access (e.g. cp),
>> posix_fadvise(POSIX_FADV_SEQUENTIAL) doubles the per-file readahead
>> window to 2 * bdi->ra_pages (256KB), producing the observed 256KB
>> limit. A 1MB application read() results in four sequential 256KB
>> round trips to the FUSE server instead of one.
>
> Hi Jim,
> Thanks for sharing this issue. I am CCing Geman Maglione and Hanna
> Czenczek who work on virtiofsd and are also becoming more involved in
> the virtiofs kernel driver.
>
>> In process_init_reply(), the kernel processes the
>> server's max_readahead response like this:
>>
>> ra_pages = arg->max_readahead / PAGE_SIZE;
>> fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
>>
>> Since bdi->ra_pages starts at VM_READAHEAD_PAGES (128KB), and the
>> kernel sends this value as init_in->max_readahead, the server can only
>> decrease readahead -- never increase it. Even if the server responds
>> with max_readahead=1MB, the min() clamps it back to 128KB.
>>
>> Other filesystems set ra_pages or io_pages based on server/device
>> capabilities:
>>
>> - SMB/CIFS sets ra_pages directly (2 * rsize, or from mount option)
>> - Ceph sets ra_pages directly from mount option
>> - 9P sets both ra_pages and io_pages from maxdata
>> - NFS sets io_pages from rsize
>>
>> I see two possible approaches and would like feedback:
>>
>> Option A: Fix the max_readahead negotiation
>>
>> Replace the current:
>>
>> fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
>>
>> with:
>>
>> fm->sb->s_bdi->ra_pages = min(ra_pages, fc->max_pages);
>>
>> This uses the server's max_readahead response directly, capped by
>> fc->max_pages for safety. I think this is backward compatible:
>> existing servers that echo the kernel's 128KB value get the same
>> result. Servers that return a lower value still reduce it. Only
>> servers that return a higher value see changed behavior.
>>
>> FUSE servers can opt in by advertising a larger max_readahead in the
>> FUSE_INIT response.
>>
>> Option B: Set io_pages from max_pages
>>
>> Set bdi->io_pages after FUSE_INIT negotiation:
>>
>> fm->sb->s_bdi->io_pages = fc->max_pages;
>>
>> This matches what NFS does (setting io_pages from rsize). The
>> readahead code uses max(bdi->io_pages, ra->ra_pages) to determine
>> the maximum readahead size, so a large io_pages would allow larger
>> readahead submissions.
>>
>> This is simpler since no server-side change is needed. However, it
>> bypasses the max_readahead protocol field, making max_readahead
>> effectively meaningless for any device with large max_pages.
>>
>> In both cases, fc->max_pages is already clamped by
>> fc->max_pages_limit, which for virtio-fs accounts for the virtqueue
>> descriptor count.
>>
>> Thoughts?
>
> I think this is a question for Miklos. You could also send a patch with
> your preferred solution to expediate this.
>
> Thank you for looking into this - it will be nice to remove this
> performance limitation.
We have this patch in our branch:
commit 763c96da4bd6d1bb95d8e6bb7fd352389f3a17b9
Author: Bernd Schubert <bschubert@ddn.com>
Date: Wed May 7 23:39:00 2025 +0200
fuse: Use fuse-server provided read-ahead for CAP_SYS_ADMIN
read-ahead is currently limited to bdi->ra_pages. One can change
that after the mount with something like
minor=$(stat -c "%d" /path/to/fuse)
echo 1024 > /sys/class/bdi/0:$(minor)/read_ahead_kb
Issue is that fuse-server cannot do that from its ->init method,
as it has to know about device minor, which blocks before
init is complete.
Fuse already sets the bdi value, but upper limit is the current
bdi value. For CAP_SYS_ADMIN we can allow higher values.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index af8404f539a3..417598bb3575 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1369,7 +1369,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
fc->no_flock = 1;
}
- fm->sb->s_bdi->ra_pages =
+ if (CAP_SYS_ADMIN)
+ fm->sb->s_bdi->ra_pages = ra_pages;
+ else
+ fm->sb->s_bdi->ra_pages =
min(fm->sb->s_bdi->ra_pages, ra_pages);
fc->minor = arg->minor;
fc->max_write = arg->minor < 5 ? 4096 : arg->max_write;
Maybe we can increase the limit for non privileged servers, but
not to fc->max_pages, because an admin might tune fuse_max_pages_limit
to a rather large value. I was able to kind of crash the system
with >16MB read-ahead in early 6.x kernels (I think 6.1).
next prev parent reply other threads:[~2026-03-24 16:05 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 14:54 fuse: buffered reads limited to 256KB regardless of negotiated max_pages jim.harris
2026-03-24 13:21 ` Stefan Hajnoczi
2026-03-24 16:05 ` Bernd Schubert [this message]
2026-03-24 17:11 ` Darrick J. Wong
2026-03-24 17:40 ` Bernd Schubert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=30a271bd-eaba-4248-af66-df18a61864ca@bsbernd.com \
--to=bernd@bsbernd.com \
--cc=gmaglione@redhat.com \
--cc=hreitz@redhat.com \
--cc=izach@nvidia.com \
--cc=jim.harris@nvidia.com \
--cc=ksztyber@nvidia.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mgurtovoy@nvidia.com \
--cc=miklos@szeredi.hu \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox