public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Bernd Schubert <bernd@bsbernd.com>
To: Stefan Hajnoczi <stefanha@redhat.com>, jim.harris@nvidia.com
Cc: linux-fsdevel@vger.kernel.org, Miklos Szeredi <miklos@szeredi.hu>,
	Max Gurtovoy <mgurtovoy@nvidia.com>, Idan Zach <izach@nvidia.com>,
	Konrad Sztyber <ksztyber@nvidia.com>,
	German Maglione <gmaglione@redhat.com>,
	hreitz@redhat.com
Subject: Re: fuse: buffered reads limited to 256KB regardless of negotiated max_pages
Date: Tue, 24 Mar 2026 17:05:45 +0100	[thread overview]
Message-ID: <30a271bd-eaba-4248-af66-df18a61864ca@bsbernd.com> (raw)
In-Reply-To: <20260324132149.GB668717@fedora>



On 3/24/26 14:21, Stefan Hajnoczi wrote:
> On Mon, Mar 16, 2026 at 07:54:35AM -0700, jim.harris@nvidia.com wrote:
>> Hi all,
>>
>> We have a FUSE server that advertises max_write=1MB and max_pages=256
>> in the FUSE_INIT response. Buffered sequential writes arrive at the
>> server at the full 1MB as expected. However, buffered sequential reads
>> are capped at 256KB per FUSE READ request.
>>
>> The cap comes from the BDI readahead window. bdi->ra_pages defaults to
>> VM_READAHEAD_PAGES (32 pages / 128KB). For sequential access (e.g. cp),
>> posix_fadvise(POSIX_FADV_SEQUENTIAL) doubles the per-file readahead
>> window to 2 * bdi->ra_pages (256KB), producing the observed 256KB
>> limit. A 1MB application read() results in four sequential 256KB
>> round trips to the FUSE server instead of one.
> 
> Hi Jim,
> Thanks for sharing this issue. I am CCing Geman Maglione and Hanna
> Czenczek who work on virtiofsd and are also becoming more involved in
> the virtiofs kernel driver.
> 
>> In process_init_reply(), the kernel processes the
>> server's max_readahead response like this:
>>
>>     ra_pages = arg->max_readahead / PAGE_SIZE;
>>     fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
>>
>> Since bdi->ra_pages starts at VM_READAHEAD_PAGES (128KB), and the
>> kernel sends this value as init_in->max_readahead, the server can only
>> decrease readahead -- never increase it. Even if the server responds
>> with max_readahead=1MB, the min() clamps it back to 128KB.
>>
>> Other filesystems set ra_pages or io_pages based on server/device
>> capabilities:
>>
>>   - SMB/CIFS sets ra_pages directly (2 * rsize, or from mount option)
>>   - Ceph sets ra_pages directly from mount option
>>   - 9P sets both ra_pages and io_pages from maxdata
>>   - NFS sets io_pages from rsize
>>
>> I see two possible approaches and would like feedback:
>>
>> Option A: Fix the max_readahead negotiation
>>
>> Replace the current:
>>
>>     fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
>>
>> with:
>>
>>     fm->sb->s_bdi->ra_pages = min(ra_pages, fc->max_pages);
>>
>> This uses the server's max_readahead response directly, capped by
>> fc->max_pages for safety. I think this is backward compatible:
>> existing servers that echo the kernel's 128KB value get the same
>> result. Servers that return a lower value still reduce it. Only
>> servers that return a higher value see changed behavior.
>>
>> FUSE servers can opt in by advertising a larger max_readahead in the
>> FUSE_INIT response.
>>
>> Option B: Set io_pages from max_pages
>>
>> Set bdi->io_pages after FUSE_INIT negotiation:
>>
>>     fm->sb->s_bdi->io_pages = fc->max_pages;
>>
>> This matches what NFS does (setting io_pages from rsize). The
>> readahead code uses max(bdi->io_pages, ra->ra_pages) to determine
>> the maximum readahead size, so a large io_pages would allow larger
>> readahead submissions.
>>
>> This is simpler since no server-side change is needed. However, it
>> bypasses the max_readahead protocol field, making max_readahead
>> effectively meaningless for any device with large max_pages.
>>
>> In both cases, fc->max_pages is already clamped by
>> fc->max_pages_limit, which for virtio-fs accounts for the virtqueue
>> descriptor count.
>>
>> Thoughts?
> 
> I think this is a question for Miklos. You could also send a patch with
> your preferred solution to expediate this.
> 
> Thank you for looking into this - it will be nice to remove this
> performance limitation.


We have this patch in our branch:

commit 763c96da4bd6d1bb95d8e6bb7fd352389f3a17b9
Author: Bernd Schubert <bschubert@ddn.com>
Date:   Wed May 7 23:39:00 2025 +0200

    fuse: Use fuse-server provided read-ahead for CAP_SYS_ADMIN
    
    read-ahead is currently limited to bdi->ra_pages. One can change
    that after the mount with something like
    
    minor=$(stat -c "%d" /path/to/fuse)
    echo 1024 > /sys/class/bdi/0:$(minor)/read_ahead_kb
    
    Issue is that fuse-server cannot do that from its ->init method,
    as it has to know about device minor, which blocks before
    init is complete.
    
    Fuse already sets the bdi value, but upper limit is the current
    bdi value. For CAP_SYS_ADMIN we can allow higher values.
    
    Signed-off-by: Bernd Schubert <bschubert@ddn.com>

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index af8404f539a3..417598bb3575 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1369,7 +1369,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 			fc->no_flock = 1;
 		}
 
-		fm->sb->s_bdi->ra_pages =
+		if (CAP_SYS_ADMIN)
+			fm->sb->s_bdi->ra_pages = ra_pages;
+		else
+			fm->sb->s_bdi->ra_pages =
 				min(fm->sb->s_bdi->ra_pages, ra_pages);
 		fc->minor = arg->minor;
 		fc->max_write = arg->minor < 5 ? 4096 : arg->max_write;




Maybe we can increase the limit for non privileged servers, but
not to fc->max_pages, because an admin might tune fuse_max_pages_limit 
to a rather large  value. I was able to kind of crash the system
with >16MB read-ahead in early 6.x kernels (I think 6.1).




  reply	other threads:[~2026-03-24 16:05 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16 14:54 fuse: buffered reads limited to 256KB regardless of negotiated max_pages jim.harris
2026-03-24 13:21 ` Stefan Hajnoczi
2026-03-24 16:05   ` Bernd Schubert [this message]
2026-03-24 17:11     ` Darrick J. Wong
2026-03-24 17:40       ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=30a271bd-eaba-4248-af66-df18a61864ca@bsbernd.com \
    --to=bernd@bsbernd.com \
    --cc=gmaglione@redhat.com \
    --cc=hreitz@redhat.com \
    --cc=izach@nvidia.com \
    --cc=jim.harris@nvidia.com \
    --cc=ksztyber@nvidia.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=miklos@szeredi.hu \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox