public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Bernd Schubert <bernd@bsbernd.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>,
	jim.harris@nvidia.com, linux-fsdevel@vger.kernel.org,
	Miklos Szeredi <miklos@szeredi.hu>,
	Max Gurtovoy <mgurtovoy@nvidia.com>, Idan Zach <izach@nvidia.com>,
	Konrad Sztyber <ksztyber@nvidia.com>,
	German Maglione <gmaglione@redhat.com>,
	hreitz@redhat.com
Subject: Re: fuse: buffered reads limited to 256KB regardless of negotiated max_pages
Date: Tue, 24 Mar 2026 10:11:38 -0700	[thread overview]
Message-ID: <20260324171138.GS6202@frogsfrogsfrogs> (raw)
In-Reply-To: <30a271bd-eaba-4248-af66-df18a61864ca@bsbernd.com>

On Tue, Mar 24, 2026 at 05:05:45PM +0100, Bernd Schubert wrote:
> 
> 
> On 3/24/26 14:21, Stefan Hajnoczi wrote:
> > On Mon, Mar 16, 2026 at 07:54:35AM -0700, jim.harris@nvidia.com wrote:
> >> Hi all,
> >>
> >> We have a FUSE server that advertises max_write=1MB and max_pages=256
> >> in the FUSE_INIT response. Buffered sequential writes arrive at the
> >> server at the full 1MB as expected. However, buffered sequential reads
> >> are capped at 256KB per FUSE READ request.
> >>
> >> The cap comes from the BDI readahead window. bdi->ra_pages defaults to
> >> VM_READAHEAD_PAGES (32 pages / 128KB). For sequential access (e.g. cp),
> >> posix_fadvise(POSIX_FADV_SEQUENTIAL) doubles the per-file readahead
> >> window to 2 * bdi->ra_pages (256KB), producing the observed 256KB
> >> limit. A 1MB application read() results in four sequential 256KB
> >> round trips to the FUSE server instead of one.
> > 
> > Hi Jim,
> > Thanks for sharing this issue. I am CCing Geman Maglione and Hanna
> > Czenczek who work on virtiofsd and are also becoming more involved in
> > the virtiofs kernel driver.
> > 
> >> In process_init_reply(), the kernel processes the
> >> server's max_readahead response like this:
> >>
> >>     ra_pages = arg->max_readahead / PAGE_SIZE;
> >>     fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
> >>
> >> Since bdi->ra_pages starts at VM_READAHEAD_PAGES (128KB), and the
> >> kernel sends this value as init_in->max_readahead, the server can only
> >> decrease readahead -- never increase it. Even if the server responds
> >> with max_readahead=1MB, the min() clamps it back to 128KB.
> >>
> >> Other filesystems set ra_pages or io_pages based on server/device
> >> capabilities:
> >>
> >>   - SMB/CIFS sets ra_pages directly (2 * rsize, or from mount option)
> >>   - Ceph sets ra_pages directly from mount option
> >>   - 9P sets both ra_pages and io_pages from maxdata
> >>   - NFS sets io_pages from rsize
> >>
> >> I see two possible approaches and would like feedback:
> >>
> >> Option A: Fix the max_readahead negotiation
> >>
> >> Replace the current:
> >>
> >>     fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
> >>
> >> with:
> >>
> >>     fm->sb->s_bdi->ra_pages = min(ra_pages, fc->max_pages);
> >>
> >> This uses the server's max_readahead response directly, capped by
> >> fc->max_pages for safety. I think this is backward compatible:
> >> existing servers that echo the kernel's 128KB value get the same
> >> result. Servers that return a lower value still reduce it. Only
> >> servers that return a higher value see changed behavior.
> >>
> >> FUSE servers can opt in by advertising a larger max_readahead in the
> >> FUSE_INIT response.
> >>
> >> Option B: Set io_pages from max_pages
> >>
> >> Set bdi->io_pages after FUSE_INIT negotiation:
> >>
> >>     fm->sb->s_bdi->io_pages = fc->max_pages;
> >>
> >> This matches what NFS does (setting io_pages from rsize). The
> >> readahead code uses max(bdi->io_pages, ra->ra_pages) to determine
> >> the maximum readahead size, so a large io_pages would allow larger
> >> readahead submissions.
> >>
> >> This is simpler since no server-side change is needed. However, it
> >> bypasses the max_readahead protocol field, making max_readahead
> >> effectively meaningless for any device with large max_pages.
> >>
> >> In both cases, fc->max_pages is already clamped by
> >> fc->max_pages_limit, which for virtio-fs accounts for the virtqueue
> >> descriptor count.
> >>
> >> Thoughts?
> > 
> > I think this is a question for Miklos. You could also send a patch with
> > your preferred solution to expediate this.
> > 
> > Thank you for looking into this - it will be nice to remove this
> > performance limitation.
> 
> 
> We have this patch in our branch:
> 
> commit 763c96da4bd6d1bb95d8e6bb7fd352389f3a17b9
> Author: Bernd Schubert <bschubert@ddn.com>
> Date:   Wed May 7 23:39:00 2025 +0200
> 
>     fuse: Use fuse-server provided read-ahead for CAP_SYS_ADMIN
>     
>     read-ahead is currently limited to bdi->ra_pages. One can change
>     that after the mount with something like
>     
>     minor=$(stat -c "%d" /path/to/fuse)
>     echo 1024 > /sys/class/bdi/0:$(minor)/read_ahead_kb
>     
>     Issue is that fuse-server cannot do that from its ->init method,
>     as it has to know about device minor, which blocks before
>     init is complete.
>     
>     Fuse already sets the bdi value, but upper limit is the current
>     bdi value. For CAP_SYS_ADMIN we can allow higher values.
>     
>     Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> 
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index af8404f539a3..417598bb3575 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1369,7 +1369,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
>  			fc->no_flock = 1;
>  		}
>  
> -		fm->sb->s_bdi->ra_pages =
> +		if (CAP_SYS_ADMIN)

I really hope your branch has:

		if (has_capability_noaudit(current, CAP_SYS_ADMIN))

and not what's written there. ;)

--D

> +			fm->sb->s_bdi->ra_pages = ra_pages;
> +		else
> +			fm->sb->s_bdi->ra_pages =
>  				min(fm->sb->s_bdi->ra_pages, ra_pages);
>  		fc->minor = arg->minor;
>  		fc->max_write = arg->minor < 5 ? 4096 : arg->max_write;
> 
> 
> 
> 
> Maybe we can increase the limit for non privileged servers, but
> not to fc->max_pages, because an admin might tune fuse_max_pages_limit 
> to a rather large  value. I was able to kind of crash the system
> with >16MB read-ahead in early 6.x kernels (I think 6.1).
> 
> 
> 
> 

  reply	other threads:[~2026-03-24 17:11 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16 14:54 fuse: buffered reads limited to 256KB regardless of negotiated max_pages jim.harris
2026-03-24 13:21 ` Stefan Hajnoczi
2026-03-24 16:05   ` Bernd Schubert
2026-03-24 17:11     ` Darrick J. Wong [this message]
2026-03-24 17:40       ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260324171138.GS6202@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=bernd@bsbernd.com \
    --cc=gmaglione@redhat.com \
    --cc=hreitz@redhat.com \
    --cc=izach@nvidia.com \
    --cc=jim.harris@nvidia.com \
    --cc=ksztyber@nvidia.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=miklos@szeredi.hu \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox