public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: jim.harris@nvidia.com
Cc: linux-fsdevel@vger.kernel.org, Miklos Szeredi <miklos@szeredi.hu>,
	Max Gurtovoy <mgurtovoy@nvidia.com>, Idan Zach <izach@nvidia.com>,
	Konrad Sztyber <ksztyber@nvidia.com>,
	German Maglione <gmaglione@redhat.com>,
	hreitz@redhat.com
Subject: Re: fuse: buffered reads limited to 256KB regardless of negotiated max_pages
Date: Tue, 24 Mar 2026 09:21:49 -0400	[thread overview]
Message-ID: <20260324132149.GB668717@fedora> (raw)
In-Reply-To: <20260316145435.BA2542605C3@ubuntu.localdomain>

[-- Attachment #1: Type: text/plain, Size: 3471 bytes --]

On Mon, Mar 16, 2026 at 07:54:35AM -0700, jim.harris@nvidia.com wrote:
> Hi all,
> 
> We have a FUSE server that advertises max_write=1MB and max_pages=256
> in the FUSE_INIT response. Buffered sequential writes arrive at the
> server at the full 1MB as expected. However, buffered sequential reads
> are capped at 256KB per FUSE READ request.
> 
> The cap comes from the BDI readahead window. bdi->ra_pages defaults to
> VM_READAHEAD_PAGES (32 pages / 128KB). For sequential access (e.g. cp),
> posix_fadvise(POSIX_FADV_SEQUENTIAL) doubles the per-file readahead
> window to 2 * bdi->ra_pages (256KB), producing the observed 256KB
> limit. A 1MB application read() results in four sequential 256KB
> round trips to the FUSE server instead of one.

Hi Jim,
Thanks for sharing this issue. I am CCing Geman Maglione and Hanna
Czenczek who work on virtiofsd and are also becoming more involved in
the virtiofs kernel driver.

> In process_init_reply(), the kernel processes the
> server's max_readahead response like this:
> 
>     ra_pages = arg->max_readahead / PAGE_SIZE;
>     fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
> 
> Since bdi->ra_pages starts at VM_READAHEAD_PAGES (128KB), and the
> kernel sends this value as init_in->max_readahead, the server can only
> decrease readahead -- never increase it. Even if the server responds
> with max_readahead=1MB, the min() clamps it back to 128KB.
> 
> Other filesystems set ra_pages or io_pages based on server/device
> capabilities:
> 
>   - SMB/CIFS sets ra_pages directly (2 * rsize, or from mount option)
>   - Ceph sets ra_pages directly from mount option
>   - 9P sets both ra_pages and io_pages from maxdata
>   - NFS sets io_pages from rsize
> 
> I see two possible approaches and would like feedback:
> 
> Option A: Fix the max_readahead negotiation
> 
> Replace the current:
> 
>     fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
> 
> with:
> 
>     fm->sb->s_bdi->ra_pages = min(ra_pages, fc->max_pages);
> 
> This uses the server's max_readahead response directly, capped by
> fc->max_pages for safety. I think this is backward compatible:
> existing servers that echo the kernel's 128KB value get the same
> result. Servers that return a lower value still reduce it. Only
> servers that return a higher value see changed behavior.
> 
> FUSE servers can opt in by advertising a larger max_readahead in the
> FUSE_INIT response.
> 
> Option B: Set io_pages from max_pages
> 
> Set bdi->io_pages after FUSE_INIT negotiation:
> 
>     fm->sb->s_bdi->io_pages = fc->max_pages;
> 
> This matches what NFS does (setting io_pages from rsize). The
> readahead code uses max(bdi->io_pages, ra->ra_pages) to determine
> the maximum readahead size, so a large io_pages would allow larger
> readahead submissions.
> 
> This is simpler since no server-side change is needed. However, it
> bypasses the max_readahead protocol field, making max_readahead
> effectively meaningless for any device with large max_pages.
> 
> In both cases, fc->max_pages is already clamped by
> fc->max_pages_limit, which for virtio-fs accounts for the virtqueue
> descriptor count.
> 
> Thoughts?

I think this is a question for Miklos. You could also send a patch with
your preferred solution to expediate this.

Thank you for looking into this - it will be nice to remove this
performance limitation.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2026-03-24 13:22 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16 14:54 fuse: buffered reads limited to 256KB regardless of negotiated max_pages jim.harris
2026-03-24 13:21 ` Stefan Hajnoczi [this message]
2026-03-24 16:05   ` Bernd Schubert
2026-03-24 17:11     ` Darrick J. Wong
2026-03-24 17:40       ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260324132149.GB668717@fedora \
    --to=stefanha@redhat.com \
    --cc=gmaglione@redhat.com \
    --cc=hreitz@redhat.com \
    --cc=izach@nvidia.com \
    --cc=jim.harris@nvidia.com \
    --cc=ksztyber@nvidia.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox