From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6AFD43D75DE for ; Tue, 24 Mar 2026 17:11:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774372299; cv=none; b=oS/lRM4wGKM2ZuxMEJoGK8zXQ/2gwaF5RunVTxcmDoJ7GB4EAc6DqiBatmWerquuD0J7VE00r5QHrl/ikPpOPccHg3xeJ9cg67BD6u1+Zu6GdLtGtqKwj0HjQp2e+AjkJszhvkwsU4LJEQNciaHAfAydCxGbHMTTnWpPim/2QAU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774372299; c=relaxed/simple; bh=KGyzR6Xr/duTB6cN2dhJAoKg7SYKWn8PB1baEe5S6x4=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=RsOxBNCviNnW6OzNeP2yqF6JLDejVR7NWmLrs9XWluemDEGS4fzzZecgg5hhCnybBbiVGLQ5umCUuLYiXOOW0kHRbaZsstLWVknuvIZ8ppV+eycU2zJzywZT0LaUoHGRH2UFEjN2OdUcW+gV0QcCtAROQ9MT2OiKyPBErDj5TXc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=BqjZ1hDn; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="BqjZ1hDn" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1205EC19424; Tue, 24 Mar 2026 17:11:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1774372299; bh=KGyzR6Xr/duTB6cN2dhJAoKg7SYKWn8PB1baEe5S6x4=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=BqjZ1hDnRY8AgHaMMqbx4BsGy3IhzO5weeuBgWbGSF2d4W+jl7DSgeOWvsjBzFhXG KzoHI+4nXP5+senJqS52eiZJKNWh+iFKOXUHwuW4kKaYO6nz8WILJcpMq13xTS8E+V Tl6+5S+5zt0HHOkHyLfZX9ulE5zzp0LbvXnrdvGPXYO7w1z3emwQUD5R76nXcVEbkW WpRljtFIE26dBsOUVJIvOEWFujgSjG8JYD38kUPKCuGJ/SDIeyks5wXaIM7KHGOojO j1FBnHzvaS+M7J45TBMBc9mpy0T7kKcQAUHO89ri69mqZtTMQFLm7AzT/gu8/Qa/BP pbxQO57T17phw== Date: Tue, 24 Mar 2026 10:11:38 -0700 From: "Darrick J. Wong" To: Bernd Schubert Cc: Stefan Hajnoczi , jim.harris@nvidia.com, linux-fsdevel@vger.kernel.org, Miklos Szeredi , Max Gurtovoy , Idan Zach , Konrad Sztyber , German Maglione , hreitz@redhat.com Subject: Re: fuse: buffered reads limited to 256KB regardless of negotiated max_pages Message-ID: <20260324171138.GS6202@frogsfrogsfrogs> References: <20260316145435.BA2542605C3@ubuntu.localdomain> <20260324132149.GB668717@fedora> <30a271bd-eaba-4248-af66-df18a61864ca@bsbernd.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <30a271bd-eaba-4248-af66-df18a61864ca@bsbernd.com> On Tue, Mar 24, 2026 at 05:05:45PM +0100, Bernd Schubert wrote: > > > On 3/24/26 14:21, Stefan Hajnoczi wrote: > > On Mon, Mar 16, 2026 at 07:54:35AM -0700, jim.harris@nvidia.com wrote: > >> Hi all, > >> > >> We have a FUSE server that advertises max_write=1MB and max_pages=256 > >> in the FUSE_INIT response. Buffered sequential writes arrive at the > >> server at the full 1MB as expected. However, buffered sequential reads > >> are capped at 256KB per FUSE READ request. > >> > >> The cap comes from the BDI readahead window. bdi->ra_pages defaults to > >> VM_READAHEAD_PAGES (32 pages / 128KB). For sequential access (e.g. cp), > >> posix_fadvise(POSIX_FADV_SEQUENTIAL) doubles the per-file readahead > >> window to 2 * bdi->ra_pages (256KB), producing the observed 256KB > >> limit. A 1MB application read() results in four sequential 256KB > >> round trips to the FUSE server instead of one. > > > > Hi Jim, > > Thanks for sharing this issue. I am CCing Geman Maglione and Hanna > > Czenczek who work on virtiofsd and are also becoming more involved in > > the virtiofs kernel driver. > > > >> In process_init_reply(), the kernel processes the > >> server's max_readahead response like this: > >> > >> ra_pages = arg->max_readahead / PAGE_SIZE; > >> fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages); > >> > >> Since bdi->ra_pages starts at VM_READAHEAD_PAGES (128KB), and the > >> kernel sends this value as init_in->max_readahead, the server can only > >> decrease readahead -- never increase it. Even if the server responds > >> with max_readahead=1MB, the min() clamps it back to 128KB. > >> > >> Other filesystems set ra_pages or io_pages based on server/device > >> capabilities: > >> > >> - SMB/CIFS sets ra_pages directly (2 * rsize, or from mount option) > >> - Ceph sets ra_pages directly from mount option > >> - 9P sets both ra_pages and io_pages from maxdata > >> - NFS sets io_pages from rsize > >> > >> I see two possible approaches and would like feedback: > >> > >> Option A: Fix the max_readahead negotiation > >> > >> Replace the current: > >> > >> fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages); > >> > >> with: > >> > >> fm->sb->s_bdi->ra_pages = min(ra_pages, fc->max_pages); > >> > >> This uses the server's max_readahead response directly, capped by > >> fc->max_pages for safety. I think this is backward compatible: > >> existing servers that echo the kernel's 128KB value get the same > >> result. Servers that return a lower value still reduce it. Only > >> servers that return a higher value see changed behavior. > >> > >> FUSE servers can opt in by advertising a larger max_readahead in the > >> FUSE_INIT response. > >> > >> Option B: Set io_pages from max_pages > >> > >> Set bdi->io_pages after FUSE_INIT negotiation: > >> > >> fm->sb->s_bdi->io_pages = fc->max_pages; > >> > >> This matches what NFS does (setting io_pages from rsize). The > >> readahead code uses max(bdi->io_pages, ra->ra_pages) to determine > >> the maximum readahead size, so a large io_pages would allow larger > >> readahead submissions. > >> > >> This is simpler since no server-side change is needed. However, it > >> bypasses the max_readahead protocol field, making max_readahead > >> effectively meaningless for any device with large max_pages. > >> > >> In both cases, fc->max_pages is already clamped by > >> fc->max_pages_limit, which for virtio-fs accounts for the virtqueue > >> descriptor count. > >> > >> Thoughts? > > > > I think this is a question for Miklos. You could also send a patch with > > your preferred solution to expediate this. > > > > Thank you for looking into this - it will be nice to remove this > > performance limitation. > > > We have this patch in our branch: > > commit 763c96da4bd6d1bb95d8e6bb7fd352389f3a17b9 > Author: Bernd Schubert > Date: Wed May 7 23:39:00 2025 +0200 > > fuse: Use fuse-server provided read-ahead for CAP_SYS_ADMIN > > read-ahead is currently limited to bdi->ra_pages. One can change > that after the mount with something like > > minor=$(stat -c "%d" /path/to/fuse) > echo 1024 > /sys/class/bdi/0:$(minor)/read_ahead_kb > > Issue is that fuse-server cannot do that from its ->init method, > as it has to know about device minor, which blocks before > init is complete. > > Fuse already sets the bdi value, but upper limit is the current > bdi value. For CAP_SYS_ADMIN we can allow higher values. > > Signed-off-by: Bernd Schubert > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index af8404f539a3..417598bb3575 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -1369,7 +1369,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, > fc->no_flock = 1; > } > > - fm->sb->s_bdi->ra_pages = > + if (CAP_SYS_ADMIN) I really hope your branch has: if (has_capability_noaudit(current, CAP_SYS_ADMIN)) and not what's written there. ;) --D > + fm->sb->s_bdi->ra_pages = ra_pages; > + else > + fm->sb->s_bdi->ra_pages = > min(fm->sb->s_bdi->ra_pages, ra_pages); > fc->minor = arg->minor; > fc->max_write = arg->minor < 5 ? 4096 : arg->max_write; > > > > > Maybe we can increase the limit for non privileged servers, but > not to fc->max_pages, because an admin might tune fuse_max_pages_limit > to a rather large value. I was able to kind of crash the system > with >16MB read-ahead in early 6.x kernels (I think 6.1). > > > >