From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
Matthew Wilcox <willy@infradead.org>,
Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
Paulo Alcantara <pc@manguebit.com>, Jens Axboe <axboe@kernel.dk>,
Leon Romanovsky <leon@kernel.org>,
Steve French <sfrench@samba.org>,
ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
Marc Dionne <marc.dionne@auristor.com>,
Eric Van Hensbergen <ericvh@kernel.org>,
Dominique Martinet <asmadeus@codewreck.org>,
Ilya Dryomov <idryomov@gmail.com>,
Trond Myklebust <trondmy@kernel.org>,
netfs@lists.linux.dev, linux-afs@lists.infradead.org,
linux-cifs@vger.kernel.org, linux-nfs@vger.kernel.org,
ceph-devel@vger.kernel.org, v9fs@lists.linux.dev,
linux-erofs@lists.ozlabs.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: [PATCH 00/26] netfs: Keep track of folios in a segmented bio_vec[] chain
Date: Thu, 26 Mar 2026 10:45:15 +0000 [thread overview]
Message-ID: <20260326104544.509518-1-dhowells@redhat.com> (raw)
Hi Christian,
Could you add these patches to the VFS tree for next?
The patches get rid of folio_queue, rolling_buffer and ITER_FOLIOQ,
replacing the folio queue construct used to manage buffers in netfslib with
one based around a segmented chain of bio_vec arrays instead. There are
three main aims here:
(1) The kernel file I/O subsystem seems to be moving towards consolidating
on the use of bio_vec arrays, so embrace this by moving netfslib to
keep track of its buffers for buffered I/O in bio_vec[] form.
(2) Netfslib already uses a bio_vec[] to handle unbuffered/DIO, so the
number of different buffering schemes used can be reduced to just a
single one.
(3) Always send an entire filesystem RPC request message to a TCP socket
with single kernel_sendmsg() call as this is faster, more efficient
and doesn't require the use of corking as it puts the entire
transmission loop inside of a single tcp_sendmsg().
For the replacement of folio_queue, a segmented chain of bio_vec arrays
rather than a single monolithic array is provided:
struct bvecq {
struct bvecq *next;
struct bvecq *prev;
unsigned long long fpos;
refcount_t ref;
u32 priv;
u16 nr_segs;
u16 max_segs;
bool inline_bv:1;
bool free:1;
bool unpin:1;
bool discontig:1;
struct bio_vec *bv;
struct bio_vec __bv[];
};
The fields are:
(1) next, prev - Link segments together in a list. I want this to be
NULL-terminated linear rather than circular to make it possible to
arbitrarily glue bits on the front.
(2) fpos, discontig - Note the current file position of the first byte of
the segment; all the bio_vecs in ->bv[] must be contiguous in the file
space. The fpos can be used to find the folio by file position rather
then from the info in the bio_vec.
If there's a discontiguity, this should break over into a new bvecq
segment with the discontig flag set (though this is redundant if you
keep track of the file position). Note that the beginning and end
file positions in a segment need not be aligned to any filesystem
block size.
(3) ref - Refcount. Each bvecq keeps a ref on the next. I'm not sure
this is entirely necessary, but it makes sharing slices easier.
(4) priv - Private data for the owner. Dispensible; currently only used
for storing a debug ID for tracing in a patch not included here.
(5) max_segs, nr_segs. The size of bv[] and the number of elements used.
I've assumed a maximum of 65535 bio_vecs in the array (which would
represent a ~1MiB allocation).
(6) bv, __bv, inline_bv. bv points to the bio_vec[] array handled by
this segment. This may begin at __bv and if it does inline_bv should
be set (otherwise it's impossible to distinguish a separately
allocated bio_vec[] that follows immediately by coincidence).
(7) free, unpin. free is set if the memory pointed to by the bio_vecs
needs freeing in some way upon I/O completion. unpin is set if this
means using GUP unpinning rather than put_page().
I've also defined an iov_iter iterator type ITER_BVECQ to walk this sort of
construct so that it can be passed directly to sendmsg() or block-based DIO
(as cachefiles does).
This series makes the following changes to netfslib:
(1) The folio_queue chain used to hold folios for buffered I/O is replaced
with a bvecq chain. Each bio_vec then holds (a portion of) one folio.
Each bvecq holds a contiguous sequence of folios, but adjacent bvecqs
in a chain may be discontiguous.
(2) For unbuffered/DIO, the source iov_iter is extracted into a bvecq
chain.
(3) An abstract position representation ('bvecq_pos') is created that can
used to hold a position in a bvecq chain. For the moment, this takes
a ref on the bvecq it points to, but that may be excessive.
(4) Buffer tracking is managed with three cursors: The load_cursor, at
which new folios are added as we go; the dispatch_cursor, at which new
subrequests' buffers start when they're created; and the
collect_cursor, the point at which folios are being unlocked.
Not all cursors are necessarily needed in all situations and during
buffered writeback, we need a dispatch cursor per stream (one for the
network filesystem and one for the cache).
(5) ->prepare_read(), buffer setting up and ->issue_read() are merged, as
are the write variants, with the filesystem calling back up to
netfslib to prepare its buffer. This simplifies the process of
setting up a subrequest. It may even make sense to have the
filesystem allocate the subrequest.
(6) Retry dispatch tracking is added to netfs_io_request so that the
buffer preparation functions can find it. Retry requires an
additional buffer cursor.
(7) Netfslib dispatches I/O by accumulating enough bufferage to dispatch
at least one subrequest, then looping to generate as many as the
filesystem wants to (they may be limited by other constraints,
e.g. max RDMA segment count or negotiated max size). This loop could
be moved down into the filesystem. A new method is provided by which
netfslib can ask the filesystem to provide an estimate of the data
that should be accumulated before dispatch begins.
(8) Reading from the cache is now managed by querying the cache to provide
a list of the next two data extents within the cache.
(9) AFS directories are switched to using a bvecq rather than a
folio_queue to hold their contents.
(10) CIFS is switch to using a bvecq rather than a folio_queue for holding
a temporary encryption buffer.
(11) CIFS RDMA is given the ability to extract ITER_BVECQ and support for
extracting ITER_FOLIOQ, ITER_BVEC and ITER_KVEC is removed.
(12) All the folio_queue and rolling_buffer code is removed.
Cachefiles is also modified:
(1) The object type in the cachefiles file xattr is now correctly set to
CACHEFILES_CONTENT_{SINGLE,ALL,BACKFS_MAP} rather than just being 0,
to indicate whether we have a single monolithic blob, all the data up
to cache i_size with no holes or a sparse file with the data mapped by
the backing file system (as currently upstream).
(2) For "ALL" type files, the cache's i_size is used to track how much
data is saved in the cache and no longer bears any relation to the
netfs i_size. The actual object size is stored in the xattr.
(3) For most typical files which are contiguous and written progressively,
the object type is now set to "ALL". For anything else, cachefiles
uses SEEK_DATA/HOLE to find extent outlines at before (this is the
current behaviour and needs to be fixed, but in a separate set of
patches as it's not trivial).
Two further things that I'm working on (but not in this branch) are:
(1) Make it so that a filesystem can be given a copy of a subchain which
it can then tack header and trailer protocol elements upon to form a
single message (I have this working for cifs) and even join copies
together with intervening protocol elements to form compounds.
(2) Make it so that a filesystem can 'splice' out the contents of the TCP
receive queue into a bvecq chain. This allows the socket lock to be
dropped much more quickly and the copying of data read to the
destination buffers to happen without the lock. I have this working
for cifs too. Kernel recvmsg() doesn't then block kernel sendmsg()
for anywhere near as long.
There are also some things I want to consider for the future:
(1) Create one or more batched iteration functions to 'unlock' all the
folios in a bio_vec[], where 'unlock' is the appropriate action for
ending a read or a write. Batching should hopefully also improve the
efficiency of wrangling the marks on the xarray. Very often these
marks are going to be represented by contiguous bits, so there may be
a way to change them in bulk.
(2) Rather than walking the bvecq chain to get each individual folio out
via bv_page, use the file position stored on the bvecq and the sum of
bv_len to iterate over the appropriate range in i_pages.
(3) Change iov_iter to store the initial starting point and for
iov_iter_revert() to reset to that and advance. This would (a) help
prevent over-reversion and (b) dispense with the need for a prev
pointer.
(4) Use bvecq to replace scatterlist. One problem with replacing
scatterlist is that crypto drivers like to glue bits on the front of
the scatterlists they're given (something trivial with that API) - and
this is one way to achieve it.
The patches can also be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-next
Thanks,
David
David Howells (22):
netfs: Fix read abandonment during retry
netfs: Fix the handling of stream->front by removing it
cachefiles: Fix excess dput() after end_removing()
cachefiles: Don't rely on backing fs storage map for most use cases
mm: Make readahead store folio count in readahead_control
netfs: Bulk load the readahead-provided folios up front
Add a function to kmap one page of a multipage bio_vec
iov_iter: Add a segmented queue of bio_vec[]
netfs: Add some tools for managing bvecq chains
netfs: Add a function to extract from an iter into a bvecq
afs: Use a bvecq to hold dir content rather than folioq
cifs: Use a bvecq for buffering instead of a folioq
cifs: Support ITER_BVECQ in smb_extract_iter_to_rdma()
netfs: Switch to using bvecq rather than folio_queue and
rolling_buffer
cifs: Remove support for ITER_KVEC/BVEC/FOLIOQ from
smb_extract_iter_to_rdma()
netfs: Remove netfs_alloc/free_folioq_buffer()
netfs: Remove netfs_extract_user_iter()
iov_iter: Remove ITER_FOLIOQ
netfs: Remove folio_queue and rolling_buffer
netfs: Check for too much data being read
netfs: Limit the the minimum trigger for progress reporting
netfs: Combine prepare and issue ops and grab the buffers on request
Deepanshu Kartikey (2):
netfs: Fix NULL pointer dereference in netfs_unbuffered_write() on
retry
netfs: Fix kernel BUG in netfs_limit_iter() for ITER_KVEC iterators
Paulo Alcantara (1):
netfs: fix error handling in netfs_extract_user_iter()
Viacheslav Dubeyko (1):
netfs: fix VM_BUG_ON_FOLIO() issue in netfs_write_begin() call
Documentation/core-api/folio_queue.rst | 209 ------
Documentation/core-api/index.rst | 1 -
fs/9p/vfs_addr.c | 49 +-
fs/afs/dir.c | 43 +-
fs/afs/dir_edit.c | 43 +-
fs/afs/dir_search.c | 33 +-
fs/afs/file.c | 27 +-
fs/afs/fsclient.c | 8 +-
fs/afs/inode.c | 20 +-
fs/afs/internal.h | 14 +-
fs/afs/write.c | 35 +-
fs/afs/yfsclient.c | 6 +-
fs/cachefiles/interface.c | 82 +-
fs/cachefiles/internal.h | 10 +-
fs/cachefiles/io.c | 486 ++++++++----
fs/cachefiles/namei.c | 55 +-
fs/cachefiles/xattr.c | 20 +-
fs/ceph/Kconfig | 1 +
fs/ceph/addr.c | 127 ++--
fs/netfs/Kconfig | 3 +
fs/netfs/Makefile | 4 +-
fs/netfs/buffered_read.c | 524 ++++++++-----
fs/netfs/buffered_write.c | 30 +-
fs/netfs/bvecq.c | 706 ++++++++++++++++++
fs/netfs/direct_read.c | 119 ++-
fs/netfs/direct_write.c | 174 +++--
fs/netfs/fscache_io.c | 6 -
fs/netfs/internal.h | 89 ++-
fs/netfs/iterator.c | 305 +++-----
fs/netfs/misc.c | 147 +---
fs/netfs/objects.c | 21 +-
fs/netfs/read_collect.c | 134 ++--
fs/netfs/read_pgpriv2.c | 168 +++--
fs/netfs/read_retry.c | 227 +++---
fs/netfs/read_single.c | 170 +++--
fs/netfs/rolling_buffer.c | 222 ------
fs/netfs/stats.c | 6 +-
fs/netfs/write_collect.c | 126 +++-
fs/netfs/write_issue.c | 987 ++++++++++++++-----------
fs/netfs/write_retry.c | 135 ++--
fs/nfs/Kconfig | 1 +
fs/nfs/fscache.c | 24 +-
fs/smb/client/cifsglob.h | 2 +-
fs/smb/client/cifssmb.c | 13 +-
fs/smb/client/file.c | 146 ++--
fs/smb/client/smb2ops.c | 78 +-
fs/smb/client/smb2pdu.c | 28 +-
fs/smb/client/smbdirect.c | 152 +---
fs/smb/client/transport.c | 15 +-
include/linux/bvec.h | 21 +
include/linux/bvecq.h | 205 +++++
include/linux/folio_queue.h | 282 -------
include/linux/fscache.h | 17 +
include/linux/iov_iter.h | 68 +-
include/linux/netfs.h | 145 ++--
include/linux/pagemap.h | 1 +
include/linux/rolling_buffer.h | 61 --
include/linux/uio.h | 17 +-
include/trace/events/cachefiles.h | 17 +-
include/trace/events/netfs.h | 123 ++-
lib/iov_iter.c | 395 +++++-----
lib/scatterlist.c | 57 +-
lib/tests/kunit_iov_iter.c | 185 ++---
mm/readahead.c | 4 +
net/9p/client.c | 8 +-
65 files changed, 4144 insertions(+), 3493 deletions(-)
delete mode 100644 Documentation/core-api/folio_queue.rst
create mode 100644 fs/netfs/bvecq.c
delete mode 100644 fs/netfs/rolling_buffer.c
create mode 100644 include/linux/bvecq.h
delete mode 100644 include/linux/folio_queue.h
delete mode 100644 include/linux/rolling_buffer.h
next reply other threads:[~2026-03-26 10:46 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-26 10:45 David Howells [this message]
2026-03-26 10:45 ` [PATCH 01/26] netfs: Fix NULL pointer dereference in netfs_unbuffered_write() on retry David Howells
2026-03-26 10:45 ` [PATCH 02/26] netfs: Fix kernel BUG in netfs_limit_iter() for ITER_KVEC iterators David Howells
2026-03-26 10:45 ` [PATCH 03/26] netfs: fix VM_BUG_ON_FOLIO() issue in netfs_write_begin() call David Howells
2026-03-26 10:45 ` [PATCH 04/26] netfs: fix error handling in netfs_extract_user_iter() David Howells
2026-03-26 10:45 ` [PATCH 05/26] netfs: Fix read abandonment during retry David Howells
2026-03-26 10:45 ` [PATCH 06/26] netfs: Fix the handling of stream->front by removing it David Howells
2026-03-26 10:45 ` [PATCH 07/26] cachefiles: Fix excess dput() after end_removing() David Howells
[not found] ` <CA+yaA_=gpTnueByzFNYrqNL_qSC2rE4iGDjLHtJap-=_rhE3HQ@mail.gmail.com>
2026-03-26 11:10 ` David Howells
2026-03-26 10:45 ` [PATCH 08/26] cachefiles: Don't rely on backing fs storage map for most use cases David Howells
2026-03-26 10:45 ` [PATCH 09/26] mm: Make readahead store folio count in readahead_control David Howells
2026-03-26 10:45 ` [PATCH 10/26] netfs: Bulk load the readahead-provided folios up front David Howells
2026-03-26 10:45 ` [PATCH 11/26] Add a function to kmap one page of a multipage bio_vec David Howells
2026-03-26 10:45 ` [PATCH 12/26] iov_iter: Add a segmented queue of bio_vec[] David Howells
2026-03-26 10:45 ` [PATCH 13/26] netfs: Add some tools for managing bvecq chains David Howells
2026-03-26 10:45 ` [PATCH 14/26] netfs: Add a function to extract from an iter into a bvecq David Howells
2026-03-26 10:45 ` [PATCH 15/26] afs: Use a bvecq to hold dir content rather than folioq David Howells
2026-03-26 10:45 ` [PATCH 16/26] cifs: Use a bvecq for buffering instead of a folioq David Howells
2026-03-26 10:45 ` [PATCH 17/26] cifs: Support ITER_BVECQ in smb_extract_iter_to_rdma() David Howells
2026-03-26 10:45 ` [PATCH 18/26] netfs: Switch to using bvecq rather than folio_queue and rolling_buffer David Howells
2026-03-26 10:45 ` [PATCH 19/26] cifs: Remove support for ITER_KVEC/BVEC/FOLIOQ from smb_extract_iter_to_rdma() David Howells
2026-03-26 10:45 ` [PATCH 20/26] netfs: Remove netfs_alloc/free_folioq_buffer() David Howells
2026-03-26 10:45 ` [PATCH 21/26] netfs: Remove netfs_extract_user_iter() David Howells
2026-03-26 10:45 ` [PATCH 22/26] iov_iter: Remove ITER_FOLIOQ David Howells
2026-03-26 10:45 ` [PATCH 23/26] netfs: Remove folio_queue and rolling_buffer David Howells
2026-03-26 10:45 ` [PATCH 24/26] netfs: Check for too much data being read David Howells
2026-03-26 10:45 ` [PATCH 25/26] netfs: Limit the the minimum trigger for progress reporting David Howells
2026-03-26 14:19 ` ChenXiaoSong
2026-03-26 10:45 ` [PATCH 26/26] netfs: Combine prepare and issue ops and grab the buffers on request David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260326104544.509518-1-dhowells@redhat.com \
--to=dhowells@redhat.com \
--cc=asmadeus@codewreck.org \
--cc=axboe@kernel.dk \
--cc=ceph-devel@vger.kernel.org \
--cc=chenxiaosong@chenxiaosong.com \
--cc=christian@brauner.io \
--cc=ericvh@kernel.org \
--cc=hch@infradead.org \
--cc=idryomov@gmail.com \
--cc=leon@kernel.org \
--cc=linux-afs@lists.infradead.org \
--cc=linux-cifs@vger.kernel.org \
--cc=linux-erofs@lists.ozlabs.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=marc.dionne@auristor.com \
--cc=netfs@lists.linux.dev \
--cc=pc@manguebit.com \
--cc=sfrench@samba.org \
--cc=trondmy@kernel.org \
--cc=v9fs@lists.linux.dev \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox