* [RFC PATCH v2 0/8] NFSD: support DIO
@ 2025-07-08 16:06 Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 1/8] NFSD: Relocate the fh_want_write() and fh_drop_write() helpers Mike Snitzer
` (7 more replies)
0 siblings, 8 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-08 16:06 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm, snitzer
Hi,
The v1 thread had quite a bit of discussion, and here are some
highlights worth reading to provide background:
jeff's summary: https://lore.kernel.org/linux-nfs/b1accdad470f19614f9d3865bb3a4c69958e5800.camel@kernel.org/
performance: https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
Changes since v1:
- Rebased ontop of Chuck's proposed fh_getattr movement (Chuck, I like
it so went with it).
- NFSD's expanding of misaligned READ DIO to be DIO-aligned that
resulted in 'start_extra' blocks being read into associated pages
caused incorrect data to be returned from NFSD. This was caught
using the 'dt' utility. See patch 8 for more details.
- This v2 patchset has been tested pretty extensively now.
I haven't time to explore Christoph's pagecache idea like I hoped:
https://lore.kernel.org/linux-nfs/aEqEQLumUp8Y7JR5@infradead.org/
Any help with hardening against page cache invalidation problems due
to mixing O_DIRECT and buffered IO appreciated. ;)
Thanks,
Mike
ps. I have a LOCALIO patchset that I'll be posting next that benefits
from this NFSD patchset.
Chuck Lever (2):
NFSD: Relocate the fh_want_write() and fh_drop_write() helpers
NFSD: Move the fh_getattr() helper
Mike Snitzer (6):
NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
NFSD: pass nfsd_file to nfsd_iter_read()
NFSD: add io_cache_read controls to debugfs interface
NFSD: add io_cache_write controls to debugfs interface
NFSD: issue READs using O_DIRECT even if IO is misaligned
fs/nfsd/debugfs.c | 94 ++++++++++++++++++++++
fs/nfsd/filecache.c | 32 ++++++++
fs/nfsd/filecache.h | 4 +
fs/nfsd/nfs4xdr.c | 8 +-
fs/nfsd/nfsd.h | 9 +++
fs/nfsd/nfsfh.c | 27 +++++++
fs/nfsd/nfsfh.h | 38 +++++++++
fs/nfsd/trace.h | 37 +++++++++
fs/nfsd/vfs.c | 154 ++++++++++++++++++++++++++++++++++---
fs/nfsd/vfs.h | 35 +--------
include/linux/sunrpc/svc.h | 5 +-
lib/iov_iter.c | 5 +-
12 files changed, 395 insertions(+), 53 deletions(-)
--
2.44.0
^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC PATCH v2 1/8] NFSD: Relocate the fh_want_write() and fh_drop_write() helpers
2025-07-08 16:06 [RFC PATCH v2 0/8] NFSD: support DIO Mike Snitzer
@ 2025-07-08 16:06 ` Mike Snitzer
2025-07-10 13:59 ` Jeff Layton
2025-07-08 16:06 ` [RFC PATCH v2 2/8] NFSD: Move the fh_getattr() helper Mike Snitzer
` (6 subsequent siblings)
7 siblings, 1 reply; 33+ messages in thread
From: Mike Snitzer @ 2025-07-08 16:06 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm, snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Clean up: These helpers are part of the NFSD file handle API.
Relocate them to fs/nfsd/nfsfh.h.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/nfsfh.h | 37 +++++++++++++++++++++++++++++++++++++
fs/nfsd/vfs.h | 20 --------------------
2 files changed, 37 insertions(+), 20 deletions(-)
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 1cf979722521..6f5255d1c190 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -14,6 +14,8 @@
#include <linux/exportfs.h>
#include <linux/nfs4.h>
+#include "export.h"
+
/*
* The file handle starts with a sequence of four-byte words.
* The first word contains a version number (1) and three descriptor bytes
@@ -271,6 +273,41 @@ static inline bool fh_fsid_match(const struct knfsd_fh *fh1,
return true;
}
+/**
+ * fh_want_write - Get write access to an export
+ * @fhp: File handle of file to be written
+ *
+ * Caller must invoke fh_drop_write() when its write operation
+ * is complete.
+ *
+ * Returns 0 if the file handle's export can be written to. Otherwise
+ * the export is not prepared for updates, and the returned negative
+ * errno value reflects the reason for the failure.
+ */
+static inline int fh_want_write(struct svc_fh *fhp)
+{
+ int ret;
+
+ if (fhp->fh_want_write)
+ return 0;
+ ret = mnt_want_write(fhp->fh_export->ex_path.mnt);
+ if (!ret)
+ fhp->fh_want_write = true;
+ return ret;
+}
+
+/**
+ * fh_drop_write - Release write access on an export
+ * @fhp: File handle of file on which fh_want_write() was previously called
+ */
+static inline void fh_drop_write(struct svc_fh *fhp)
+{
+ if (fhp->fh_want_write) {
+ fhp->fh_want_write = false;
+ mnt_drop_write(fhp->fh_export->ex_path.mnt);
+ }
+}
+
/**
* knfsd_fh_hash - calculate the crc32 hash for the filehandle
* @fh - pointer to filehandle
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index eff04959606f..4007dcbbbfef 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -160,26 +160,6 @@ __be32 nfsd_permission(struct svc_cred *cred, struct svc_export *exp,
void nfsd_filp_close(struct file *fp);
-static inline int fh_want_write(struct svc_fh *fh)
-{
- int ret;
-
- if (fh->fh_want_write)
- return 0;
- ret = mnt_want_write(fh->fh_export->ex_path.mnt);
- if (!ret)
- fh->fh_want_write = true;
- return ret;
-}
-
-static inline void fh_drop_write(struct svc_fh *fh)
-{
- if (fh->fh_want_write) {
- fh->fh_want_write = false;
- mnt_drop_write(fh->fh_export->ex_path.mnt);
- }
-}
-
static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
{
u32 request_mask = STATX_BASIC_STATS;
--
2.44.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH v2 2/8] NFSD: Move the fh_getattr() helper
2025-07-08 16:06 [RFC PATCH v2 0/8] NFSD: support DIO Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 1/8] NFSD: Relocate the fh_want_write() and fh_drop_write() helpers Mike Snitzer
@ 2025-07-08 16:06 ` Mike Snitzer
2025-07-10 13:59 ` Jeff Layton
2025-07-08 16:06 ` [RFC PATCH v2 3/8] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
` (5 subsequent siblings)
7 siblings, 1 reply; 33+ messages in thread
From: Mike Snitzer @ 2025-07-08 16:06 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm, snitzer
From: Chuck Lever <chuck.lever@oracle.com>
Clean up: The fh_getattr() function is part of NFSD's file handle
API, so relocate it.
I've made it an un-inlined function so that trace points and new
functionality can easily be introduced. That increases the size of
nfsd.ko by about a page on my x86_64 system (out of 26MB; compiled
with -O2).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/nfsfh.c | 23 +++++++++++++++++++++++
fs/nfsd/nfsfh.h | 1 +
fs/nfsd/vfs.h | 13 -------------
3 files changed, 24 insertions(+), 13 deletions(-)
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 16e6b4428d55..f4a3cc9e31e0 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -663,6 +663,29 @@ fh_update(struct svc_fh *fhp)
return nfserr_serverfault;
}
+/**
+ * fh_getattr - Retrieve attributes on a local file
+ * @fhp: File handle of target file
+ * @stat: Caller-supplied kstat buffer to be filled in
+ *
+ * Returns nfs_ok on success, otherwise an NFS status code is
+ * returned.
+ */
+__be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
+{
+ struct path p = {
+ .mnt = fhp->fh_export->ex_path.mnt,
+ .dentry = fhp->fh_dentry,
+ };
+ u32 request_mask = STATX_BASIC_STATS;
+
+ if (fhp->fh_maxsize == NFS4_FHSIZE)
+ request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
+
+ return nfserrno(vfs_getattr(&p, stat, request_mask,
+ AT_STATX_SYNC_AS_STAT));
+}
+
/**
* fh_fill_pre_attrs - Fill in pre-op attributes
* @fhp: file handle to be updated
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 6f5255d1c190..5ef7191f8ad8 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -222,6 +222,7 @@ extern char * SVCFH_fmt(struct svc_fh *fhp);
__be32 fh_verify(struct svc_rqst *, struct svc_fh *, umode_t, int);
__be32 fh_verify_local(struct net *, struct svc_cred *, struct auth_domain *,
struct svc_fh *, umode_t, int);
+__be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat);
__be32 fh_compose(struct svc_fh *, struct svc_export *, struct dentry *, struct svc_fh *);
__be32 fh_update(struct svc_fh *);
void fh_put(struct svc_fh *);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 4007dcbbbfef..0c0292611c6d 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -160,17 +160,4 @@ __be32 nfsd_permission(struct svc_cred *cred, struct svc_export *exp,
void nfsd_filp_close(struct file *fp);
-static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
-{
- u32 request_mask = STATX_BASIC_STATS;
- struct path p = {.mnt = fh->fh_export->ex_path.mnt,
- .dentry = fh->fh_dentry};
-
- if (fh->fh_maxsize == NFS4_FHSIZE)
- request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
-
- return nfserrno(vfs_getattr(&p, stat, request_mask,
- AT_STATX_SYNC_AS_STAT));
-}
-
#endif /* LINUX_NFSD_VFS_H */
--
2.44.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH v2 3/8] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-07-08 16:06 [RFC PATCH v2 0/8] NFSD: support DIO Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 1/8] NFSD: Relocate the fh_want_write() and fh_drop_write() helpers Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 2/8] NFSD: Move the fh_getattr() helper Mike Snitzer
@ 2025-07-08 16:06 ` Mike Snitzer
2025-07-10 7:45 ` Christoph Hellwig
2025-07-08 16:06 ` [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec Mike Snitzer
` (4 subsequent siblings)
7 siblings, 1 reply; 33+ messages in thread
From: Mike Snitzer @ 2025-07-08 16:06 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm, snitzer
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store DIO
alignment attributes from underlying filesystem in associated
nfsd_file. This is done when the nfsd_file is first opened for
a regular file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/filecache.c | 32 ++++++++++++++++++++++++++++++++
fs/nfsd/filecache.h | 4 ++++
fs/nfsd/nfsfh.c | 4 ++++
3 files changed, 40 insertions(+)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 905034fd8c34..e8a9336b42f5 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -231,6 +231,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
refcount_set(&nf->nf_ref, 1);
nf->nf_may = need;
nf->nf_mark = NULL;
+ nf->nf_dio_mem_align = 0;
+ nf->nf_dio_offset_align = 0;
+ nf->nf_dio_read_offset_align = 0;
return nf;
}
@@ -1064,6 +1067,33 @@ nfsd_file_is_cached(struct inode *inode)
return ret;
}
+static __be32
+nfsd_file_getattr(const struct svc_fh *fhp, struct nfsd_file *nf)
+{
+ struct inode *inode = file_inode(nf->nf_file);
+ struct kstat stat;
+ __be32 status;
+
+ /* Currently only need to get DIO alignment info for regular files */
+ if (!S_ISREG(inode->i_mode))
+ return nfs_ok;
+
+ status = fh_getattr(fhp, &stat);
+ if (status != nfs_ok)
+ return status;
+
+ if (stat.result_mask & STATX_DIOALIGN) {
+ nf->nf_dio_mem_align = stat.dio_mem_align;
+ nf->nf_dio_offset_align = stat.dio_offset_align;
+ }
+ if (stat.result_mask & STATX_DIO_READ_ALIGN)
+ nf->nf_dio_read_offset_align = stat.dio_read_offset_align;
+ else
+ nf->nf_dio_read_offset_align = nf->nf_dio_offset_align;
+
+ return status;
+}
+
static __be32
nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
struct svc_cred *cred,
@@ -1182,6 +1212,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
}
status = nfserrno(ret);
trace_nfsd_file_open(nf, status);
+ if (status == nfs_ok)
+ status = nfsd_file_getattr(fhp, nf);
}
} else
status = nfserr_jukebox;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index cd02f91aaef1..d41428ce8a11 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -54,6 +54,10 @@ struct nfsd_file {
struct list_head nf_gc;
struct rcu_head nf_rcu;
ktime_t nf_birthtime;
+
+ u32 nf_dio_mem_align;
+ u32 nf_dio_offset_align;
+ u32 nf_dio_read_offset_align;
};
int nfsd_file_cache_init(void);
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index f4a3cc9e31e0..bdba2ba828a6 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -677,8 +677,12 @@ __be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
.mnt = fhp->fh_export->ex_path.mnt,
.dentry = fhp->fh_dentry,
};
+ struct inode *inode = d_inode(p.dentry);
u32 request_mask = STATX_BASIC_STATS;
+ if (S_ISREG(inode->i_mode))
+ request_mask |= (STATX_DIOALIGN | STATX_DIO_READ_ALIGN);
+
if (fhp->fh_maxsize == NFS4_FHSIZE)
request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
--
2.44.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-08 16:06 [RFC PATCH v2 0/8] NFSD: support DIO Mike Snitzer
` (2 preceding siblings ...)
2025-07-08 16:06 ` [RFC PATCH v2 3/8] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
@ 2025-07-08 16:06 ` Mike Snitzer
2025-07-10 7:24 ` Christoph Hellwig
2025-07-10 13:52 ` Jeff Layton
2025-07-08 16:06 ` [RFC PATCH v2 5/8] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
` (3 subsequent siblings)
7 siblings, 2 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-08 16:06 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm, snitzer
iov_iter_aligned_bvec() is strictly checking alignment of each element
of the bvec to arrive at whether the bvec is aligned relative to
dma_alignment and on-disk alignment. Checking each element
individually results in disallowing a bvec that in aggregate is
perfectly aligned relative to the provided @len_mask.
Relax the on-disk alignment checking such that it is done on the full
extent described by the bvec but still do piecewise checking of the
dma_alignment for each bvec's bv_offset.
This allows for NFS's WRITE payload to be issued using O_DIRECT as
long as the bvec created with xdr_buf_to_bvec() is composed of pages
that respect the underlying device's dma_alignment (@addr_mask) and
the overall contiguous on-disk extent is aligned relative to the
logical_block_size (@len_mask).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
lib/iov_iter.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index bdb37d572e97..b2ae482b8a1d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
unsigned skip = i->iov_offset;
size_t size = i->count;
+ if (size & len_mask)
+ return false;
+
do {
size_t len = bvec->bv_len;
if (len > size)
len = size;
- if (len & len_mask)
- return false;
if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
return false;
--
2.44.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH v2 5/8] NFSD: pass nfsd_file to nfsd_iter_read()
2025-07-08 16:06 [RFC PATCH v2 0/8] NFSD: support DIO Mike Snitzer
` (3 preceding siblings ...)
2025-07-08 16:06 ` [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec Mike Snitzer
@ 2025-07-08 16:06 ` Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
` (2 subsequent siblings)
7 siblings, 0 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-08 16:06 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm, snitzer
Prepares for nfsd_iter_read() to use DIO alignment stored in nfsd_file.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/nfs4xdr.c | 8 ++++----
fs/nfsd/vfs.c | 7 ++++---
fs/nfsd/vfs.h | 2 +-
3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 2ee218ff4958..4e29297c610a 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4476,7 +4476,7 @@ static __be32 nfsd4_encode_splice_read(
static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
struct nfsd4_read *read,
- struct file *file, unsigned long maxcount)
+ unsigned long maxcount)
{
struct xdr_stream *xdr = resp->xdr;
unsigned int base = xdr->buf->page_len & ~PAGE_MASK;
@@ -4487,7 +4487,7 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
if (xdr_reserve_space_vec(xdr, maxcount) < 0)
return nfserr_resource;
- nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, file,
+ nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
read->rd_offset, &maxcount, base,
&read->rd_eof);
read->rd_length = maxcount;
@@ -4534,7 +4534,7 @@ nfsd4_encode_read(struct nfsd4_compoundres *resp, __be32 nfserr,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr) {
xdr_truncate_encode(xdr, eof_offset);
return nfserr;
@@ -5430,7 +5430,7 @@ nfsd4_encode_read_plus_data(struct nfsd4_compoundres *resp,
if (file->f_op->splice_read && splice_ok)
nfserr = nfsd4_encode_splice_read(resp, read, file, maxcount);
else
- nfserr = nfsd4_encode_readv(resp, read, file, maxcount);
+ nfserr = nfsd4_encode_readv(resp, read, maxcount);
if (nfserr)
return nfserr;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 2edf76feaeb9..845c212ad10b 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1067,7 +1067,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
* @fhp: file handle of file to be read
- * @file: opened struct file of file to be read
+ * @nf: opened struct nfsd_file of file to be read
* @offset: starting byte offset
* @count: IN: requested number of bytes; OUT: number of bytes read
* @base: offset in first page of read buffer
@@ -1080,9 +1080,10 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
* returned.
*/
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset, unsigned long *count,
+ struct nfsd_file *nf, loff_t offset, unsigned long *count,
unsigned int base, u32 *eof)
{
+ struct file *file = nf->nf_file;
unsigned long v, total;
struct iov_iter iter;
struct kiocb kiocb;
@@ -1304,7 +1305,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (file->f_op->splice_read && nfsd_read_splice_ok(rqstp))
err = nfsd_splice_read(rqstp, fhp, file, offset, count, eof);
else
- err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
+ err = nfsd_iter_read(rqstp, fhp, nf, offset, count, 0, eof);
nfsd_file_put(nf);
trace_nfsd_read_done(rqstp, fhp, offset, *count);
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 0c0292611c6d..fa46f8b5f132 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -121,7 +121,7 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned long *count,
u32 *eof);
__be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
- struct file *file, loff_t offset,
+ struct nfsd_file *nf, loff_t offset,
unsigned long *count, unsigned int base,
u32 *eof);
bool nfsd_read_splice_ok(struct svc_rqst *rqstp);
--
2.44.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface
2025-07-08 16:06 [RFC PATCH v2 0/8] NFSD: support DIO Mike Snitzer
` (4 preceding siblings ...)
2025-07-08 16:06 ` [RFC PATCH v2 5/8] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
@ 2025-07-08 16:06 ` Mike Snitzer
2025-07-10 7:47 ` Christoph Hellwig
2025-07-10 14:06 ` Jeff Layton
2025-07-08 16:06 ` [RFC PATCH v2 7/8] NFSD: add io_cache_write " Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 8/8] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
7 siblings, 2 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-08 16:06 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm, snitzer
Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
read by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=0)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=1).
- not cached (NFSD_IO_DIRECT=2)
io_cache_read is 0 by default. It may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_read
If NFSD_IO_DONTCACHE is specified using 1, FOP_DONTCACHE must be
advertised as supported by the underlying filesystem (e.g. XFS),
otherwise all IO flagged with RWF_DONTCACHE will fail with
-EOPNOTSUPP.
If NFSD_IO_DIRECT is specified using 2, the IO must be aligned
relative to the underlying block device's logical_block_size. Also the
memory buffer used to store the read must be aligned relative to the
underlying block device's dma_alignment.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/debugfs.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfsd.h | 8 +++++++
fs/nfsd/vfs.c | 15 ++++++++++++++
3 files changed, 76 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 84b0c8b559dc..709646af797a 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -27,11 +27,61 @@ static int nfsd_dsr_get(void *data, u64 *val)
static int nfsd_dsr_set(void *data, u64 val)
{
nfsd_disable_splice_read = (val > 0) ? true : false;
+ if (!nfsd_disable_splice_read) {
+ /*
+ * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
+ * if splice_read is enabled.
+ */
+ nfsd_io_cache_read = NFSD_IO_BUFFERED;
+ }
return 0;
}
DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
+/*
+ * /sys/kernel/debug/nfsd/io_cache_read
+ *
+ * Contents:
+ * %0: NFS READ will use buffered IO (default)
+ * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
+ * %2: NFS READ will use direct IO
+ *
+ * The default value of this setting is zero (buffered IO is
+ * used). This setting takes immediate effect for all NFS
+ * versions, all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_io_cache_read_get(void *data, u64 *val)
+{
+ *val = nfsd_io_cache_read;
+ return 0;
+}
+
+static int nfsd_io_cache_read_set(void *data, u64 val)
+{
+ switch (val) {
+ case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
+ /*
+ * Must disable splice_read when enabling
+ * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
+ */
+ nfsd_disable_splice_read = true;
+ nfsd_io_cache_read = val;
+ break;
+ case NFSD_IO_BUFFERED:
+ default:
+ nfsd_io_cache_read = NFSD_IO_BUFFERED;
+ break;
+ }
+
+ return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
+ nfsd_io_cache_read_set, "%llu\n");
+
void nfsd_debugfs_exit(void)
{
debugfs_remove_recursive(nfsd_top_dir);
@@ -44,4 +94,7 @@ void nfsd_debugfs_init(void)
debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
nfsd_top_dir, NULL, &nfsd_dsr_fops);
+
+ debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
+ nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 1cd0bed57bc2..4740567f4e7e 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -153,6 +153,14 @@ static inline void nfsd_debugfs_exit(void) {}
extern bool nfsd_disable_splice_read __read_mostly;
+enum {
+ NFSD_IO_BUFFERED = 0,
+ NFSD_IO_DONTCACHE,
+ NFSD_IO_DIRECT
+};
+
+extern u64 nfsd_io_cache_read __read_mostly;
+
extern int nfsd_max_blksize;
static inline int nfsd_v4client(struct svc_rqst *rq)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 845c212ad10b..632ce417f4ef 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -49,6 +49,7 @@
#define NFSDDBG_FACILITY NFSDDBG_FILEOP
bool nfsd_disable_splice_read __read_mostly;
+u64 nfsd_io_cache_read __read_mostly;
/**
* nfserrno - Map Linux errnos to NFS errnos
@@ -1107,6 +1108,20 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
trace_nfsd_read_vector(rqstp, fhp, offset, *count);
iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
+
+ switch (nsfd_io_cache_read) {
+ case NFSD_IO_DIRECT:
+ if (iov_iter_is_aligned(&iter, nf->nf_dio_mem_align - 1,
+ nf->nf_dio_read_offset_align - 1))
+ kiocb.ki_flags = IOCB_DIRECT;
+ break;
+ case NFSD_IO_DONTCACHE:
+ kiocb.ki_flags = IOCB_DONTCACHE;
+ break;
+ case NFSD_IO_BUFFERED:
+ break;
+ }
+
host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
--
2.44.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH v2 7/8] NFSD: add io_cache_write controls to debugfs interface
2025-07-08 16:06 [RFC PATCH v2 0/8] NFSD: support DIO Mike Snitzer
` (5 preceding siblings ...)
2025-07-08 16:06 ` [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
@ 2025-07-08 16:06 ` Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 8/8] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
7 siblings, 0 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-08 16:06 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm, snitzer
Add 'io_cache_write' to NFSD's debugfs interface so that: Any data
written by NFSD will either be:
- cached using page cache (NFSD_IO_BUFFERED=0)
- cached but removed from the page cache upon completion
(NFSD_IO_DONTCACHE=1).
- not cached (NFSD_IO_DIRECT=2)
io_cache_write is 0 by default. It may be set by writing to:
/sys/kernel/debug/nfsd/io_cache_write
If NFSD_IO_DONTCACHE is specified using 1, FOP_DONTCACHE must be
advertised as supported by the underlying filesystem (e.g. XFS),
otherwise all IO flagged with RWF_DONTCACHE will fail with
-EOPNOTSUPP.
If NFSD_IO_DIRECT is specified using 2, the IO must be aligned
relative to the underlying block device's logical_block_size. Also the
memory buffer used to store the WRITE payload must be aligned relative
to the underlying block device's dma_alignment.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/debugfs.c | 41 +++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfsd.h | 1 +
fs/nfsd/vfs.c | 15 +++++++++++++++
3 files changed, 57 insertions(+)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 709646af797a..fadf1d88f640 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -82,6 +82,44 @@ static int nfsd_io_cache_read_set(void *data, u64 val)
DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
nfsd_io_cache_read_set, "%llu\n");
+/*
+ * /sys/kernel/debug/nfsd/io_cache_write
+ *
+ * Contents:
+ * %0: NFS WRITE will use buffered IO (default)
+ * %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
+ * %2: NFS WRITE will use direct IO
+ *
+ * The default value of this setting is zero (buffered IO is
+ * used). This setting takes immediate effect for all NFS
+ * versions, all exports, and in all NFSD net namespaces.
+ */
+
+static int nfsd_io_cache_write_get(void *data, u64 *val)
+{
+ *val = nfsd_io_cache_write;
+ return 0;
+}
+
+static int nfsd_io_cache_write_set(void *data, u64 val)
+{
+ switch (val) {
+ case NFSD_IO_BUFFERED:
+ case NFSD_IO_DONTCACHE:
+ case NFSD_IO_DIRECT:
+ nfsd_io_cache_write = val;
+ break;
+ default:
+ nfsd_io_cache_write = NFSD_IO_BUFFERED;
+ break;
+ }
+
+ return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_write_fops, nfsd_io_cache_write_get,
+ nfsd_io_cache_write_set, "%llu\n");
+
void nfsd_debugfs_exit(void)
{
debugfs_remove_recursive(nfsd_top_dir);
@@ -97,4 +135,7 @@ void nfsd_debugfs_init(void)
debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
+
+ debugfs_create_file("io_cache_write", S_IWUSR | S_IRUGO,
+ nfsd_top_dir, NULL, &nfsd_io_cache_write_fops);
}
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 4740567f4e7e..1ae38c5557c4 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -160,6 +160,7 @@ enum {
};
extern u64 nfsd_io_cache_read __read_mostly;
+extern u64 nfsd_io_cache_write __read_mostly;
extern int nfsd_max_blksize;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 632ce417f4ef..05a7ba383334 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -50,6 +50,7 @@
bool nfsd_disable_splice_read __read_mostly;
u64 nfsd_io_cache_read __read_mostly;
+u64 nfsd_io_cache_write __read_mostly;
/**
* nfserrno - Map Linux errnos to NFS errnos
@@ -1228,6 +1229,20 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
+
+ switch (nfsd_io_cache_write) {
+ case NFSD_IO_DIRECT:
+ if (iov_iter_is_aligned(&iter, nf->nf_dio_mem_align - 1,
+ nf->nf_dio_offset_align - 1))
+ kiocb.ki_flags = IOCB_DIRECT;
+ break;
+ case NFSD_IO_DONTCACHE:
+ kiocb.ki_flags = IOCB_DONTCACHE;
+ break;
+ case NFSD_IO_BUFFERED:
+ break;
+ }
+
since = READ_ONCE(file->f_wb_err);
if (verf)
nfsd_copy_write_verifier(verf, nn);
--
2.44.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH v2 8/8] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-07-08 16:06 [RFC PATCH v2 0/8] NFSD: support DIO Mike Snitzer
` (6 preceding siblings ...)
2025-07-08 16:06 ` [RFC PATCH v2 7/8] NFSD: add io_cache_write " Mike Snitzer
@ 2025-07-08 16:06 ` Mike Snitzer
2025-07-08 21:22 ` Mike Snitzer
2025-07-10 7:51 ` Christoph Hellwig
7 siblings, 2 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-08 16:06 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm, snitzer
If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
DIO-aligned block (on either end of the READ).
Must allocate and use a bounce-buffer page (called 'start_extra_page')
if/when expanding the misaligned READ requires reading extra partial
page at the start of the READ so that its DIO-aligned. Otherwise those
extra pages at the start will make their way back to the NFS client
and corruption will occur (as found, and then this fix of using an
extra page verified, with the 'dt' utility, using:
dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
iotype=sequential pattern=iot onerr=abort oncerr=abort
see: https://github.com/RobinTMiller/dt.git )
Reserve an extra page in svc_serv_maxpages() because nfsd_iter_read()
might need two extra pages when a READ payload is not DIO-aligned --
but nfsd_iter_read() and nfsd_splice_actor() are mutually exclusive
(so reuse page reserved for nfsd_splice_actor).
Also add nfsd_read_vector_dio trace event. This combination of
trace events is useful:
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector_dio/enable
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
Which for this dd command:
dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
Results in:
nfsd-16580 [001] ..... 5672.403130: nfsd_read_vector_dio: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47008 start=0+0 end=47104-96
nfsd-16580 [001] ..... 5672.403131: nfsd_read_vector: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47104
nfsd-16580 [001] ..... 5672.403134: xfs_file_direct_read: dev 253:0 ino 0x1c2388c1 disize 0x16f40 pos 0x0 bytecount 0xb800
nfsd-16580 [001] ..... 5672.404380: nfsd_read_io_done: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47008
nfsd-16580 [001] ..... 5672.404672: nfsd_read_vector_dio: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=47008 len=47008 start=46592+416 end=94208-192
nfsd-16580 [001] ..... 5672.404672: nfsd_read_vector: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=46592 len=47616
nfsd-16580 [001] ..... 5672.404673: xfs_file_direct_read: dev 253:0 ino 0x1c2388c1 disize 0x16f40 pos 0xb600 bytecount 0xba00
nfsd-16580 [001] ..... 5672.405771: nfsd_read_io_done: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=47008 len=47008
Suggested-by: Jeff Layton <jlayton@kernel.org>
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
fs/nfsd/trace.h | 37 ++++++++++
fs/nfsd/vfs.c | 141 +++++++++++++++++++++++++++++++------
include/linux/sunrpc/svc.h | 5 +-
3 files changed, 161 insertions(+), 22 deletions(-)
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index a664fdf1161e..55055482f8a8 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -473,6 +473,43 @@ DEFINE_NFSD_IO_EVENT(write_done);
DEFINE_NFSD_IO_EVENT(commit_start);
DEFINE_NFSD_IO_EVENT(commit_done);
+TRACE_EVENT(nfsd_read_vector_dio,
+ TP_PROTO(struct svc_rqst *rqstp,
+ struct svc_fh *fhp,
+ u64 offset,
+ u32 len,
+ loff_t start,
+ loff_t start_extra,
+ loff_t end,
+ loff_t end_extra),
+ TP_ARGS(rqstp, fhp, offset, len, start, start_extra, end, end_extra),
+ TP_STRUCT__entry(
+ __field(u32, xid)
+ __field(u32, fh_hash)
+ __field(u64, offset)
+ __field(u32, len)
+ __field(loff_t, start)
+ __field(loff_t, start_extra)
+ __field(loff_t, end)
+ __field(loff_t, end_extra)
+ ),
+ TP_fast_assign(
+ __entry->xid = be32_to_cpu(rqstp->rq_xid);
+ __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
+ __entry->offset = offset;
+ __entry->len = len;
+ __entry->start = start;
+ __entry->start_extra = start_extra;
+ __entry->end = end;
+ __entry->end_extra = end_extra;
+ ),
+ TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%llu end=%llu-%llu",
+ __entry->xid, __entry->fh_hash,
+ __entry->offset, __entry->len,
+ __entry->start, __entry->start_extra,
+ __entry->end, __entry->end_extra)
+);
+
DECLARE_EVENT_CLASS(nfsd_err_class,
TP_PROTO(struct svc_rqst *rqstp,
struct svc_fh *fhp,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 05a7ba383334..2d21b8ec2d32 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -19,6 +19,7 @@
#include <linux/splice.h>
#include <linux/falloc.h>
#include <linux/fcntl.h>
+#include <linux/math.h>
#include <linux/namei.h>
#include <linux/delay.h>
#include <linux/fsnotify.h>
@@ -1065,6 +1066,74 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
+static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
+ loff_t offset, __u32 len,
+ const u32 dio_blocksize,
+ loff_t *start, loff_t *end,
+ loff_t *start_extra, loff_t *end_extra)
+{
+ loff_t orig_end = offset + len;
+
+ if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
+ "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
+ __func__, dio_blocksize, PAGE_SIZE))
+ return false;
+
+ *start = round_down(offset, dio_blocksize);
+ *end = round_up(orig_end, dio_blocksize);
+ *start_extra = offset - *start;
+ *end_extra = *end - orig_end;
+
+ /* Show original offset and count, and how it was expanded for DIO */
+ trace_nfsd_read_vector_dio(rqstp, fhp, offset, len,
+ *start, *start_extra, *end, *end_extra);
+
+ return (*start_extra || *end_extra);
+}
+
+static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
+ loff_t start_extra, loff_t end_extra,
+ ssize_t bytes_read,
+ unsigned long bytes_expected,
+ loff_t *offset,
+ unsigned long *rq_bvec_numpages,
+ struct page *start_extra_page)
+{
+ ssize_t host_err = bytes_read;
+ loff_t v;
+
+ /* Must remove first start_extra_page from rqstp->rq_bvec */
+ if (start_extra_page) {
+ __free_page(start_extra_page);
+ *rq_bvec_numpages -= 1;
+ v = *rq_bvec_numpages;
+ for (int i = 0; i < v; i++) {
+ struct bio_vec *bv = &rqstp->rq_bvec[i+1];
+ bvec_set_page(&rqstp->rq_bvec[i], bv->bv_page,
+ bv->bv_offset, bv->bv_len);
+ }
+ }
+ /* Then eliminate the end_extra bytes from the last page */
+ v = *rq_bvec_numpages;
+ rqstp->rq_bvec[v].bv_len -= end_extra;
+
+ if (host_err < 0)
+ return host_err;
+
+ /* Must adjust returned read size to reflect original extent */
+ *offset += start_extra;
+ if (likely(host_err >= start_extra)) {
+ host_err -= start_extra;
+ if (host_err > bytes_expected)
+ host_err = bytes_expected;
+ } else {
+ /* Short read that didn't read any of requested data */
+ host_err = 0;
+ }
+
+ return host_err;
+}
+
/**
* nfsd_iter_read - Perform a VFS read using an iterator
* @rqstp: RPC transaction context
@@ -1086,44 +1155,74 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int base, u32 *eof)
{
struct file *file = nf->nf_file;
- unsigned long v, total;
+ unsigned long v, total, in_count = *count;
+ loff_t start_extra = 0, end_extra = 0;
+ struct page *start_extra_page = NULL;
struct iov_iter iter;
struct kiocb kiocb;
- ssize_t host_err;
+ ssize_t host_err = 0;
size_t len;
init_sync_kiocb(&kiocb, file);
+
+ /*
+ * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
+ * the next DIO-aligned block (on either end of the READ).
+ */
+ if ((nfsd_io_cache_read == NFSD_IO_DIRECT) &&
+ nf->nf_dio_mem_align && (base & (nf->nf_dio_mem_align-1)) == 0) {
+ loff_t start, end;
+ if (nfsd_analyze_read_dio(rqstp, fhp, offset, in_count,
+ nf->nf_dio_read_offset_align,
+ &start, &end, &start_extra, &end_extra)) {
+ /* trace_nfsd_read_vector() will reflect larger DIO-aligned READ */
+ offset = start;
+ in_count = end - start;
+ kiocb.ki_flags = IOCB_DIRECT;
+ }
+ } else if (nfsd_io_cache_read == NFSD_IO_DONTCACHE)
+ kiocb.ki_flags = IOCB_DONTCACHE;
+
kiocb.ki_pos = offset;
v = 0;
- total = *count;
+ if (start_extra) {
+ start_extra_page = alloc_page(GFP_KERNEL);
+ if (WARN_ONCE(start_extra_page == NULL,
+ "%s: Unable to allocate start_extra_page\n",
+ __func__))
+ host_err = -ENOMEM;
+ else
+ bvec_set_page(&rqstp->rq_bvec[v++], start_extra_page,
+ start_extra, PAGE_SIZE - start_extra);
+ }
+ total = in_count - start_extra;
while (total) {
len = min_t(size_t, total, PAGE_SIZE - base);
- bvec_set_page(&rqstp->rq_bvec[v], *(rqstp->rq_next_page++),
+ bvec_set_page(&rqstp->rq_bvec[v++], *(rqstp->rq_next_page++),
len, base);
total -= len;
- ++v;
base = 0;
}
- WARN_ON_ONCE(v > rqstp->rq_maxpages);
-
- trace_nfsd_read_vector(rqstp, fhp, offset, *count);
- iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
+ /* even though start_extra_page is independent, rqstp->rq_bvec is used */
+ if (WARN_ONCE(v > rqstp->rq_maxpages,
+ "%s: v=%lu exceeds rqstp->rq_maxpages=%lu\n", __func__,
+ v, rqstp->rq_maxpages)) {
+ host_err = -EINVAL;
+ }
- switch (nsfd_io_cache_read) {
- case NFSD_IO_DIRECT:
- if (iov_iter_is_aligned(&iter, nf->nf_dio_mem_align - 1,
- nf->nf_dio_read_offset_align - 1))
- kiocb.ki_flags = IOCB_DIRECT;
- break;
- case NFSD_IO_DONTCACHE:
- kiocb.ki_flags = IOCB_DONTCACHE;
- break;
- case NFSD_IO_BUFFERED:
- break;
+ if (!host_err) {
+ trace_nfsd_read_vector(rqstp, fhp, offset, in_count);
+ iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, in_count);
+ host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
}
- host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
+ if (start_extra || end_extra) {
+ host_err = nfsd_complete_misaligned_read_dio(rqstp, start_extra,
+ end_extra, host_err,
+ *count, &offset, &v,
+ start_extra_page);
+ }
return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
}
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index e64ab444e0a7..190c2667500e 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -163,10 +163,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* pages, one for the request, and one for the reply.
* nfsd_splice_actor() might need an extra page when a READ payload
* is not page-aligned.
+ * nfsd_iter_read() might need two extra pages when a READ payload
+ * is not DIO-aligned -- but nfsd_iter_read() and nfsd_splice_actor()
+ * are mutually exclusive (so reuse page reserved for nfsd_splice_actor).
*/
static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
{
- return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
+ return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
}
/*
--
2.44.0
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 8/8] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-07-08 16:06 ` [RFC PATCH v2 8/8] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
@ 2025-07-08 21:22 ` Mike Snitzer
2025-07-10 7:51 ` Christoph Hellwig
1 sibling, 0 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-08 21:22 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm
On Tue, Jul 08, 2025 at 12:06:19PM -0400, Mike Snitzer wrote:
> If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
> DIO-aligned block (on either end of the READ).
>
> Must allocate and use a bounce-buffer page (called 'start_extra_page')
> if/when expanding the misaligned READ requires reading extra partial
> page at the start of the READ so that its DIO-aligned. Otherwise those
> extra pages at the start will make their way back to the NFS client
> and corruption will occur (as found, and then this fix of using an
> extra page verified, with the 'dt' utility, using:
> dt of=/mnt/share1/dt_a.test passes=1 bs=47008 count=2 \
> iotype=sequential pattern=iot onerr=abort oncerr=abort
> see: https://github.com/RobinTMiller/dt.git )
>
> Reserve an extra page in svc_serv_maxpages() because nfsd_iter_read()
> might need two extra pages when a READ payload is not DIO-aligned --
> but nfsd_iter_read() and nfsd_splice_actor() are mutually exclusive
> (so reuse page reserved for nfsd_splice_actor).
>
> Also add nfsd_read_vector_dio trace event. This combination of
> trace events is useful:
>
> echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
> echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector_dio/enable
> echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
> echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
>
> Which for this dd command:
>
> dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct
>
> Results in:
>
> nfsd-16580 [001] ..... 5672.403130: nfsd_read_vector_dio: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47008 start=0+0 end=47104-96
> nfsd-16580 [001] ..... 5672.403131: nfsd_read_vector: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47104
> nfsd-16580 [001] ..... 5672.403134: xfs_file_direct_read: dev 253:0 ino 0x1c2388c1 disize 0x16f40 pos 0x0 bytecount 0xb800
> nfsd-16580 [001] ..... 5672.404380: nfsd_read_io_done: xid=0x5ccf019c fh_hash=0xe4dadb60 offset=0 len=47008
>
> nfsd-16580 [001] ..... 5672.404672: nfsd_read_vector_dio: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=47008 len=47008 start=46592+416 end=94208-192
> nfsd-16580 [001] ..... 5672.404672: nfsd_read_vector: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=46592 len=47616
> nfsd-16580 [001] ..... 5672.404673: xfs_file_direct_read: dev 253:0 ino 0x1c2388c1 disize 0x16f40 pos 0xb600 bytecount 0xba00
> nfsd-16580 [001] ..... 5672.405771: nfsd_read_io_done: xid=0x5dcf019c fh_hash=0xe4dadb60 offset=47008 len=47008
>
> Suggested-by: Jeff Layton <jlayton@kernel.org>
> Suggested-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
> fs/nfsd/trace.h | 37 ++++++++++
> fs/nfsd/vfs.c | 141 +++++++++++++++++++++++++++++++------
> include/linux/sunrpc/svc.h | 5 +-
> 3 files changed, 161 insertions(+), 22 deletions(-)
>
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index a664fdf1161e..55055482f8a8 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -473,6 +473,43 @@ DEFINE_NFSD_IO_EVENT(write_done);
> DEFINE_NFSD_IO_EVENT(commit_start);
> DEFINE_NFSD_IO_EVENT(commit_done);
>
> +TRACE_EVENT(nfsd_read_vector_dio,
> + TP_PROTO(struct svc_rqst *rqstp,
> + struct svc_fh *fhp,
> + u64 offset,
> + u32 len,
> + loff_t start,
> + loff_t start_extra,
> + loff_t end,
> + loff_t end_extra),
> + TP_ARGS(rqstp, fhp, offset, len, start, start_extra, end, end_extra),
> + TP_STRUCT__entry(
> + __field(u32, xid)
> + __field(u32, fh_hash)
> + __field(u64, offset)
> + __field(u32, len)
> + __field(loff_t, start)
> + __field(loff_t, start_extra)
> + __field(loff_t, end)
> + __field(loff_t, end_extra)
> + ),
> + TP_fast_assign(
> + __entry->xid = be32_to_cpu(rqstp->rq_xid);
> + __entry->fh_hash = knfsd_fh_hash(&fhp->fh_handle);
> + __entry->offset = offset;
> + __entry->len = len;
> + __entry->start = start;
> + __entry->start_extra = start_extra;
> + __entry->end = end;
> + __entry->end_extra = end_extra;
> + ),
> + TP_printk("xid=0x%08x fh_hash=0x%08x offset=%llu len=%u start=%llu+%llu end=%llu-%llu",
> + __entry->xid, __entry->fh_hash,
> + __entry->offset, __entry->len,
> + __entry->start, __entry->start_extra,
> + __entry->end, __entry->end_extra)
> +);
> +
> DECLARE_EVENT_CLASS(nfsd_err_class,
> TP_PROTO(struct svc_rqst *rqstp,
> struct svc_fh *fhp,
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 05a7ba383334..2d21b8ec2d32 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -19,6 +19,7 @@
> #include <linux/splice.h>
> #include <linux/falloc.h>
> #include <linux/fcntl.h>
> +#include <linux/math.h>
> #include <linux/namei.h>
> #include <linux/delay.h>
> #include <linux/fsnotify.h>
> @@ -1065,6 +1066,74 @@ __be32 nfsd_splice_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
>
> +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + loff_t offset, __u32 len,
> + const u32 dio_blocksize,
> + loff_t *start, loff_t *end,
> + loff_t *start_extra, loff_t *end_extra)
> +{
> + loff_t orig_end = offset + len;
> +
> + if (WARN_ONCE(dio_blocksize > PAGE_SIZE,
> + "%s: underlying storage's dio_blocksize=%u > PAGE_SIZE=%lu\n",
> + __func__, dio_blocksize, PAGE_SIZE))
> + return false;
> +
> + *start = round_down(offset, dio_blocksize);
> + *end = round_up(orig_end, dio_blocksize);
> + *start_extra = offset - *start;
> + *end_extra = *end - orig_end;
> +
> + /* Show original offset and count, and how it was expanded for DIO */
> + trace_nfsd_read_vector_dio(rqstp, fhp, offset, len,
> + *start, *start_extra, *end, *end_extra);
> +
> + return (*start_extra || *end_extra);
> +}
Turns out there is a bug in the above final return, incremental fix
that will be in v3 (once others hopefully have a chance to review):
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 2d21b8ec2d32..6336a5806f4c 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1088,7 +1088,7 @@ static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
trace_nfsd_read_vector_dio(rqstp, fhp, offset, len,
*start, *start_extra, *end, *end_extra);
- return (*start_extra || *end_extra);
+ return true;
}
static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> +static ssize_t nfsd_complete_misaligned_read_dio(struct svc_rqst *rqstp,
> + loff_t start_extra, loff_t end_extra,
> + ssize_t bytes_read,
> + unsigned long bytes_expected,
> + loff_t *offset,
> + unsigned long *rq_bvec_numpages,
> + struct page *start_extra_page)
> +{
> + ssize_t host_err = bytes_read;
> + loff_t v;
> +
> + /* Must remove first start_extra_page from rqstp->rq_bvec */
> + if (start_extra_page) {
> + __free_page(start_extra_page);
> + *rq_bvec_numpages -= 1;
> + v = *rq_bvec_numpages;
> + for (int i = 0; i < v; i++) {
> + struct bio_vec *bv = &rqstp->rq_bvec[i+1];
> + bvec_set_page(&rqstp->rq_bvec[i], bv->bv_page,
> + bv->bv_offset, bv->bv_len);
> + }
> + }
> + /* Then eliminate the end_extra bytes from the last page */
> + v = *rq_bvec_numpages;
> + rqstp->rq_bvec[v].bv_len -= end_extra;
> +
> + if (host_err < 0)
> + return host_err;
> +
> + /* Must adjust returned read size to reflect original extent */
> + *offset += start_extra;
> + if (likely(host_err >= start_extra)) {
> + host_err -= start_extra;
> + if (host_err > bytes_expected)
> + host_err = bytes_expected;
> + } else {
> + /* Short read that didn't read any of requested data */
> + host_err = 0;
> + }
> +
> + return host_err;
> +}
> +
Also, I think its worth calling out this
nfsd_complete_misaligned_read_dio function for its remapping/shifting
of the READ payload reflected in rqstp->rq_bvec[].
Could easily be that a cleaner way exists to do this and I'm just
missing it.
> /**
> * nfsd_iter_read - Perform a VFS read using an iterator
> * @rqstp: RPC transaction context
> @@ -1086,44 +1155,74 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> unsigned int base, u32 *eof)
> {
> struct file *file = nf->nf_file;
> - unsigned long v, total;
> + unsigned long v, total, in_count = *count;
> + loff_t start_extra = 0, end_extra = 0;
> + struct page *start_extra_page = NULL;
> struct iov_iter iter;
> struct kiocb kiocb;
> - ssize_t host_err;
> + ssize_t host_err = 0;
> size_t len;
>
> init_sync_kiocb(&kiocb, file);
> +
> + /*
> + * If NFSD_IO_DIRECT enabled, expand any misaligned READ to
> + * the next DIO-aligned block (on either end of the READ).
> + */
> + if ((nfsd_io_cache_read == NFSD_IO_DIRECT) &&
> + nf->nf_dio_mem_align && (base & (nf->nf_dio_mem_align-1)) == 0) {
> + loff_t start, end;
> + if (nfsd_analyze_read_dio(rqstp, fhp, offset, in_count,
> + nf->nf_dio_read_offset_align,
> + &start, &end, &start_extra, &end_extra)) {
> + /* trace_nfsd_read_vector() will reflect larger DIO-aligned READ */
> + offset = start;
> + in_count = end - start;
> + kiocb.ki_flags = IOCB_DIRECT;
> + }
> + } else if (nfsd_io_cache_read == NFSD_IO_DONTCACHE)
> + kiocb.ki_flags = IOCB_DONTCACHE;
> +
> kiocb.ki_pos = offset;
<snip>
Just bringing the top fix home, nfsd_analyze_read_dio() was only
returning true if an READ IO requiring expanding the a misaligned IO
to be DIO-aligned (as reflected by start_extra or end_extra being
non-zero). But if the READ was already DIO_aligned it wouldn't set
IOCB_DIRECT (because neither start_extra or end_extra set).
I had been hyper-focused on handling misaligned IO and missed that the
basic case of IO already being DIO-aligned had regressed since my v1
patchset.
Mike
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-08 16:06 ` [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec Mike Snitzer
@ 2025-07-10 7:24 ` Christoph Hellwig
2025-07-10 7:32 ` Mike Snitzer
2025-07-10 13:52 ` Jeff Layton
1 sibling, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2025-07-10 7:24 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker, linux-nfs,
linus-fsdevel, linux-mm
You'll really need to send this to the VFS mainers and especially
Al. Also the best way to make things stand out is to either send
them separastely or the beginning of the series.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-10 7:24 ` Christoph Hellwig
@ 2025-07-10 7:32 ` Mike Snitzer
2025-07-10 7:44 ` Christoph Hellwig
0 siblings, 1 reply; 33+ messages in thread
From: Mike Snitzer @ 2025-07-10 7:32 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker, linux-nfs,
linux-fsdevel, linux-mm
On Thu, Jul 10, 2025 at 12:24:38AM -0700, Christoph Hellwig wrote:
> You'll really need to send this to the VFS mainers and especially
> Al. Also the best way to make things stand out is to either send
> them separastely or the beginning of the series.
>
Yeah, I had the linus-fsdevel typo which didn't help too.
The first time I posted this series I did a better job of sending this
patch to Andrew and Al iirc. In any case, I can pull this fix out to
front of series. But also iov_iter_aligned_iovec() appear to have the
same issue.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-10 7:32 ` Mike Snitzer
@ 2025-07-10 7:44 ` Christoph Hellwig
0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2025-07-10 7:44 UTC (permalink / raw)
To: Mike Snitzer
Cc: Christoph Hellwig, Chuck Lever, Jeff Layton, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm
On Thu, Jul 10, 2025 at 03:32:03AM -0400, Mike Snitzer wrote:
> The first time I posted this series I did a better job of sending this
> patch to Andrew and Al iirc. In any case, I can pull this fix out to
> front of series. But also iov_iter_aligned_iovec() appear to have the
> same issue.
Maybe send a series just addressing the two for now to kick off the
discussion.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 3/8] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-07-08 16:06 ` [RFC PATCH v2 3/8] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
@ 2025-07-10 7:45 ` Christoph Hellwig
2025-07-14 17:46 ` Mike Snitzer
0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2025-07-10 7:45 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker, linux-nfs,
linux-fsdevel, linux-mm
On Tue, Jul 08, 2025 at 12:06:14PM -0400, Mike Snitzer wrote:
> Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store DIO
> alignment attributes from underlying filesystem in associated
> nfsd_file. This is done when the nfsd_file is first opened for
> a regular file.
Just as a little warning: these are unfortunately only supported by
very few file systems right now. For someone with a little bit of
time on their hand it would be nice to do a pass to set them for
all remaining direct I/O supporting file systems, which is only about
a dozen anyway.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface
2025-07-08 16:06 ` [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
@ 2025-07-10 7:47 ` Christoph Hellwig
2025-07-14 17:33 ` Mike Snitzer
2025-07-10 14:06 ` Jeff Layton
1 sibling, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2025-07-10 7:47 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker, linux-nfs,
linus-fsdevel, linux-mm
On Tue, Jul 08, 2025 at 12:06:17PM -0400, Mike Snitzer wrote:
> Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
> read by NFSD will either be:
> - cached using page cache (NFSD_IO_BUFFERED=0)
> - cached but removed from the page cache upon completion
> (NFSD_IO_DONTCACHE=1).
> - not cached (NFSD_IO_DIRECT=2)
>
> io_cache_read is 0 by default. It may be set by writing to:
> /sys/kernel/debug/nfsd/io_cache_read
>
> If NFSD_IO_DONTCACHE is specified using 1, FOP_DONTCACHE must be
> advertised as supported by the underlying filesystem (e.g. XFS),
> otherwise all IO flagged with RWF_DONTCACHE will fail with
> -EOPNOTSUPP.
>
> If NFSD_IO_DIRECT is specified using 2, the IO must be aligned
> relative to the underlying block device's logical_block_size. Also the
> memory buffer used to store the read must be aligned relative to the
> underlying block device's dma_alignment.
Does this also need some kind of check that direct I/O is supported
at all by the file system / fops instance?
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 8/8] NFSD: issue READs using O_DIRECT even if IO is misaligned
2025-07-08 16:06 ` [RFC PATCH v2 8/8] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-07-08 21:22 ` Mike Snitzer
@ 2025-07-10 7:51 ` Christoph Hellwig
1 sibling, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2025-07-10 7:51 UTC (permalink / raw)
To: Mike Snitzer
Cc: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker, linux-nfs,
linus-fsdevel, linux-mm
On Tue, Jul 08, 2025 at 12:06:19PM -0400, Mike Snitzer wrote:
> +static bool nfsd_analyze_read_dio(struct svc_rqst *rqstp, struct svc_fh *fhp,
> + loff_t offset, __u32 len,
> + const u32 dio_blocksize,
> + loff_t *start, loff_t *end,
> + loff_t *start_extra, loff_t *end_extra)
With this amount of arguments, especially out arguments that return
values the better choice is usually to have a struct.
If not using two-tab ndents at least make it a lot more readable..
> +{
> + ssize_t host_err = bytes_read;
> + loff_t v;
> +
> + /* Must remove first start_extra_page from rqstp->rq_bvec */
> + if (start_extra_page) {
> + __free_page(start_extra_page);
I can't really follow the logic here. Why must it be removed (and
freed)? Can you write a more detailed comment here as the logic
isn't very obvious.
> + *rq_bvec_numpages -= 1;
> + v = *rq_bvec_numpages;
> + for (int i = 0; i < v; i++) {
> + struct bio_vec *bv = &rqstp->rq_bvec[i+1];
> + bvec_set_page(&rqstp->rq_bvec[i], bv->bv_page,
> + bv->bv_offset, bv->bv_len);
> + }
This is basically shifting down the bvecs, right? Why not simply
use memmove?
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-08 16:06 ` [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec Mike Snitzer
2025-07-10 7:24 ` Christoph Hellwig
@ 2025-07-10 13:52 ` Jeff Layton
2025-07-10 14:48 ` Keith Busch
1 sibling, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2025-07-10 13:52 UTC (permalink / raw)
To: Mike Snitzer, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm, kbusch, hch
On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> iov_iter_aligned_bvec() is strictly checking alignment of each element
> of the bvec to arrive at whether the bvec is aligned relative to
> dma_alignment and on-disk alignment. Checking each element
> individually results in disallowing a bvec that in aggregate is
> perfectly aligned relative to the provided @len_mask.
>
> Relax the on-disk alignment checking such that it is done on the full
> extent described by the bvec but still do piecewise checking of the
> dma_alignment for each bvec's bv_offset.
>
> This allows for NFS's WRITE payload to be issued using O_DIRECT as
> long as the bvec created with xdr_buf_to_bvec() is composed of pages
> that respect the underlying device's dma_alignment (@addr_mask) and
> the overall contiguous on-disk extent is aligned relative to the
> logical_block_size (@len_mask).
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
> lib/iov_iter.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index bdb37d572e97..b2ae482b8a1d 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
> unsigned skip = i->iov_offset;
> size_t size = i->count;
>
> + if (size & len_mask)
> + return false;
> +
> do {
> size_t len = bvec->bv_len;
>
> if (len > size)
> len = size;
> - if (len & len_mask)
> - return false;
> if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> return false;
>
cc'ing Keith too since he wrote this helper originally. Should this
also get a:
Fixes: cfa320f72882 ("iov: introduce iov_iter_aligned")
?
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 2/8] NFSD: Move the fh_getattr() helper
2025-07-08 16:06 ` [RFC PATCH v2 2/8] NFSD: Move the fh_getattr() helper Mike Snitzer
@ 2025-07-10 13:59 ` Jeff Layton
0 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2025-07-10 13:59 UTC (permalink / raw)
To: Mike Snitzer, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm
On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Clean up: The fh_getattr() function is part of NFSD's file handle
> API, so relocate it.
>
> I've made it an un-inlined function so that trace points and new
> functionality can easily be introduced. That increases the size of
> nfsd.ko by about a page on my x86_64 system (out of 26MB; compiled
> with -O2).
>
Weird. I would have expected making it uninlined would decrease the
size of the binary not increase it. There are quite a few fh_getattr()
calls.
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
> fs/nfsd/nfsfh.c | 23 +++++++++++++++++++++++
> fs/nfsd/nfsfh.h | 1 +
> fs/nfsd/vfs.h | 13 -------------
> 3 files changed, 24 insertions(+), 13 deletions(-)
>
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 16e6b4428d55..f4a3cc9e31e0 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -663,6 +663,29 @@ fh_update(struct svc_fh *fhp)
> return nfserr_serverfault;
> }
>
> +/**
> + * fh_getattr - Retrieve attributes on a local file
> + * @fhp: File handle of target file
> + * @stat: Caller-supplied kstat buffer to be filled in
> + *
> + * Returns nfs_ok on success, otherwise an NFS status code is
> + * returned.
> + */
> +__be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat)
> +{
> + struct path p = {
> + .mnt = fhp->fh_export->ex_path.mnt,
> + .dentry = fhp->fh_dentry,
> + };
> + u32 request_mask = STATX_BASIC_STATS;
> +
> + if (fhp->fh_maxsize == NFS4_FHSIZE)
> + request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
> +
> + return nfserrno(vfs_getattr(&p, stat, request_mask,
> + AT_STATX_SYNC_AS_STAT));
> +}
> +
> /**
> * fh_fill_pre_attrs - Fill in pre-op attributes
> * @fhp: file handle to be updated
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index 6f5255d1c190..5ef7191f8ad8 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -222,6 +222,7 @@ extern char * SVCFH_fmt(struct svc_fh *fhp);
> __be32 fh_verify(struct svc_rqst *, struct svc_fh *, umode_t, int);
> __be32 fh_verify_local(struct net *, struct svc_cred *, struct auth_domain *,
> struct svc_fh *, umode_t, int);
> +__be32 fh_getattr(const struct svc_fh *fhp, struct kstat *stat);
> __be32 fh_compose(struct svc_fh *, struct svc_export *, struct dentry *, struct svc_fh *);
> __be32 fh_update(struct svc_fh *);
> void fh_put(struct svc_fh *);
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index 4007dcbbbfef..0c0292611c6d 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -160,17 +160,4 @@ __be32 nfsd_permission(struct svc_cred *cred, struct svc_export *exp,
>
> void nfsd_filp_close(struct file *fp);
>
> -static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
> -{
> - u32 request_mask = STATX_BASIC_STATS;
> - struct path p = {.mnt = fh->fh_export->ex_path.mnt,
> - .dentry = fh->fh_dentry};
> -
> - if (fh->fh_maxsize == NFS4_FHSIZE)
> - request_mask |= (STATX_BTIME | STATX_CHANGE_COOKIE);
> -
> - return nfserrno(vfs_getattr(&p, stat, request_mask,
> - AT_STATX_SYNC_AS_STAT));
> -}
> -
> #endif /* LINUX_NFSD_VFS_H */
In any case, I don't see a real benefit in keeping this as a static
inline.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 1/8] NFSD: Relocate the fh_want_write() and fh_drop_write() helpers
2025-07-08 16:06 ` [RFC PATCH v2 1/8] NFSD: Relocate the fh_want_write() and fh_drop_write() helpers Mike Snitzer
@ 2025-07-10 13:59 ` Jeff Layton
0 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2025-07-10 13:59 UTC (permalink / raw)
To: Mike Snitzer, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm
On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Clean up: These helpers are part of the NFSD file handle API.
> Relocate them to fs/nfsd/nfsfh.h.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
> fs/nfsd/nfsfh.h | 37 +++++++++++++++++++++++++++++++++++++
> fs/nfsd/vfs.h | 20 --------------------
> 2 files changed, 37 insertions(+), 20 deletions(-)
>
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index 1cf979722521..6f5255d1c190 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -14,6 +14,8 @@
> #include <linux/exportfs.h>
> #include <linux/nfs4.h>
>
> +#include "export.h"
> +
> /*
> * The file handle starts with a sequence of four-byte words.
> * The first word contains a version number (1) and three descriptor bytes
> @@ -271,6 +273,41 @@ static inline bool fh_fsid_match(const struct knfsd_fh *fh1,
> return true;
> }
>
> +/**
> + * fh_want_write - Get write access to an export
> + * @fhp: File handle of file to be written
> + *
> + * Caller must invoke fh_drop_write() when its write operation
> + * is complete.
> + *
> + * Returns 0 if the file handle's export can be written to. Otherwise
> + * the export is not prepared for updates, and the returned negative
> + * errno value reflects the reason for the failure.
> + */
> +static inline int fh_want_write(struct svc_fh *fhp)
> +{
> + int ret;
> +
> + if (fhp->fh_want_write)
> + return 0;
> + ret = mnt_want_write(fhp->fh_export->ex_path.mnt);
> + if (!ret)
> + fhp->fh_want_write = true;
> + return ret;
> +}
> +
> +/**
> + * fh_drop_write - Release write access on an export
> + * @fhp: File handle of file on which fh_want_write() was previously called
> + */
> +static inline void fh_drop_write(struct svc_fh *fhp)
> +{
> + if (fhp->fh_want_write) {
> + fhp->fh_want_write = false;
> + mnt_drop_write(fhp->fh_export->ex_path.mnt);
> + }
> +}
> +
> /**
> * knfsd_fh_hash - calculate the crc32 hash for the filehandle
> * @fh - pointer to filehandle
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index eff04959606f..4007dcbbbfef 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -160,26 +160,6 @@ __be32 nfsd_permission(struct svc_cred *cred, struct svc_export *exp,
>
> void nfsd_filp_close(struct file *fp);
>
> -static inline int fh_want_write(struct svc_fh *fh)
> -{
> - int ret;
> -
> - if (fh->fh_want_write)
> - return 0;
> - ret = mnt_want_write(fh->fh_export->ex_path.mnt);
> - if (!ret)
> - fh->fh_want_write = true;
> - return ret;
> -}
> -
> -static inline void fh_drop_write(struct svc_fh *fh)
> -{
> - if (fh->fh_want_write) {
> - fh->fh_want_write = false;
> - mnt_drop_write(fh->fh_export->ex_path.mnt);
> - }
> -}
> -
> static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
> {
> u32 request_mask = STATX_BASIC_STATS;
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface
2025-07-08 16:06 ` [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-07-10 7:47 ` Christoph Hellwig
@ 2025-07-10 14:06 ` Jeff Layton
2025-07-10 22:46 ` Chuck Lever
1 sibling, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2025-07-10 14:06 UTC (permalink / raw)
To: Mike Snitzer, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm
On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
> read by NFSD will either be:
> - cached using page cache (NFSD_IO_BUFFERED=0)
> - cached but removed from the page cache upon completion
> (NFSD_IO_DONTCACHE=1).
> - not cached (NFSD_IO_DIRECT=2)
>
> io_cache_read is 0 by default. It may be set by writing to:
> /sys/kernel/debug/nfsd/io_cache_read
>
> If NFSD_IO_DONTCACHE is specified using 1, FOP_DONTCACHE must be
> advertised as supported by the underlying filesystem (e.g. XFS),
> otherwise all IO flagged with RWF_DONTCACHE will fail with
> -EOPNOTSUPP.
>
> If NFSD_IO_DIRECT is specified using 2, the IO must be aligned
> relative to the underlying block device's logical_block_size. Also the
> memory buffer used to store the read must be aligned relative to the
> underlying block device's dma_alignment.
>
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
> fs/nfsd/debugfs.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
> fs/nfsd/nfsd.h | 8 +++++++
> fs/nfsd/vfs.c | 15 ++++++++++++++
> 3 files changed, 76 insertions(+)
>
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 84b0c8b559dc..709646af797a 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -27,11 +27,61 @@ static int nfsd_dsr_get(void *data, u64 *val)
> static int nfsd_dsr_set(void *data, u64 val)
> {
> nfsd_disable_splice_read = (val > 0) ? true : false;
> + if (!nfsd_disable_splice_read) {
> + /*
> + * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
> + * if splice_read is enabled.
> + */
> + nfsd_io_cache_read = NFSD_IO_BUFFERED;
> + }
> return 0;
> }
>
> DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
>
> +/*
> + * /sys/kernel/debug/nfsd/io_cache_read
> + *
> + * Contents:
> + * %0: NFS READ will use buffered IO (default)
> + * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> + * %2: NFS READ will use direct IO
> + *
> + * The default value of this setting is zero (buffered IO is
> + * used). This setting takes immediate effect for all NFS
> + * versions, all exports, and in all NFSD net namespaces.
> + */
> +
Could we switch this to use a string instead? Maybe
buffered/dontcache/direct ?
> +static int nfsd_io_cache_read_get(void *data, u64 *val)
> +{
> + *val = nfsd_io_cache_read;
> + return 0;
> +}
> +
> +static int nfsd_io_cache_read_set(void *data, u64 val)
> +{
> + switch (val) {
> + case NFSD_IO_DONTCACHE:
> + case NFSD_IO_DIRECT:
> + /*
> + * Must disable splice_read when enabling
> + * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
> + */
> + nfsd_disable_splice_read = true;
> + nfsd_io_cache_read = val;
> + break;
> + case NFSD_IO_BUFFERED:
> + default:
> + nfsd_io_cache_read = NFSD_IO_BUFFERED;
> + break;
I think the default case should leave nfsd_io_cache_read alone and
return an error. If we add new values later, and someone tries to use
them on an old kernel, it's better to make that attempt error out.
Ditto for the write side controls.
> + }
> +
> + return 0;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
> + nfsd_io_cache_read_set, "%llu\n");
> +
> void nfsd_debugfs_exit(void)
> {
> debugfs_remove_recursive(nfsd_top_dir);
> @@ -44,4 +94,7 @@ void nfsd_debugfs_init(void)
>
> debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
> nfsd_top_dir, NULL, &nfsd_dsr_fops);
> +
> + debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
> + nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
> }
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 1cd0bed57bc2..4740567f4e7e 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -153,6 +153,14 @@ static inline void nfsd_debugfs_exit(void) {}
>
> extern bool nfsd_disable_splice_read __read_mostly;
>
> +enum {
> + NFSD_IO_BUFFERED = 0,
> + NFSD_IO_DONTCACHE,
> + NFSD_IO_DIRECT
> +};
> +
> +extern u64 nfsd_io_cache_read __read_mostly;
> +
> extern int nfsd_max_blksize;
>
> static inline int nfsd_v4client(struct svc_rqst *rq)
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 845c212ad10b..632ce417f4ef 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -49,6 +49,7 @@
> #define NFSDDBG_FACILITY NFSDDBG_FILEOP
>
> bool nfsd_disable_splice_read __read_mostly;
> +u64 nfsd_io_cache_read __read_mostly;
>
> /**
> * nfserrno - Map Linux errnos to NFS errnos
> @@ -1107,6 +1108,20 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
> +
> + switch (nsfd_io_cache_read) {
> + case NFSD_IO_DIRECT:
> + if (iov_iter_is_aligned(&iter, nf->nf_dio_mem_align - 1,
> + nf->nf_dio_read_offset_align - 1))
> + kiocb.ki_flags = IOCB_DIRECT;
> + break;
> + case NFSD_IO_DONTCACHE:
> + kiocb.ki_flags = IOCB_DONTCACHE;
> + break;
> + case NFSD_IO_BUFFERED:
> + break;
> + }
> +
> host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> }
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-10 13:52 ` Jeff Layton
@ 2025-07-10 14:48 ` Keith Busch
2025-07-10 16:12 ` Mike Snitzer
0 siblings, 1 reply; 33+ messages in thread
From: Keith Busch @ 2025-07-10 14:48 UTC (permalink / raw)
To: Jeff Layton
Cc: Mike Snitzer, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker, linux-nfs,
linus-fsdevel, linux-mm, hch
On Thu, Jul 10, 2025 at 09:52:53AM -0400, Jeff Layton wrote:
> On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> > iov_iter_aligned_bvec() is strictly checking alignment of each element
> > of the bvec to arrive at whether the bvec is aligned relative to
> > dma_alignment and on-disk alignment. Checking each element
> > individually results in disallowing a bvec that in aggregate is
> > perfectly aligned relative to the provided @len_mask.
> >
> > Relax the on-disk alignment checking such that it is done on the full
> > extent described by the bvec but still do piecewise checking of the
> > dma_alignment for each bvec's bv_offset.
> >
> > This allows for NFS's WRITE payload to be issued using O_DIRECT as
> > long as the bvec created with xdr_buf_to_bvec() is composed of pages
> > that respect the underlying device's dma_alignment (@addr_mask) and
> > the overall contiguous on-disk extent is aligned relative to the
> > logical_block_size (@len_mask).
> >
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> > lib/iov_iter.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > index bdb37d572e97..b2ae482b8a1d 100644
> > --- a/lib/iov_iter.c
> > +++ b/lib/iov_iter.c
> > @@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
> > unsigned skip = i->iov_offset;
> > size_t size = i->count;
> >
> > + if (size & len_mask)
> > + return false;
> > +
> > do {
> > size_t len = bvec->bv_len;
> >
> > if (len > size)
> > len = size;
> > - if (len & len_mask)
> > - return false;
> > if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > return false;
> >
>
> cc'ing Keith too since he wrote this helper originally.
Thanks.
There's a comment in __bio_iov_iter_get_pages that says it expects each
vector to be a multiple of the block size. That makes it easier to
slit when needed, and this patch would allow vectors that break the
current assumption when calculating the "trim" value.
But for nvme, you couldn't split such a bvec into a usable command
anyway. I think you'd have to introduce a different queue limit to check
against when validating iter alignment if you don't want to use the
logical block size.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-10 14:48 ` Keith Busch
@ 2025-07-10 16:12 ` Mike Snitzer
2025-07-10 16:29 ` Keith Busch
2025-08-01 15:23 ` Keith Busch
0 siblings, 2 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-10 16:12 UTC (permalink / raw)
To: Keith Busch, Ming Lei, Jens Axboe
Cc: Jeff Layton, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker, linux-nfs,
linux-fsdevel, linux-mm, hch, linux-block
On Thu, Jul 10, 2025 at 08:48:04AM -0600, Keith Busch wrote:
> On Thu, Jul 10, 2025 at 09:52:53AM -0400, Jeff Layton wrote:
> > On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> > > iov_iter_aligned_bvec() is strictly checking alignment of each element
> > > of the bvec to arrive at whether the bvec is aligned relative to
> > > dma_alignment and on-disk alignment. Checking each element
> > > individually results in disallowing a bvec that in aggregate is
> > > perfectly aligned relative to the provided @len_mask.
> > >
> > > Relax the on-disk alignment checking such that it is done on the full
> > > extent described by the bvec but still do piecewise checking of the
> > > dma_alignment for each bvec's bv_offset.
> > >
> > > This allows for NFS's WRITE payload to be issued using O_DIRECT as
> > > long as the bvec created with xdr_buf_to_bvec() is composed of pages
> > > that respect the underlying device's dma_alignment (@addr_mask) and
> > > the overall contiguous on-disk extent is aligned relative to the
> > > logical_block_size (@len_mask).
> > >
> > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > ---
> > > lib/iov_iter.c | 5 +++--
> > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > > index bdb37d572e97..b2ae482b8a1d 100644
> > > --- a/lib/iov_iter.c
> > > +++ b/lib/iov_iter.c
> > > @@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
> > > unsigned skip = i->iov_offset;
> > > size_t size = i->count;
> > >
> > > + if (size & len_mask)
> > > + return false;
> > > +
> > > do {
> > > size_t len = bvec->bv_len;
> > >
> > > if (len > size)
> > > len = size;
> > > - if (len & len_mask)
> > > - return false;
> > > if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > > return false;
> > >
> >
> > cc'ing Keith too since he wrote this helper originally.
>
> Thanks.
>
> There's a comment in __bio_iov_iter_get_pages that says it expects each
> vector to be a multiple of the block size. That makes it easier to
> slit when needed, and this patch would allow vectors that break the
> current assumption when calculating the "trim" value.
Thanks for the pointer, that high-level bio code is being too
restrictive.
But not seeing any issues with the trim calculation itself, 'trim' is
the number of bytes that are past the last logical_block_size aligned
boundary. And then iov_iter_revert() will rollback the iov such that
it doesn't include those. Then size is reduced by trim bytes.
Just restating my challenge:
Assuming that each vector is itself a multiple of logical_block_size
disallows valid usecases (like the one I have with NFSD needing to use
O_DIRECT for its WRITE payload, which can have the head and/or tail
vectors at _not_ logical_block_size aligned boundaries).
> But for nvme, you couldn't split such a bvec into a usable command
> anyway. I think you'd have to introduce a different queue limit to check
> against when validating iter alignment if you don't want to use the
> logical block size.
There isn't a new queue limit in play though, but I get your meaning
that we did in fact assume the logical_block_size applicable for each
vector.
With my patch, we still validate logical_block_size (of overall
bvec length, not of each vector) and implicitly validate that each
vector has dma_alignment on-disk (while verifying that pages are
aligned to dma_alignment).
All said, in practice I haven't had any issues with this patch. But
it could just be I don't have the stars aligned to test the case that
might have problems. If you know of such a case I'd welcome
suggestions.
Thanks,
Mike
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-10 16:12 ` Mike Snitzer
@ 2025-07-10 16:29 ` Keith Busch
2025-07-10 17:22 ` Mike Snitzer
2025-08-01 15:23 ` Keith Busch
1 sibling, 1 reply; 33+ messages in thread
From: Keith Busch @ 2025-07-10 16:29 UTC (permalink / raw)
To: Mike Snitzer
Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
linux-block
On Thu, Jul 10, 2025 at 12:12:29PM -0400, Mike Snitzer wrote:
> On Thu, Jul 10, 2025 at 08:48:04AM -0600, Keith Busch wrote:
> > On Thu, Jul 10, 2025 at 09:52:53AM -0400, Jeff Layton wrote:
> > > On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> > > > iov_iter_aligned_bvec() is strictly checking alignment of each element
> > > > of the bvec to arrive at whether the bvec is aligned relative to
> > > > dma_alignment and on-disk alignment. Checking each element
> > > > individually results in disallowing a bvec that in aggregate is
> > > > perfectly aligned relative to the provided @len_mask.
> > > >
> > > > Relax the on-disk alignment checking such that it is done on the full
> > > > extent described by the bvec but still do piecewise checking of the
> > > > dma_alignment for each bvec's bv_offset.
> > > >
> > > > This allows for NFS's WRITE payload to be issued using O_DIRECT as
> > > > long as the bvec created with xdr_buf_to_bvec() is composed of pages
> > > > that respect the underlying device's dma_alignment (@addr_mask) and
> > > > the overall contiguous on-disk extent is aligned relative to the
> > > > logical_block_size (@len_mask).
> > > >
> > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > ---
> > > > lib/iov_iter.c | 5 +++--
> > > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > > > index bdb37d572e97..b2ae482b8a1d 100644
> > > > --- a/lib/iov_iter.c
> > > > +++ b/lib/iov_iter.c
> > > > @@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
> > > > unsigned skip = i->iov_offset;
> > > > size_t size = i->count;
> > > >
> > > > + if (size & len_mask)
> > > > + return false;
> > > > +
> > > > do {
> > > > size_t len = bvec->bv_len;
> > > >
> > > > if (len > size)
> > > > len = size;
> > > > - if (len & len_mask)
> > > > - return false;
> > > > if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > > > return false;
> > > >
> > >
> > > cc'ing Keith too since he wrote this helper originally.
> >
> > Thanks.
> >
> > There's a comment in __bio_iov_iter_get_pages that says it expects each
> > vector to be a multiple of the block size. That makes it easier to
> > slit when needed, and this patch would allow vectors that break the
> > current assumption when calculating the "trim" value.
>
> Thanks for the pointer, that high-level bio code is being too
> restrictive.
>
> But not seeing any issues with the trim calculation itself, 'trim' is
> the number of bytes that are past the last logical_block_size aligned
> boundary. And then iov_iter_revert() will rollback the iov such that
> it doesn't include those. Then size is reduced by trim bytes.
The trim calculation assumes the current bi_size is already a block size
multiple, but it may not be with your propsal. So the trim bytes needs
to take into account the existing bi_size to know how much to trim off
to arrive at a proper total bi_size instead of assuming we can append a
block sized multiple carved out the current iov.
> All said, in practice I haven't had any issues with this patch. But
> it could just be I don't have the stars aligned to test the case that
> might have problems. If you know of such a case I'd welcome
> suggestions.
It might be a little harder with iter_bvec, but you also mentioned doing
the same for iter_iovec too, which I think should be pretty easy to
cause a problem for nvme: just submit an O_DIRECT read or write with
individual iovec sizes that are not block size granularities.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-10 16:29 ` Keith Busch
@ 2025-07-10 17:22 ` Mike Snitzer
2025-07-10 19:51 ` Keith Busch
2025-07-10 19:57 ` Keith Busch
0 siblings, 2 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-10 17:22 UTC (permalink / raw)
To: Keith Busch
Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
linux-block
On Thu, Jul 10, 2025 at 10:29:22AM -0600, Keith Busch wrote:
> On Thu, Jul 10, 2025 at 12:12:29PM -0400, Mike Snitzer wrote:
> > On Thu, Jul 10, 2025 at 08:48:04AM -0600, Keith Busch wrote:
> > > On Thu, Jul 10, 2025 at 09:52:53AM -0400, Jeff Layton wrote:
> > > > On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> > > > > iov_iter_aligned_bvec() is strictly checking alignment of each element
> > > > > of the bvec to arrive at whether the bvec is aligned relative to
> > > > > dma_alignment and on-disk alignment. Checking each element
> > > > > individually results in disallowing a bvec that in aggregate is
> > > > > perfectly aligned relative to the provided @len_mask.
> > > > >
> > > > > Relax the on-disk alignment checking such that it is done on the full
> > > > > extent described by the bvec but still do piecewise checking of the
> > > > > dma_alignment for each bvec's bv_offset.
> > > > >
> > > > > This allows for NFS's WRITE payload to be issued using O_DIRECT as
> > > > > long as the bvec created with xdr_buf_to_bvec() is composed of pages
> > > > > that respect the underlying device's dma_alignment (@addr_mask) and
> > > > > the overall contiguous on-disk extent is aligned relative to the
> > > > > logical_block_size (@len_mask).
> > > > >
> > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > > ---
> > > > > lib/iov_iter.c | 5 +++--
> > > > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > > > > index bdb37d572e97..b2ae482b8a1d 100644
> > > > > --- a/lib/iov_iter.c
> > > > > +++ b/lib/iov_iter.c
> > > > > @@ -819,13 +819,14 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
> > > > > unsigned skip = i->iov_offset;
> > > > > size_t size = i->count;
> > > > >
> > > > > + if (size & len_mask)
> > > > > + return false;
> > > > > +
> > > > > do {
> > > > > size_t len = bvec->bv_len;
> > > > >
> > > > > if (len > size)
> > > > > len = size;
> > > > > - if (len & len_mask)
> > > > > - return false;
> > > > > if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> > > > > return false;
> > > > >
> > > >
> > > > cc'ing Keith too since he wrote this helper originally.
> > >
> > > Thanks.
> > >
> > > There's a comment in __bio_iov_iter_get_pages that says it expects each
> > > vector to be a multiple of the block size. That makes it easier to
> > > slit when needed, and this patch would allow vectors that break the
> > > current assumption when calculating the "trim" value.
> >
> > Thanks for the pointer, that high-level bio code is being too
> > restrictive.
> >
> > But not seeing any issues with the trim calculation itself, 'trim' is
> > the number of bytes that are past the last logical_block_size aligned
> > boundary. And then iov_iter_revert() will rollback the iov such that
> > it doesn't include those. Then size is reduced by trim bytes.
>
> The trim calculation assumes the current bi_size is already a block size
> multiple, but it may not be with your propsal. So the trim bytes needs
> to take into account the existing bi_size to know how much to trim off
> to arrive at a proper total bi_size instead of assuming we can append a
> block sized multiple carved out the current iov.
The trim "calculation" doesn't assume anything, it just lops off
whatever is past the end of the last logical_block_size aligned
boundary of the requested pages (which is meant to be bi_size). The
fact that the trim ever gets anything implies bi_size is *not* always
logical_block_size aligned. No?
But sure, with my change it opens the door for bvecs with vectors that
aren't all logical_block_size aligned.
I'll revisit this code, but if you see a way forward to fix
__bio_iov_iter_get_pages to cope with my desired iov_iter_aligned_bvec
change please don't be shy with a patch ;)
> > All said, in practice I haven't had any issues with this patch. But
> > it could just be I don't have the stars aligned to test the case that
> > might have problems. If you know of such a case I'd welcome
> > suggestions.
>
> It might be a little harder with iter_bvec, but you also mentioned doing
> the same for iter_iovec too, which I think should be pretty easy to
> cause a problem for nvme: just submit an O_DIRECT read or write with
> individual iovec sizes that are not block size granularities.
I made the iter_iovec change yesterday (before I realized I don't
actually need it for my NFSD case) and all was fine issuing O_DIRECT
IO (via NFSD, so needing the relaxed checking) through to 16
XFS-on-NVMe devices. SO I think the devil will be in the details if
NVMe actually cares.
Mike
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-10 17:22 ` Mike Snitzer
@ 2025-07-10 19:51 ` Keith Busch
2025-07-10 19:57 ` Keith Busch
1 sibling, 0 replies; 33+ messages in thread
From: Keith Busch @ 2025-07-10 19:51 UTC (permalink / raw)
To: Mike Snitzer
Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
linux-block
On Thu, Jul 10, 2025 at 01:22:59PM -0400, Mike Snitzer wrote:
> On Thu, Jul 10, 2025 at 10:29:22AM -0600, Keith Busch wrote:
> > The trim calculation assumes the current bi_size is already a block size
> > multiple, but it may not be with your propsal. So the trim bytes needs
> > to take into account the existing bi_size to know how much to trim off
> > to arrive at a proper total bi_size instead of assuming we can append a
> > block sized multiple carved out the current iov.
>
> The trim "calculation" doesn't assume anything, it just lops off
> whatever is past the end of the last logical_block_size aligned
> boundary of the requested pages (which is meant to be bi_size). The
> fact that the trim ever gets anything implies bi_size is *not* always
> logical_block_size aligned. No?
No. The iov must be a block size, but if it spans more pages than the
bio can hold (because of bi_max_vecs), the total size of the pages
gotten is only part of iov. That's the case that 'trim' is trying to
handle, as we only got part of the iov, so it's aligned down to make
sure the next iteration can consider perfectly block size aligned
iovecs. At every step of iovec iteration, the bio's bi_size is a block
size multiple.
Let's say we tried to allow smaller vecs. Assume block size of 512
bytes, and you send a direct IO with 4 vecs of 128 bytes each. That
would normally get rejected very early, but if you did send that to the
bio layer, the entirety of the first iov would get trimmed off and you
should get an EFAULT return.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-10 17:22 ` Mike Snitzer
2025-07-10 19:51 ` Keith Busch
@ 2025-07-10 19:57 ` Keith Busch
1 sibling, 0 replies; 33+ messages in thread
From: Keith Busch @ 2025-07-10 19:57 UTC (permalink / raw)
To: Mike Snitzer
Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
linux-block
On Thu, Jul 10, 2025 at 01:22:59PM -0400, Mike Snitzer wrote:
> I'll revisit this code, but if you see a way forward to fix
> __bio_iov_iter_get_pages to cope with my desired iov_iter_aligned_bvec
> change please don't be shy with a patch ;)
For the record, I do like the cause you're going for. I wanted to make
that work when I initially introduced dma_alignment to direct-io, but
the spliting, merging, and respecting all the queue contraints for
things like virt boundaries and segment limits was a bit more than I
expected.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface
2025-07-10 14:06 ` Jeff Layton
@ 2025-07-10 22:46 ` Chuck Lever
2025-07-14 16:47 ` Mike Snitzer
0 siblings, 1 reply; 33+ messages in thread
From: Chuck Lever @ 2025-07-10 22:46 UTC (permalink / raw)
To: Jeff Layton, Mike Snitzer, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker
Cc: linux-nfs, linus-fsdevel, linux-mm
On 7/10/25 10:06 AM, Jeff Layton wrote:
> On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
>> Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
>> read by NFSD will either be:
>> - cached using page cache (NFSD_IO_BUFFERED=0)
>> - cached but removed from the page cache upon completion
>> (NFSD_IO_DONTCACHE=1).
>> - not cached (NFSD_IO_DIRECT=2)
>>
>> io_cache_read is 0 by default. It may be set by writing to:
>> /sys/kernel/debug/nfsd/io_cache_read
>>
>> If NFSD_IO_DONTCACHE is specified using 1, FOP_DONTCACHE must be
>> advertised as supported by the underlying filesystem (e.g. XFS),
>> otherwise all IO flagged with RWF_DONTCACHE will fail with
>> -EOPNOTSUPP.
>>
>> If NFSD_IO_DIRECT is specified using 2, the IO must be aligned
>> relative to the underlying block device's logical_block_size. Also the
>> memory buffer used to store the read must be aligned relative to the
>> underlying block device's dma_alignment.
>>
>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>> ---
>> fs/nfsd/debugfs.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
>> fs/nfsd/nfsd.h | 8 +++++++
>> fs/nfsd/vfs.c | 15 ++++++++++++++
>> 3 files changed, 76 insertions(+)
>>
>> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
>> index 84b0c8b559dc..709646af797a 100644
>> --- a/fs/nfsd/debugfs.c
>> +++ b/fs/nfsd/debugfs.c
>> @@ -27,11 +27,61 @@ static int nfsd_dsr_get(void *data, u64 *val)
>> static int nfsd_dsr_set(void *data, u64 val)
>> {
>> nfsd_disable_splice_read = (val > 0) ? true : false;
>> + if (!nfsd_disable_splice_read) {
>> + /*
>> + * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
>> + * if splice_read is enabled.
>> + */
>> + nfsd_io_cache_read = NFSD_IO_BUFFERED;
>> + }
>> return 0;
>> }
>>
>> DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
>>
>> +/*
>> + * /sys/kernel/debug/nfsd/io_cache_read
>> + *
>> + * Contents:
>> + * %0: NFS READ will use buffered IO (default)
>> + * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
>> + * %2: NFS READ will use direct IO
>> + *
>> + * The default value of this setting is zero (buffered IO is
>> + * used). This setting takes immediate effect for all NFS
>> + * versions, all exports, and in all NFSD net namespaces.
>> + */
>> +
>
> Could we switch this to use a string instead? Maybe
> buffered/dontcache/direct ?
That thought occurred to me too, since it would make the API a little
more self-documenting, and might be a harbinger of what a future
export option might look like.
>> +static int nfsd_io_cache_read_get(void *data, u64 *val)
>> +{
>> + *val = nfsd_io_cache_read;
>> + return 0;
>> +}
>> +
>> +static int nfsd_io_cache_read_set(void *data, u64 val)
>> +{
>> + switch (val) {
>> + case NFSD_IO_DONTCACHE:
>> + case NFSD_IO_DIRECT:
>> + /*
>> + * Must disable splice_read when enabling
>> + * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
>> + */
>> + nfsd_disable_splice_read = true;
>> + nfsd_io_cache_read = val;
>> + break;
>> + case NFSD_IO_BUFFERED:
>> + default:
>> + nfsd_io_cache_read = NFSD_IO_BUFFERED;
>> + break;
>
> I think the default case should leave nfsd_io_cache_read alone and
> return an error. If we add new values later, and someone tries to use
> them on an old kernel, it's better to make that attempt error out.
>
> Ditto for the write side controls.
+1 on both accounts.
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
>> + nfsd_io_cache_read_set, "%llu\n");
>> +
>> void nfsd_debugfs_exit(void)
>> {
>> debugfs_remove_recursive(nfsd_top_dir);
>> @@ -44,4 +94,7 @@ void nfsd_debugfs_init(void)
>>
>> debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
>> nfsd_top_dir, NULL, &nfsd_dsr_fops);
>> +
>> + debugfs_create_file("io_cache_read", S_IWUSR | S_IRUGO,
>> + nfsd_top_dir, NULL, &nfsd_io_cache_read_fops);
>> }
>> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
>> index 1cd0bed57bc2..4740567f4e7e 100644
>> --- a/fs/nfsd/nfsd.h
>> +++ b/fs/nfsd/nfsd.h
>> @@ -153,6 +153,14 @@ static inline void nfsd_debugfs_exit(void) {}
>>
>> extern bool nfsd_disable_splice_read __read_mostly;
>>
>> +enum {
>> + NFSD_IO_BUFFERED = 0,
>> + NFSD_IO_DONTCACHE,
>> + NFSD_IO_DIRECT
>> +};
>> +
>> +extern u64 nfsd_io_cache_read __read_mostly;
>> +
>> extern int nfsd_max_blksize;
>>
>> static inline int nfsd_v4client(struct svc_rqst *rq)
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 845c212ad10b..632ce417f4ef 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -49,6 +49,7 @@
>> #define NFSDDBG_FACILITY NFSDDBG_FILEOP
>>
>> bool nfsd_disable_splice_read __read_mostly;
>> +u64 nfsd_io_cache_read __read_mostly;
>>
>> /**
>> * nfserrno - Map Linux errnos to NFS errnos
>> @@ -1107,6 +1108,20 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>
>> trace_nfsd_read_vector(rqstp, fhp, offset, *count);
>> iov_iter_bvec(&iter, ITER_DEST, rqstp->rq_bvec, v, *count);
>> +
>> + switch (nsfd_io_cache_read) {
>> + case NFSD_IO_DIRECT:
>> + if (iov_iter_is_aligned(&iter, nf->nf_dio_mem_align - 1,
>> + nf->nf_dio_read_offset_align - 1))
>> + kiocb.ki_flags = IOCB_DIRECT;
>> + break;
>> + case NFSD_IO_DONTCACHE:
>> + kiocb.ki_flags = IOCB_DONTCACHE;
>> + break;
>> + case NFSD_IO_BUFFERED:
>> + break;
>> + }
>> +
>> host_err = vfs_iocb_iter_read(file, &kiocb, &iter);
>> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>> }
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface
2025-07-10 22:46 ` Chuck Lever
@ 2025-07-14 16:47 ` Mike Snitzer
2025-07-15 11:57 ` Jeff Layton
0 siblings, 1 reply; 33+ messages in thread
From: Mike Snitzer @ 2025-07-14 16:47 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, linux-nfs, linus-fsdevel,
linux-mm
On Thu, Jul 10, 2025 at 06:46:37PM -0400, Chuck Lever wrote:
> On 7/10/25 10:06 AM, Jeff Layton wrote:
> > On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> >> Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
> >> read by NFSD will either be:
> >> - cached using page cache (NFSD_IO_BUFFERED=0)
> >> - cached but removed from the page cache upon completion
> >> (NFSD_IO_DONTCACHE=1).
> >> - not cached (NFSD_IO_DIRECT=2)
> >>
> >> io_cache_read is 0 by default. It may be set by writing to:
> >> /sys/kernel/debug/nfsd/io_cache_read
> >>
> >> If NFSD_IO_DONTCACHE is specified using 1, FOP_DONTCACHE must be
> >> advertised as supported by the underlying filesystem (e.g. XFS),
> >> otherwise all IO flagged with RWF_DONTCACHE will fail with
> >> -EOPNOTSUPP.
> >>
> >> If NFSD_IO_DIRECT is specified using 2, the IO must be aligned
> >> relative to the underlying block device's logical_block_size. Also the
> >> memory buffer used to store the read must be aligned relative to the
> >> underlying block device's dma_alignment.
> >>
> >> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> >> ---
> >> fs/nfsd/debugfs.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
> >> fs/nfsd/nfsd.h | 8 +++++++
> >> fs/nfsd/vfs.c | 15 ++++++++++++++
> >> 3 files changed, 76 insertions(+)
> >>
> >> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> >> index 84b0c8b559dc..709646af797a 100644
> >> --- a/fs/nfsd/debugfs.c
> >> +++ b/fs/nfsd/debugfs.c
> >> @@ -27,11 +27,61 @@ static int nfsd_dsr_get(void *data, u64 *val)
> >> static int nfsd_dsr_set(void *data, u64 val)
> >> {
> >> nfsd_disable_splice_read = (val > 0) ? true : false;
> >> + if (!nfsd_disable_splice_read) {
> >> + /*
> >> + * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
> >> + * if splice_read is enabled.
> >> + */
> >> + nfsd_io_cache_read = NFSD_IO_BUFFERED;
> >> + }
> >> return 0;
> >> }
> >>
> >> DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> >>
> >> +/*
> >> + * /sys/kernel/debug/nfsd/io_cache_read
> >> + *
> >> + * Contents:
> >> + * %0: NFS READ will use buffered IO (default)
> >> + * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> >> + * %2: NFS READ will use direct IO
> >> + *
> >> + * The default value of this setting is zero (buffered IO is
> >> + * used). This setting takes immediate effect for all NFS
> >> + * versions, all exports, and in all NFSD net namespaces.
> >> + */
> >> +
> >
> > Could we switch this to use a string instead? Maybe
> > buffered/dontcache/direct ?
>
> That thought occurred to me too, since it would make the API a little
> more self-documenting, and might be a harbinger of what a future
> export option might look like.
>
>
> >> +static int nfsd_io_cache_read_get(void *data, u64 *val)
> >> +{
> >> + *val = nfsd_io_cache_read;
> >> + return 0;
> >> +}
> >> +
> >> +static int nfsd_io_cache_read_set(void *data, u64 val)
> >> +{
> >> + switch (val) {
> >> + case NFSD_IO_DONTCACHE:
> >> + case NFSD_IO_DIRECT:
> >> + /*
> >> + * Must disable splice_read when enabling
> >> + * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
> >> + */
> >> + nfsd_disable_splice_read = true;
> >> + nfsd_io_cache_read = val;
> >> + break;
> >> + case NFSD_IO_BUFFERED:
> >> + default:
> >> + nfsd_io_cache_read = NFSD_IO_BUFFERED;
> >> + break;
> >
> > I think the default case should leave nfsd_io_cache_read alone and
> > return an error. If we add new values later, and someone tries to use
> > them on an old kernel, it's better to make that attempt error out.
> >
> > Ditto for the write side controls.
>
> +1 on both accounts.
I started to implement this just now (so that I can kick v3 of this
patchset out of the nest today) but soon found that debugfs doesn't
provide string-based interface controls.
See simple_attr_open() (which is used by DEFINE_DEBUGFS_ATTRIBUTE).
It only allows u64 to be set/get.
I'll fix the default case to return an error for now though.
Once we graduate from debugfs to a proper per-export control we can
impose string controls/mapping, e.g.:
+static u64 nfsd_io_cache_string_to_mode(const char *nfsd_io_cache_string)
+{
+ u64 val = NFSD_IO_UNKNOWN;
+
+ if (!strncmp(nfsd_io_cache_string, NFSD_IO_BUFFERED_string,
+ strlen(NFSD_IO_BUFFERED_string)))
+ val = NFSD_IO_BUFFERED;
+ else if (!strncmp(nfsd_io_cache_string, NFSD_IO_DONTCACHE_string,
+ strlen(NFSD_IO_DONTCACHE_string)))
+ val = NFSD_IO_DONTCACHE;
+ else if (!strncmp(nfsd_io_cache_string, NFSD_IO_DIRECT_string,
+ strlen(NFSD_IO_DIRECT_string)))
+ val = NFSD_IO_DIRECT;
+
+ return val;
+}
+
+static const char *
+nfsd_io_cache_mode_to_string(const char *nfsd_io_cache_string)
+{
+ char *nfsd_io_cache_string;
+
+ switch (val) {
+ case NFSD_IO_BUFFERED:
+ nfsd_io_cache_string = NFSD_IO_BUFFERED_string;
+ break;
+ case NFSD_IO_DONTCACHE:
+ nfsd_io_cache_string = NFSD_IO_DONTCACHE_string;
+ break;
+ case NFSD_IO_DIRECT:
+ nfsd_io_cache_string = NFSD_IO_DIRECT_string;
+ break;
+ case NFSD_IO_UNKNOWN:
+ nfsd_io_cache_string = NFSD_IO_UNKNOWN_string;
+ break;
+ }
+
+ return nfsd_io_cache_string;
+}
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface
2025-07-10 7:47 ` Christoph Hellwig
@ 2025-07-14 17:33 ` Mike Snitzer
0 siblings, 0 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-14 17:33 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker, linux-nfs,
linus-fsdevel, linux-mm
On Thu, Jul 10, 2025 at 12:47:43AM -0700, Christoph Hellwig wrote:
> On Tue, Jul 08, 2025 at 12:06:17PM -0400, Mike Snitzer wrote:
> > Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
> > read by NFSD will either be:
> > - cached using page cache (NFSD_IO_BUFFERED=0)
> > - cached but removed from the page cache upon completion
> > (NFSD_IO_DONTCACHE=1).
> > - not cached (NFSD_IO_DIRECT=2)
> >
> > io_cache_read is 0 by default. It may be set by writing to:
> > /sys/kernel/debug/nfsd/io_cache_read
> >
> > If NFSD_IO_DONTCACHE is specified using 1, FOP_DONTCACHE must be
> > advertised as supported by the underlying filesystem (e.g. XFS),
> > otherwise all IO flagged with RWF_DONTCACHE will fail with
> > -EOPNOTSUPP.
> >
> > If NFSD_IO_DIRECT is specified using 2, the IO must be aligned
> > relative to the underlying block device's logical_block_size. Also the
> > memory buffer used to store the read must be aligned relative to the
> > underlying block device's dma_alignment.
>
> Does this also need some kind of check that direct I/O is supported
> at all by the file system / fops instance?
Long-term: definitely. Near-term: not going to add it because it'd
have to be checked on a per-IO basis (because this debugfs-based
interface is a global setting so we don't have a specific export to
check at the time the request to use O_DIRECT is made via debugfs).
Once we graduate to having a per-export control interface in NFSD we
can then verify export's underlying filesystem supports O_DIRECT.
Hope this makes sense, thanks.
Mike
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 3/8] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
2025-07-10 7:45 ` Christoph Hellwig
@ 2025-07-14 17:46 ` Mike Snitzer
0 siblings, 0 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-07-14 17:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
Tom Talpey, Trond Myklebust, Anna Schumaker, linux-nfs,
linux-fsdevel, linux-mm
On Thu, Jul 10, 2025 at 12:45:52AM -0700, Christoph Hellwig wrote:
> On Tue, Jul 08, 2025 at 12:06:14PM -0400, Mike Snitzer wrote:
> > Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get and store DIO
> > alignment attributes from underlying filesystem in associated
> > nfsd_file. This is done when the nfsd_file is first opened for
> > a regular file.
>
> Just as a little warning: these are unfortunately only supported by
> very few file systems right now. For someone with a little bit of
> time on their hand it would be nice to do a pass to set them for
> all remaining direct I/O supporting file systems, which is only about
> a dozen anyway.
That's a good point. Meanwhile, I'll be sure to have the NFSD code
(and NFS LOCALIO code later) be more defensive about the possibility
of these attributes being 0 (due to lack of underlying filesystem
support).
Mike
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface
2025-07-14 16:47 ` Mike Snitzer
@ 2025-07-15 11:57 ` Jeff Layton
0 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2025-07-15 11:57 UTC (permalink / raw)
To: Mike Snitzer, Chuck Lever
Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, linux-nfs, linus-fsdevel,
linux-mm
On Mon, 2025-07-14 at 12:47 -0400, Mike Snitzer wrote:
> On Thu, Jul 10, 2025 at 06:46:37PM -0400, Chuck Lever wrote:
> > On 7/10/25 10:06 AM, Jeff Layton wrote:
> > > On Tue, 2025-07-08 at 12:06 -0400, Mike Snitzer wrote:
> > > > Add 'io_cache_read' to NFSD's debugfs interface so that: Any data
> > > > read by NFSD will either be:
> > > > - cached using page cache (NFSD_IO_BUFFERED=0)
> > > > - cached but removed from the page cache upon completion
> > > > (NFSD_IO_DONTCACHE=1).
> > > > - not cached (NFSD_IO_DIRECT=2)
> > > >
> > > > io_cache_read is 0 by default. It may be set by writing to:
> > > > /sys/kernel/debug/nfsd/io_cache_read
> > > >
> > > > If NFSD_IO_DONTCACHE is specified using 1, FOP_DONTCACHE must be
> > > > advertised as supported by the underlying filesystem (e.g. XFS),
> > > > otherwise all IO flagged with RWF_DONTCACHE will fail with
> > > > -EOPNOTSUPP.
> > > >
> > > > If NFSD_IO_DIRECT is specified using 2, the IO must be aligned
> > > > relative to the underlying block device's logical_block_size. Also the
> > > > memory buffer used to store the read must be aligned relative to the
> > > > underlying block device's dma_alignment.
> > > >
> > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > ---
> > > > fs/nfsd/debugfs.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
> > > > fs/nfsd/nfsd.h | 8 +++++++
> > > > fs/nfsd/vfs.c | 15 ++++++++++++++
> > > > 3 files changed, 76 insertions(+)
> > > >
> > > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > > > index 84b0c8b559dc..709646af797a 100644
> > > > --- a/fs/nfsd/debugfs.c
> > > > +++ b/fs/nfsd/debugfs.c
> > > > @@ -27,11 +27,61 @@ static int nfsd_dsr_get(void *data, u64 *val)
> > > > static int nfsd_dsr_set(void *data, u64 val)
> > > > {
> > > > nfsd_disable_splice_read = (val > 0) ? true : false;
> > > > + if (!nfsd_disable_splice_read) {
> > > > + /*
> > > > + * Cannot use NFSD_IO_DONTCACHE or NFSD_IO_DIRECT
> > > > + * if splice_read is enabled.
> > > > + */
> > > > + nfsd_io_cache_read = NFSD_IO_BUFFERED;
> > > > + }
> > > > return 0;
> > > > }
> > > >
> > > > DEFINE_DEBUGFS_ATTRIBUTE(nfsd_dsr_fops, nfsd_dsr_get, nfsd_dsr_set, "%llu\n");
> > > >
> > > > +/*
> > > > + * /sys/kernel/debug/nfsd/io_cache_read
> > > > + *
> > > > + * Contents:
> > > > + * %0: NFS READ will use buffered IO (default)
> > > > + * %1: NFS READ will use dontcache (buffered IO w/ dropbehind)
> > > > + * %2: NFS READ will use direct IO
> > > > + *
> > > > + * The default value of this setting is zero (buffered IO is
> > > > + * used). This setting takes immediate effect for all NFS
> > > > + * versions, all exports, and in all NFSD net namespaces.
> > > > + */
> > > > +
> > >
> > > Could we switch this to use a string instead? Maybe
> > > buffered/dontcache/direct ?
> >
> > That thought occurred to me too, since it would make the API a little
> > more self-documenting, and might be a harbinger of what a future
> > export option might look like.
> >
> >
> > > > +static int nfsd_io_cache_read_get(void *data, u64 *val)
> > > > +{
> > > > + *val = nfsd_io_cache_read;
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int nfsd_io_cache_read_set(void *data, u64 val)
> > > > +{
> > > > + switch (val) {
> > > > + case NFSD_IO_DONTCACHE:
> > > > + case NFSD_IO_DIRECT:
> > > > + /*
> > > > + * Must disable splice_read when enabling
> > > > + * NFSD_IO_DONTCACHE or NFSD_IO_DIRECT.
> > > > + */
> > > > + nfsd_disable_splice_read = true;
> > > > + nfsd_io_cache_read = val;
> > > > + break;
> > > > + case NFSD_IO_BUFFERED:
> > > > + default:
> > > > + nfsd_io_cache_read = NFSD_IO_BUFFERED;
> > > > + break;
> > >
> > > I think the default case should leave nfsd_io_cache_read alone and
> > > return an error. If we add new values later, and someone tries to use
> > > them on an old kernel, it's better to make that attempt error out.
> > >
> > > Ditto for the write side controls.
> >
> > +1 on both accounts.
>
> I started to implement this just now (so that I can kick v3 of this
> patchset out of the nest today) but soon found that debugfs doesn't
> provide string-based interface controls.
>
> See simple_attr_open() (which is used by DEFINE_DEBUGFS_ATTRIBUTE).
> It only allows u64 to be set/get.
>
> I'll fix the default case to return an error for now though.
>
> Once we graduate from debugfs to a proper per-export control we can
> impose string controls/mapping, e.g.:
>
> +static u64 nfsd_io_cache_string_to_mode(const char *nfsd_io_cache_string)
> +{
> + u64 val = NFSD_IO_UNKNOWN;
> +
> + if (!strncmp(nfsd_io_cache_string, NFSD_IO_BUFFERED_string,
> + strlen(NFSD_IO_BUFFERED_string)))
> + val = NFSD_IO_BUFFERED;
> + else if (!strncmp(nfsd_io_cache_string, NFSD_IO_DONTCACHE_string,
> + strlen(NFSD_IO_DONTCACHE_string)))
> + val = NFSD_IO_DONTCACHE;
> + else if (!strncmp(nfsd_io_cache_string, NFSD_IO_DIRECT_string,
> + strlen(NFSD_IO_DIRECT_string)))
> + val = NFSD_IO_DIRECT;
> +
> + return val;
> +}
> +
> +static const char *
> +nfsd_io_cache_mode_to_string(const char *nfsd_io_cache_string)
> +{
> + char *nfsd_io_cache_string;
> +
> + switch (val) {
> + case NFSD_IO_BUFFERED:
> + nfsd_io_cache_string = NFSD_IO_BUFFERED_string;
> + break;
> + case NFSD_IO_DONTCACHE:
> + nfsd_io_cache_string = NFSD_IO_DONTCACHE_string;
> + break;
> + case NFSD_IO_DIRECT:
> + nfsd_io_cache_string = NFSD_IO_DIRECT_string;
> + break;
> + case NFSD_IO_UNKNOWN:
> + nfsd_io_cache_string = NFSD_IO_UNKNOWN_string;
> + break;
> + }
> +
> + return nfsd_io_cache_string;
> +}
Bummer.
I guess we could just roll our own using the seqfile interfaces and put
it in the same directory. I may take a stab at that before we ship
this. For now, we can stick with the integers.
Thanks for fixing up the default case!
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-07-10 16:12 ` Mike Snitzer
2025-07-10 16:29 ` Keith Busch
@ 2025-08-01 15:23 ` Keith Busch
2025-08-01 16:10 ` Mike Snitzer
1 sibling, 1 reply; 33+ messages in thread
From: Keith Busch @ 2025-08-01 15:23 UTC (permalink / raw)
To: Mike Snitzer
Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
linux-block
On Thu, Jul 10, 2025 at 12:12:29PM -0400, Mike Snitzer wrote:
> All said, in practice I haven't had any issues with this patch. But
> it could just be I don't have the stars aligned to test the case that
> might have problems. If you know of such a case I'd welcome
> suggestions.
This is something I threw together that appears to be successful with
NVMe through raw block direct-io. This will defer catching an invalid io
vector to much later in the block stack, which should be okay, and
removes one of the vector walks in the fast path, so that's a bonus.
While this is testing okay with NVMe so far, I haven't tested any more
complicated setups yet, and I probably need to get filesystems using
this relaxed limit too.
---
diff --git a/block/bio.c b/block/bio.c
index 92c512e876c8d..634b2031c4829 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1227,13 +1227,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
extraction_flags |= ITER_ALLOW_P2PDMA;
- /*
- * Each segment in the iov is required to be a block size multiple.
- * However, we may not be able to get the entire segment if it spans
- * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the
- * result to ensure the bio's total size is correct. The remainder of
- * the iov data will be picked up in the next bio iteration.
- */
size = iov_iter_extract_pages(iter, &pages,
UINT_MAX - bio->bi_iter.bi_size,
nr_pages, extraction_flags, &offset);
@@ -1241,18 +1234,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return size ? size : -EFAULT;
nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
-
- if (bio->bi_bdev) {
- size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
- iov_iter_revert(iter, trim);
- size -= trim;
- }
-
- if (unlikely(!size)) {
- ret = -EFAULT;
- goto out;
- }
-
for (left = size, i = 0; left > 0; left -= len, i += num_pages) {
struct page *page = pages[i];
struct folio *folio = page_folio(page);
@@ -1297,6 +1278,23 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return ret;
}
+static int bio_align_to_bs(struct bio *bio, struct iov_iter *iter)
+{
+ unsigned int mask = bdev_logical_block_size(bio->bi_bdev) - 1;
+ unsigned int total = bio->bi_iter.bi_size;
+ size_t trim = total & mask;
+
+ if (!trim)
+ return 0;
+
+ /* FIXME: might be leaking pages */
+ bio_revert(bio, trim);
+ iov_iter_revert(iter, trim);
+ if (total == trim)
+ return -EFAULT;
+ return 0;
+}
+
/**
* bio_iov_iter_get_pages - add user or kernel pages to a bio
* @bio: bio to add pages to
@@ -1327,7 +1325,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
if (iov_iter_is_bvec(iter)) {
bio_iov_bvec_set(bio, iter);
iov_iter_advance(iter, bio->bi_iter.bi_size);
- return 0;
+ return bio_align_to_bs(bio, iter);
}
if (iov_iter_extract_will_pin(iter))
@@ -1336,6 +1334,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
ret = __bio_iov_iter_get_pages(bio, iter);
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
+ ret = bio_align_to_bs(bio, iter);
return bio->bi_vcnt ? 0 : ret;
}
EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 70d704615be52..a3acfef8eb81d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -298,6 +298,9 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
unsigned nsegs = 0, bytes = 0;
bio_for_each_bvec(bv, bio, iter) {
+ if (bv.bv_offset & lim->dma_alignment)
+ return -EFAULT;
+
/*
* If the queue doesn't support SG gaps and adding this
* offset would create a gap, disallow it.
@@ -341,6 +344,8 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
* we do not use the full hardware limits.
*/
bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
+ if (!bytes)
+ return -EFAULT;
/*
* Bio splitting may cause subtle trouble such as hang when doing sync
diff --git a/block/fops.c b/block/fops.c
index 82451ac8ff25d..820902cf10730 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
struct iov_iter *iter)
{
- return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
- !bdev_iter_is_aligned(bdev, iter);
+ return (iocb->ki_pos | iov_iter_count(iter)) &
+ (bdev_logical_block_size(bdev) - 1);
}
#define DIO_INLINE_BIO_VECS 4
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 46ffac5caab78..d3ddf78d1f35e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -169,6 +169,22 @@ static inline void bio_advance(struct bio *bio, unsigned int nbytes)
#define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
+static inline void bio_revert(struct bio *bio, unsigned int nbytes)
+{
+ bio->bi_iter.bi_size -= nbytes;
+
+ while (nbytes) {
+ struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+ if (nbytes < bv->bv_len) {
+ bv->bv_len -= nbytes;
+ return;
+ }
+ bio->bi_vcnt--;
+ nbytes -= bv->bv_len;
+ }
+}
+
static inline unsigned bio_segments(struct bio *bio)
{
unsigned segs = 0;
--
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec
2025-08-01 15:23 ` Keith Busch
@ 2025-08-01 16:10 ` Mike Snitzer
0 siblings, 0 replies; 33+ messages in thread
From: Mike Snitzer @ 2025-08-01 16:10 UTC (permalink / raw)
To: Keith Busch
Cc: Ming Lei, Jens Axboe, Jeff Layton, Chuck Lever, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, linux-nfs, linux-fsdevel, linux-mm, hch,
linux-block
On Fri, Aug 01, 2025 at 09:23:57AM -0600, Keith Busch wrote:
> On Thu, Jul 10, 2025 at 12:12:29PM -0400, Mike Snitzer wrote:
> > All said, in practice I haven't had any issues with this patch. But
> > it could just be I don't have the stars aligned to test the case that
> > might have problems. If you know of such a case I'd welcome
> > suggestions.
>
> This is something I threw together that appears to be successful with
> NVMe through raw block direct-io. This will defer catching an invalid io
> vector to much later in the block stack, which should be okay, and
> removes one of the vector walks in the fast path, so that's a bonus.
>
> While this is testing okay with NVMe so far, I haven't tested any more
> complicated setups yet, and I probably need to get filesystems using
> this relaxed limit too.
Ship it! ;)
Many thanks for this, I'll review closely and circle back!
Mike
> ---
> diff --git a/block/bio.c b/block/bio.c
> index 92c512e876c8d..634b2031c4829 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1227,13 +1227,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
> extraction_flags |= ITER_ALLOW_P2PDMA;
>
> - /*
> - * Each segment in the iov is required to be a block size multiple.
> - * However, we may not be able to get the entire segment if it spans
> - * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the
> - * result to ensure the bio's total size is correct. The remainder of
> - * the iov data will be picked up in the next bio iteration.
> - */
> size = iov_iter_extract_pages(iter, &pages,
> UINT_MAX - bio->bi_iter.bi_size,
> nr_pages, extraction_flags, &offset);
> @@ -1241,18 +1234,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> return size ? size : -EFAULT;
>
> nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
> -
> - if (bio->bi_bdev) {
> - size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
> - iov_iter_revert(iter, trim);
> - size -= trim;
> - }
> -
> - if (unlikely(!size)) {
> - ret = -EFAULT;
> - goto out;
> - }
> -
> for (left = size, i = 0; left > 0; left -= len, i += num_pages) {
> struct page *page = pages[i];
> struct folio *folio = page_folio(page);
> @@ -1297,6 +1278,23 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> return ret;
> }
>
> +static int bio_align_to_bs(struct bio *bio, struct iov_iter *iter)
> +{
> + unsigned int mask = bdev_logical_block_size(bio->bi_bdev) - 1;
> + unsigned int total = bio->bi_iter.bi_size;
> + size_t trim = total & mask;
> +
> + if (!trim)
> + return 0;
> +
> + /* FIXME: might be leaking pages */
> + bio_revert(bio, trim);
> + iov_iter_revert(iter, trim);
> + if (total == trim)
> + return -EFAULT;
> + return 0;
> +}
> +
> /**
> * bio_iov_iter_get_pages - add user or kernel pages to a bio
> * @bio: bio to add pages to
> @@ -1327,7 +1325,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> if (iov_iter_is_bvec(iter)) {
> bio_iov_bvec_set(bio, iter);
> iov_iter_advance(iter, bio->bi_iter.bi_size);
> - return 0;
> + return bio_align_to_bs(bio, iter);
> }
>
> if (iov_iter_extract_will_pin(iter))
> @@ -1336,6 +1334,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> ret = __bio_iov_iter_get_pages(bio, iter);
> } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
>
> + ret = bio_align_to_bs(bio, iter);
> return bio->bi_vcnt ? 0 : ret;
> }
> EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 70d704615be52..a3acfef8eb81d 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -298,6 +298,9 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> unsigned nsegs = 0, bytes = 0;
>
> bio_for_each_bvec(bv, bio, iter) {
> + if (bv.bv_offset & lim->dma_alignment)
> + return -EFAULT;
> +
> /*
> * If the queue doesn't support SG gaps and adding this
> * offset would create a gap, disallow it.
> @@ -341,6 +344,8 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> * we do not use the full hardware limits.
> */
> bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
> + if (!bytes)
> + return -EFAULT;
>
> /*
> * Bio splitting may cause subtle trouble such as hang when doing sync
> diff --git a/block/fops.c b/block/fops.c
> index 82451ac8ff25d..820902cf10730 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
> static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
> struct iov_iter *iter)
> {
> - return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
> - !bdev_iter_is_aligned(bdev, iter);
> + return (iocb->ki_pos | iov_iter_count(iter)) &
> + (bdev_logical_block_size(bdev) - 1);
> }
>
> #define DIO_INLINE_BIO_VECS 4
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 46ffac5caab78..d3ddf78d1f35e 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -169,6 +169,22 @@ static inline void bio_advance(struct bio *bio, unsigned int nbytes)
>
> #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
>
> +static inline void bio_revert(struct bio *bio, unsigned int nbytes)
> +{
> + bio->bi_iter.bi_size -= nbytes;
> +
> + while (nbytes) {
> + struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> +
> + if (nbytes < bv->bv_len) {
> + bv->bv_len -= nbytes;
> + return;
> + }
> + bio->bi_vcnt--;
> + nbytes -= bv->bv_len;
> + }
> +}
> +
> static inline unsigned bio_segments(struct bio *bio)
> {
> unsigned segs = 0;
> --
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2025-08-01 16:10 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08 16:06 [RFC PATCH v2 0/8] NFSD: support DIO Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 1/8] NFSD: Relocate the fh_want_write() and fh_drop_write() helpers Mike Snitzer
2025-07-10 13:59 ` Jeff Layton
2025-07-08 16:06 ` [RFC PATCH v2 2/8] NFSD: Move the fh_getattr() helper Mike Snitzer
2025-07-10 13:59 ` Jeff Layton
2025-07-08 16:06 ` [RFC PATCH v2 3/8] NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
2025-07-10 7:45 ` Christoph Hellwig
2025-07-14 17:46 ` Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 4/8] lib/iov_iter: remove piecewise bvec length checking in iov_iter_aligned_bvec Mike Snitzer
2025-07-10 7:24 ` Christoph Hellwig
2025-07-10 7:32 ` Mike Snitzer
2025-07-10 7:44 ` Christoph Hellwig
2025-07-10 13:52 ` Jeff Layton
2025-07-10 14:48 ` Keith Busch
2025-07-10 16:12 ` Mike Snitzer
2025-07-10 16:29 ` Keith Busch
2025-07-10 17:22 ` Mike Snitzer
2025-07-10 19:51 ` Keith Busch
2025-07-10 19:57 ` Keith Busch
2025-08-01 15:23 ` Keith Busch
2025-08-01 16:10 ` Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 5/8] NFSD: pass nfsd_file to nfsd_iter_read() Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 6/8] NFSD: add io_cache_read controls to debugfs interface Mike Snitzer
2025-07-10 7:47 ` Christoph Hellwig
2025-07-14 17:33 ` Mike Snitzer
2025-07-10 14:06 ` Jeff Layton
2025-07-10 22:46 ` Chuck Lever
2025-07-14 16:47 ` Mike Snitzer
2025-07-15 11:57 ` Jeff Layton
2025-07-08 16:06 ` [RFC PATCH v2 7/8] NFSD: add io_cache_write " Mike Snitzer
2025-07-08 16:06 ` [RFC PATCH v2 8/8] NFSD: issue READs using O_DIRECT even if IO is misaligned Mike Snitzer
2025-07-08 21:22 ` Mike Snitzer
2025-07-10 7:51 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).