linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO
@ 2025-08-05 23:20 Mike Snitzer
  2025-08-05 23:20 ` [PATCH v7 01/11] NFS/localio: nfs_close_local_fh() fix check for file closed Mike Snitzer
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:20 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

Hi,

NFS and LOCALIO in particular benefit from avoiding the page cache for
workloads that have a working set that is significantly larger than
available system memory. Enter: NFS DIRECT, which makes it possible to
always enable LOCALIO to use O_DIRECT even if the IO is not
DIO-aligned.

Changes since v6:
- Include Trond's 3 LOCALIO fixes that seem to have been forgotten;
  not related to rest of this patchset other than LOCALIO related.
- Patch 4 is a LOCALIO fix that should be picked up for 6.17 too.
- Update LOCALIO to not use iov_iter_is_aligned() because it will
  soon be removed upstream.
- Add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support

Changes since v5:
- I split the NFS changes back out for v6 since the NFSD DIRECT
  changes have started to land in nfsd-testing
- With the benefit of having updated the NFSD trace points to use an
  EVENT_CLASS I have now updated NFS's equivalents to also use one.
- Updated patch headers.
- Patches 4 and 5, while not strictly needed, are "nice to have"
  because they evolve the NFS LOCALIO code to a better place.

All review appreciated, thanks.
Mike

Mike Snitzer (8):
  nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local()
  nfs/localio: make trace_nfs_local_open_fh more useful
  nfs/localio: add nfsd_file_dio_alignment
  nfs/localio: refactor iocb initialization
  nfs/localio: fallback to NFSD for misaligned O_DIRECT READs
  nfs/direct: add misaligned READ handling
  nfs/direct: add misaligned WRITE handling
  NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support

Trond Myklebust (3):
  NFS/localio: nfs_close_local_fh() fix check for file closed
  NFS/localio: nfs_uuid_put() fix races with nfs_open/close_local_fh()
  NFS/localio: nfs_uuid_put() fix the wake up after unlinking the file

 fs/nfs/direct.c            | 258 ++++++++++++++++++++++++++++++++++---
 fs/nfs/inode.c             |  15 +++
 fs/nfs/internal.h          |  17 ++-
 fs/nfs/localio.c           | 232 +++++++++++++++++++++------------
 fs/nfs/nfstrace.h          |  64 ++++++++-
 fs/nfs/pagelist.c          |  22 +++-
 fs/nfs_common/nfslocalio.c |  28 ++--
 fs/nfsd/localio.c          |  11 ++
 include/linux/nfs_page.h   |   1 +
 include/linux/nfslocalio.h |   2 +
 10 files changed, 529 insertions(+), 121 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v7 01/11] NFS/localio: nfs_close_local_fh() fix check for file closed
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
@ 2025-08-05 23:20 ` Mike Snitzer
  2025-08-05 23:20 ` [PATCH v7 02/11] NFS/localio: nfs_uuid_put() fix races with nfs_open/close_local_fh() Mike Snitzer
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:20 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

If the struct nfs_file_localio is closed, its list entry will be empty,
but the nfs_uuid->files list might still contain other entries.

Acked-by: Mike Snitzer <snitzer@kernel.org>
Tested-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Fixes: 21fb44034695 ("nfs_localio: protect race between nfs_uuid_put() and nfs_close_local_fh()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs_common/nfslocalio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs_common/nfslocalio.c b/fs/nfs_common/nfslocalio.c
index 05c7c16e37ab4..64949c46c1741 100644
--- a/fs/nfs_common/nfslocalio.c
+++ b/fs/nfs_common/nfslocalio.c
@@ -314,7 +314,7 @@ void nfs_close_local_fh(struct nfs_file_localio *nfl)
 		rcu_read_unlock();
 		return;
 	}
-	if (list_empty(&nfs_uuid->files)) {
+	if (list_empty(&nfl->list)) {
 		/* nfs_uuid_put() has started closing files, wait for it
 		 * to finished
 		 */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 02/11] NFS/localio: nfs_uuid_put() fix races with nfs_open/close_local_fh()
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
  2025-08-05 23:20 ` [PATCH v7 01/11] NFS/localio: nfs_close_local_fh() fix check for file closed Mike Snitzer
@ 2025-08-05 23:20 ` Mike Snitzer
  2025-08-05 23:20 ` [PATCH v7 03/11] NFS/localio: nfs_uuid_put() fix the wake up after unlinking the file Mike Snitzer
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:20 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

In order for the wait in nfs_uuid_put() to be safe, it is necessary to
ensure that nfs_uuid_add_file() doesn't add a new entry once the
nfs_uuid->net has been NULLed out.

Also fix up the wake_up_var_locked() / wait_var_event_spinlock() to both
use the nfs_uuid address, since nfl, and &nfl->uuid could be used elsewhere.

Acked-by: Mike Snitzer <snitzer@kernel.org>
Tested-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/all/175262893035.2234665.1735173020338594784@noble.neil.brown.name/
Fixes: 21fb44034695 ("nfs_localio: protect race between nfs_uuid_put() and nfs_close_local_fh()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs_common/nfslocalio.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/nfs_common/nfslocalio.c b/fs/nfs_common/nfslocalio.c
index 64949c46c1741..f1f1592ac1348 100644
--- a/fs/nfs_common/nfslocalio.c
+++ b/fs/nfs_common/nfslocalio.c
@@ -177,7 +177,7 @@ static bool nfs_uuid_put(nfs_uuid_t *nfs_uuid)
 			/* nfs_close_local_fh() is doing the
 			 * close and we must wait. until it unlinks
 			 */
-			wait_var_event_spinlock(nfl,
+			wait_var_event_spinlock(nfs_uuid,
 						list_first_entry_or_null(
 							&nfs_uuid->files,
 							struct nfs_file_localio,
@@ -243,15 +243,20 @@ void nfs_localio_invalidate_clients(struct list_head *nn_local_clients,
 }
 EXPORT_SYMBOL_GPL(nfs_localio_invalidate_clients);
 
-static void nfs_uuid_add_file(nfs_uuid_t *nfs_uuid, struct nfs_file_localio *nfl)
+static int nfs_uuid_add_file(nfs_uuid_t *nfs_uuid, struct nfs_file_localio *nfl)
 {
+	int ret = 0;
+
 	/* Add nfl to nfs_uuid->files if it isn't already */
 	spin_lock(&nfs_uuid->lock);
-	if (list_empty(&nfl->list)) {
+	if (rcu_access_pointer(nfs_uuid->net) == NULL) {
+		ret = -ENXIO;
+	} else if (list_empty(&nfl->list)) {
 		rcu_assign_pointer(nfl->nfs_uuid, nfs_uuid);
 		list_add_tail(&nfl->list, &nfs_uuid->files);
 	}
 	spin_unlock(&nfs_uuid->lock);
+	return ret;
 }
 
 /*
@@ -285,11 +290,13 @@ struct nfsd_file *nfs_open_local_fh(nfs_uuid_t *uuid,
 	}
 	rcu_read_unlock();
 	/* We have an implied reference to net thanks to nfsd_net_try_get */
-	localio = nfs_to->nfsd_open_local_fh(net, uuid->dom, rpc_clnt,
-					     cred, nfs_fh, pnf, fmode);
+	localio = nfs_to->nfsd_open_local_fh(net, uuid->dom, rpc_clnt, cred,
+					     nfs_fh, pnf, fmode);
+	if (!IS_ERR(localio) && nfs_uuid_add_file(uuid, nfl) < 0) {
+		/* Delete the cached file when racing with nfs_uuid_put() */
+		nfs_to_nfsd_file_put_local(pnf);
+	}
 	nfs_to_nfsd_net_put(net);
-	if (!IS_ERR(localio))
-		nfs_uuid_add_file(uuid, nfl);
 
 	return localio;
 }
@@ -338,7 +345,7 @@ void nfs_close_local_fh(struct nfs_file_localio *nfl)
 	 */
 	spin_lock(&nfs_uuid->lock);
 	list_del_init(&nfl->list);
-	wake_up_var_locked(&nfl->nfs_uuid, &nfs_uuid->lock);
+	wake_up_var_locked(nfs_uuid, &nfs_uuid->lock);
 	spin_unlock(&nfs_uuid->lock);
 }
 EXPORT_SYMBOL_GPL(nfs_close_local_fh);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 03/11] NFS/localio: nfs_uuid_put() fix the wake up after unlinking the file
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
  2025-08-05 23:20 ` [PATCH v7 01/11] NFS/localio: nfs_close_local_fh() fix check for file closed Mike Snitzer
  2025-08-05 23:20 ` [PATCH v7 02/11] NFS/localio: nfs_uuid_put() fix races with nfs_open/close_local_fh() Mike Snitzer
@ 2025-08-05 23:20 ` Mike Snitzer
  2025-08-05 23:20 ` [PATCH v7 04/11] nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local() Mike Snitzer
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:20 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Use store_release_wake_up() instead of wake_up_var_locked(), because the
waiter cannot retake the nfs_uuid->lock.

Acked-by: Mike Snitzer <snitzer@kernel.org>
Tested-by: Mike Snitzer <snitzer@kernel.org>
Suggested-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/all/175262948827.2234665.1891349021754495573@noble.neil.brown.name/
Fixes: 21fb44034695 ("nfs_localio: protect race between nfs_uuid_put() and nfs_close_local_fh()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs_common/nfslocalio.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/nfs_common/nfslocalio.c b/fs/nfs_common/nfslocalio.c
index f1f1592ac1348..dd715cdb6c043 100644
--- a/fs/nfs_common/nfslocalio.c
+++ b/fs/nfs_common/nfslocalio.c
@@ -198,8 +198,7 @@ static bool nfs_uuid_put(nfs_uuid_t *nfs_uuid)
 		/* Now we can allow racing nfs_close_local_fh() to
 		 * skip the locking.
 		 */
-		RCU_INIT_POINTER(nfl->nfs_uuid, NULL);
-		wake_up_var_locked(&nfl->nfs_uuid, &nfs_uuid->lock);
+		store_release_wake_up(&nfl->nfs_uuid, RCU_INITIALIZER(NULL));
 	}
 
 	/* Remove client from nn->local_clients */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 04/11] nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local()
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
                   ` (2 preceding siblings ...)
  2025-08-05 23:20 ` [PATCH v7 03/11] NFS/localio: nfs_uuid_put() fix the wake up after unlinking the file Mike Snitzer
@ 2025-08-05 23:20 ` Mike Snitzer
  2025-08-05 23:21 ` [PATCH v7 05/11] nfs/localio: make trace_nfs_local_open_fh more useful Mike Snitzer
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:20 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

From: Mike Snitzer <snitzer@hammerspace.com>

Previously nfs_local_probe() was made to disable and then attempt to
re-enable LOCALIO (via LOCALIO protocol handshake) if/when it was
called and LOCALIO already enabled.

Vague memory for _why_ this was the case is that this was useful
if/when a local NFS server were to be restarted with a local NFS
client connected to it.

But as it happens this causes an absurd amount of LOCALIO flapping
which has a side-effect of too much IO being needlessly sent to NFSD
(using RPC over the loopback network interface).  This is the
definition of "serious performance loss" (that negates the point of
having LOCALIO).

So remove this mis-optimization for re-enabling LOCALIO if/when an NFS
server is restarted (which is an extremely rare thing to do).  Will
revisit testing that scenario again but in the meantime this patch
restores the full benefit of LOCALIO.

Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
---
 fs/nfs/localio.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index bd5fca2858998..9adcd1380bbba 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -180,10 +180,8 @@ static void nfs_local_probe(struct nfs_client *clp)
 		return;
 	}
 
-	if (nfs_client_is_local(clp)) {
-		/* If already enabled, disable and re-enable */
-		nfs_localio_disable_client(clp);
-	}
+	if (nfs_client_is_local(clp))
+		return;
 
 	if (!nfs_uuid_begin(&clp->cl_uuid))
 		return;
@@ -244,7 +242,8 @@ __nfs_local_open_fh(struct nfs_client *clp, const struct cred *cred,
 		case -ENOMEM:
 		case -ENXIO:
 		case -ENOENT:
-			/* Revalidate localio, will disable if unsupported */
+			/* Revalidate localio */
+			nfs_localio_disable_client(clp);
 			nfs_local_probe(clp);
 		}
 	}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 05/11] nfs/localio: make trace_nfs_local_open_fh more useful
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
                   ` (3 preceding siblings ...)
  2025-08-05 23:20 ` [PATCH v7 04/11] nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local() Mike Snitzer
@ 2025-08-05 23:21 ` Mike Snitzer
  2025-08-05 23:21 ` [PATCH v7 06/11] nfs/localio: add nfsd_file_dio_alignment Mike Snitzer
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:21 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

From: Mike Snitzer <snitzer@hammerspace.com>

Always trigger trace event when LOCALIO opens a file.

Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
 fs/nfs/localio.c  | 5 +++--
 fs/nfs/nfstrace.h | 6 +++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 9adcd1380bbba..2a6e1d3a764ba 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -231,13 +231,13 @@ __nfs_local_open_fh(struct nfs_client *clp, const struct cred *cred,
 		    struct nfsd_file __rcu **pnf,
 		    const fmode_t mode)
 {
+	int status = 0;
 	struct nfsd_file *localio;
 
 	localio = nfs_open_local_fh(&clp->cl_uuid, clp->cl_rpcclient,
 				    cred, fh, nfl, pnf, mode);
 	if (IS_ERR(localio)) {
-		int status = PTR_ERR(localio);
-		trace_nfs_local_open_fh(fh, mode, status);
+		status = PTR_ERR(localio);
 		switch (status) {
 		case -ENOMEM:
 		case -ENXIO:
@@ -247,6 +247,7 @@ __nfs_local_open_fh(struct nfs_client *clp, const struct cred *cred,
 			nfs_local_probe(clp);
 		}
 	}
+	trace_nfs_local_open_fh(fh, mode, status);
 	return localio;
 }
 
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 96b1323318c2f..4ec66d5e9cc6c 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -1712,10 +1712,10 @@ TRACE_EVENT(nfs_local_open_fh,
 		),
 
 		TP_printk(
-			"error=%d fhandle=0x%08x mode=%s",
-			__entry->error,
+			"fhandle=0x%08x mode=%s result=%d",
 			__entry->fhandle,
-			show_fs_fmode_flags(__entry->fmode)
+			show_fs_fmode_flags(__entry->fmode),
+			__entry->error
 		)
 );
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 06/11] nfs/localio: add nfsd_file_dio_alignment
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
                   ` (4 preceding siblings ...)
  2025-08-05 23:21 ` [PATCH v7 05/11] nfs/localio: make trace_nfs_local_open_fh more useful Mike Snitzer
@ 2025-08-05 23:21 ` Mike Snitzer
  2025-08-05 23:21 ` [PATCH v7 07/11] nfs/localio: refactor iocb initialization Mike Snitzer
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:21 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

From: Mike Snitzer <snitzer@hammerspace.com>

And use it to avoid issuing misaligned IO using O_DIRECT.

Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
 fs/nfs/localio.c           | 25 +++++++++++++++++++++----
 fs/nfsd/localio.c          | 11 +++++++++++
 include/linux/nfslocalio.h |  2 ++
 3 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 2a6e1d3a764ba..932019c074a6e 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -322,12 +322,10 @@ nfs_local_iocb_alloc(struct nfs_pgio_header *hdr,
 		return NULL;
 	}
 
+	init_sync_kiocb(&iocb->kiocb, file);
 	if (localio_O_DIRECT_semantics &&
-	    test_bit(NFS_IOHDR_ODIRECT, &hdr->flags)) {
-		iocb->kiocb.ki_filp = file;
+	    test_bit(NFS_IOHDR_ODIRECT, &hdr->flags))
 		iocb->kiocb.ki_flags = IOCB_DIRECT;
-	} else
-		init_sync_kiocb(&iocb->kiocb, file);
 
 	iocb->kiocb.ki_pos = hdr->args.offset;
 	iocb->hdr = hdr;
@@ -346,6 +344,25 @@ nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int dir)
 		      hdr->args.count + hdr->args.pgbase);
 	if (hdr->args.pgbase != 0)
 		iov_iter_advance(i, hdr->args.pgbase);
+
+	if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
+		u32 nf_dio_mem_align, nf_dio_offset_align, nf_dio_read_offset_align;
+		/* Verify the IO is DIO-aligned as required */
+		nfs_to->nfsd_file_dio_alignment(iocb->localio, &nf_dio_mem_align,
+						&nf_dio_offset_align,
+						&nf_dio_read_offset_align);
+		if (dir == READ)
+			nf_dio_offset_align = nf_dio_read_offset_align;
+		/* direct I/O must be aligned to device logical sector size */
+		if (nf_dio_mem_align && nf_dio_offset_align &&
+		    (hdr->args.pgbase && (hdr->args.pgbase & (nf_dio_mem_align - 1)) == 0) &&
+		    (((hdr->args.offset | hdr->args.count) & (nf_dio_offset_align - 1)) == 0))
+			return 0;
+
+		/* Fallback to using buffered for this misaligned IO */
+		iocb->kiocb.ki_flags &= ~IOCB_DIRECT;
+		iocb->kiocb.ki_filp->f_flags &= ~O_DIRECT;
+	}
 }
 
 static void
diff --git a/fs/nfsd/localio.c b/fs/nfsd/localio.c
index 269fa9391dc46..be710d809a3ba 100644
--- a/fs/nfsd/localio.c
+++ b/fs/nfsd/localio.c
@@ -117,12 +117,23 @@ nfsd_open_local_fh(struct net *net, struct auth_domain *dom,
 	return localio;
 }
 
+static void nfsd_file_dio_alignment(struct nfsd_file *nf,
+				    u32 *nf_dio_mem_align,
+				    u32 *nf_dio_offset_align,
+				    u32 *nf_dio_read_offset_align)
+{
+	*nf_dio_mem_align = nf->nf_dio_mem_align;
+	*nf_dio_offset_align = nf->nf_dio_offset_align;
+	*nf_dio_read_offset_align = nf->nf_dio_read_offset_align;
+}
+
 static const struct nfsd_localio_operations nfsd_localio_ops = {
 	.nfsd_net_try_get  = nfsd_net_try_get,
 	.nfsd_net_put  = nfsd_net_put,
 	.nfsd_open_local_fh = nfsd_open_local_fh,
 	.nfsd_file_put_local = nfsd_file_put_local,
 	.nfsd_file_file = nfsd_file_file,
+	.nfsd_file_dio_alignment = nfsd_file_dio_alignment,
 };
 
 void nfsd_localio_ops_init(void)
diff --git a/include/linux/nfslocalio.h b/include/linux/nfslocalio.h
index 59ea90bd136b6..3d91043254e64 100644
--- a/include/linux/nfslocalio.h
+++ b/include/linux/nfslocalio.h
@@ -64,6 +64,8 @@ struct nfsd_localio_operations {
 						const fmode_t);
 	struct net *(*nfsd_file_put_local)(struct nfsd_file __rcu **);
 	struct file *(*nfsd_file_file)(struct nfsd_file *);
+	void (*nfsd_file_dio_alignment)(struct nfsd_file *,
+					u32 *, u32 *, u32 *);
 } ____cacheline_aligned;
 
 extern void nfsd_localio_ops_init(void);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 07/11] nfs/localio: refactor iocb initialization
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
                   ` (5 preceding siblings ...)
  2025-08-05 23:21 ` [PATCH v7 06/11] nfs/localio: add nfsd_file_dio_alignment Mike Snitzer
@ 2025-08-05 23:21 ` Mike Snitzer
  2025-08-05 23:21 ` [PATCH v7 08/11] nfs/localio: fallback to NFSD for misaligned O_DIRECT READs Mike Snitzer
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:21 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

From: Mike Snitzer <snitzer@hammerspace.com>

No functional change, but slighty cleaner.

Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
 fs/nfs/localio.c | 56 +++++++++++++++++++++---------------------------
 1 file changed, 25 insertions(+), 31 deletions(-)

diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index 932019c074a6e..ac5e0dc405564 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -282,23 +282,6 @@ nfs_local_open_fh(struct nfs_client *clp, const struct cred *cred,
 }
 EXPORT_SYMBOL_GPL(nfs_local_open_fh);
 
-static struct bio_vec *
-nfs_bvec_alloc_and_import_pagevec(struct page **pagevec,
-		unsigned int npages, gfp_t flags)
-{
-	struct bio_vec *bvec, *p;
-
-	bvec = kmalloc_array(npages, sizeof(*bvec), flags);
-	if (bvec != NULL) {
-		for (p = bvec; npages > 0; p++, pagevec++, npages--) {
-			p->bv_page = *pagevec;
-			p->bv_len = PAGE_SIZE;
-			p->bv_offset = 0;
-		}
-	}
-	return bvec;
-}
-
 static void
 nfs_local_iocb_free(struct nfs_local_kiocb *iocb)
 {
@@ -315,8 +298,9 @@ nfs_local_iocb_alloc(struct nfs_pgio_header *hdr,
 	iocb = kmalloc(sizeof(*iocb), flags);
 	if (iocb == NULL)
 		return NULL;
-	iocb->bvec = nfs_bvec_alloc_and_import_pagevec(hdr->page_array.pagevec,
-			hdr->page_array.npages, flags);
+
+	iocb->bvec = kmalloc_array(hdr->page_array.npages,
+				   sizeof(struct bio_vec), flags);
 	if (iocb->bvec == NULL) {
 		kfree(iocb);
 		return NULL;
@@ -339,9 +323,21 @@ static void
 nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int dir)
 {
 	struct nfs_pgio_header *hdr = iocb->hdr;
+	struct page **pagevec = hdr->page_array.pagevec;
+	unsigned long v, total;
+	size_t len;
 
-	iov_iter_bvec(i, dir, iocb->bvec, hdr->page_array.npages,
-		      hdr->args.count + hdr->args.pgbase);
+	v = 0;
+	total = hdr->args.count + hdr->args.pgbase;
+	while (total) {
+		len = min_t(size_t, total, PAGE_SIZE);
+		bvec_set_page(&iocb->bvec[v], *(pagevec++), len, 0);
+		total -= len;
+		++v;
+	}
+	WARN_ON_ONCE(v != hdr->page_array.npages);
+
+	iov_iter_bvec(i, dir, iocb->bvec, v, hdr->args.count + hdr->args.pgbase);
 	if (hdr->args.pgbase != 0)
 		iov_iter_advance(i, hdr->args.pgbase);
 
@@ -468,6 +464,10 @@ static void nfs_local_call_read(struct work_struct *work)
 	save_cred = override_creds(filp->f_cred);
 
 	nfs_local_iter_init(&iter, iocb, READ);
+	if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
+		iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
+		iocb->aio_complete_work = nfs_local_read_aio_complete_work;
+	}
 
 	status = filp->f_op->read_iter(&iocb->kiocb, &iter);
 	if (status != -EIOCBQUEUED) {
@@ -501,11 +501,6 @@ nfs_do_local_read(struct nfs_pgio_header *hdr,
 	nfs_local_pgio_init(hdr, call_ops);
 	hdr->res.eof = false;
 
-	if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
-		iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
-		iocb->aio_complete_work = nfs_local_read_aio_complete_work;
-	}
-
 	INIT_WORK(&iocb->work, nfs_local_call_read);
 	queue_work(nfslocaliod_workqueue, &iocb->work);
 
@@ -661,6 +656,10 @@ static void nfs_local_call_write(struct work_struct *work)
 	save_cred = override_creds(filp->f_cred);
 
 	nfs_local_iter_init(&iter, iocb, WRITE);
+	if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
+		iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
+		iocb->aio_complete_work = nfs_local_write_aio_complete_work;
+	}
 
 	file_start_write(filp);
 	status = filp->f_op->write_iter(&iocb->kiocb, &iter);
@@ -710,11 +709,6 @@ nfs_do_local_write(struct nfs_pgio_header *hdr,
 
 	nfs_set_local_verifier(hdr->inode, hdr->res.verf, hdr->args.stable);
 
-	if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
-		iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
-		iocb->aio_complete_work = nfs_local_write_aio_complete_work;
-	}
-
 	INIT_WORK(&iocb->work, nfs_local_call_write);
 	queue_work(nfslocaliod_workqueue, &iocb->work);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 08/11] nfs/localio: fallback to NFSD for misaligned O_DIRECT READs
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
                   ` (6 preceding siblings ...)
  2025-08-05 23:21 ` [PATCH v7 07/11] nfs/localio: refactor iocb initialization Mike Snitzer
@ 2025-08-05 23:21 ` Mike Snitzer
  2025-08-05 23:21 ` [PATCH v7 09/11] nfs/direct: add misaligned READ handling Mike Snitzer
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:21 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

But this fallback is sub-optimal due to resorting to using RPC and
will only serve as a last resort if NFS client's O_DIRECT support
fails to align misaligned IO (support is added in subsequent patches).

Add 'localio_O_DIRECT_align_misaligned_IO' modparm, which depends on
localio_O_DIRECT_semantics=Y, to control if LOCALIO will make best
effort to transform misaligned IO to DIO-aligned (e.g. expanding
misaligned READ to DIO-aligned).

If LOCALIO determines that an O_DIRECT READ is misaligned, and larger
than 32K, then it makes sense to immediately issue the READ remotely
via NFSD (which has the ability to expand a misaligned O_DIRECT READ
to be DIO-aligned) if/when NFSD is configured to use O_DIRECT for READ
IO with: echo 3 > /sys/kernel/debug/nfsd/io_cache_read

This commit's various refactoring makes it possible for LOCALIO to
fallback to NFS pagelist code in process context to allow for
immediate retry over RPC. This refactoring alone makes this commit
worthwile even though it is highly unlikely that LOCALIO will ever
fallback to NFSD for misaligned READs (again, only a bug in the
subsequent patches would be cause for fallback).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/internal.h |   9 ++-
 fs/nfs/localio.c  | 167 ++++++++++++++++++++++++++++++----------------
 fs/nfs/pagelist.c |  15 +++--
 3 files changed, 127 insertions(+), 64 deletions(-)

diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 522011eea5f2f..f1015413b85cf 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -462,13 +462,14 @@ extern struct nfsd_file *nfs_local_open_fh(struct nfs_client *,
 					   struct nfs_file_localio *,
 					   const fmode_t);
 extern int nfs_local_doio(struct nfs_client *,
-			  struct nfsd_file *,
+			  struct nfsd_file **,
 			  struct nfs_pgio_header *,
 			  const struct rpc_call_ops *);
 extern int nfs_local_commit(struct nfsd_file *,
 			    struct nfs_commit_data *,
 			    const struct rpc_call_ops *, int);
 extern bool nfs_server_is_local(const struct nfs_client *clp);
+extern bool nfs_localio_O_DIRECT_align_misaligned_IO(void);
 
 #else /* CONFIG_NFS_LOCALIO */
 static inline void nfs_local_probe(struct nfs_client *clp) {}
@@ -481,7 +482,7 @@ nfs_local_open_fh(struct nfs_client *clp, const struct cred *cred,
 	return NULL;
 }
 static inline int nfs_local_doio(struct nfs_client *clp,
-				 struct nfsd_file *localio,
+				 struct nfsd_file **localio,
 				 struct nfs_pgio_header *hdr,
 				 const struct rpc_call_ops *call_ops)
 {
@@ -497,6 +498,10 @@ static inline bool nfs_server_is_local(const struct nfs_client *clp)
 {
 	return false;
 }
+static inline bool nfs_localio_O_DIRECT_align_misaligned_IO(void)
+{
+	return false;
+}
 #endif /* CONFIG_NFS_LOCALIO */
 
 /* super.c */
diff --git a/fs/nfs/localio.c b/fs/nfs/localio.c
index ac5e0dc405564..3e625947ad796 100644
--- a/fs/nfs/localio.c
+++ b/fs/nfs/localio.c
@@ -36,6 +36,7 @@ struct nfs_local_kiocb {
 	struct nfs_pgio_header	*hdr;
 	struct work_struct	work;
 	void (*aio_complete_work)(struct work_struct *);
+	struct iov_iter		iter ____cacheline_aligned;
 	struct nfsd_file	*localio;
 };
 
@@ -54,6 +55,11 @@ module_param(localio_O_DIRECT_semantics, bool, 0644);
 MODULE_PARM_DESC(localio_O_DIRECT_semantics,
 		 "LOCALIO will use O_DIRECT semantics to filesystem.");
 
+static bool localio_O_DIRECT_align_misaligned_IO __read_mostly = true;
+module_param(localio_O_DIRECT_align_misaligned_IO, bool, 0644);
+MODULE_PARM_DESC(localio_O_DIRECT_align_misaligned_IO,
+		 "If LOCALIO_O_DIRECT_semantics=Y make best effort to transform misaligned IO to DIO-aligned.");
+
 static inline bool nfs_client_is_local(const struct nfs_client *clp)
 {
 	return !!rcu_access_pointer(clp->cl_uuid.net);
@@ -65,6 +71,12 @@ bool nfs_server_is_local(const struct nfs_client *clp)
 }
 EXPORT_SYMBOL_GPL(nfs_server_is_local);
 
+bool nfs_localio_O_DIRECT_align_misaligned_IO(void)
+{
+	return localio_O_DIRECT_align_misaligned_IO;
+}
+EXPORT_SYMBOL_GPL(nfs_localio_O_DIRECT_align_misaligned_IO);
+
 /*
  * UUID_IS_LOCAL XDR functions
  */
@@ -319,46 +331,60 @@ nfs_local_iocb_alloc(struct nfs_pgio_header *hdr,
 	return iocb;
 }
 
-static void
-nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int dir)
+static int
+nfs_local_iter_init(struct iov_iter *i, struct nfs_local_kiocb *iocb, int rw)
 {
 	struct nfs_pgio_header *hdr = iocb->hdr;
 	struct page **pagevec = hdr->page_array.pagevec;
+	bool misaligned_DIO = false;
 	unsigned long v, total;
 	size_t len;
 
+	if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
+		u32 nf_dio_mem_align, nf_dio_offset_align, nf_dio_read_offset_align;
+		/* Verify the IO is DIO-aligned as required */
+		nfs_to->nfsd_file_dio_alignment(iocb->localio, &nf_dio_mem_align,
+						&nf_dio_offset_align,
+						&nf_dio_read_offset_align);
+		if (rw == ITER_DEST)
+			nf_dio_offset_align = nf_dio_read_offset_align;
+		if (!nf_dio_mem_align || !nf_dio_offset_align ||
+		    (hdr->args.pgbase && (hdr->args.pgbase & (nf_dio_mem_align - 1))) ||
+		    ((hdr->args.offset | hdr->args.count) & (nf_dio_offset_align - 1)))
+			misaligned_DIO = true;
+	}
+
 	v = 0;
 	total = hdr->args.count + hdr->args.pgbase;
 	while (total) {
 		len = min_t(size_t, total, PAGE_SIZE);
 		bvec_set_page(&iocb->bvec[v], *(pagevec++), len, 0);
+		/* No need to verify memory is DIO-aligned since bv_offset is 0 */
 		total -= len;
 		++v;
 	}
 	WARN_ON_ONCE(v != hdr->page_array.npages);
 
-	iov_iter_bvec(i, dir, iocb->bvec, v, hdr->args.count + hdr->args.pgbase);
+	iov_iter_bvec(i, rw, iocb->bvec, v, hdr->args.count + hdr->args.pgbase);
 	if (hdr->args.pgbase != 0)
 		iov_iter_advance(i, hdr->args.pgbase);
 
-	if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
-		u32 nf_dio_mem_align, nf_dio_offset_align, nf_dio_read_offset_align;
-		/* Verify the IO is DIO-aligned as required */
-		nfs_to->nfsd_file_dio_alignment(iocb->localio, &nf_dio_mem_align,
-						&nf_dio_offset_align,
-						&nf_dio_read_offset_align);
-		if (dir == READ)
-			nf_dio_offset_align = nf_dio_read_offset_align;
-		/* direct I/O must be aligned to device logical sector size */
-		if (nf_dio_mem_align && nf_dio_offset_align &&
-		    (hdr->args.pgbase && (hdr->args.pgbase & (nf_dio_mem_align - 1)) == 0) &&
-		    (((hdr->args.offset | hdr->args.count) & (nf_dio_offset_align - 1)) == 0))
-			return 0;
-
+	if (misaligned_DIO) {
+		/* Only send misaligned READ to NFSD if 32K or larger */
+		if (localio_O_DIRECT_align_misaligned_IO &&
+		    (rw == ITER_DEST) && (hdr->args.count >= (32 << 10))) {
+			/*
+			 * Fallback to sending this READ to NFSD since it
+			 * can expand misaligned READ IO to be DIO-aligned.
+			 */
+			return -ENOSYS;
+		}
 		/* Fallback to using buffered for this misaligned IO */
 		iocb->kiocb.ki_flags &= ~IOCB_DIRECT;
 		iocb->kiocb.ki_filp->f_flags &= ~O_DIRECT;
 	}
+
+	return 0;
 }
 
 static void
@@ -391,13 +417,18 @@ nfs_local_pgio_done(struct nfs_pgio_header *hdr, long status)
 	}
 }
 
-static void
-nfs_local_pgio_release(struct nfs_local_kiocb *iocb)
+static void nfs_local_iocb_release(struct nfs_local_kiocb *iocb)
 {
-	struct nfs_pgio_header *hdr = iocb->hdr;
-
 	nfs_local_file_put(iocb->localio);
 	nfs_local_iocb_free(iocb);
+}
+
+static void
+nfs_local_pgio_release(struct nfs_local_kiocb *iocb)
+{
+	struct nfs_pgio_header *hdr = iocb->hdr;
+
+	nfs_local_iocb_release(iocb);
 	nfs_local_hdr_release(hdr, hdr->task.tk_ops);
 }
 
@@ -458,18 +489,16 @@ static void nfs_local_call_read(struct work_struct *work)
 		container_of(work, struct nfs_local_kiocb, work);
 	struct file *filp = iocb->kiocb.ki_filp;
 	const struct cred *save_cred;
-	struct iov_iter iter;
 	ssize_t status;
 
 	save_cred = override_creds(filp->f_cred);
 
-	nfs_local_iter_init(&iter, iocb, READ);
 	if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
 		iocb->kiocb.ki_complete = nfs_local_read_aio_complete;
 		iocb->aio_complete_work = nfs_local_read_aio_complete_work;
 	}
 
-	status = filp->f_op->read_iter(&iocb->kiocb, &iter);
+	status = filp->f_op->read_iter(&iocb->kiocb, &iocb->iter);
 	if (status != -EIOCBQUEUED) {
 		nfs_local_read_done(iocb, status);
 		nfs_local_pgio_release(iocb);
@@ -479,25 +508,14 @@ static void nfs_local_call_read(struct work_struct *work)
 }
 
 static int
-nfs_do_local_read(struct nfs_pgio_header *hdr,
-		  struct nfsd_file *localio,
+nfs_local_do_read(struct nfs_local_kiocb *iocb,
 		  const struct rpc_call_ops *call_ops)
 {
-	struct nfs_local_kiocb *iocb;
-	struct file *file = nfs_to->nfsd_file_file(localio);
-
-	/* Don't support filesystems without read_iter */
-	if (!file->f_op->read_iter)
-		return -EAGAIN;
+	struct nfs_pgio_header *hdr = iocb->hdr;
 
 	dprintk("%s: vfs_read count=%u pos=%llu\n",
 		__func__, hdr->args.count, hdr->args.offset);
 
-	iocb = nfs_local_iocb_alloc(hdr, file, GFP_KERNEL);
-	if (iocb == NULL)
-		return -ENOMEM;
-	iocb->localio = localio;
-
 	nfs_local_pgio_init(hdr, call_ops);
 	hdr->res.eof = false;
 
@@ -649,20 +667,18 @@ static void nfs_local_call_write(struct work_struct *work)
 	struct file *filp = iocb->kiocb.ki_filp;
 	unsigned long old_flags = current->flags;
 	const struct cred *save_cred;
-	struct iov_iter iter;
 	ssize_t status;
 
 	current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
 	save_cred = override_creds(filp->f_cred);
 
-	nfs_local_iter_init(&iter, iocb, WRITE);
 	if (iocb->kiocb.ki_flags & IOCB_DIRECT) {
 		iocb->kiocb.ki_complete = nfs_local_write_aio_complete;
 		iocb->aio_complete_work = nfs_local_write_aio_complete_work;
 	}
 
 	file_start_write(filp);
-	status = filp->f_op->write_iter(&iocb->kiocb, &iter);
+	status = filp->f_op->write_iter(&iocb->kiocb, &iocb->iter);
 	file_end_write(filp);
 	if (status != -EIOCBQUEUED) {
 		nfs_local_write_done(iocb, status);
@@ -675,26 +691,15 @@ static void nfs_local_call_write(struct work_struct *work)
 }
 
 static int
-nfs_do_local_write(struct nfs_pgio_header *hdr,
-		   struct nfsd_file *localio,
+nfs_local_do_write(struct nfs_local_kiocb *iocb,
 		   const struct rpc_call_ops *call_ops)
 {
-	struct nfs_local_kiocb *iocb;
-	struct file *file = nfs_to->nfsd_file_file(localio);
-
-	/* Don't support filesystems without write_iter */
-	if (!file->f_op->write_iter)
-		return -EAGAIN;
+	struct nfs_pgio_header *hdr = iocb->hdr;
 
 	dprintk("%s: vfs_write count=%u pos=%llu %s\n",
 		__func__, hdr->args.count, hdr->args.offset,
 		(hdr->args.stable == NFS_UNSTABLE) ?  "unstable" : "stable");
 
-	iocb = nfs_local_iocb_alloc(hdr, file, GFP_NOIO);
-	if (iocb == NULL)
-		return -ENOMEM;
-	iocb->localio = localio;
-
 	switch (hdr->args.stable) {
 	default:
 		break;
@@ -715,32 +720,78 @@ nfs_do_local_write(struct nfs_pgio_header *hdr,
 	return 0;
 }
 
-int nfs_local_doio(struct nfs_client *clp, struct nfsd_file *localio,
+static struct nfs_local_kiocb *
+nfs_local_iocb_init(struct nfs_pgio_header *hdr, struct nfsd_file **localio)
+{
+	struct file *file = nfs_to->nfsd_file_file(*localio);
+	struct nfs_local_kiocb *iocb;
+	gfp_t gfp_mask;
+	int rw, status;
+
+	if (hdr->rw_mode & FMODE_READ) {
+		if (!file->f_op->read_iter)
+			return ERR_PTR(-EOPNOTSUPP);
+		gfp_mask = GFP_KERNEL;
+		rw = ITER_DEST;
+	} else {
+		if (!file->f_op->write_iter)
+			return ERR_PTR(-EOPNOTSUPP);
+		gfp_mask = GFP_NOIO;
+		rw = ITER_SOURCE;
+	}
+
+	iocb = nfs_local_iocb_alloc(hdr, file, gfp_mask);
+	if (iocb == NULL)
+		return ERR_PTR(-ENOMEM);
+	iocb->hdr = hdr;
+	iocb->localio = *localio;
+
+	status = nfs_local_iter_init(&iocb->iter, iocb, rw);
+	if (status == -ENOSYS) {
+		/* close nfsd_file and clear localio,
+		 * this informs callers that IO should
+		 * be serviced remotely.
+		 */
+		nfs_local_iocb_release(iocb);
+		*localio = NULL;
+		return ERR_PTR(status);
+	}
+	WARN_ON_ONCE(status != 0);
+
+	return iocb;
+}
+
+int nfs_local_doio(struct nfs_client *clp, struct nfsd_file **localio,
 		   struct nfs_pgio_header *hdr,
 		   const struct rpc_call_ops *call_ops)
 {
+	struct nfs_local_kiocb *iocb;
 	int status = 0;
 
 	if (!hdr->args.count)
 		return 0;
 
+	iocb = nfs_local_iocb_init(hdr, localio);
+	if (IS_ERR(iocb))
+		return PTR_ERR(iocb);
+
 	switch (hdr->rw_mode) {
 	case FMODE_READ:
-		status = nfs_do_local_read(hdr, localio, call_ops);
+		status = nfs_local_do_read(iocb, call_ops);
 		break;
 	case FMODE_WRITE:
-		status = nfs_do_local_write(hdr, localio, call_ops);
+		status = nfs_local_do_write(iocb, call_ops);
 		break;
 	default:
 		dprintk("%s: invalid mode: %d\n", __func__,
 			hdr->rw_mode);
-		status = -EINVAL;
+		status = -EOPNOTSUPP;
 	}
 
 	if (status != 0) {
 		if (status == -EAGAIN)
 			nfs_localio_disable_client(clp);
-		nfs_local_file_put(localio);
+		nfs_local_iocb_release(iocb);
 		hdr->task.tk_status = status;
 		nfs_local_hdr_release(hdr, call_ops);
 	}
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 11968dcb72431..9ddff27e96e9f 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -762,9 +762,17 @@ int nfs_initiate_pgio(struct rpc_clnt *clnt, struct nfs_pgio_header *hdr,
 		hdr->args.count,
 		(unsigned long long)hdr->args.offset);
 
-	if (localio)
-		return nfs_local_doio(NFS_SERVER(hdr->inode)->nfs_client,
-				      localio, hdr, call_ops);
+	if (localio) {
+		int status = nfs_local_doio(NFS_SERVER(hdr->inode)->nfs_client,
+					    &localio, hdr, call_ops);
+		/* nfs_local_doio() will clear localio and return -ENOSYS if
+		 * it is prudent to immediately service this IO remotely.
+		 */
+		if (status != -ENOSYS)
+			return status;
+		WARN_ON_ONCE(localio != NULL);
+		/* fallthrough */
+	}
 
 	task = rpc_run_task(&task_setup_data);
 	if (IS_ERR(task))
@@ -959,7 +967,6 @@ static int nfs_generic_pg_pgios(struct nfs_pageio_descriptor *desc)
 	ret = nfs_generic_pgio(desc, hdr);
 	if (ret == 0) {
 		struct nfs_client *clp = NFS_SERVER(hdr->inode)->nfs_client;
-
 		struct nfsd_file *localio =
 			nfs_local_open_fh(clp, hdr->cred, hdr->args.fh,
 					  &hdr->args.context->nfl,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 09/11] nfs/direct: add misaligned READ handling
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
                   ` (7 preceding siblings ...)
  2025-08-05 23:21 ` [PATCH v7 08/11] nfs/localio: fallback to NFSD for misaligned O_DIRECT READs Mike Snitzer
@ 2025-08-05 23:21 ` Mike Snitzer
  2025-08-05 23:21 ` [PATCH v7 10/11] nfs/direct: add misaligned WRITE handling Mike Snitzer
  2025-08-05 23:21 ` [PATCH v7 11/11] NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:21 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

Because the NFS client will already happily handle misaligned O_DIRECT
IO (by sending it out to NFSD via RPC) this commit's new capabilities
are for the benefit of LOCALIO and require the nfs modparam:
  localio_O_DIRECT_align_misaligned_IO=Y

When enabled, misaligned READ IO is expanded to consist of a
DIO-aligned extent followed by a single misaligned tail page (due to
it being a partial page).

Also add nfs_analyze_dio_class and use it to create
nfs_analyze_read_dio and nfs_analyze_write_dio trace events.

The nfs_analyze_read_dio trace event's start and middle reflect the
combined DIO-aligned extent that is issued using O_DIRECT, and the end
reflects the misaligned tail page that is issued with buffered IO.

This combination of trace events is useful for LOCALIO READs:

  echo 1 > /sys/kernel/tracing/events/nfs/nfs_analyze_read_dio/enable
  echo 1 > /sys/kernel/tracing/events/nfs/nfs_initiate_read/enable
  echo 1 > /sys/kernel/tracing/events/nfs/nfs_readpage_done/enable
  echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable

Which for this dd command:

  dd if=/mnt/share1/test of=/dev/null bs=47008 count=2 iflag=direct

Results in:
              dd-34643   [041] ..... 36754.259543: nfs_analyze_read_dio: fileid=00:2e:1739 fhandle=0x9b43f225 offset=0 count=47008 start=0+0 middle=0+45056 end=45056+1952
              dd-34643   [041] ..... 36754.259557: nfs_initiate_read: fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=0 count=45056
  kworker/u194:3-33963   [042] ..... 36754.259559: xfs_file_direct_read: dev 259:16 ino 0x3e00008f disize 0x16f40 pos 0x0 bytecount 0xb000
              dd-34643   [041] ..... 36754.259560: nfs_initiate_read: fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=45056 count=1952
  kworker/u194:8-33926   [035] ..... 36754.259563: nfs_readpage_done: error=0 fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=45056 count=1952 res=1952
  kworker/u194:8-33926   [035] ..... 36754.259579: nfs_readpage_done: error=0 fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=0 count=45056 res=45056

              dd-34643   [041] ..... 36754.259584: nfs_analyze_read_dio: fileid=00:2e:1739 fhandle=0x9b43f225 offset=47008 count=47008 start=45056+1952 middle=47008+43104 end=90112+3904
              dd-34643   [041] ..... 36754.259586: nfs_initiate_read: fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=45056 count=45056
  kworker/u194:8-33926   [035] ..... 36754.259588: xfs_file_direct_read: dev 259:16 ino 0x3e00008f disize 0x16f40 pos 0xb000 bytecount 0xb000
              dd-34643   [041] ..... 36754.259588: nfs_initiate_read: fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=90112 count=3904
  kworker/u194:3-33963   [042] ..... 36754.259591: nfs_readpage_done: error=0 fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=90112 count=3904 res=3904 eof
  kworker/u194:3-33963   [042] ..... 36754.259606: nfs_readpage_done: error=0 fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=45056 count=45056 res=45056

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/direct.c          | 176 ++++++++++++++++++++++++++++++++++++---
 fs/nfs/internal.h        |   8 ++
 fs/nfs/nfstrace.h        |  58 +++++++++++++
 fs/nfs/pagelist.c        |   7 ++
 include/linux/nfs_page.h |   1 +
 5 files changed, 239 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 48d89716193a7..118782070c5e8 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -210,6 +210,13 @@ static void nfs_direct_req_free(struct kref *kref)
 		nfs_put_lock_context(dreq->l_ctx);
 	if (dreq->ctx != NULL)
 		put_nfs_open_context(dreq->ctx);
+
+	if (dreq->start_extra_bvec != NULL) {
+		if (dreq->start_extra_bvec->bv_page != NULL)
+			__free_page(dreq->start_extra_bvec->bv_page);
+		kfree(dreq->start_extra_bvec);
+	}
+
 	kmem_cache_free(nfs_direct_cachep, dreq);
 }
 
@@ -264,6 +271,10 @@ static void nfs_direct_complete(struct nfs_direct_req *dreq)
 		if (dreq->count != 0) {
 			res = (long) dreq->count;
 			WARN_ON_ONCE(dreq->count < 0);
+			/* Reduce res by front_pad */
+			if ((dreq->start_extra_bvec != NULL) &&
+			    res >= dreq->start_extra_bvec->bv_len)
+				res -= dreq->start_extra_bvec->bv_len;
 		}
 		dreq->iocb->ki_complete(dreq->iocb, res);
 	}
@@ -285,6 +296,15 @@ static void nfs_direct_read_completion(struct nfs_pgio_header *hdr)
 	}
 
 	nfs_direct_count_bytes(dreq, hdr);
+
+	if (dreq->start_extra_bvec != NULL && (dreq->count == dreq->max_count)) {
+		unsigned front_pad = dreq->start_extra_bvec->bv_len;
+
+		hdr->res.count -= front_pad;
+		hdr->good_bytes -= front_pad;
+		hdr->args.count -= front_pad;
+		hdr->args.offset += front_pad;
+	}
 	spin_unlock(&dreq->lock);
 
 	nfs_update_delegated_atime(dreq->inode);
@@ -353,6 +373,30 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 	desc.pg_dreq = dreq;
 	inode_dio_begin(inode);
 
+	if (dreq->start_extra_bvec != NULL) {
+		struct nfs_page *req;
+		size_t pgbase = dreq->start_extra_bvec->bv_offset;
+		unsigned int front_pad = dreq->start_extra_bvec->bv_len;
+
+		/* Must force start pos to DIO-aligned start */
+		WARN_ON(pos != dreq->io_start);
+		req = nfs_page_create_from_page(dreq->ctx,
+						dreq->start_extra_bvec->bv_page,
+						pgbase, pos, front_pad);
+		if (IS_ERR(req)) {
+			result = PTR_ERR(req);
+			goto out;
+		}
+		if (!nfs_pageio_add_request(&desc, req)) {
+			result = desc.pg_error;
+			nfs_release_request(req);
+			goto out;
+		}
+
+		requested_bytes += front_pad;
+		pos += front_pad;
+	}
+
 	while (iov_iter_count(iter)) {
 		struct page **pagevec;
 		size_t bytes;
@@ -363,12 +407,19 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 						  rsize, &pgbase);
 		if (result < 0)
 			break;
-	
-		bytes = result;
-		npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
+
+		/* Limit the first batch of pages to DIO-aligned boundary? */
+		if (pos < dreq->end_offset && dreq->middle_len)
+			bytes = min_t(size_t, dreq->middle_len, result);
+		else
+			bytes = result;
+		npages = (bytes + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
+
 		for (i = 0; i < npages; i++) {
 			struct nfs_page *req;
 			unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
+			bool issue_dio_now = false;
+
 			/* XXX do we need to do the eof zeroing found in async_filler? */
 			req = nfs_page_create_from_page(dreq->ctx, pagevec[i],
 							pgbase, pos, req_len);
@@ -376,15 +427,33 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 				result = PTR_ERR(req);
 				break;
 			}
+
+			pgbase = 0;
+			result -= req_len;
+			bytes -= req_len;
+			requested_bytes += req_len;
+			pos += req_len;
+
+			/* Looking ahead, is this req the end of the DIO-aligned middle? */
+			if (bytes == 0 && dreq->end_len &&
+			    pos == dreq->end_offset && result == dreq->end_len) {
+				desc.pg_doio_now = 1;
+				issue_dio_now = true;
+				/* Reset iter to the last page (known misaligned),
+				 * issue previous DIO-aligned page and then handle
+				 * the last partial page stored in iter
+				 */
+				iov_iter_revert(iter, result);
+			}
+
 			if (!nfs_pageio_add_request(&desc, req)) {
 				result = desc.pg_error;
 				nfs_release_request(req);
 				break;
 			}
-			pgbase = 0;
-			bytes -= req_len;
-			requested_bytes += req_len;
-			pos += req_len;
+
+			if (issue_dio_now)
+				break;
 		}
 		nfs_direct_release_pages(pagevec, npages);
 		kvfree(pagevec);
@@ -398,6 +467,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 	 * If no bytes were started, return the error, and let the
 	 * generic layer handle the completion.
 	 */
+out:
 	if (requested_bytes == 0) {
 		inode_dio_end(inode);
 		nfs_direct_req_release(dreq);
@@ -409,6 +479,68 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 	return requested_bytes;
 }
 
+/*
+ * If localio_O_DIRECT_align_misaligned_READ enabled, expand any
+ * misaligned READ to include the previous DIO-aligned block.
+ * - FIXME: expanding the end to also be DIO-aligned requires a
+ *   bounce page that must be copied to original partial end page.
+ */
+static bool nfs_analyze_read_dio(loff_t offset, size_t len,
+				 struct nfs_direct_req *dreq)
+{
+#if IS_ENABLED(CONFIG_NFS_LOCALIO)
+	/* Hardcoded to PAGE_SIZE (since don't have LOCALIO nfsd_file's
+	 * dio_alignment), works for smaller alignment too (e.g. 512b).
+	 */
+	u32 dio_blocksize = PAGE_SIZE;
+	loff_t start, front_pad, orig_end, middle_end;
+
+	/* Return early if feature disabled, if IO is irreparably
+	 * misaligned (len < PAGE_SIZE) or if IO is already DIO-aligned.
+	 */
+	if (!nfs_localio_O_DIRECT_align_misaligned_IO() ||
+	    unlikely(len < dio_blocksize) ||
+	    (((offset | len) & (dio_blocksize-1)) == 0))
+		return false;
+
+	start = round_down(offset, dio_blocksize);
+	front_pad = offset - start;
+	orig_end = offset + len;
+	middle_end = round_down(orig_end, dio_blocksize);
+
+	if (front_pad) {
+		gfp_t gfp_mask = nfs_io_gfp_mask();
+
+		dreq->start_extra_bvec = kmalloc(sizeof(struct bio_vec), gfp_mask);
+		if (dreq->start_extra_bvec == NULL)
+			return false;
+		dreq->start_extra_bvec->bv_page = alloc_page(gfp_mask);
+		if (dreq->start_extra_bvec->bv_page == NULL) {
+			kfree(dreq->start_extra_bvec);
+			dreq->start_extra_bvec = NULL;
+			return false;
+		}
+		bvec_set_page(dreq->start_extra_bvec,
+			      dreq->start_extra_bvec->bv_page,
+			      front_pad, PAGE_SIZE - front_pad);
+	}
+
+	dreq->io_start = start;
+	dreq->start_len = front_pad;
+	dreq->middle_offset = offset;
+	dreq->middle_len = middle_end - offset;
+	dreq->end_offset = middle_end;
+	dreq->end_len = orig_end - middle_end;
+
+	dreq->max_count = orig_end - start;
+
+	trace_nfs_analyze_read_dio(offset, len, dreq);
+	return true;
+#else
+	return false;
+#endif
+}
+
 /**
  * nfs_file_direct_read - file direct read operation for NFS files
  * @iocb: target I/O control block
@@ -439,6 +571,9 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
 	struct nfs_lock_context *l_ctx;
 	ssize_t result, requested;
 	size_t count = iov_iter_count(iter);
+	size_t in_count = count;
+	unsigned int front_pad = 0;
+
 	nfs_add_stats(mapping->host, NFSIOS_DIRECTREADBYTES, count);
 
 	dfprintk(FILE, "NFS: direct read(%pD2, %zd@%Ld)\n",
@@ -456,8 +591,19 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
 		goto out;
 
 	dreq->inode = inode;
-	dreq->max_count = count;
-	dreq->io_start = iocb->ki_pos;
+	if (!swap && nfs_analyze_read_dio(iocb->ki_pos, count, dreq)) {
+		/* note that dreq values do include front_pad
+		 * (dreq->io_start -> dreq->start_extra_bvec->bv_offset)
+		 */
+		iocb->ki_pos = dreq->io_start;
+		count = dreq->max_count;
+		if (dreq->start_extra_bvec)
+			front_pad = dreq->start_extra_bvec->bv_len;
+	} else {
+		dreq->io_start = iocb->ki_pos;
+		dreq->max_count = count;
+	}
+
 	dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
 	l_ctx = nfs_get_lock_context(dreq->ctx);
 	if (IS_ERR(l_ctx)) {
@@ -483,16 +629,24 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
 		}
 	}
 
-	NFS_I(inode)->read_io += count;
+	NFS_I(inode)->read_io += in_count;
 	requested = nfs_direct_read_schedule_iovec(dreq, iter, iocb->ki_pos);
 
 	if (!swap)
 		nfs_end_io_direct(inode);
 
 	if (requested > 0) {
+		if (front_pad) {
+			/* given the iov_iter_revert below, must exclude the
+			 * front_pad (dreq->start_extra_bvec) from requested,
+			 */
+			requested -= front_pad;
+		}
+
 		result = nfs_direct_wait(dreq);
 		if (result > 0) {
-			requested -= result;
+			if (front_pad && result >= front_pad)
+				result -= front_pad;
 			iocb->ki_pos += result;
 		}
 		iov_iter_revert(iter, requested);
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f1015413b85cf..9e1991026a338 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -992,4 +992,12 @@ struct nfs_direct_req {
 	/* for read */
 #define NFS_ODIRECT_SHOULD_DIRTY	(3)	/* dirty user-space page after read */
 #define NFS_ODIRECT_DONE		INT_MAX	/* write verification failed */
+
+	/* State for expanding/splitting misaligned IO to be DIO-aligned (for LOCALIO) */
+	struct bio_vec *        start_extra_bvec;
+	loff_t			middle_offset;
+	loff_t			end_offset;
+	ssize_t			start_len;
+	ssize_t			middle_len;
+	ssize_t			end_len;
 };
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 4ec66d5e9cc6c..ec4c0f073361a 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -1598,6 +1598,64 @@ DEFINE_NFS_DIRECT_REQ_EVENT(nfs_direct_write_completion);
 DEFINE_NFS_DIRECT_REQ_EVENT(nfs_direct_write_schedule_iovec);
 DEFINE_NFS_DIRECT_REQ_EVENT(nfs_direct_write_reschedule_io);
 
+DECLARE_EVENT_CLASS(nfs_analyze_dio_class,
+	TP_PROTO(
+		loff_t offset,
+		ssize_t count,
+		const struct nfs_direct_req *dreq
+	),
+	TP_ARGS(offset, count, dreq),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(u64, fileid)
+		__field(u32, fhandle)
+		__field(loff_t, offset)
+		__field(ssize_t, count)
+		__field(loff_t, start)
+		__field(ssize_t, start_len)
+		__field(loff_t, middle)
+		__field(ssize_t, middle_len)
+		__field(loff_t, end)
+		__field(ssize_t, end_len)
+	),
+	TP_fast_assign(
+		const struct inode *inode = dreq->inode;
+		const struct nfs_inode *nfsi = NFS_I(inode);
+		const struct nfs_fh *fh = &nfsi->fh;
+
+		__entry->dev = inode->i_sb->s_dev;
+		__entry->fileid = nfsi->fileid;
+		__entry->fhandle = nfs_fhandle_hash(fh);
+		__entry->offset = offset;
+		__entry->count = count;
+		__entry->start = dreq->io_start;
+		__entry->start_len = dreq->start_len;
+		__entry->middle = dreq->middle_offset;
+		__entry->middle_len = dreq->middle_len;
+		__entry->end = dreq->end_offset;
+		__entry->end_len = dreq->end_len;
+	),
+	TP_printk("fileid=%02x:%02x:%llu fhandle=0x%08x "
+		  "offset=%lld count=%zd "
+		  "start=%llu+%lu middle=%llu+%lu end=%llu+%lu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long long)__entry->fileid,
+		  __entry->fhandle, __entry->offset, __entry->count,
+		  __entry->start, __entry->start_len,
+		  __entry->middle, __entry->middle_len,
+		  __entry->end, __entry->end_len)
+)
+
+#define DEFINE_NFS_ANALYZE_DIO_EVENT(name)			\
+DEFINE_EVENT(nfs_analyze_dio_class, nfs_analyze_##name##_dio,	\
+	TP_PROTO(loff_t offset,					\
+		 ssize_t count,					\
+		 const struct nfs_direct_req *dreq),		\
+	TP_ARGS(offset, count, dreq))
+
+DEFINE_NFS_ANALYZE_DIO_EVENT(read);
+DEFINE_NFS_ANALYZE_DIO_EVENT(write);
+
 TRACE_EVENT(nfs_fh_to_dentry,
 		TP_PROTO(
 			const struct super_block *sb,
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 9ddff27e96e9f..8d877360042d4 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -832,6 +832,7 @@ void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
 		     int io_flags)
 {
 	desc->pg_moreio = 0;
+	desc->pg_doio_now = 0;
 	desc->pg_inode = inode;
 	desc->pg_ops = pg_ops;
 	desc->pg_completion_ops = compl_ops;
@@ -1141,6 +1142,8 @@ nfs_pageio_do_add_request(struct nfs_pageio_descriptor *desc,
 		return size;
 	nfs_list_move_request(req, &mirror->pg_list);
 	mirror->pg_count += req->wb_bytes;
+	if (desc->pg_doio_now)
+		return 0; /* trigger nfs_pageio_doio() in caller */
 	return req->wb_bytes;
 }
 
@@ -1220,6 +1223,10 @@ static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
 			nfs_pageio_doio(desc);
 			if (desc->pg_error < 0 || mirror->pg_recoalesce)
 				return 0;
+			if (desc->pg_doio_now) {
+				desc->pg_doio_now = 0;
+				return 1;
+			}
 			/* retry add_request for this subreq */
 			nfs_page_group_lock(req);
 			continue;
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 169b4ae30ff47..2e88dc2ff3fea 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -117,6 +117,7 @@ struct nfs_pageio_descriptor {
 	u32			pg_mirror_idx;	/* current mirror */
 	unsigned short		pg_maxretrans;
 	unsigned char		pg_moreio : 1;
+	unsigned char		pg_doio_now : 1;
 };
 
 /* arbitrarily selected limit to number of mirrors */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 10/11] nfs/direct: add misaligned WRITE handling
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
                   ` (8 preceding siblings ...)
  2025-08-05 23:21 ` [PATCH v7 09/11] nfs/direct: add misaligned READ handling Mike Snitzer
@ 2025-08-05 23:21 ` Mike Snitzer
  2025-08-05 23:21 ` [PATCH v7 11/11] NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:21 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

Because the NFS client will already happily handle misaligned O_DIRECT
IO (by sending it out to NFSD via RPC) this commit's new capabilities
are for the benefit of LOCALIO and require the nfs modparam:
  localio_O_DIRECT_align_misaligned_IO=Y

When enabled, misaligned WRITE IO is split into a start, middle and
end as needed. The large middle extent is DIO-aligned and the start
and/or end are misaligned (due to each being a partial page).

Like the READ support that came before this WRITE support, the
nfs_analyze_write_dio trace event shows how the NFS client splits a
given misaligned IO into a mix of misaligned page(s) and a DIO-aligned
extent.

This combination of trace events is useful for LOCALIO WRITEs:

  echo 1 > /sys/kernel/tracing/events/nfs/nfs_analyze_write_dio/enable
  echo 1 > /sys/kernel/tracing/events/nfs/nfs_initiate_write/enable
  echo 1 > /sys/kernel/tracing/events/nfs/nfs_writeback_done/enable
  echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable

Which for this dd command:

  dd if=/dev/zero of=/mnt/share1/test bs=47008 count=2 oflag=direct

Results in:
              dd-34927   [042] ..... 39070.896951: nfs_analyze_write_dio: fileid=00:2e:1739 fhandle=0x9b43f225 offset=0 count=47008 start=0+0 middle=0+45056 end=45056+1952
              dd-34927   [042] ..... 39070.896958: nfs_initiate_write: fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=0 count=45056 stable=UNSTABLE
              dd-34927   [042] ..... 39070.896961: nfs_initiate_write: fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=45056 count=1952 stable=UNSTABLE
  kworker/u194:3-33963   [029] ..... 39070.896963: xfs_file_direct_write: dev 259:16 ino 0x3e00008f disize 0x0 pos 0x0 bytecount 0xb000
  kworker/u194:3-33963   [029] ..... 39070.896999: nfs_writeback_done: error=0 fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=0 count=45056 res=45056 stable=UNSTABLE verifier=73268c6866702410
  kworker/u194:0-34670   [041] ..... 39070.897005: nfs_writeback_done: error=0 fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=45056 count=1952 res=1952 stable=UNSTABLE verifier=73268c6866702410

              dd-34927   [042] ..... 39070.897083: nfs_analyze_write_dio: fileid=00:2e:1739 fhandle=0x9b43f225 offset=47008 count=47008 start=47008+2144 middle=49152+40960 end=90112+3904
              dd-34927   [042] ..... 39070.897093: nfs_initiate_write: fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=47008 count=2144 stable=UNSTABLE
              dd-34927   [042] ..... 39070.897096: nfs_initiate_write: fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=49152 count=40960 stable=UNSTABLE
              dd-34927   [042] ..... 39070.897098: nfs_initiate_write: fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=90112 count=3904 stable=UNSTABLE
  kworker/u194:0-34670   [041] ..... 39070.897098: nfs_writeback_done: error=0 fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=47008 count=2144 res=2144 stable=UNSTABLE verifier=73268c6866702410
  kworker/u194:3-33963   [029] ..... 39070.897099: xfs_file_direct_write: dev 259:16 ino 0x3e00008f disize 0xb7a0 pos 0xc000 bytecount 0xa000
  kworker/u194:3-33963   [029] ..... 39070.897120: nfs_writeback_done: error=0 fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=49152 count=40960 res=40960 stable=UNSTABLE verifier=73268c6866702410
  kworker/u194:0-34670   [041] ..... 39070.897124: nfs_writeback_done: error=0 fileid=00:2e:1739 fhandle=0x4d34e6c1 offset=90112 count=3904 res=3904 stable=UNSTABLE verifier=73268c6866702410

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/direct.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 77 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 118782070c5e8..96999184453fb 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -1046,11 +1046,19 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 		if (result < 0)
 			break;
 
-		bytes = result;
-		npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
+		/* Limit the amount of bytes serviced each iteration to aligned batches */
+		if (pos < dreq->middle_offset && dreq->start_len)
+			bytes = min_t(size_t, dreq->start_len, result);
+		else if (pos < dreq->end_offset && dreq->middle_len)
+			bytes = min_t(size_t, dreq->middle_len, result);
+		else
+			bytes = result;
+		npages = (bytes + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
+
 		for (i = 0; i < npages; i++) {
 			struct nfs_page *req;
 			unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
+			bool issue_dio_now = false;
 
 			req = nfs_page_create_from_page(dreq->ctx, pagevec[i],
 							pgbase, pos, req_len);
@@ -1066,6 +1074,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			}
 
 			pgbase = 0;
+			result -= req_len;
 			bytes -= req_len;
 			requested_bytes += req_len;
 			pos += req_len;
@@ -1075,9 +1084,27 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 				continue;
 			}
 
+			/* Looking ahead, is this req the end of the start or middle? */
+			if (bytes == 0) {
+				if ((dreq->start_len &&
+				     pos == dreq->middle_offset && result >= dreq->middle_len) ||
+				    (dreq->end_len &&
+				     pos == dreq->end_offset && result == dreq->end_len)) {
+					desc.pg_doio_now = 1;
+					issue_dio_now = true;
+					/* Reset iter to the last boundary, isse the current
+					 * req and then handle iter to next boundary or end.
+					 */
+					iov_iter_revert(iter, result);
+				}
+			}
+
 			nfs_lock_request(req);
-			if (nfs_pageio_add_request(&desc, req))
+			if (nfs_pageio_add_request(&desc, req)) {
+				if (issue_dio_now)
+					break;
 				continue;
+			}
 
 			/* Exit on hard errors */
 			if (desc.pg_error < 0 && desc.pg_error != -EAGAIN) {
@@ -1118,6 +1145,48 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 	return requested_bytes;
 }
 
+/*
+ * If localio_O_DIRECT_align_misaligned_WRITE enabled, split misaligned
+ * WRITE to a DIO-aligned middle and misaligned head and/or tail.
+ */
+static bool nfs_analyze_write_dio(loff_t offset, ssize_t len,
+				  struct nfs_direct_req *dreq)
+{
+#if IS_ENABLED(CONFIG_NFS_LOCALIO)
+	/* Hardcoded to PAGE_SIZE (since don't have LOCALIO nfsd_file's
+	 * dio_alignment), works for smaller alignment too (e.g. 512b).
+	 */
+	u32 dio_blocksize = PAGE_SIZE;
+	loff_t start_end, orig_end, middle_end;
+
+	/* Return early if feature disabled, if IO is irreparably
+	 * misaligned (len < PAGE_SIZE) or if IO is already DIO-aligned.
+	 */
+	if (!nfs_localio_O_DIRECT_align_misaligned_IO() ||
+	    unlikely(len < dio_blocksize) ||
+	    (((offset | len) & (dio_blocksize-1)) == 0))
+		return false;
+
+	start_end = round_up(offset, dio_blocksize);
+	orig_end = offset + len;
+	middle_end = round_down(orig_end, dio_blocksize);
+
+	dreq->io_start = offset;
+	dreq->max_count = orig_end - offset;
+
+	dreq->start_len = start_end - offset;
+	dreq->middle_offset = start_end;
+	dreq->middle_len = middle_end - start_end;
+	dreq->end_offset = middle_end;
+	dreq->end_len = orig_end - middle_end;
+
+	trace_nfs_analyze_write_dio(offset, len, dreq);
+	return true;
+#else
+	return false;
+#endif
+}
+
 /**
  * nfs_file_direct_write - file direct write operation for NFS files
  * @iocb: target I/O control block
@@ -1175,8 +1244,11 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
 		goto out;
 
 	dreq->inode = inode;
-	dreq->max_count = count;
-	dreq->io_start = pos;
+	if (swap || !nfs_analyze_write_dio(pos, count, dreq)) {
+		dreq->max_count = count;
+		dreq->io_start = pos;
+	}
+
 	dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
 	l_ctx = nfs_get_lock_context(dreq->ctx);
 	if (IS_ERR(l_ctx)) {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 11/11] NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
                   ` (9 preceding siblings ...)
  2025-08-05 23:21 ` [PATCH v7 10/11] nfs/direct: add misaligned WRITE handling Mike Snitzer
@ 2025-08-05 23:21 ` Mike Snitzer
  10 siblings, 0 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-08-05 23:21 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

From: Mike Snitzer <snitzer@hammerspace.com>

NFS doesn't have DIO alignment constraints, so have NFS respond with
accommodating DIO alignment attributes (rather than plumb in GETATTR
support for STATX_DIOALIGN and STATX_DIO_READ_ALIGN).

The most coarse-grained dio_offset_align is the most accommodating
(e.g. PAGE_SIZE, in future larger may be supported).

Now that NFS has support, NFS reexport will now handle unaligned DIO
(NFSD's NFSD_IO_DIRECT support requires the underlying filesystem
support STATX_DIOALIGN and/or STATX_DIO_READ_ALIGN).

Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
 fs/nfs/inode.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 338ef77ae4230..7866d60b18452 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1066,6 +1066,21 @@ int nfs_getattr(struct mnt_idmap *idmap, const struct path *path,
 	if (S_ISDIR(inode->i_mode))
 		stat->blksize = NFS_SERVER(inode)->dtsize;
 	stat->btime = NFS_I(inode)->btime;
+
+	/* Special handling for STATX_DIOALIGN and STATX_DIO_READ_ALIGN
+	 * - NFS doesn't have DIO alignment constraints, avoid getting
+	 *   these DIO attrs from remote and just respond with most
+	 *   accommodating limits (so client will issue supported DIO).
+	 * - this is unintuitive, but the most coarse-grained
+	 *   dio_offset_align is the most accommodating.
+	 */
+	if ((request_mask & (STATX_DIOALIGN | STATX_DIO_READ_ALIGN)) &&
+	    S_ISREG(inode->i_mode)) {
+		stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN;
+		stat->dio_mem_align = 4; /* 4-byte alignment */
+		stat->dio_offset_align = PAGE_SIZE;
+		stat->dio_read_offset_align = stat->dio_offset_align;
+	}
 out:
 	trace_nfs_getattr_exit(inode, err);
 	return err;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-08-05 23:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-05 23:20 [PATCH v7 00/11] NFS DIRECT: align misaligned DIO for LOCALIO Mike Snitzer
2025-08-05 23:20 ` [PATCH v7 01/11] NFS/localio: nfs_close_local_fh() fix check for file closed Mike Snitzer
2025-08-05 23:20 ` [PATCH v7 02/11] NFS/localio: nfs_uuid_put() fix races with nfs_open/close_local_fh() Mike Snitzer
2025-08-05 23:20 ` [PATCH v7 03/11] NFS/localio: nfs_uuid_put() fix the wake up after unlinking the file Mike Snitzer
2025-08-05 23:20 ` [PATCH v7 04/11] nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local() Mike Snitzer
2025-08-05 23:21 ` [PATCH v7 05/11] nfs/localio: make trace_nfs_local_open_fh more useful Mike Snitzer
2025-08-05 23:21 ` [PATCH v7 06/11] nfs/localio: add nfsd_file_dio_alignment Mike Snitzer
2025-08-05 23:21 ` [PATCH v7 07/11] nfs/localio: refactor iocb initialization Mike Snitzer
2025-08-05 23:21 ` [PATCH v7 08/11] nfs/localio: fallback to NFSD for misaligned O_DIRECT READs Mike Snitzer
2025-08-05 23:21 ` [PATCH v7 09/11] nfs/direct: add misaligned READ handling Mike Snitzer
2025-08-05 23:21 ` [PATCH v7 10/11] nfs/direct: add misaligned WRITE handling Mike Snitzer
2025-08-05 23:21 ` [PATCH v7 11/11] NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).