public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support
@ 2026-04-01 19:44 Pranjal Shrivastava
  2026-04-01 19:44 ` [RFC PATCH 1/4] sunrpc: add supports_p2pdma to rpc_xprt_ops Pranjal Shrivastava
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Pranjal Shrivastava @ 2026-04-01 19:44 UTC (permalink / raw)
  To: trond.myklebust, anna
  Cc: davem, kuba, edumazet, pabeni, chuck.lever, jlayton, tom,
	okorniev, neil, dai.ngo, linux-nfs, netdev, Pranjal Shrivastava

As high-performance storage environments increasingly rely on direct
data movement between PCIe endpoints (e.g., moving data directly between
an NVMe Controller Memory Buffer and a Network Interface), support for 
Peer-to-Peer DMA (P2PDMA) in the network filesystem layer becomes 
essential. This series introduces P2PDMA support for the NFS Direct I/O.

Currently, NFS O_DIRECT operations fail with -EREMOTEIO if the user
buffer resides in PCIe BAR memory. This is primarily due to the use of
the legacy `iov_iter_get_pages_alloc2()` API, which cannot pass the 
required `FOLL_PCI_P2PDMA` flag, and a request lifecycle that is unaware
 of the pinning requirements for P2P memory.

Design
=======
The proposed design centers around making the NFS request lifecycle
"pin-aware" and upgrading the infrastructure to support modern memory 
extraction APIs.

1. 64-bit Capability Infrastructure
The existing nfs_server->caps bitmask is limited to 32 bits and is
currently exhausted. This series expands the bitmask to 64 bits to
accommodate NFS_CAP_P2PDMA. Crucially, it also refactors the NFS_CAP_*
constants to use ULL definitions. This prevents a subtle 32-bit
truncation bug where bitwise negations (e.g., caps &= ~NFS_CAP_ACLS)
would accidentally clear the high bits of the 64-bit capability field.

2. Transport-Level Detection
P2PDMA support is a property of the local transport hardware. A new
supports_p2pdma operation is added to the SunRPC transport ops. For RDMA,
this is implemented by querying the underlying device via 
ib_dma_pci_p2p_dma_supported(). The NFS client queries this during mount
and sets the NFS_CAP_P2PDMA bit accordingly.

3. Pin-Aware Request Lifecycle
Standard NFS requests use get_page() and put_page() for memory 
management. However, memory extracted via iov_iter_extract_pages()
requires explicit pinning and unpinning (unpin_user_page()).

This series introduces a PG_PINNED flag in struct nfs_page. When set,
the request lifecycle skips standard page referencing and ensures that
unpin_user_page() is called only when the I/O is complete. This ensures
that physical memory remains pinned for the duration of the DMA transfer

4. API Migration
The Direct I/O path is migrated to the modern iov_iter_extract_pages()
API. The ITER_ALLOW_P2PDMA flag is passed to the iterator only when the
local mount has signaled P2P support via the capability bit. This ensures
that "normal" users on standard TCP/UDP transports see no change in
behavior or overhead.

Call for review
===============
Any insights on the proposed changes to the nfs_page lifecycle and the
64-bit capability expansion are appreciated. If this approach is deemed
incorrect or if there is a more idiomatic way for this, please direct me
in the right direction.

Thanks,
Praan

Pranjal Shrivastava (4):
  sunrpc: add supports_p2pdma to rpc_xprt_ops
  nfs: add NFS_CAP_P2PDMA and detect transport support
  nfs: make nfs_page pin-aware
  nfs: allow P2PDMA in direct I/O path

 fs/nfs/client.c                 |  8 ++++
 fs/nfs/direct.c                 | 51 ++++++++++++++++++-------
 fs/nfs/nfs4_fs.h                |  2 +-
 fs/nfs/pagelist.c               | 18 ++++++---
 fs/nfs/super.c                  |  2 +-
 include/linux/nfs_fs_sb.h       | 67 +++++++++++++++++----------------
 include/linux/nfs_page.h        |  2 +
 include/linux/sunrpc/xprt.h     |  1 +
 net/sunrpc/xprtrdma/transport.c |  9 +++++
 9 files changed, 106 insertions(+), 54 deletions(-)

-- 
2.53.0.1185.g05d4b7b318-goog


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/4] sunrpc: add supports_p2pdma to rpc_xprt_ops
  2026-04-01 19:44 [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support Pranjal Shrivastava
@ 2026-04-01 19:44 ` Pranjal Shrivastava
  2026-04-01 19:44 ` [RFC PATCH 2/4] nfs: add NFS_CAP_P2PDMA and detect transport support Pranjal Shrivastava
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Pranjal Shrivastava @ 2026-04-01 19:44 UTC (permalink / raw)
  To: trond.myklebust, anna
  Cc: davem, kuba, edumazet, pabeni, chuck.lever, jlayton, tom,
	okorniev, neil, dai.ngo, linux-nfs, netdev, Pranjal Shrivastava

Add a new transport op, supports_p2pdma, to allow upper layers (such as
NFS) to query whether the underlying RPC transport supports PCI
Peer-to-Peer DMA (P2PDMA).

Since the capability is hardware-dependent. For the RDMA transport,
implement this by querying the underlying InfiniBand device via
ib_dma_pci_p2p_dma_supported() to ensures that P2PDMA is only
attempted when both the transport and the HCA drivers support it.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 include/linux/sunrpc/xprt.h     | 1 +
 net/sunrpc/xprtrdma/transport.c | 9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index f46d1fb8f71a..e451acd2e047 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -187,6 +187,7 @@ struct rpc_xprt_ops {
 	void		(*bc_free_rqst)(struct rpc_rqst *rqst);
 	void		(*bc_destroy)(struct rpc_xprt *xprt,
 				      unsigned int max_reqs);
+	bool		(*supports_p2pdma)(struct rpc_xprt *xprt);
 };
 
 /*
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 9a8ce5df83ca..1c1714189a29 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -717,6 +717,14 @@ xprt_rdma_disable_swap(struct rpc_xprt *xprt)
 {
 }
 
+static bool
+xprt_rdma_supports_p2pdma(struct rpc_xprt *xprt)
+{
+	struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+
+	return ib_dma_pci_p2p_dma_supported(r_xprt->rx_ep->re_id->device);
+}
+
 /*
  * Plumbing for rpc transport switch and kernel module
  */
@@ -742,6 +750,7 @@ static const struct rpc_xprt_ops xprt_rdma_procs = {
 	.enable_swap		= xprt_rdma_enable_swap,
 	.disable_swap		= xprt_rdma_disable_swap,
 	.inject_disconnect	= xprt_rdma_inject_disconnect,
+	.supports_p2pdma	= xprt_rdma_supports_p2pdma,
 #if defined(CONFIG_SUNRPC_BACKCHANNEL)
 	.bc_setup		= xprt_rdma_bc_setup,
 	.bc_maxpayload		= xprt_rdma_bc_maxpayload,
-- 
2.53.0.1185.g05d4b7b318-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/4] nfs: add NFS_CAP_P2PDMA and detect transport support
  2026-04-01 19:44 [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support Pranjal Shrivastava
  2026-04-01 19:44 ` [RFC PATCH 1/4] sunrpc: add supports_p2pdma to rpc_xprt_ops Pranjal Shrivastava
@ 2026-04-01 19:44 ` Pranjal Shrivastava
  2026-04-02 13:11   ` Chuck Lever
  2026-04-01 19:44 ` [RFC PATCH 3/4] nfs: make nfs_page pin-aware Pranjal Shrivastava
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Pranjal Shrivastava @ 2026-04-01 19:44 UTC (permalink / raw)
  To: trond.myklebust, anna
  Cc: davem, kuba, edumazet, pabeni, chuck.lever, jlayton, tom,
	okorniev, neil, dai.ngo, linux-nfs, netdev, Pranjal Shrivastava

The NFS server capabilities bitmask (server->caps) is currently full,
utilizing all 32 bits of the existing unsigned int. Expand the bitmask
to 64 bits (u64) to allow for new feature flags.

Introduce a new capability bit, NFS_CAP_P2PDMA, to indicate that the
local mount is backed by hardware and a transport capable of PCI
Peer-to-Peer DMA.

Update nfs_server_set_init_caps() to query the underlying SunRPC
transport for P2PDMA support during the mount process. If the transport
(e.g., RDMA) signals support, set the NFS_CAP_P2PDMA bit in the mount's
capabilities. This allows the high-performance Direct I/O path to
efficiently determine if it should allow P2P memory buffers.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 fs/nfs/client.c           |  8 +++++
 fs/nfs/nfs4_fs.h          |  2 +-
 fs/nfs/super.c            |  2 +-
 include/linux/nfs_fs_sb.h | 67 ++++++++++++++++++++-------------------
 4 files changed, 44 insertions(+), 35 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index be02bb227741..f177cf098d44 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -712,6 +712,8 @@ static void nfs4_server_set_init_caps(struct nfs_server *server)
 
 void nfs_server_set_init_caps(struct nfs_server *server)
 {
+	struct rpc_xprt *xprt;
+
 	switch (server->nfs_client->rpc_ops->version) {
 	case 2:
 		server->caps = NFS_CAP_HARDLINKS | NFS_CAP_SYMLINKS;
@@ -725,6 +727,12 @@ void nfs_server_set_init_caps(struct nfs_server *server)
 		nfs4_server_set_init_caps(server);
 		break;
 	}
+
+	rcu_read_lock();
+	xprt = rcu_dereference(server->client->cl_xprt);
+	if (xprt->ops->supports_p2pdma && xprt->ops->supports_p2pdma(xprt))
+		server->caps |= NFS_CAP_P2PDMA;
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(nfs_server_set_init_caps);
 
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index b48e5b87cb2a..a309cc739fa3 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -60,7 +60,7 @@ enum nfs4_client_state {
 struct nfs_seqid_counter;
 struct nfs4_minor_version_ops {
 	u32	minor_version;
-	unsigned init_caps;
+	u64	init_caps;
 
 	int	(*init_client)(struct nfs_client *);
 	void	(*shutdown_client)(struct nfs_client *);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 7a318581f85b..b2de13a355df 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -672,7 +672,7 @@ int nfs_show_stats(struct seq_file *m, struct dentry *root)
 	show_implementation_id(m, nfss);
 
 	seq_puts(m, "\n\tcaps:\t");
-	seq_printf(m, "caps=0x%x", nfss->caps);
+	seq_printf(m, "caps=0x%llx", nfss->caps);
 	seq_printf(m, ",wtmult=%u", nfss->wtmult);
 	seq_printf(m, ",dtsize=%u", nfss->dtsize);
 	seq_printf(m, ",bsize=%u", nfss->bsize);
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 4daee27fa5eb..e66818c7a0b8 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -175,7 +175,7 @@ struct nfs_server {
 #define NFS_AUTOMOUNT_INHERIT_RSIZE	0x0002
 #define NFS_AUTOMOUNT_INHERIT_WSIZE	0x0004
 
-	unsigned int		caps;		/* server capabilities */
+	__u64			caps;		/* server capabilities */
 	__u64			fattr_valid;	/* Valid attributes */
 	unsigned int		rsize;		/* read size */
 	unsigned int		rpages;		/* read size (in pages) */
@@ -299,36 +299,37 @@ struct nfs_server {
 };
 
 /* Server capabilities */
-#define NFS_CAP_READDIRPLUS	(1U << 0)
-#define NFS_CAP_HARDLINKS	(1U << 1)
-#define NFS_CAP_SYMLINKS	(1U << 2)
-#define NFS_CAP_ACLS		(1U << 3)
-#define NFS_CAP_ATOMIC_OPEN	(1U << 4)
-#define NFS_CAP_LGOPEN		(1U << 5)
-#define NFS_CAP_CASE_INSENSITIVE	(1U << 6)
-#define NFS_CAP_CASE_PRESERVING	(1U << 7)
-#define NFS_CAP_REBOOT_LAYOUTRETURN	(1U << 8)
-#define NFS_CAP_OFFLOAD_STATUS	(1U << 9)
-#define NFS_CAP_ZERO_RANGE	(1U << 10)
-#define NFS_CAP_DIR_DELEG	(1U << 11)
-#define NFS_CAP_OPEN_XOR	(1U << 12)
-#define NFS_CAP_DELEGTIME	(1U << 13)
-#define NFS_CAP_POSIX_LOCK	(1U << 14)
-#define NFS_CAP_UIDGID_NOMAP	(1U << 15)
-#define NFS_CAP_STATEID_NFSV41	(1U << 16)
-#define NFS_CAP_ATOMIC_OPEN_V1	(1U << 17)
-#define NFS_CAP_SECURITY_LABEL	(1U << 18)
-#define NFS_CAP_SEEK		(1U << 19)
-#define NFS_CAP_ALLOCATE	(1U << 20)
-#define NFS_CAP_DEALLOCATE	(1U << 21)
-#define NFS_CAP_LAYOUTSTATS	(1U << 22)
-#define NFS_CAP_CLONE		(1U << 23)
-#define NFS_CAP_COPY		(1U << 24)
-#define NFS_CAP_OFFLOAD_CANCEL	(1U << 25)
-#define NFS_CAP_LAYOUTERROR	(1U << 26)
-#define NFS_CAP_COPY_NOTIFY	(1U << 27)
-#define NFS_CAP_XATTR		(1U << 28)
-#define NFS_CAP_READ_PLUS	(1U << 29)
-#define NFS_CAP_FS_LOCATIONS	(1U << 30)
-#define NFS_CAP_MOVEABLE	(1U << 31)
+#define NFS_CAP_READDIRPLUS	(1ULL << 0)
+#define NFS_CAP_HARDLINKS	(1ULL << 1)
+#define NFS_CAP_SYMLINKS	(1ULL << 2)
+#define NFS_CAP_ACLS		(1ULL << 3)
+#define NFS_CAP_ATOMIC_OPEN	(1ULL << 4)
+#define NFS_CAP_LGOPEN		(1ULL << 5)
+#define NFS_CAP_CASE_INSENSITIVE	(1ULL << 6)
+#define NFS_CAP_CASE_PRESERVING	(1ULL << 7)
+#define NFS_CAP_REBOOT_LAYOUTRETURN	(1ULL << 8)
+#define NFS_CAP_OFFLOAD_STATUS	(1ULL << 9)
+#define NFS_CAP_ZERO_RANGE	(1ULL << 10)
+#define NFS_CAP_DIR_DELEG	(1ULL << 11)
+#define NFS_CAP_OPEN_XOR	(1ULL << 12)
+#define NFS_CAP_DELEGTIME	(1ULL << 13)
+#define NFS_CAP_POSIX_LOCK	(1ULL << 14)
+#define NFS_CAP_UIDGID_NOMAP	(1ULL << 15)
+#define NFS_CAP_STATEID_NFSV41	(1ULL << 16)
+#define NFS_CAP_ATOMIC_OPEN_V1	(1ULL << 17)
+#define NFS_CAP_SECURITY_LABEL	(1ULL << 18)
+#define NFS_CAP_SEEK		(1ULL << 19)
+#define NFS_CAP_ALLOCATE	(1ULL << 20)
+#define NFS_CAP_DEALLOCATE	(1ULL << 21)
+#define NFS_CAP_LAYOUTSTATS	(1ULL << 22)
+#define NFS_CAP_CLONE		(1ULL << 23)
+#define NFS_CAP_COPY		(1ULL << 24)
+#define NFS_CAP_OFFLOAD_CANCEL	(1ULL << 25)
+#define NFS_CAP_LAYOUTERROR	(1ULL << 26)
+#define NFS_CAP_COPY_NOTIFY	(1ULL << 27)
+#define NFS_CAP_XATTR		(1ULL << 28)
+#define NFS_CAP_READ_PLUS	(1ULL << 29)
+#define NFS_CAP_FS_LOCATIONS	(1ULL << 30)
+#define NFS_CAP_MOVEABLE	(1ULL << 31)
+#define NFS_CAP_P2PDMA		(1ULL << 32)
 #endif
-- 
2.53.0.1185.g05d4b7b318-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/4] nfs: make nfs_page pin-aware
  2026-04-01 19:44 [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support Pranjal Shrivastava
  2026-04-01 19:44 ` [RFC PATCH 1/4] sunrpc: add supports_p2pdma to rpc_xprt_ops Pranjal Shrivastava
  2026-04-01 19:44 ` [RFC PATCH 2/4] nfs: add NFS_CAP_P2PDMA and detect transport support Pranjal Shrivastava
@ 2026-04-01 19:44 ` Pranjal Shrivastava
  2026-04-02  5:04   ` Christoph Hellwig
  2026-04-01 19:45 ` [RFC PATCH 4/4] nfs: allow P2PDMA in direct I/O path Pranjal Shrivastava
  2026-04-02  5:07 ` [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support Christoph Hellwig
  4 siblings, 1 reply; 9+ messages in thread
From: Pranjal Shrivastava @ 2026-04-01 19:44 UTC (permalink / raw)
  To: trond.myklebust, anna
  Cc: davem, kuba, edumazet, pabeni, chuck.lever, jlayton, tom,
	okorniev, neil, dai.ngo, linux-nfs, netdev, Pranjal Shrivastava

The migration to iov_iter_extract_pages() for Direct I/O introduces
page pinning (GUP) instead of standard page referencing. To handle this
correctly, nfs_page must track whether it holds a pin or a standard
reference.

Add a new flag, PG_PINNED, to struct nfs_page. Update the creation
path (nfs_page_create_from_page) to accept a 'pinned' boolean and
set the flag accordingly. If the page is pinned, we skip the standard
get_page() as the pin itself acts as a reference.

Update nfs_clear_request() to use unpin_user_page() instead of
put_page() when the PG_PINNED flag is set. This ensures that memory
remains safely locked for DMA and that kernel page accounting stays
consistent. Ensure subrequests inherit the pin status from their
parent request.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 fs/nfs/direct.c          |  4 ++--
 fs/nfs/pagelist.c        | 18 +++++++++++++-----
 include/linux/nfs_page.h |  2 ++
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 48d89716193a..c8429b430181 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -370,7 +370,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			struct nfs_page *req;
 			unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
 			/* XXX do we need to do the eof zeroing found in async_filler? */
-			req = nfs_page_create_from_page(dreq->ctx, pagevec[i],
+			req = nfs_page_create_from_page(dreq->ctx, pagevec[i], false,
 							pgbase, pos, req_len);
 			if (IS_ERR(req)) {
 				result = PTR_ERR(req);
@@ -898,7 +898,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			struct nfs_page *req;
 			unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
 
-			req = nfs_page_create_from_page(dreq->ctx, pagevec[i],
+			req = nfs_page_create_from_page(dreq->ctx, pagevec[i], false,
 							pgbase, pos, req_len);
 			if (IS_ERR(req)) {
 				result = PTR_ERR(req);
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index a9373de891c9..72d3da0fb654 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -413,11 +413,14 @@ static void nfs_page_assign_folio(struct nfs_page *req, struct folio *folio)
 	}
 }
 
-static void nfs_page_assign_page(struct nfs_page *req, struct page *page)
+static void nfs_page_assign_page(struct nfs_page *req, struct page *page, bool pinned)
 {
 	if (page != NULL) {
 		req->wb_page = page;
-		get_page(page);
+		if (pinned)
+			set_bit(PG_PINNED, &req->wb_flags);
+		else
+			get_page(page);
 	}
 }
 
@@ -435,6 +438,7 @@ static void nfs_page_assign_page(struct nfs_page *req, struct page *page)
  */
 struct nfs_page *nfs_page_create_from_page(struct nfs_open_context *ctx,
 					   struct page *page,
+					   bool pinned,
 					   unsigned int pgbase, loff_t offset,
 					   unsigned int count)
 {
@@ -446,7 +450,7 @@ struct nfs_page *nfs_page_create_from_page(struct nfs_open_context *ctx,
 	ret = nfs_page_create(l_ctx, pgbase, offset >> PAGE_SHIFT,
 			      offset_in_page(offset), count);
 	if (!IS_ERR(ret)) {
-		nfs_page_assign_page(ret, page);
+		nfs_page_assign_page(ret, page, pinned);
 		nfs_page_group_init(ret, NULL);
 	}
 	nfs_put_lock_context(l_ctx);
@@ -500,7 +504,8 @@ nfs_create_subreq(struct nfs_page *req,
 		if (folio)
 			nfs_page_assign_folio(ret, folio);
 		else
-			nfs_page_assign_page(ret, page);
+			nfs_page_assign_page(ret, page,
+					     test_bit(PG_PINNED, &req->wb_flags));
 		/* find the last request */
 		for (last = req->wb_head;
 		     last->wb_this_page != req->wb_head;
@@ -556,7 +561,10 @@ static void nfs_clear_request(struct nfs_page *req)
 		req->wb_folio = NULL;
 		clear_bit(PG_FOLIO, &req->wb_flags);
 	} else if (page != NULL) {
-		put_page(page);
+		if (test_and_clear_bit(PG_PINNED, &req->wb_flags))
+			unpin_user_page(page);
+		else
+			put_page(page);
 		req->wb_page = NULL;
 	}
 	if (l_ctx != NULL) {
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index afe1d8f09d89..cae67540a2ae 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -37,6 +37,7 @@ enum {
 	PG_REMOVE,		/* page group sync bit in write path */
 	PG_CONTENDED1,		/* Is someone waiting for a lock? */
 	PG_CONTENDED2,		/* Is someone waiting for a lock? */
+	PG_PINNED,		/* page is pinned by GUP */
 };
 
 struct nfs_inode;
@@ -124,6 +125,7 @@ struct nfs_pageio_descriptor {
 
 extern struct nfs_page *nfs_page_create_from_page(struct nfs_open_context *ctx,
 						  struct page *page,
+						  bool pinned,
 						  unsigned int pgbase,
 						  loff_t offset,
 						  unsigned int count);
-- 
2.53.0.1185.g05d4b7b318-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/4] nfs: allow P2PDMA in direct I/O path
  2026-04-01 19:44 [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support Pranjal Shrivastava
                   ` (2 preceding siblings ...)
  2026-04-01 19:44 ` [RFC PATCH 3/4] nfs: make nfs_page pin-aware Pranjal Shrivastava
@ 2026-04-01 19:45 ` Pranjal Shrivastava
  2026-04-02  5:05   ` Christoph Hellwig
  2026-04-02  5:07 ` [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support Christoph Hellwig
  4 siblings, 1 reply; 9+ messages in thread
From: Pranjal Shrivastava @ 2026-04-01 19:45 UTC (permalink / raw)
  To: trond.myklebust, anna
  Cc: davem, kuba, edumazet, pabeni, chuck.lever, jlayton, tom,
	okorniev, neil, dai.ngo, linux-nfs, netdev, Pranjal Shrivastava

Migrate the NFS Direct I/O path from the legacy iov_iter_get_pages_alloc2()
API to the modern iov_iter_extract_pages() API. This migration enables
support for PCI Peer-to-Peer DMA (P2PDMA) by allowing the setting the
ITER_ALLOW_P2PDMA flag.

Pass ITER_ALLOW_P2PDMA to iov_iter_extract_pages() only if the local
mount indicates support via the NFS_CAP_P2PDMA capability bit (detected
at mount time for RDMA transports).

Fix the memory safety bug in the Direct I/O loop where pages were being
unpinned immediately after request creation. Instead, we now leverage
pin-aware nfs_page structures to hold the pins until the I/O is complete
The manual release in the loop is updated to only clean up pages that
failed to be handed over to an nfs_page request.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 fs/nfs/direct.c | 51 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index c8429b430181..6916541af9db 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -165,11 +165,17 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 	return 0;
 }
 
-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
+static void nfs_direct_release_pages(struct page **pages, unsigned int npages,
+				     bool pinned)
 {
 	unsigned int i;
-	for (i = 0; i < npages; i++)
-		put_page(pages[i]);
+
+	if (pinned) {
+		unpin_user_pages(pages, npages);
+	} else {
+		for (i = 0; i < npages; i++)
+			put_page(pages[i]);
+	}
 }
 
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
@@ -354,23 +360,30 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 	inode_dio_begin(inode);
 
 	while (iov_iter_count(iter)) {
-		struct page **pagevec;
+		/* Tell extract pages to allocate the page array */
+		struct page **pagevec = NULL;
 		size_t bytes;
 		size_t pgbase;
 		unsigned npages, i;
+		bool pinned = iov_iter_extract_will_pin(iter);
+		iov_iter_extraction_t extraction_flags = 0;
+
+		if (NFS_SERVER(inode)->caps & NFS_CAP_P2PDMA)
+			extraction_flags |= ITER_ALLOW_P2PDMA;
 
-		result = iov_iter_get_pages_alloc2(iter, &pagevec,
-						  rsize, &pgbase);
+		result = iov_iter_extract_pages(iter, &pagevec,
+						rsize, ~0U,
+						extraction_flags, &pgbase);
 		if (result < 0)
 			break;
-	
+
 		bytes = result;
 		npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
 		for (i = 0; i < npages; i++) {
 			struct nfs_page *req;
 			unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
 			/* XXX do we need to do the eof zeroing found in async_filler? */
-			req = nfs_page_create_from_page(dreq->ctx, pagevec[i], false,
+			req = nfs_page_create_from_page(dreq->ctx, pagevec[i], pinned,
 							pgbase, pos, req_len);
 			if (IS_ERR(req)) {
 				result = PTR_ERR(req);
@@ -386,7 +399,8 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			requested_bytes += req_len;
 			pos += req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		if (i < npages)
+			nfs_direct_release_pages(pagevec + i, npages - i, pinned);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
@@ -882,13 +896,21 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 
 	NFS_I(inode)->write_io += iov_iter_count(iter);
 	while (iov_iter_count(iter)) {
-		struct page **pagevec;
+
+		/* Tell extract pages to allocate the page array */
+		struct page **pagevec = NULL;
 		size_t bytes;
 		size_t pgbase;
 		unsigned npages, i;
+		bool pinned = iov_iter_extract_will_pin(iter);
+		iov_iter_extraction_t extraction_flags = 0;
+
+		if (NFS_SERVER(inode)->caps & NFS_CAP_P2PDMA)
+			extraction_flags |= ITER_ALLOW_P2PDMA;
 
-		result = iov_iter_get_pages_alloc2(iter, &pagevec,
-						  wsize, &pgbase);
+		result = iov_iter_extract_pages(iter, &pagevec,
+						wsize, ~0U,
+						extraction_flags, &pgbase);
 		if (result < 0)
 			break;
 
@@ -898,7 +920,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			struct nfs_page *req;
 			unsigned int req_len = min_t(size_t, bytes, PAGE_SIZE - pgbase);
 
-			req = nfs_page_create_from_page(dreq->ctx, pagevec[i], false,
+			req = nfs_page_create_from_page(dreq->ctx, pagevec[i], pinned,
 							pgbase, pos, req_len);
 			if (IS_ERR(req)) {
 				result = PTR_ERR(req);
@@ -942,7 +964,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			desc.pg_error = 0;
 			defer = true;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		if (i < npages)
+			nfs_direct_release_pages(pagevec + i, npages - i, pinned);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
-- 
2.53.0.1185.g05d4b7b318-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 3/4] nfs: make nfs_page pin-aware
  2026-04-01 19:44 ` [RFC PATCH 3/4] nfs: make nfs_page pin-aware Pranjal Shrivastava
@ 2026-04-02  5:04   ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2026-04-02  5:04 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: trond.myklebust, anna, davem, kuba, edumazet, pabeni, chuck.lever,
	jlayton, tom, okorniev, neil, dai.ngo, linux-nfs, netdev

This conversion really should go first as it is badly needed independent
of any P2P support.  And I wonder if it should go further - currently
the NFS I/O code is using folios for buffered I/O, but pages for direct
I/O, which makes larger I/O very inefficient.

The iov_iter_extract_bvecs wrapper allows to extract bvecs instead, which
might be a good choice here either by passing down the bvecs or
converting to an nfs_page inline.  Or just open coding a variant of
iov_iter_extract_bvecs that converts to nfs_page structures instead of
bvecs.  This would pair with a helper similar to __bio_release_pages on
the unlock side.

> +			req = nfs_page_create_from_page(dreq->ctx, pagevec[i], false,
>  							pgbase, pos, req_len);
>

A lot of this code reads pretty odd as it's overflowing the lines.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 4/4] nfs: allow P2PDMA in direct I/O path
  2026-04-01 19:45 ` [RFC PATCH 4/4] nfs: allow P2PDMA in direct I/O path Pranjal Shrivastava
@ 2026-04-02  5:05   ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2026-04-02  5:05 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: trond.myklebust, anna, davem, kuba, edumazet, pabeni, chuck.lever,
	jlayton, tom, okorniev, neil, dai.ngo, linux-nfs, netdev

On Wed, Apr 01, 2026 at 07:45:00PM +0000, Pranjal Shrivastava wrote:
> Migrate the NFS Direct I/O path from the legacy iov_iter_get_pages_alloc2()
> API to the modern iov_iter_extract_pages() API. This migration enables
> support for PCI Peer-to-Peer DMA (P2PDMA) by allowing the setting the
> ITER_ALLOW_P2PDMA flag.
> 
> Pass ITER_ALLOW_P2PDMA to iov_iter_extract_pages() only if the local
> mount indicates support via the NFS_CAP_P2PDMA capability bit (detected
> at mount time for RDMA transports).

Please split theconversion to iov_iter_extract_pages into a separate
preparation patch, and even series.  That is a long overdue change
that fixes potential data corruption in XFS.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support
  2026-04-01 19:44 [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support Pranjal Shrivastava
                   ` (3 preceding siblings ...)
  2026-04-01 19:45 ` [RFC PATCH 4/4] nfs: allow P2PDMA in direct I/O path Pranjal Shrivastava
@ 2026-04-02  5:07 ` Christoph Hellwig
  4 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2026-04-02  5:07 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: trond.myklebust, anna, davem, kuba, edumazet, pabeni, chuck.lever,
	jlayton, tom, okorniev, neil, dai.ngo, linux-nfs, netdev

Besides the point of splitting out the iov_iter_extract* conversion this
seems to ignore pNFS.  You also need to check the layout driver for the
current file range and propagate P2P support or lack of through that.

Note that the block-style layouts can also trivially support P2P and not
just RPC-based ones.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 2/4] nfs: add NFS_CAP_P2PDMA and detect transport support
  2026-04-01 19:44 ` [RFC PATCH 2/4] nfs: add NFS_CAP_P2PDMA and detect transport support Pranjal Shrivastava
@ 2026-04-02 13:11   ` Chuck Lever
  0 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-04-02 13:11 UTC (permalink / raw)
  To: Pranjal Shrivastava, Trond Myklebust, Anna Schumaker
  Cc: davem, Jakub Kicinski, edumazet, Paolo Abeni, Chuck Lever,
	Jeff Layton, Tom Talpey, Olga Kornievskaia, NeilBrown, Dai Ngo,
	linux-nfs, netdev


On Wed, Apr 1, 2026, at 3:44 PM, Pranjal Shrivastava wrote:
> The NFS server capabilities bitmask (server->caps) is currently full,
> utilizing all 32 bits of the existing unsigned int. Expand the bitmask
> to 64 bits (u64) to allow for new feature flags.
>
> Introduce a new capability bit, NFS_CAP_P2PDMA, to indicate that the
> local mount is backed by hardware and a transport capable of PCI
> Peer-to-Peer DMA.
>
> Update nfs_server_set_init_caps() to query the underlying SunRPC
> transport for P2PDMA support during the mount process. If the transport
> (e.g., RDMA) signals support, set the NFS_CAP_P2PDMA bit in the mount's
> capabilities. This allows the high-performance Direct I/O path to
> efficiently determine if it should allow P2P memory buffers.

> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> index be02bb227741..f177cf098d44 100644
> --- a/fs/nfs/client.c
> +++ b/fs/nfs/client.c

> @@ -725,6 +727,12 @@ void nfs_server_set_init_caps(struct nfs_server *server)
>  		nfs4_server_set_init_caps(server);
>  		break;
>  	}
> +
> +	rcu_read_lock();
> +	xprt = rcu_dereference(server->client->cl_xprt);
> +	if (xprt->ops->supports_p2pdma && xprt->ops->supports_p2pdma(xprt))
> +		server->caps |= NFS_CAP_P2PDMA;
> +	rcu_read_unlock();
>  }
>  EXPORT_SYMBOL_GPL(nfs_server_set_init_caps);

Is the transport even connected when the NFS client does this
test? If it isn't, xprtrdma and the RDMA core have not chosen
an underlying device yet.

Note that, even if this logic /is/ correct, if the transport
connection is lost the transport will reconnect automatically,
doing the RDMA CM dance again and possibly resolving to a
different device. The NFS client layer will be none-the-wiser
and the NFS_CAP_P2PDMA flag setting will be stale at that point,
and quite possibly incorrect if the new connection's device is
not P2P-enabled.

(Basically this is what happens when an RDMA device is removed).

So this detection has to be done as part of xprtrdma's connection
flow, and it needs to set a flag somewhere in the rpc_xprt. The
NFS direct I/O code path then has to look for that flag before
choosing the mechanism/flags it uses for each iov iter.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-04-02 13:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-01 19:44 [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support Pranjal Shrivastava
2026-04-01 19:44 ` [RFC PATCH 1/4] sunrpc: add supports_p2pdma to rpc_xprt_ops Pranjal Shrivastava
2026-04-01 19:44 ` [RFC PATCH 2/4] nfs: add NFS_CAP_P2PDMA and detect transport support Pranjal Shrivastava
2026-04-02 13:11   ` Chuck Lever
2026-04-01 19:44 ` [RFC PATCH 3/4] nfs: make nfs_page pin-aware Pranjal Shrivastava
2026-04-02  5:04   ` Christoph Hellwig
2026-04-01 19:45 ` [RFC PATCH 4/4] nfs: allow P2PDMA in direct I/O path Pranjal Shrivastava
2026-04-02  5:05   ` Christoph Hellwig
2026-04-02  5:07 ` [RFC PATCH 0/4] nfs: Enable PCI Peer-to-Peer DMA (P2PDMA) support Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox