public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/4] Add a bio_vec based API to core/rw.c
@ 2026-01-14 14:39 Chuck Lever
  2026-01-14 14:39 ` [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API Chuck Lever
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Chuck Lever @ 2026-01-14 14:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: linux-rdma, linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

This series introduces a bio_vec based API for RDMA read and write
operations in the RDMA core, eliminating unnecessary scatterlist
conversions for callers that already work with bvecs.

Current users of rdma_rw_ctx_init() must convert their native data
structures into scatterlists. For subsystems like svcrdma that
maintain data in bvec format, this conversion adds overhead both in
CPU cycles and memory footprint. The new API accepts bvec arrays
directly.

For hardware RDMA devices, the implementation uses the IOVA-based
DMA mapping API to reduce IOTLB synchronization overhead from O(n)
per-page syncs to a single O(1) sync after all mappings complete.
Software RDMA devices (rxe, siw) continue using virtual addressing.

The series includes MR registration support for bvec arrays,
enabling iWARP devices and the force_mr debug parameter. The MR
path reuses existing ib_map_mr_sg() infrastructure by constructing
a synthetic scatterlist from the bvec DMA addresses.

The final patch adds the first consumer for the new API: svcrdma.
It replaces its scatterlist conversion code, significantly reducing
the svc_rdma_rw_ctxt structure size. The previous implementation
embedded a scatterlist array of RPCSVC_MAXPAGES entries (4KB or
more per context); the new implementation uses a pointer to a
dynamically allocated bvec array.

Based on v6.19-rc5.

Chuck Lever (4):
  RDMA/core: add bio_vec based RDMA read/write API
  RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  RDMA/core: add MR support for bvec-based RDMA operations
  svcrdma: use bvec-based RDMA read/write API

 drivers/infiniband/core/rw.c      | 492 ++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h           |  35 +++
 include/rdma/rw.h                 |  26 ++
 net/sunrpc/xprtrdma/svc_rdma_rw.c | 115 ++++---
 4 files changed, 608 insertions(+), 60 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-14 14:39 [PATCH v1 0/4] Add a bio_vec based API to core/rw.c Chuck Lever
@ 2026-01-14 14:39 ` Chuck Lever
  2026-01-15 15:53   ` Christoph Hellwig
  2026-01-14 14:39 ` [PATCH v1 2/4] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations Chuck Lever
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2026-01-14 14:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: linux-rdma, linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The existing rdma_rw_ctx_init() API requires callers to construct a
scatterlist, which is then DMA-mapped page by page. Callers that
already have data in bio_vec form (such as the NVMe-oF target) must
first convert to scatterlist, adding overhead and complexity.

Introduce rdma_rw_ctx_init_bvec() and rdma_rw_ctx_destroy_bvec() to
accept bio_vec arrays directly. The new helpers use dma_map_phys()
for hardware RDMA devices and virtual addressing for software RDMA
devices (rxe, siw), avoiding intermediate scatterlist construction.

Memory registration (MR) path support is deferred to a follow-up
series; callers requiring MR-based transfers (iWARP devices or
force_mr=1) receive -EOPNOTSUPP and should use the scatterlist API.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 drivers/infiniband/core/rw.c | 194 +++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h      |  35 +++++++
 include/rdma/rw.h            |  10 ++
 3 files changed, 239 insertions(+)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 6354ddf2a274..42215c2ff42b 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -274,6 +274,124 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	return 1;
 }
 
+static int rdma_rw_init_single_wr_bvec(struct rdma_rw_ctx *ctx,
+		struct ib_qp *qp, const struct bio_vec *bvec, u32 offset,
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	struct ib_rdma_wr *rdma_wr = &ctx->single.wr;
+	struct bio_vec adjusted = *bvec;
+	u64 dma_addr;
+
+	ctx->nr_ops = 1;
+
+	if (offset) {
+		adjusted.bv_offset += offset;
+		adjusted.bv_len -= offset;
+	}
+
+	dma_addr = ib_dma_map_bvec(dev, &adjusted, dir);
+	if (ib_dma_mapping_error(dev, dma_addr))
+		return -ENOMEM;
+
+	ctx->single.sge.lkey = qp->pd->local_dma_lkey;
+	ctx->single.sge.addr = dma_addr;
+	ctx->single.sge.length = adjusted.bv_len;
+
+	memset(rdma_wr, 0, sizeof(*rdma_wr));
+	if (dir == DMA_TO_DEVICE)
+		rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
+	else
+		rdma_wr->wr.opcode = IB_WR_RDMA_READ;
+	rdma_wr->wr.sg_list = &ctx->single.sge;
+	rdma_wr->wr.num_sge = 1;
+	rdma_wr->remote_addr = remote_addr;
+	rdma_wr->rkey = rkey;
+
+	ctx->type = RDMA_RW_SINGLE_WR;
+	return 1;
+}
+
+static int rdma_rw_init_map_wrs_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		const struct bio_vec *bvec, u32 nr_bvec, u32 offset,
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	u32 max_sge = dir == DMA_TO_DEVICE ? qp->max_write_sge :
+		      qp->max_read_sge;
+	struct ib_sge *sge;
+	u32 total_len = 0, i, j, bvec_idx = 0;
+	u32 mapped_bvecs = 0;
+	u64 dma_addr;
+
+	ctx->nr_ops = DIV_ROUND_UP(nr_bvec, max_sge);
+
+	ctx->map.sges = sge = kcalloc(nr_bvec, sizeof(*sge), GFP_KERNEL);
+	if (!ctx->map.sges)
+		return -ENOMEM;
+
+	ctx->map.wrs = kcalloc(ctx->nr_ops, sizeof(*ctx->map.wrs), GFP_KERNEL);
+	if (!ctx->map.wrs)
+		goto out_free_sges;
+
+	for (i = 0; i < ctx->nr_ops; i++) {
+		struct ib_rdma_wr *rdma_wr = &ctx->map.wrs[i];
+		u32 nr_sge = min(nr_bvec - bvec_idx, max_sge);
+
+		if (dir == DMA_TO_DEVICE)
+			rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
+		else
+			rdma_wr->wr.opcode = IB_WR_RDMA_READ;
+		rdma_wr->remote_addr = remote_addr + total_len;
+		rdma_wr->rkey = rkey;
+		rdma_wr->wr.num_sge = nr_sge;
+		rdma_wr->wr.sg_list = sge;
+
+		for (j = 0; j < nr_sge; j++, bvec_idx++) {
+			const struct bio_vec *bv = &bvec[bvec_idx];
+			u32 len = bv->bv_len;
+
+			/* Handle offset into first bvec */
+			if (bvec_idx == 0 && offset) {
+				struct bio_vec adjusted = *bv;
+
+				adjusted.bv_offset += offset;
+				adjusted.bv_len -= offset;
+				dma_addr = ib_dma_map_bvec(dev, &adjusted, dir);
+				len = adjusted.bv_len;
+			} else {
+				dma_addr = ib_dma_map_bvec(dev, bv, dir);
+			}
+
+			if (ib_dma_mapping_error(dev, dma_addr))
+				goto out_unmap;
+
+			mapped_bvecs++;
+			sge->addr = dma_addr;
+			sge->length = len;
+			sge->lkey = qp->pd->local_dma_lkey;
+
+			total_len += len;
+			sge++;
+		}
+
+		rdma_wr->wr.next = i + 1 < ctx->nr_ops ?
+			&ctx->map.wrs[i + 1].wr : NULL;
+	}
+
+	ctx->type = RDMA_RW_MULTI_WR;
+	return ctx->nr_ops;
+
+out_unmap:
+	for (i = 0; i < mapped_bvecs; i++)
+		ib_dma_unmap_bvec(dev, ctx->map.sges[i].addr,
+				  ctx->map.sges[i].length, dir);
+	kfree(ctx->map.wrs);
+out_free_sges:
+	kfree(ctx->map.sges);
+	return -ENOMEM;
+}
+
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:	context to initialize
@@ -344,6 +462,46 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num,
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
 
+/**
+ * rdma_rw_ctx_init_bvec - initialize a RDMA READ/WRITE context from bio_vec
+ * @ctx:	context to initialize
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @bvec:	bio_vec array to READ/WRITE from/to
+ * @nr_bvec:	number of entries in @bvec
+ * @offset:	byte offset into first bvec
+ * @remote_addr:remote address to read/write (relative to @rkey)
+ * @rkey:	remote key to operate on
+ * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ *
+ * Maps the bio_vec array directly using dma_map_phys(), avoiding the
+ * intermediate scatterlist conversion. Does not support the MR registration
+ * path (iWARP devices or force_mr=1).
+ *
+ * Returns the number of WQEs that will be needed on the workqueue if
+ * successful, or a negative error code.
+ */
+int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u32 port_num, const struct bio_vec *bvec, u32 nr_bvec,
+		u32 offset, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir)
+{
+	if (nr_bvec == 0 || offset > bvec[0].bv_len)
+		return -EINVAL;
+
+	/* MR path not supported for bvec - reject iWARP and force_mr */
+	if (rdma_rw_io_needs_mr(qp->device, port_num, dir, nr_bvec))
+		return -EOPNOTSUPP;
+
+	if (nr_bvec == 1)
+		return rdma_rw_init_single_wr_bvec(ctx, qp, bvec, offset,
+				remote_addr, rkey, dir);
+
+	return rdma_rw_init_map_wrs_bvec(ctx, qp, bvec, nr_bvec, offset,
+			remote_addr, rkey, dir);
+}
+EXPORT_SYMBOL(rdma_rw_ctx_init_bvec);
+
 /**
  * rdma_rw_ctx_signature_init - initialize a RW context with signature offload
  * @ctx:	context to initialize
@@ -598,6 +756,42 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
+/**
+ * rdma_rw_ctx_destroy_bvec - release resources from rdma_rw_ctx_init_bvec
+ * @ctx:	context to release
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @bvec:	bio_vec array that was used for the READ/WRITE
+ * @nr_bvec:	number of entries in @bvec
+ * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ */
+void rdma_rw_ctx_destroy_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u32 __maybe_unused port_num,
+		const struct bio_vec __maybe_unused *bvec,
+		u32 nr_bvec, enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	u32 i;
+
+	switch (ctx->type) {
+	case RDMA_RW_MULTI_WR:
+		for (i = 0; i < nr_bvec; i++)
+			ib_dma_unmap_bvec(dev, ctx->map.sges[i].addr,
+					  ctx->map.sges[i].length, dir);
+		kfree(ctx->map.wrs);
+		kfree(ctx->map.sges);
+		break;
+	case RDMA_RW_SINGLE_WR:
+		ib_dma_unmap_bvec(dev, ctx->single.sge.addr,
+				  ctx->single.sge.length, dir);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return;
+	}
+}
+EXPORT_SYMBOL(rdma_rw_ctx_destroy_bvec);
+
 /**
  * rdma_rw_ctx_destroy_signature - release all resources allocated by
  *	rdma_rw_ctx_signature_init
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 6aad66bc5dd7..035593b2692d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -15,6 +15,7 @@
 #include <linux/ethtool.h>
 #include <linux/types.h>
 #include <linux/device.h>
+#include <linux/bvec.h>
 #include <linux/dma-mapping.h>
 #include <linux/kref.h>
 #include <linux/list.h>
@@ -4249,6 +4250,40 @@ static inline void ib_dma_unmap_page(struct ib_device *dev,
 		dma_unmap_page(dev->dma_device, addr, size, direction);
 }
 
+/**
+ * ib_dma_map_bvec - Map a bio_vec to DMA address
+ * @dev: The device for which the dma_addr is to be created
+ * @bvec: The bio_vec to map
+ * @direction: The direction of the DMA
+ *
+ * Uses dma_map_phys() for real hardware devices and virtual
+ * address for software RDMA devices (rxe, siw).
+ */
+static inline u64 ib_dma_map_bvec(struct ib_device *dev,
+				  const struct bio_vec *bvec,
+				  enum dma_data_direction direction)
+{
+	if (ib_uses_virt_dma(dev))
+		return (uintptr_t)(page_address(bvec->bv_page) + bvec->bv_offset);
+	return dma_map_phys(dev->dma_device, bvec_phys(bvec),
+			    bvec->bv_len, direction, 0);
+}
+
+/**
+ * ib_dma_unmap_bvec - Unmap a bio_vec DMA mapping
+ * @dev: The device for which the DMA address was created
+ * @addr: The DMA address
+ * @size: The size of the region in bytes
+ * @direction: The direction of the DMA
+ */
+static inline void ib_dma_unmap_bvec(struct ib_device *dev,
+				     u64 addr, size_t size,
+				     enum dma_data_direction direction)
+{
+	if (!ib_uses_virt_dma(dev))
+		dma_unmap_phys(dev->dma_device, addr, size, direction, 0);
+}
+
 int ib_dma_virt_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents);
 static inline int ib_dma_map_sg_attrs(struct ib_device *dev,
 				      struct scatterlist *sg, int nents,
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index d606cac48233..046a8eb57125 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -49,6 +49,16 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 			 u32 port_num, struct scatterlist *sg, u32 sg_cnt,
 			 enum dma_data_direction dir);
 
+struct bio_vec;
+
+int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u32 port_num, const struct bio_vec *bvec, u32 nr_bvec,
+		u32 offset, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir);
+void rdma_rw_ctx_destroy_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u32 port_num, const struct bio_vec *bvec, u32 nr_bvec,
+		enum dma_data_direction dir);
+
 int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		u32 port_num, struct scatterlist *sg, u32 sg_cnt,
 		struct scatterlist *prot_sg, u32 prot_sg_cnt,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 2/4] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  2026-01-14 14:39 [PATCH v1 0/4] Add a bio_vec based API to core/rw.c Chuck Lever
  2026-01-14 14:39 ` [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API Chuck Lever
@ 2026-01-14 14:39 ` Chuck Lever
  2026-01-15 15:58   ` Christoph Hellwig
  2026-01-14 14:39 ` [PATCH v1 3/4] RDMA/core: add MR support for bvec-based " Chuck Lever
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2026-01-14 14:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: linux-rdma, linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The bvec RDMA API maps each bvec individually via dma_map_phys(),
requiring an IOTLB sync for each mapping. For large I/O operations
with many bvecs, this overhead becomes significant.

The two-step IOVA API (dma_iova_try_alloc/dma_iova_link/
dma_iova_sync) allocates a contiguous IOVA range upfront, links
all physical pages without IOTLB syncs, then performs a single
sync at the end. This reduces IOTLB flushes from O(n) to O(1).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 drivers/infiniband/core/rw.c | 153 +++++++++++++++++++++++++++++++++++
 include/rdma/rw.h            |   8 ++
 2 files changed, 161 insertions(+)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 42215c2ff42b..36038e5f9197 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -14,6 +14,7 @@ enum {
 	RDMA_RW_MULTI_WR,
 	RDMA_RW_MR,
 	RDMA_RW_SIG_MR,
+	RDMA_RW_IOVA,
 };
 
 static bool rdma_rw_force_mr;
@@ -392,6 +393,137 @@ static int rdma_rw_init_map_wrs_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	return -ENOMEM;
 }
 
+/*
+ * Try to use the two-step IOVA API to map bvecs into a contiguous DMA range.
+ * This reduces IOTLB sync overhead by doing one sync at the end instead of
+ * one per bvec, and produces a contiguous DMA address range.
+ *
+ * Returns the number of WQEs on success, -EOPNOTSUPP if IOVA mapping is not
+ * available, or another negative error code on failure.
+ */
+static int rdma_rw_init_iova_wrs_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		const struct bio_vec *bvec, u32 nr_bvec, u32 offset,
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	struct device *dma_dev = dev->dma_device;
+	u32 max_sge = dir == DMA_TO_DEVICE ? qp->max_write_sge :
+		      qp->max_read_sge;
+	struct ib_sge *sge;
+	size_t total_len = 0, mapped_len = 0;
+	u32 i, j, bvec_idx = 0;
+	int ret;
+
+	/* Virtual DMA devices don't support IOVA mapping */
+	if (ib_uses_virt_dma(dev))
+		return -EOPNOTSUPP;
+
+	if (!max_sge)
+		return -EINVAL;
+
+	/* Calculate total transfer length */
+	for (i = 0; i < nr_bvec; i++) {
+		size_t len = (i == 0 && offset) ?
+			     bvec[i].bv_len - offset : bvec[i].bv_len;
+
+		if (check_add_overflow(total_len, len, &total_len))
+			return -EINVAL;
+	}
+
+	/* Try to allocate contiguous IOVA space */
+	if (!dma_iova_try_alloc(dma_dev, &ctx->iova.state,
+				bvec_phys(&bvec[0]) + offset, total_len))
+		return -EOPNOTSUPP;
+
+	ctx->nr_ops = DIV_ROUND_UP(nr_bvec, max_sge);
+
+	ctx->iova.sges = sge = kcalloc(nr_bvec, sizeof(*sge), GFP_KERNEL);
+	if (!ctx->iova.sges) {
+		ret = -ENOMEM;
+		goto out_free_iova;
+	}
+
+	ctx->iova.wrs = kcalloc(ctx->nr_ops, sizeof(*ctx->iova.wrs), GFP_KERNEL);
+	if (!ctx->iova.wrs) {
+		ret = -ENOMEM;
+		goto out_free_sges;
+	}
+
+	/* Link all bvecs into the IOVA space */
+	for (i = 0; i < nr_bvec; i++) {
+		const struct bio_vec *bv = &bvec[i];
+		phys_addr_t phys = bvec_phys(bv);
+		size_t len = bv->bv_len;
+
+		if (i == 0 && offset) {
+			phys += offset;
+			len -= offset;
+		}
+
+		ret = dma_iova_link(dma_dev, &ctx->iova.state, phys,
+				    mapped_len, len, dir, 0);
+		if (ret)
+			goto out_destroy;
+
+		mapped_len += len;
+	}
+
+	/* Sync the IOTLB once for all linked pages */
+	ret = dma_iova_sync(dma_dev, &ctx->iova.state, 0, mapped_len);
+	if (ret)
+		goto out_destroy;
+
+	ctx->iova.mapped_len = mapped_len;
+
+	/* Build SGEs using offsets into the contiguous IOVA range */
+	mapped_len = 0;
+	for (i = 0; i < ctx->nr_ops; i++) {
+		struct ib_rdma_wr *rdma_wr = &ctx->iova.wrs[i];
+		u32 nr_sge = min(nr_bvec - bvec_idx, max_sge);
+
+		if (dir == DMA_TO_DEVICE)
+			rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
+		else
+			rdma_wr->wr.opcode = IB_WR_RDMA_READ;
+		rdma_wr->remote_addr = remote_addr + mapped_len;
+		rdma_wr->rkey = rkey;
+		rdma_wr->wr.num_sge = nr_sge;
+		rdma_wr->wr.sg_list = sge;
+
+		for (j = 0; j < nr_sge; j++, bvec_idx++) {
+			const struct bio_vec *bv = &bvec[bvec_idx];
+			size_t len = bv->bv_len;
+
+			if (bvec_idx == 0 && offset)
+				len -= offset;
+
+			sge->addr = ctx->iova.state.addr + mapped_len;
+			sge->length = len;
+			sge->lkey = qp->pd->local_dma_lkey;
+
+			mapped_len += len;
+			sge++;
+		}
+
+		rdma_wr->wr.next = i + 1 < ctx->nr_ops ?
+			&ctx->iova.wrs[i + 1].wr : NULL;
+	}
+
+	ctx->type = RDMA_RW_IOVA;
+	return ctx->nr_ops;
+
+out_destroy:
+	dma_iova_destroy(dma_dev, &ctx->iova.state, mapped_len, dir, 0);
+	kfree(ctx->iova.wrs);
+	kfree(ctx->iova.sges);
+	return ret;
+out_free_sges:
+	kfree(ctx->iova.sges);
+out_free_iova:
+	dma_iova_free(dma_dev, &ctx->iova.state);
+	return ret;
+}
+
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:	context to initialize
@@ -486,6 +618,8 @@ int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		u32 offset, u64 remote_addr, u32 rkey,
 		enum dma_data_direction dir)
 {
+	int ret;
+
 	if (nr_bvec == 0 || offset > bvec[0].bv_len)
 		return -EINVAL;
 
@@ -497,6 +631,15 @@ int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		return rdma_rw_init_single_wr_bvec(ctx, qp, bvec, offset,
 				remote_addr, rkey, dir);
 
+	/*
+	 * Try IOVA-based mapping first for multi-bvec transfers.
+	 * This reduces IOTLB sync overhead by batching all mappings.
+	 */
+	ret = rdma_rw_init_iova_wrs_bvec(ctx, qp, bvec, nr_bvec, offset,
+			remote_addr, rkey, dir);
+	if (ret != -EOPNOTSUPP)
+		return ret;
+
 	return rdma_rw_init_map_wrs_bvec(ctx, qp, bvec, nr_bvec, offset,
 			remote_addr, rkey, dir);
 }
@@ -673,6 +816,10 @@ struct ib_send_wr *rdma_rw_ctx_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 			first_wr = &ctx->reg[0].reg_wr.wr;
 		last_wr = &ctx->reg[ctx->nr_ops - 1].wr.wr;
 		break;
+	case RDMA_RW_IOVA:
+		first_wr = &ctx->iova.wrs[0].wr;
+		last_wr = &ctx->iova.wrs[ctx->nr_ops - 1].wr;
+		break;
 	case RDMA_RW_MULTI_WR:
 		first_wr = &ctx->map.wrs[0].wr;
 		last_wr = &ctx->map.wrs[ctx->nr_ops - 1].wr;
@@ -774,6 +921,12 @@ void rdma_rw_ctx_destroy_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	u32 i;
 
 	switch (ctx->type) {
+	case RDMA_RW_IOVA:
+		dma_iova_destroy(dev->dma_device, &ctx->iova.state,
+				 ctx->iova.mapped_len, dir, 0);
+		kfree(ctx->iova.wrs);
+		kfree(ctx->iova.sges);
+		break;
 	case RDMA_RW_MULTI_WR:
 		for (i = 0; i < nr_bvec; i++)
 			ib_dma_unmap_bvec(dev, ctx->map.sges[i].addr,
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index 046a8eb57125..8a2012f03667 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -31,6 +31,14 @@ struct rdma_rw_ctx {
 			struct ib_rdma_wr	*wrs;
 		} map;
 
+		/* for IOVA-based mapping of multiple bvecs: */
+		struct {
+			struct dma_iova_state	state;
+			struct ib_sge		*sges;
+			struct ib_rdma_wr	*wrs;
+			size_t			mapped_len;
+		} iova;
+
 		/* for registering multiple WRs: */
 		struct rdma_rw_reg_ctx {
 			struct ib_sge		sge;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 3/4] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-14 14:39 [PATCH v1 0/4] Add a bio_vec based API to core/rw.c Chuck Lever
  2026-01-14 14:39 ` [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API Chuck Lever
  2026-01-14 14:39 ` [PATCH v1 2/4] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations Chuck Lever
@ 2026-01-14 14:39 ` Chuck Lever
  2026-01-15 15:58   ` Christoph Hellwig
  2026-01-16 11:42   ` Leon Romanovsky
  2026-01-14 14:39 ` [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API Chuck Lever
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Chuck Lever @ 2026-01-14 14:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: linux-rdma, linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The bvec-based RDMA API currently returns -EOPNOTSUPP when Memory
Region registration is required. This prevents iWARP devices from
using the bvec path, since iWARP requires MR registration for RDMA
READ operations. The force_mr debug parameter is also unusable with
bvec input.

Add rdma_rw_init_mr_wrs_bvec() to handle MR registration for bvec
arrays. The approach creates a synthetic scatterlist populated with
DMA addresses from the bvecs, then reuses the existing ib_map_mr_sg()
infrastructure. This avoids driver changes while keeping the
implementation small.

The synthetic scatterlist is stored in the rdma_rw_ctx for cleanup.
On destroy, the MRs are returned to the pool and the bvec DMA
mappings are released using the stored addresses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 drivers/infiniband/core/rw.c | 157 +++++++++++++++++++++++++++++++++--
 include/rdma/rw.h            |   8 ++
 2 files changed, 159 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 36038e5f9197..610f5c946567 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -193,6 +193,140 @@ static int rdma_rw_init_mr_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	return ret;
 }
 
+static int rdma_rw_init_mr_wrs_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u32 port_num, const struct bio_vec *bvec, u32 nr_bvec,
+		u32 offset, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	struct rdma_rw_reg_ctx *prev = NULL;
+	u32 pages_per_mr = rdma_rw_fr_page_list_len(dev, qp->integrity_en);
+	struct scatterlist *sgl;
+	int i, j, ret = 0, count = 0;
+	u32 sg_idx = 0;
+
+	ctx->nr_ops = DIV_ROUND_UP(nr_bvec, pages_per_mr);
+	ctx->reg = kcalloc(ctx->nr_ops, sizeof(*ctx->reg), GFP_KERNEL);
+	if (!ctx->reg)
+		return -ENOMEM;
+
+	/*
+	 * Allocate synthetic scatterlist to hold DMA addresses.
+	 * ib_map_mr_sg() extracts sg_dma_address/len, so the page
+	 * pointer is unused.
+	 */
+	sgl = kmalloc_array(nr_bvec, sizeof(*sgl), GFP_KERNEL);
+	if (!sgl) {
+		ret = -ENOMEM;
+		goto out_free_reg;
+	}
+	sg_init_table(sgl, nr_bvec);
+
+	/*
+	 * DMA map all bvecs and populate the synthetic scatterlist.
+	 */
+	for (i = 0; i < nr_bvec; i++) {
+		const struct bio_vec *bv = &bvec[i];
+		struct bio_vec adjusted;
+		u64 dma_addr;
+		u32 len;
+
+		if (i == 0 && offset) {
+			adjusted = *bv;
+			adjusted.bv_offset += offset;
+			adjusted.bv_len -= offset;
+			bv = &adjusted;
+		}
+		len = bv->bv_len;
+
+		dma_addr = ib_dma_map_bvec(dev, bv, dir);
+		if (ib_dma_mapping_error(dev, dma_addr)) {
+			ret = -ENOMEM;
+			goto out_unmap;
+		}
+
+		/*
+		 * Populate sg entry with DMA address. sg_set_page() is
+		 * called to initialize the entry, but the page pointer
+		 * is unused by ib_map_mr_sg().
+		 */
+		sg_set_page(&sgl[i], bv->bv_page, len, bv->bv_offset);
+		sg_dma_address(&sgl[i]) = dma_addr;
+		sg_dma_len(&sgl[i]) = len;
+	}
+
+	/*
+	 * Build MR chain using the synthetic scatterlist.
+	 */
+	for (i = 0; i < ctx->nr_ops; i++) {
+		struct rdma_rw_reg_ctx *reg = &ctx->reg[i];
+		u32 nents = min(nr_bvec - sg_idx, pages_per_mr);
+
+		ret = rdma_rw_init_one_mr(qp, port_num, reg, &sgl[sg_idx],
+					  nents, 0);
+		if (ret < 0)
+			goto out_free_mrs;
+		count += ret;
+
+		if (prev) {
+			if (reg->mr->need_inval)
+				prev->wr.wr.next = &reg->inv_wr;
+			else
+				prev->wr.wr.next = &reg->reg_wr.wr;
+		}
+
+		reg->reg_wr.wr.next = &reg->wr.wr;
+
+		reg->wr.wr.sg_list = &reg->sge;
+		reg->wr.wr.num_sge = 1;
+		reg->wr.remote_addr = remote_addr;
+		reg->wr.rkey = rkey;
+
+		if (dir == DMA_TO_DEVICE) {
+			reg->wr.wr.opcode = IB_WR_RDMA_WRITE;
+		} else if (!rdma_cap_read_inv(qp->device, port_num)) {
+			reg->wr.wr.opcode = IB_WR_RDMA_READ;
+		} else {
+			reg->wr.wr.opcode = IB_WR_RDMA_READ_WITH_INV;
+			reg->wr.wr.ex.invalidate_rkey = reg->mr->lkey;
+		}
+		count++;
+
+		remote_addr += reg->sge.length;
+		sg_idx += nents;
+		prev = reg;
+	}
+
+	if (prev)
+		prev->wr.wr.next = NULL;
+
+	ctx->type = RDMA_RW_MR;
+	ctx->mr_sgl = sgl;
+	ctx->mr_sg_cnt = nr_bvec;
+	return count;
+
+out_free_mrs:
+	while (--i >= 0)
+		ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg[i].mr);
+	/* All bvecs were mapped successfully, unmap them all */
+	for (j = 0; j < nr_bvec; j++)
+		ib_dma_unmap_bvec(dev, sg_dma_address(&sgl[j]),
+				  sg_dma_len(&sgl[j]), dir);
+	kfree(sgl);
+	kfree(ctx->reg);
+	return ret;
+
+out_unmap:
+	/* Unmap bvecs that were successfully mapped (0 through i-1) */
+	for (j = 0; j < i; j++)
+		ib_dma_unmap_bvec(dev, sg_dma_address(&sgl[j]),
+				  sg_dma_len(&sgl[j]), dir);
+	kfree(sgl);
+out_free_reg:
+	kfree(ctx->reg);
+	return ret;
+}
+
 static int rdma_rw_init_map_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		struct scatterlist *sg, u32 sg_cnt, u32 offset,
 		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
@@ -606,9 +740,8 @@ EXPORT_SYMBOL(rdma_rw_ctx_init);
  * @rkey:	remote key to operate on
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
  *
- * Maps the bio_vec array directly using dma_map_phys(), avoiding the
- * intermediate scatterlist conversion. Does not support the MR registration
- * path (iWARP devices or force_mr=1).
+ * Maps the bio_vec array directly, avoiding intermediate scatterlist
+ * conversion. Supports MR registration for iWARP devices and force_mr mode.
  *
  * Returns the number of WQEs that will be needed on the workqueue if
  * successful, or a negative error code.
@@ -618,14 +751,16 @@ int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		u32 offset, u64 remote_addr, u32 rkey,
 		enum dma_data_direction dir)
 {
+	struct ib_device *dev = qp->pd->device;
 	int ret;
 
 	if (nr_bvec == 0 || offset > bvec[0].bv_len)
 		return -EINVAL;
 
-	/* MR path not supported for bvec - reject iWARP and force_mr */
-	if (rdma_rw_io_needs_mr(qp->device, port_num, dir, nr_bvec))
-		return -EOPNOTSUPP;
+	if (rdma_rw_io_needs_mr(dev, port_num, dir, nr_bvec))
+		return rdma_rw_init_mr_wrs_bvec(ctx, qp, port_num, bvec,
+						nr_bvec, offset, remote_addr,
+						rkey, dir);
 
 	if (nr_bvec == 1)
 		return rdma_rw_init_single_wr_bvec(ctx, qp, bvec, offset,
@@ -921,6 +1056,16 @@ void rdma_rw_ctx_destroy_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	u32 i;
 
 	switch (ctx->type) {
+	case RDMA_RW_MR:
+		for (i = 0; i < ctx->nr_ops; i++)
+			ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg[i].mr);
+		kfree(ctx->reg);
+		/* Unmap bvecs using stored DMA addresses */
+		for (i = 0; i < ctx->mr_sg_cnt; i++)
+			ib_dma_unmap_bvec(dev, sg_dma_address(&ctx->mr_sgl[i]),
+					  sg_dma_len(&ctx->mr_sgl[i]), dir);
+		kfree(ctx->mr_sgl);
+		break;
 	case RDMA_RW_IOVA:
 		dma_iova_destroy(dev->dma_device, &ctx->iova.state,
 				 ctx->iova.mapped_len, dir, 0);
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index 8a2012f03667..c73dc6955e07 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -48,6 +48,14 @@ struct rdma_rw_ctx {
 			struct ib_mr		*mr;
 		} *reg;
 	};
+
+	/*
+	 * For bvec MR path: store synthetic scatterlist with DMA addresses
+	 * for cleanup. Only valid when type == RDMA_RW_MR and initialized
+	 * via rdma_rw_ctx_init_bvec().
+	 */
+	struct scatterlist	*mr_sgl;
+	u32			mr_sg_cnt;
 };
 
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API
  2026-01-14 14:39 [PATCH v1 0/4] Add a bio_vec based API to core/rw.c Chuck Lever
                   ` (2 preceding siblings ...)
  2026-01-14 14:39 ` [PATCH v1 3/4] RDMA/core: add MR support for bvec-based " Chuck Lever
@ 2026-01-14 14:39 ` Chuck Lever
  2026-01-15  9:51   ` Leon Romanovsky
  2026-01-15 16:29   ` Christoph Hellwig
  2026-01-15  9:50 ` [PATCH v1 0/4] Add a bio_vec based API to core/rw.c Leon Romanovsky
  2026-01-15 15:46 ` Christoph Hellwig
  5 siblings, 2 replies; 30+ messages in thread
From: Chuck Lever @ 2026-01-14 14:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: linux-rdma, linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Convert svcrdma to the bvec-based RDMA API introduced earlier in
this series.

The bvec-based RDMA API eliminates the intermediate scatterlist
conversion step, allowing direct DMA mapping from bio_vec arrays.
This simplifies the svc_rdma_rw_ctxt structure by removing the
inline scatterlist and chained SG table management.

The structure size reduction is significant: the previous inline
scatterlist array of RPCSVC_MAXPAGES entries (4KB or more) is
replaced with a pointer to a dynamically allocated bvec array,
bringing the fixed structure size down to approximately 100 bytes.

The bvec API handles all device types internally, including iWARP
devices which require memory registration. No explicit fallback
path is needed.

Signed-off-by: cel@kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_rw.c | 115 ++++++++++++++----------------
 1 file changed, 55 insertions(+), 60 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 310de7a80be5..fac83a78282b 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -5,6 +5,7 @@
  * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
  */
 
+#include <linux/bvec.h>
 #include <rdma/rw.h>
 
 #include <linux/sunrpc/xdr.h>
@@ -21,29 +22,27 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc);
  * Write Work Requests.
  *
  * Each WR chain handles a single contiguous server-side buffer,
- * because scatterlist entries after the first have to start on
+ * because bio_vec entries after the first have to start on
  * page alignment. xdr_buf iovecs cannot guarantee alignment.
  *
  * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
  * from a client may contain a unique R_key, so each WR chain moves
  * up to one segment at a time.
  *
- * The scatterlist makes this data structure over 4KB in size. To
- * make it less likely to fail, and to handle the allocation for
- * smaller I/O requests without disabling bottom-halves, these
- * contexts are created on demand, but cached and reused until the
- * controlling svcxprt_rdma is destroyed.
+ * These contexts are created on demand, but cached and reused until
+ * the controlling svcxprt_rdma is destroyed.
  */
 struct svc_rdma_rw_ctxt {
 	struct llist_node	rw_node;
 	struct list_head	rw_list;
 	struct rdma_rw_ctx	rw_ctx;
 	unsigned int		rw_nents;
-	unsigned int		rw_first_sgl_nents;
-	struct sg_table		rw_sg_table;
-	struct scatterlist	rw_first_sgl[];
+	struct bio_vec		*rw_bvec;
 };
 
+static void svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
+				 struct svc_rdma_rw_ctxt *ctxt);
+
 static inline struct svc_rdma_rw_ctxt *
 svc_rdma_next_ctxt(struct list_head *list)
 {
@@ -52,10 +51,9 @@ svc_rdma_next_ctxt(struct list_head *list)
 }
 
 static struct svc_rdma_rw_ctxt *
-svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int sges)
+svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int nr_bvec)
 {
 	struct ib_device *dev = rdma->sc_cm_id->device;
-	unsigned int first_sgl_nents = dev->attrs.max_send_sge;
 	struct svc_rdma_rw_ctxt *ctxt;
 	struct llist_node *node;
 
@@ -65,33 +63,35 @@ svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int sges)
 	if (node) {
 		ctxt = llist_entry(node, struct svc_rdma_rw_ctxt, rw_node);
 	} else {
-		ctxt = kmalloc_node(struct_size(ctxt, rw_first_sgl, first_sgl_nents),
-				    GFP_KERNEL, ibdev_to_node(dev));
+		ctxt = kmalloc_node(sizeof(*ctxt), GFP_KERNEL,
+				    ibdev_to_node(dev));
 		if (!ctxt)
 			goto out_noctx;
 
 		INIT_LIST_HEAD(&ctxt->rw_list);
-		ctxt->rw_first_sgl_nents = first_sgl_nents;
 	}
 
-	ctxt->rw_sg_table.sgl = ctxt->rw_first_sgl;
-	if (sg_alloc_table_chained(&ctxt->rw_sg_table, sges,
-				   ctxt->rw_sg_table.sgl,
-				   first_sgl_nents))
+	ctxt->rw_bvec = kmalloc_array_node(nr_bvec, sizeof(*ctxt->rw_bvec),
+					   GFP_KERNEL, ibdev_to_node(dev));
+	if (!ctxt->rw_bvec)
 		goto out_free;
 	return ctxt;
 
 out_free:
-	kfree(ctxt);
+	if (node)
+		svc_rdma_put_rw_ctxt(rdma, ctxt);
+	else
+		kfree(ctxt);
 out_noctx:
-	trace_svcrdma_rwctx_empty(rdma, sges);
+	trace_svcrdma_rwctx_empty(rdma, nr_bvec);
 	return NULL;
 }
 
 static void __svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt,
 				   struct llist_head *list)
 {
-	sg_free_table_chained(&ctxt->rw_sg_table, ctxt->rw_first_sgl_nents);
+	kfree(ctxt->rw_bvec);
+	ctxt->rw_bvec = NULL;
 	llist_add(&ctxt->rw_node, list);
 }
 
@@ -105,6 +105,8 @@ static void svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
  * svc_rdma_destroy_rw_ctxts - Free accumulated R/W contexts
  * @rdma: transport about to be destroyed
  *
+ * Cached contexts have rw_bvec set to NULL because
+ * __svc_rdma_put_rw_ctxt() frees the bvec array before caching.
  */
 void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma)
 {
@@ -135,9 +137,10 @@ static int svc_rdma_rw_ctx_init(struct svcxprt_rdma *rdma,
 {
 	int ret;
 
-	ret = rdma_rw_ctx_init(&ctxt->rw_ctx, rdma->sc_qp, rdma->sc_port_num,
-			       ctxt->rw_sg_table.sgl, ctxt->rw_nents,
-			       0, offset, handle, direction);
+	ret = rdma_rw_ctx_init_bvec(&ctxt->rw_ctx, rdma->sc_qp,
+				    rdma->sc_port_num,
+				    ctxt->rw_bvec, ctxt->rw_nents,
+				    0, offset, handle, direction);
 	if (unlikely(ret < 0)) {
 		trace_svcrdma_dma_map_rw_err(rdma, offset, handle,
 					     ctxt->rw_nents, ret);
@@ -183,9 +186,9 @@ void svc_rdma_cc_release(struct svcxprt_rdma *rdma,
 	while ((ctxt = svc_rdma_next_ctxt(&cc->cc_rwctxts)) != NULL) {
 		list_del(&ctxt->rw_list);
 
-		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
-				    rdma->sc_port_num, ctxt->rw_sg_table.sgl,
-				    ctxt->rw_nents, dir);
+		rdma_rw_ctx_destroy_bvec(&ctxt->rw_ctx, rdma->sc_qp,
+					 rdma->sc_port_num,
+					 ctxt->rw_bvec, ctxt->rw_nents, dir);
 		__svc_rdma_put_rw_ctxt(ctxt, &free);
 
 		ctxt->rw_node.next = first;
@@ -414,29 +417,25 @@ static int svc_rdma_post_chunk_ctxt(struct svcxprt_rdma *rdma,
 	return -ENOTCONN;
 }
 
-/* Build and DMA-map an SGL that covers one kvec in an xdr_buf
+/* Build a bvec that covers one kvec in an xdr_buf.
  */
-static void svc_rdma_vec_to_sg(struct svc_rdma_write_info *info,
-			       unsigned int len,
-			       struct svc_rdma_rw_ctxt *ctxt)
+static void svc_rdma_vec_to_bvec(struct svc_rdma_write_info *info,
+				 unsigned int len,
+				 struct svc_rdma_rw_ctxt *ctxt)
 {
-	struct scatterlist *sg = ctxt->rw_sg_table.sgl;
-
-	sg_set_buf(&sg[0], info->wi_base, len);
+	bvec_set_virt(&ctxt->rw_bvec[0], info->wi_base, len);
 	info->wi_base += len;
-
 	ctxt->rw_nents = 1;
 }
 
-/* Build and DMA-map an SGL that covers part of an xdr_buf's pagelist.
+/* Build a bvec array that covers part of an xdr_buf's pagelist.
  */
-static void svc_rdma_pagelist_to_sg(struct svc_rdma_write_info *info,
-				    unsigned int remaining,
-				    struct svc_rdma_rw_ctxt *ctxt)
+static void svc_rdma_pagelist_to_bvec(struct svc_rdma_write_info *info,
+				      unsigned int remaining,
+				      struct svc_rdma_rw_ctxt *ctxt)
 {
-	unsigned int sge_no, sge_bytes, page_off, page_no;
+	unsigned int bvec_idx, sge_bytes, page_off, page_no;
 	const struct xdr_buf *xdr = info->wi_xdr;
-	struct scatterlist *sg;
 	struct page **page;
 
 	page_off = info->wi_next_off + xdr->page_base;
@@ -444,21 +443,19 @@ static void svc_rdma_pagelist_to_sg(struct svc_rdma_write_info *info,
 	page_off = offset_in_page(page_off);
 	page = xdr->pages + page_no;
 	info->wi_next_off += remaining;
-	sg = ctxt->rw_sg_table.sgl;
-	sge_no = 0;
+	bvec_idx = 0;
 	do {
 		sge_bytes = min_t(unsigned int, remaining,
 				  PAGE_SIZE - page_off);
-		sg_set_page(sg, *page, sge_bytes, page_off);
-
+		bvec_set_page(&ctxt->rw_bvec[bvec_idx], *page, sge_bytes,
+			      page_off);
 		remaining -= sge_bytes;
-		sg = sg_next(sg);
 		page_off = 0;
-		sge_no++;
+		bvec_idx++;
 		page++;
 	} while (remaining);
 
-	ctxt->rw_nents = sge_no;
+	ctxt->rw_nents = bvec_idx;
 }
 
 /* Construct RDMA Write WRs to send a portion of an xdr_buf containing
@@ -535,7 +532,7 @@ static int svc_rdma_iov_write(struct svc_rdma_write_info *info,
 			      const struct kvec *iov)
 {
 	info->wi_base = iov->iov_base;
-	return svc_rdma_build_writes(info, svc_rdma_vec_to_sg,
+	return svc_rdma_build_writes(info, svc_rdma_vec_to_bvec,
 				     iov->iov_len);
 }
 
@@ -559,7 +556,7 @@ static int svc_rdma_pages_write(struct svc_rdma_write_info *info,
 {
 	info->wi_xdr = xdr;
 	info->wi_next_off = offset - xdr->head[0].iov_len;
-	return svc_rdma_build_writes(info, svc_rdma_pagelist_to_sg,
+	return svc_rdma_build_writes(info, svc_rdma_pagelist_to_bvec,
 				     length);
 }
 
@@ -734,29 +731,27 @@ static int svc_rdma_build_read_segment(struct svc_rqst *rqstp,
 {
 	struct svcxprt_rdma *rdma = svc_rdma_rqst_rdma(rqstp);
 	struct svc_rdma_chunk_ctxt *cc = &head->rc_cc;
-	unsigned int sge_no, seg_len, len;
+	unsigned int bvec_idx, nr_bvec, seg_len, len;
 	struct svc_rdma_rw_ctxt *ctxt;
-	struct scatterlist *sg;
 	int ret;
 
 	len = segment->rs_length;
-	sge_no = PAGE_ALIGN(head->rc_pageoff + len) >> PAGE_SHIFT;
-	ctxt = svc_rdma_get_rw_ctxt(rdma, sge_no);
+	nr_bvec = PAGE_ALIGN(head->rc_pageoff + len) >> PAGE_SHIFT;
+	ctxt = svc_rdma_get_rw_ctxt(rdma, nr_bvec);
 	if (!ctxt)
 		return -ENOMEM;
-	ctxt->rw_nents = sge_no;
+	ctxt->rw_nents = nr_bvec;
 
-	sg = ctxt->rw_sg_table.sgl;
-	for (sge_no = 0; sge_no < ctxt->rw_nents; sge_no++) {
+	for (bvec_idx = 0; bvec_idx < ctxt->rw_nents; bvec_idx++) {
 		seg_len = min_t(unsigned int, len,
 				PAGE_SIZE - head->rc_pageoff);
 
 		if (!head->rc_pageoff)
 			head->rc_page_count++;
 
-		sg_set_page(sg, rqstp->rq_pages[head->rc_curpage],
-			    seg_len, head->rc_pageoff);
-		sg = sg_next(sg);
+		bvec_set_page(&ctxt->rw_bvec[bvec_idx],
+			      rqstp->rq_pages[head->rc_curpage],
+			      seg_len, head->rc_pageoff);
 
 		head->rc_pageoff += seg_len;
 		if (head->rc_pageoff == PAGE_SIZE) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 0/4] Add a bio_vec based API to core/rw.c
  2026-01-14 14:39 [PATCH v1 0/4] Add a bio_vec based API to core/rw.c Chuck Lever
                   ` (3 preceding siblings ...)
  2026-01-14 14:39 ` [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API Chuck Lever
@ 2026-01-15  9:50 ` Leon Romanovsky
  2026-01-15 15:46 ` Christoph Hellwig
  5 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-15  9:50 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Christoph Hellwig, linux-rdma, linux-nfs,
	NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Chuck Lever

On Wed, Jan 14, 2026 at 09:39:44AM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> This series introduces a bio_vec based API for RDMA read and write
> operations in the RDMA core, eliminating unnecessary scatterlist
> conversions for callers that already work with bvecs.
> 
> Current users of rdma_rw_ctx_init() must convert their native data
> structures into scatterlists. For subsystems like svcrdma that
> maintain data in bvec format, this conversion adds overhead both in
> CPU cycles and memory footprint. The new API accepts bvec arrays
> directly.
> 
> For hardware RDMA devices, the implementation uses the IOVA-based
> DMA mapping API to reduce IOTLB synchronization overhead from O(n)
> per-page syncs to a single O(1) sync after all mappings complete.
> Software RDMA devices (rxe, siw) continue using virtual addressing.
> 
> The series includes MR registration support for bvec arrays,
> enabling iWARP devices and the force_mr debug parameter. The MR
> path reuses existing ib_map_mr_sg() infrastructure by constructing
> a synthetic scatterlist from the bvec DMA addresses.
> 
> The final patch adds the first consumer for the new API: svcrdma.
> It replaces its scatterlist conversion code, significantly reducing
> the svc_rdma_rw_ctxt structure size. The previous implementation
> embedded a scatterlist array of RPCSVC_MAXPAGES entries (4KB or
> more per context); the new implementation uses a pointer to a
> dynamically allocated bvec array.
> 
> Based on v6.19-rc5.
> 
> Chuck Lever (4):
>   RDMA/core: add bio_vec based RDMA read/write API
>   RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
>   RDMA/core: add MR support for bvec-based RDMA operations
>   svcrdma: use bvec-based RDMA read/write API

Amazing, thanks a lot.

I haven’t done a deep‑dive review yet, but from what I’ve seen so far  
it looks solid and well put together.

Thanks.

> 
>  drivers/infiniband/core/rw.c      | 492 ++++++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h           |  35 +++
>  include/rdma/rw.h                 |  26 ++
>  net/sunrpc/xprtrdma/svc_rdma_rw.c | 115 ++++---
>  4 files changed, 608 insertions(+), 60 deletions(-)
> 
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API
  2026-01-14 14:39 ` [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API Chuck Lever
@ 2026-01-15  9:51   ` Leon Romanovsky
  2026-01-15 16:29   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-15  9:51 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Christoph Hellwig, linux-rdma, linux-nfs,
	NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Chuck Lever

On Wed, Jan 14, 2026 at 09:39:48AM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Convert svcrdma to the bvec-based RDMA API introduced earlier in
> this series.
> 
> The bvec-based RDMA API eliminates the intermediate scatterlist
> conversion step, allowing direct DMA mapping from bio_vec arrays.
> This simplifies the svc_rdma_rw_ctxt structure by removing the
> inline scatterlist and chained SG table management.
> 
> The structure size reduction is significant: the previous inline
> scatterlist array of RPCSVC_MAXPAGES entries (4KB or more) is
> replaced with a pointer to a dynamically allocated bvec array,
> bringing the fixed structure size down to approximately 100 bytes.
> 
> The bvec API handles all device types internally, including iWARP
> devices which require memory registration. No explicit fallback
> path is needed.
> 
> Signed-off-by: cel@kernel.org

Something went wrong here.

Thanks

> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/svc_rdma_rw.c | 115 ++++++++++++++----------------
>  1 file changed, 55 insertions(+), 60 deletions(-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 0/4] Add a bio_vec based API to core/rw.c
  2026-01-14 14:39 [PATCH v1 0/4] Add a bio_vec based API to core/rw.c Chuck Lever
                   ` (4 preceding siblings ...)
  2026-01-15  9:50 ` [PATCH v1 0/4] Add a bio_vec based API to core/rw.c Leon Romanovsky
@ 2026-01-15 15:46 ` Christoph Hellwig
  5 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-15 15:46 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, linux-rdma,
	linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever

On Wed, Jan 14, 2026 at 09:39:44AM -0500, Chuck Lever wrote:
> The final patch adds the first consumer for the new API: svcrdma.
> It replaces its scatterlist conversion code, significantly reducing
> the svc_rdma_rw_ctxt structure size. The previous implementation
> embedded a scatterlist array of RPCSVC_MAXPAGES entries (4KB or
> more per context); the new implementation uses a pointer to a
> dynamically allocated bvec array.

But isn't that comparison a little unfair when a preallocated/embedded
data structure is replaced with a dynamic allocation?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-14 14:39 ` [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API Chuck Lever
@ 2026-01-15 15:53   ` Christoph Hellwig
  2026-01-16 11:33     ` Leon Romanovsky
  2026-01-16 21:24     ` Leon Romanovsky
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-15 15:53 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, linux-rdma,
	linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever

> +static int rdma_rw_init_single_wr_bvec(struct rdma_rw_ctx *ctx,
> +		struct ib_qp *qp, const struct bio_vec *bvec, u32 offset,
> +		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> +{
> +	struct ib_device *dev = qp->pd->device;
> +	struct ib_rdma_wr *rdma_wr = &ctx->single.wr;
> +	struct bio_vec adjusted = *bvec;
> +	u64 dma_addr;
> +
> +	ctx->nr_ops = 1;
> +
> +	if (offset) {
> +		adjusted.bv_offset += offset;
> +		adjusted.bv_len -= offset;
> +	}

Hmm, if we need to split/adjust bvecs, it might be better to
pass a bvec_iter and let the iter handle the iteration?

> +static int rdma_rw_init_map_wrs_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
> +		const struct bio_vec *bvec, u32 nr_bvec, u32 offset,
> +		u64 remote_addr, u32 rkey, enum dma_data_direction dir)

Much of this seems to duplicate rdma_rw_init_map_wrs.  I wonder if
having a low-level helper that gets either a scatterlist or bvec array
(or bvec_iter) and just has different inner loops for them would
make more sense?  If not we'll just need to migrate everyone off
the scatterlist version ASAP :)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/4] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  2026-01-14 14:39 ` [PATCH v1 2/4] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations Chuck Lever
@ 2026-01-15 15:58   ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-15 15:58 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, linux-rdma,
	linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever

> +	/* Calculate total transfer length */
> +	for (i = 0; i < nr_bvec; i++) {
> +		size_t len = (i == 0 && offset) ?
> +			     bvec[i].bv_len - offset : bvec[i].bv_len;
> +
> +		if (check_add_overflow(total_len, len, &total_len))
> +			return -EINVAL;
> +	}

The caller should usually have that value, so maybe pass it in?
(also using a bvec_iter would fix that)

> +	ctx->nr_ops = DIV_ROUND_UP(nr_bvec, max_sge);

I don't think this part is correct, but see more below.

> +	/* Link all bvecs into the IOVA space */
> +	for (i = 0; i < nr_bvec; i++) {
> +		const struct bio_vec *bv = &bvec[i];
> +		phys_addr_t phys = bvec_phys(bv);
> +		size_t len = bv->bv_len;
> +
> +		if (i == 0 && offset) {
> +			phys += offset;
> +			len -= offset;
> +		}
> +
> +		ret = dma_iova_link(dma_dev, &ctx->iova.state, phys,
> +				    mapped_len, len, dir, 0);
> +		if (ret)
> +			goto out_destroy;
> +
> +		mapped_len += len;

This creates a single contiguous IOVA for all the passed in bvecs,
even if they had non-contiguous host physical addresses.

> +	/* Build SGEs using offsets into the contiguous IOVA range */
> +	mapped_len = 0;
> +	for (i = 0; i < ctx->nr_ops; i++) {
> +		struct ib_rdma_wr *rdma_wr = &ctx->iova.wrs[i];
> +		u32 nr_sge = min(nr_bvec - bvec_idx, max_sge);
> +
> +		if (dir == DMA_TO_DEVICE)
> +			rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
> +		else
> +			rdma_wr->wr.opcode = IB_WR_RDMA_READ;
> +		rdma_wr->remote_addr = remote_addr + mapped_len;
> +		rdma_wr->rkey = rkey;
> +		rdma_wr->wr.num_sge = nr_sge;
> +		rdma_wr->wr.sg_list = sge;

... which means that here you just need a single WR and SGE to register
all of them.  No need to split the IOVA space up again.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 3/4] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-14 14:39 ` [PATCH v1 3/4] RDMA/core: add MR support for bvec-based " Chuck Lever
@ 2026-01-15 15:58   ` Christoph Hellwig
  2026-01-16 11:42   ` Leon Romanovsky
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-15 15:58 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, linux-rdma,
	linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever

On Wed, Jan 14, 2026 at 09:39:47AM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> The bvec-based RDMA API currently returns -EOPNOTSUPP when Memory
> Region registration is required. This prevents iWARP devices from
> using the bvec path, since iWARP requires MR registration for RDMA
> READ operations. The force_mr debug parameter is also unusable with
> bvec input.
> 
> Add rdma_rw_init_mr_wrs_bvec() to handle MR registration for bvec
> arrays. The approach creates a synthetic scatterlist populated with
> DMA addresses from the bvecs, then reuses the existing ib_map_mr_sg()
> infrastructure. This avoids driver changes while keeping the
> implementation small.
> 
> The synthetic scatterlist is stored in the rdma_rw_ctx for cleanup.
> On destroy, the MRs are returned to the pool and the bvec DMA
> mappings are released using the stored addresses.

I wish we'd just have a bvec based MR API, and could use that.
But I don't want to hold this work back, because of that.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API
  2026-01-14 14:39 ` [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API Chuck Lever
  2026-01-15  9:51   ` Leon Romanovsky
@ 2026-01-15 16:29   ` Christoph Hellwig
  2026-01-15 18:29     ` Chuck Lever
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-15 16:29 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, linux-rdma,
	linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever

On Wed, Jan 14, 2026 at 09:39:48AM -0500, Chuck Lever wrote:
> The structure size reduction is significant: the previous inline
> scatterlist array of RPCSVC_MAXPAGES entries (4KB or more) is
> replaced with a pointer to a dynamically allocated bvec array,
> bringing the fixed structure size down to approximately 100 bytes.

Can you explain why this switches to the dynamic allocation?
To me that seems like a separate trade-off to bvec vs scatterlist.

>   * Each WR chain handles a single contiguous server-side buffer,
> - * because scatterlist entries after the first have to start on
> + * because bio_vec entries after the first have to start on
>   * page alignment. xdr_buf iovecs cannot guarantee alignment.

For both the old and new version, can you explain they have to
start on a page boundary?  Because that's not how scatterlists or
bvecs work in general.  I guess this just documents the sunrpc
limits, but somehow projects it to these structures?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API
  2026-01-15 16:29   ` Christoph Hellwig
@ 2026-01-15 18:29     ` Chuck Lever
  2026-01-15 21:53       ` Chuck Lever
  0 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2026-01-15 18:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Leon Romanovsky, linux-rdma, linux-nfs,
	NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Chuck Lever



On Thu, Jan 15, 2026, at 11:29 AM, Christoph Hellwig wrote:
> On Wed, Jan 14, 2026 at 09:39:48AM -0500, Chuck Lever wrote:
>> The structure size reduction is significant: the previous inline
>> scatterlist array of RPCSVC_MAXPAGES entries (4KB or more) is
>> replaced with a pointer to a dynamically allocated bvec array,
>> bringing the fixed structure size down to approximately 100 bytes.
>
> Can you explain why this switches to the dynamic allocation?
> To me that seems like a separate trade-off to bvec vs scatterlist.

The current implementation keeps a "default size" SGL in the
context, and chains more on if a larger SGL size is needed.
This keeps the size of the context reasonable while still
enabling large requests.

For bvec support, there's no concept of bvec array chaining.
We always have to allocate the exact size of the bvec array
that is needed for the request, otherwise we'd have to keep
a maximum-sized biovec array in every context.

Now, I suppose that later on we will be able to adopt the use of
the rqstp->rq_bvec, when the full NFSD stack supports biovecs,
and this allocation could be replaced, at least in some cases.


>>   * Each WR chain handles a single contiguous server-side buffer,
>> - * because scatterlist entries after the first have to start on
>> + * because bio_vec entries after the first have to start on
>>   * page alignment. xdr_buf iovecs cannot guarantee alignment.
>
> For both the old and new version, can you explain they have to
> start on a page boundary?  Because that's not how scatterlists or
> bvecs work in general.  I guess this just documents the sunrpc
> limits, but somehow projects it to these structures?

It's historic, and probably related to the sunrpc implementation.
I didn't question it when doing the conversion, so I'll have to
try to remember exactly why.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API
  2026-01-15 18:29     ` Chuck Lever
@ 2026-01-15 21:53       ` Chuck Lever
  2026-01-16  9:38         ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2026-01-15 21:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Leon Romanovsky, linux-rdma, linux-nfs,
	NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Chuck Lever



On Thu, Jan 15, 2026, at 1:29 PM, Chuck Lever wrote:
> On Thu, Jan 15, 2026, at 11:29 AM, Christoph Hellwig wrote:
>> On Wed, Jan 14, 2026 at 09:39:48AM -0500, Chuck Lever wrote:
>>>   * Each WR chain handles a single contiguous server-side buffer,
>>> - * because scatterlist entries after the first have to start on
>>> + * because bio_vec entries after the first have to start on
>>>   * page alignment. xdr_buf iovecs cannot guarantee alignment.
>>
>> For both the old and new version, can you explain they have to
>> start on a page boundary?  Because that's not how scatterlists or
>> bvecs work in general.  I guess this just documents the sunrpc
>> limits, but somehow projects it to these structures?
>
> It's historic, and probably related to the sunrpc implementation.
> I didn't question it when doing the conversion, so I'll have to
> try to remember exactly why.

These are contiguous because the xdr_buf "pages" field is an array
of "struct page *" pointers. So these don't have per-entry offsets.
There is one "page_offset" field in the xdr_buf that applies only
to the first entry in that array.

Therefore the payload buffer starts at an offset of zero or greater
in the first page in that array, but after that, the buffer continues
across the boundaries of each page from offset 4095 on page N to
byte 0 of page N+1, for all N.

The comment is a little misleading -- it documents an assumption
that is due to each entry of the xdr_buf pages array being "struct
page *" and there not being an offset field for each entry.

We can certainly clarify that as part of this bvec conversion series.
And (much) later on, when the head, tail, and pages fields in "struct
xdr_buf" are replaced with a single bio_vec array, this issue goes
away completely.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API
  2026-01-15 21:53       ` Chuck Lever
@ 2026-01-16  9:38         ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-16  9:38 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Jason Gunthorpe, Leon Romanovsky, linux-rdma,
	linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever

On Thu, Jan 15, 2026 at 04:53:23PM -0500, Chuck Lever wrote:
> These are contiguous because the xdr_buf "pages" field is an array
> of "struct page *" pointers. So these don't have per-entry offsets.
> There is one "page_offset" field in the xdr_buf that applies only
> to the first entry in that array.
> 
> Therefore the payload buffer starts at an offset of zero or greater
> in the first page in that array, but after that, the buffer continues
> across the boundaries of each page from offset 4095 on page N to
> byte 0 of page N+1, for all N.
> 
> The comment is a little misleading -- it documents an assumption
> that is due to each entry of the xdr_buf pages array being "struct
> page *" and there not being an offset field for each entry.

Yeah.  I also realized both the classic RDMA memory registrations,
and the IOVA based DMA mapping requires the subsequent entries to
be at least page aligned.  There is a Mellanox-specific MR type that
doesn't require that, but that obviously doesn't help iWarp, and the
IOVA coalescing is nice to have as well.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-15 15:53   ` Christoph Hellwig
@ 2026-01-16 11:33     ` Leon Romanovsky
  2026-01-16 14:52       ` Christoph Hellwig
  2026-01-16 21:24     ` Leon Romanovsky
  1 sibling, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-16 11:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, Jason Gunthorpe, linux-rdma, linux-nfs, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Chuck Lever

On Thu, Jan 15, 2026 at 04:53:34PM +0100, Christoph Hellwig wrote:
> > +static int rdma_rw_init_single_wr_bvec(struct rdma_rw_ctx *ctx,
> > +		struct ib_qp *qp, const struct bio_vec *bvec, u32 offset,
> > +		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> > +{
> > +	struct ib_device *dev = qp->pd->device;
> > +	struct ib_rdma_wr *rdma_wr = &ctx->single.wr;
> > +	struct bio_vec adjusted = *bvec;
> > +	u64 dma_addr;
> > +
> > +	ctx->nr_ops = 1;
> > +
> > +	if (offset) {
> > +		adjusted.bv_offset += offset;
> > +		adjusted.bv_len -= offset;
> > +	}
> 
> Hmm, if we need to split/adjust bvecs, it might be better to
> pass a bvec_iter and let the iter handle the iteration?
> 
> > +static int rdma_rw_init_map_wrs_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
> > +		const struct bio_vec *bvec, u32 nr_bvec, u32 offset,
> > +		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> 
> Much of this seems to duplicate rdma_rw_init_map_wrs.  I wonder if
> having a low-level helper that gets either a scatterlist or bvec array
> (or bvec_iter) and just has different inner loops for them would
> make more sense?  If not we'll just need to migrate everyone off
> the scatterlist version ASAP :)

I had short offline discussion with Jason about this series and both of
us would be more than happy to get rid of "scatterlist version".

Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 3/4] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-14 14:39 ` [PATCH v1 3/4] RDMA/core: add MR support for bvec-based " Chuck Lever
  2026-01-15 15:58   ` Christoph Hellwig
@ 2026-01-16 11:42   ` Leon Romanovsky
  2026-01-16 14:50     ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-16 11:42 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Christoph Hellwig, linux-rdma, linux-nfs,
	NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Chuck Lever

On Wed, Jan 14, 2026 at 09:39:47AM -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> The bvec-based RDMA API currently returns -EOPNOTSUPP when Memory
> Region registration is required. This prevents iWARP devices from
> using the bvec path, since iWARP requires MR registration for RDMA
> READ operations. The force_mr debug parameter is also unusable with
> bvec input.

I am not very familiar with iWARP. Do you know why we need a special
case here? Is there a reason we cannot avoid using scatterlists for
iWARP as well, now or in the future?

Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 3/4] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-16 11:42   ` Leon Romanovsky
@ 2026-01-16 14:50     ` Christoph Hellwig
  2026-01-16 21:16       ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-16 14:50 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Chuck Lever, Jason Gunthorpe, Christoph Hellwig, linux-rdma,
	linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever

On Fri, Jan 16, 2026 at 01:42:36PM +0200, Leon Romanovsky wrote:
> On Wed, Jan 14, 2026 at 09:39:47AM -0500, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > The bvec-based RDMA API currently returns -EOPNOTSUPP when Memory
> > Region registration is required. This prevents iWARP devices from
> > using the bvec path, since iWARP requires MR registration for RDMA
> > READ operations. The force_mr debug parameter is also unusable with
> > bvec input.
> 
> I am not very familiar with iWARP. Do you know why we need a special
> case here? Is there a reason we cannot avoid using scatterlists for
> iWARP as well, now or in the future?

iWarp must use MRs for the destination of RDMA READ operations, but the
core RW code can also optionally use it for other things.  So to support
that natively here we'd need a bvec-based version of ib_map_mr_sg.  Which
would be really nice to have for the storage host drivers anyway, but
until then the scatterlist emulation here can do.  And implementing it
might take a while, as ib_map_mr_sg is a very thin wrapper around a call
into the low-level driver.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-16 11:33     ` Leon Romanovsky
@ 2026-01-16 14:52       ` Christoph Hellwig
  2026-01-16 14:57         ` Chuck Lever
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-16 14:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Chuck Lever, Jason Gunthorpe, linux-rdma,
	linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever

On Fri, Jan 16, 2026 at 01:33:10PM +0200, Leon Romanovsky wrote:
> > Much of this seems to duplicate rdma_rw_init_map_wrs.  I wonder if
> > having a low-level helper that gets either a scatterlist or bvec array
> > (or bvec_iter) and just has different inner loops for them would
> > make more sense?  If not we'll just need to migrate everyone off
> > the scatterlist version ASAP :)
> 
> I had short offline discussion with Jason about this series and both of
> us would be more than happy to get rid of "scatterlist version".

nvmet_rdma_rw_ctx_init is used by isert, srpt, nvmet, ksmbd and
the sunrpc server side.  I don't think any of them should be
super complicated to work, but it will need a fair amount of testing
resources.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-16 14:52       ` Christoph Hellwig
@ 2026-01-16 14:57         ` Chuck Lever
  2026-01-16 21:14           ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Chuck Lever @ 2026-01-16 14:57 UTC (permalink / raw)
  To: Christoph Hellwig, Leon Romanovsky
  Cc: Jason Gunthorpe, linux-rdma, linux-nfs, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Chuck Lever



On Fri, Jan 16, 2026, at 9:52 AM, Christoph Hellwig wrote:
> On Fri, Jan 16, 2026 at 01:33:10PM +0200, Leon Romanovsky wrote:
>> > Much of this seems to duplicate rdma_rw_init_map_wrs.  I wonder if
>> > having a low-level helper that gets either a scatterlist or bvec array
>> > (or bvec_iter) and just has different inner loops for them would
>> > make more sense?  If not we'll just need to migrate everyone off
>> > the scatterlist version ASAP :)
>> 
>> I had short offline discussion with Jason about this series and both of
>> us would be more than happy to get rid of "scatterlist version".
>
> nvmet_rdma_rw_ctx_init is used by isert, srpt, nvmet, ksmbd and
> the sunrpc server side.  I don't think any of them should be
> super complicated to work, but it will need a fair amount of testing
> resources.

My preference is to keep the scope of this series narrow --
introduce the new API, and add one consumer for it. The other
conversions can then each be done by domain experts as they
have time.

I have no strong feeling for or against eventually removing
SGL support entirely from rw.c.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-16 14:57         ` Chuck Lever
@ 2026-01-16 21:14           ` Leon Romanovsky
  0 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-16 21:14 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Jason Gunthorpe, linux-rdma, linux-nfs,
	NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Chuck Lever

On Fri, Jan 16, 2026 at 09:57:57AM -0500, Chuck Lever wrote:
> 
> 
> On Fri, Jan 16, 2026, at 9:52 AM, Christoph Hellwig wrote:
> > On Fri, Jan 16, 2026 at 01:33:10PM +0200, Leon Romanovsky wrote:
> >> > Much of this seems to duplicate rdma_rw_init_map_wrs.  I wonder if
> >> > having a low-level helper that gets either a scatterlist or bvec array
> >> > (or bvec_iter) and just has different inner loops for them would
> >> > make more sense?  If not we'll just need to migrate everyone off
> >> > the scatterlist version ASAP :)
> >> 
> >> I had short offline discussion with Jason about this series and both of
> >> us would be more than happy to get rid of "scatterlist version".
> >
> > nvmet_rdma_rw_ctx_init is used by isert, srpt, nvmet, ksmbd and
> > the sunrpc server side.  I don't think any of them should be
> > super complicated to work, but it will need a fair amount of testing
> > resources.
> 
> My preference is to keep the scope of this series narrow --
> introduce the new API, and add one consumer for it. The other
> conversions can then each be done by domain experts as they
> have time.

Of course, I'm only outlining the direction in which the RDMA subsystem
is expected to evolve.

> 
> I have no strong feeling for or against eventually removing
> SGL support entirely from rw.c.
> 
> -- 
> Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 3/4] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-16 14:50     ` Christoph Hellwig
@ 2026-01-16 21:16       ` Leon Romanovsky
  0 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-16 21:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, Jason Gunthorpe, linux-rdma, linux-nfs, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Chuck Lever

On Fri, Jan 16, 2026 at 03:50:27PM +0100, Christoph Hellwig wrote:
> On Fri, Jan 16, 2026 at 01:42:36PM +0200, Leon Romanovsky wrote:
> > On Wed, Jan 14, 2026 at 09:39:47AM -0500, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > The bvec-based RDMA API currently returns -EOPNOTSUPP when Memory
> > > Region registration is required. This prevents iWARP devices from
> > > using the bvec path, since iWARP requires MR registration for RDMA
> > > READ operations. The force_mr debug parameter is also unusable with
> > > bvec input.
> > 
> > I am not very familiar with iWARP. Do you know why we need a special
> > case here? Is there a reason we cannot avoid using scatterlists for
> > iWARP as well, now or in the future?
> 
> iWarp must use MRs for the destination of RDMA READ operations, but the
> core RW code can also optionally use it for other things.  So to support
> that natively here we'd need a bvec-based version of ib_map_mr_sg.  Which
> would be really nice to have for the storage host drivers anyway, but
> until then the scatterlist emulation here can do.  And implementing it
> might take a while, as ib_map_mr_sg is a very thin wrapper around a call
> into the low-level driver.

It is in my roadmap, but as you said, it will take time :(.

Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-15 15:53   ` Christoph Hellwig
  2026-01-16 11:33     ` Leon Romanovsky
@ 2026-01-16 21:24     ` Leon Romanovsky
  2026-01-16 21:49       ` Chuck Lever
  1 sibling, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-16 21:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, Jason Gunthorpe, linux-rdma, linux-nfs, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Chuck Lever

On Thu, Jan 15, 2026 at 04:53:34PM +0100, Christoph Hellwig wrote:
> > +static int rdma_rw_init_single_wr_bvec(struct rdma_rw_ctx *ctx,
> > +		struct ib_qp *qp, const struct bio_vec *bvec, u32 offset,
> > +		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> > +{
> > +	struct ib_device *dev = qp->pd->device;
> > +	struct ib_rdma_wr *rdma_wr = &ctx->single.wr;
> > +	struct bio_vec adjusted = *bvec;
> > +	u64 dma_addr;
> > +
> > +	ctx->nr_ops = 1;
> > +
> > +	if (offset) {
> > +		adjusted.bv_offset += offset;
> > +		adjusted.bv_len -= offset;
> > +	}
> 
> Hmm, if we need to split/adjust bvecs, it might be better to
> pass a bvec_iter and let the iter handle the iteration?

It would also be worthwhile to support P2P scenarios in this flow.

Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-16 21:24     ` Leon Romanovsky
@ 2026-01-16 21:49       ` Chuck Lever
  2026-01-17 16:20         ` Leon Romanovsky
  2026-01-19  6:52         ` Christoph Hellwig
  0 siblings, 2 replies; 30+ messages in thread
From: Chuck Lever @ 2026-01-16 21:49 UTC (permalink / raw)
  To: Leon Romanovsky, Christoph Hellwig
  Cc: Jason Gunthorpe, linux-rdma, linux-nfs, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Chuck Lever



On Fri, Jan 16, 2026, at 4:24 PM, Leon Romanovsky wrote:
> On Thu, Jan 15, 2026 at 04:53:34PM +0100, Christoph Hellwig wrote:
>> > +static int rdma_rw_init_single_wr_bvec(struct rdma_rw_ctx *ctx,
>> > +		struct ib_qp *qp, const struct bio_vec *bvec, u32 offset,
>> > +		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
>> > +{
>> > +	struct ib_device *dev = qp->pd->device;
>> > +	struct ib_rdma_wr *rdma_wr = &ctx->single.wr;
>> > +	struct bio_vec adjusted = *bvec;
>> > +	u64 dma_addr;
>> > +
>> > +	ctx->nr_ops = 1;
>> > +
>> > +	if (offset) {
>> > +		adjusted.bv_offset += offset;
>> > +		adjusted.bv_len -= offset;
>> > +	}
>> 
>> Hmm, if we need to split/adjust bvecs, it might be better to
>> pass a bvec_iter and let the iter handle the iteration?
>
> It would also be worthwhile to support P2P scenarios in this flow.

I can add some code to this series to do that, but I don't believe
I have facilities to test it.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-16 21:49       ` Chuck Lever
@ 2026-01-17 16:20         ` Leon Romanovsky
  2026-01-19  6:52         ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-17 16:20 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Jason Gunthorpe, linux-rdma, linux-nfs,
	NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Chuck Lever

On Fri, Jan 16, 2026 at 04:49:06PM -0500, Chuck Lever wrote:
> 
> 
> On Fri, Jan 16, 2026, at 4:24 PM, Leon Romanovsky wrote:
> > On Thu, Jan 15, 2026 at 04:53:34PM +0100, Christoph Hellwig wrote:
> >> > +static int rdma_rw_init_single_wr_bvec(struct rdma_rw_ctx *ctx,
> >> > +		struct ib_qp *qp, const struct bio_vec *bvec, u32 offset,
> >> > +		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> >> > +{
> >> > +	struct ib_device *dev = qp->pd->device;
> >> > +	struct ib_rdma_wr *rdma_wr = &ctx->single.wr;
> >> > +	struct bio_vec adjusted = *bvec;
> >> > +	u64 dma_addr;
> >> > +
> >> > +	ctx->nr_ops = 1;
> >> > +
> >> > +	if (offset) {
> >> > +		adjusted.bv_offset += offset;
> >> > +		adjusted.bv_len -= offset;
> >> > +	}
> >> 
> >> Hmm, if we need to split/adjust bvecs, it might be better to
> >> pass a bvec_iter and let the iter handle the iteration?
> >
> > It would also be worthwhile to support P2P scenarios in this flow.
> 
> I can add some code to this series to do that, but I don't believe
> I have facilities to test it.

If it is possible, let's add.

Thanks

> 
> -- 
> Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-16 21:49       ` Chuck Lever
  2026-01-17 16:20         ` Leon Romanovsky
@ 2026-01-19  6:52         ` Christoph Hellwig
  2026-01-19 10:28           ` Leon Romanovsky
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-19  6:52 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe, linux-rdma,
	linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever

On Fri, Jan 16, 2026 at 04:49:06PM -0500, Chuck Lever wrote:
> >> Hmm, if we need to split/adjust bvecs, it might be better to
> >> pass a bvec_iter and let the iter handle the iteration?
> >
> > It would also be worthwhile to support P2P scenarios in this flow.
> 
> I can add some code to this series to do that, but I don't believe
> I have facilities to test it.

Please don't add untested code.  If Leon wants the P2P support and
volunteers to test it, sure.  But let's not merge it without being
tested.  And at least for NFS I don't really see how P2P would easily
fit in anyway.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-19  6:52         ` Christoph Hellwig
@ 2026-01-19 10:28           ` Leon Romanovsky
  2026-01-19 12:03             ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-19 10:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, Jason Gunthorpe, linux-rdma, linux-nfs, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Chuck Lever

On Mon, Jan 19, 2026 at 07:52:12AM +0100, Christoph Hellwig wrote:
> On Fri, Jan 16, 2026 at 04:49:06PM -0500, Chuck Lever wrote:
> > >> Hmm, if we need to split/adjust bvecs, it might be better to
> > >> pass a bvec_iter and let the iter handle the iteration?
> > >
> > > It would also be worthwhile to support P2P scenarios in this flow.
> > 
> > I can add some code to this series to do that, but I don't believe
> > I have facilities to test it.
> 
> Please don't add untested code.  If Leon wants the P2P support and
> volunteers to test it, sure.

I can do it with the help of how to setup the system.

> But let's not merge it without being tested.  And at least for NFS I don't
> really see how P2P would easily fit in anyway.

Chuck is proposing a new IB/core API that will also be used by NVMe too.
Wouldn't p2p be useful in the general case

Thanks

> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-19 10:28           ` Leon Romanovsky
@ 2026-01-19 12:03             ` Christoph Hellwig
  2026-01-19 14:37               ` Chuck Lever
  2026-01-19 18:34               ` Leon Romanovsky
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-19 12:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Chuck Lever, Jason Gunthorpe, linux-rdma,
	linux-nfs, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Chuck Lever

On Mon, Jan 19, 2026 at 12:28:57PM +0200, Leon Romanovsky wrote:
> > > I can add some code to this series to do that, but I don't believe
> > > I have facilities to test it.
> > 
> > Please don't add untested code.  If Leon wants the P2P support and
> > volunteers to test it, sure.
> 
> I can do it with the help of how to setup the system.
> 
> > But let's not merge it without being tested.  And at least for NFS I don't
> > really see how P2P would easily fit in anyway.
> 
> Chuck is proposing a new IB/core API that will also be used by NVMe too.

Hopefully eventually, yes.  Not in this series, though.

> Wouldn't p2p be useful in the general case

Well, P2P into a CMB might work in nfsd in theory now that there is
direct I/O support, but it'll require a lot of work.

So if you want to help to convert nvmet, the series to do that would
be the right place to add P2P support, as with that we can actually
test it.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-19 12:03             ` Christoph Hellwig
@ 2026-01-19 14:37               ` Chuck Lever
  2026-01-19 18:34               ` Leon Romanovsky
  1 sibling, 0 replies; 30+ messages in thread
From: Chuck Lever @ 2026-01-19 14:37 UTC (permalink / raw)
  To: Christoph Hellwig, Leon Romanovsky
  Cc: Jason Gunthorpe, linux-rdma, linux-nfs, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Chuck Lever



On Mon, Jan 19, 2026, at 7:03 AM, Christoph Hellwig wrote:
> On Mon, Jan 19, 2026 at 12:28:57PM +0200, Leon Romanovsky wrote:
>> > > I can add some code to this series to do that, but I don't believe
>> > > I have facilities to test it.
>> > 
>> > Please don't add untested code.  If Leon wants the P2P support and
>> > volunteers to test it, sure.
>> 
>> I can do it with the help of how to setup the system.
>> 
>> > But let's not merge it without being tested.  And at least for NFS I don't
>> > really see how P2P would easily fit in anyway.
>> 
>> Chuck is proposing a new IB/core API that will also be used by NVMe too.
>
> Hopefully eventually, yes.  Not in this series, though.

I can understand that P2P is not in the narrow scope of the
existing series. I have a patch now, but I'll postpone it
until later. Obviously it's not something I would feel
comfortable merging without testing.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-19 12:03             ` Christoph Hellwig
  2026-01-19 14:37               ` Chuck Lever
@ 2026-01-19 18:34               ` Leon Romanovsky
  1 sibling, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-19 18:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, Jason Gunthorpe, linux-rdma, linux-nfs, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, Chuck Lever

On Mon, Jan 19, 2026 at 01:03:11PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 19, 2026 at 12:28:57PM +0200, Leon Romanovsky wrote:
> > > > I can add some code to this series to do that, but I don't believe
> > > > I have facilities to test it.
> > > 
> > > Please don't add untested code.  If Leon wants the P2P support and
> > > volunteers to test it, sure.
> > 
> > I can do it with the help of how to setup the system.
> > 
> > > But let's not merge it without being tested.  And at least for NFS I don't
> > > really see how P2P would easily fit in anyway.
> > 
> > Chuck is proposing a new IB/core API that will also be used by NVMe too.
> 
> Hopefully eventually, yes.  Not in this series, though.

Fair enough.

> 
> > Wouldn't p2p be useful in the general case
> 
> Well, P2P into a CMB might work in nfsd in theory now that there is
> direct I/O support, but it'll require a lot of work.
> 
> So if you want to help to convert nvmet, the series to do that would
> be the right place to add P2P support, as with that we can actually
> test it.

If both of you plan to attend LSF/MM this year, and I receive an
invitation as well, we can discuss the future p2p roadmap in person
and how we want to move forward.

Most of the items from our discussion last year [1] have already been
completed or are on track for this or the next development cycle. The
remaining big item which is left is removing SG from RDMA.

Thanks

[1] https://lore.kernel.org/all/20250122071600.GC10702@unreal/


> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2026-01-19 18:34 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-14 14:39 [PATCH v1 0/4] Add a bio_vec based API to core/rw.c Chuck Lever
2026-01-14 14:39 ` [PATCH v1 1/4] RDMA/core: add bio_vec based RDMA read/write API Chuck Lever
2026-01-15 15:53   ` Christoph Hellwig
2026-01-16 11:33     ` Leon Romanovsky
2026-01-16 14:52       ` Christoph Hellwig
2026-01-16 14:57         ` Chuck Lever
2026-01-16 21:14           ` Leon Romanovsky
2026-01-16 21:24     ` Leon Romanovsky
2026-01-16 21:49       ` Chuck Lever
2026-01-17 16:20         ` Leon Romanovsky
2026-01-19  6:52         ` Christoph Hellwig
2026-01-19 10:28           ` Leon Romanovsky
2026-01-19 12:03             ` Christoph Hellwig
2026-01-19 14:37               ` Chuck Lever
2026-01-19 18:34               ` Leon Romanovsky
2026-01-14 14:39 ` [PATCH v1 2/4] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations Chuck Lever
2026-01-15 15:58   ` Christoph Hellwig
2026-01-14 14:39 ` [PATCH v1 3/4] RDMA/core: add MR support for bvec-based " Chuck Lever
2026-01-15 15:58   ` Christoph Hellwig
2026-01-16 11:42   ` Leon Romanovsky
2026-01-16 14:50     ` Christoph Hellwig
2026-01-16 21:16       ` Leon Romanovsky
2026-01-14 14:39 ` [PATCH v1 4/4] svcrdma: use bvec-based RDMA read/write API Chuck Lever
2026-01-15  9:51   ` Leon Romanovsky
2026-01-15 16:29   ` Christoph Hellwig
2026-01-15 18:29     ` Chuck Lever
2026-01-15 21:53       ` Chuck Lever
2026-01-16  9:38         ` Christoph Hellwig
2026-01-15  9:50 ` [PATCH v1 0/4] Add a bio_vec based API to core/rw.c Leon Romanovsky
2026-01-15 15:46 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox