public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] Add a bio_vec based API to core/rw.c
@ 2026-01-22 22:03 Chuck Lever
  2026-01-22 22:03 ` [PATCH v3 1/5] RDMA/core: add bio_vec based RDMA read/write API Chuck Lever
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Chuck Lever @ 2026-01-22 22:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

This series introduces a bio_vec based API for RDMA read and write
operations in the RDMA core, eliminating unnecessary scatterlist
conversions for callers that already work with bvecs.

Current users of rdma_rw_ctx_init() must convert their native data
structures into scatterlists. For subsystems like svcrdma that
maintain data in bvec format, this conversion adds overhead both in
CPU cycles and memory footprint. The new API accepts bvec arrays
directly.

For hardware RDMA devices, the implementation uses the IOVA-based
DMA mapping API to reduce IOTLB synchronization overhead from O(n)
per-page syncs to a single O(1) sync after all mappings complete.
Software RDMA devices (rxe, siw) continue using virtual addressing.

The series includes MR registration support for bvec arrays,
enabling iWARP devices and the force_mr debug parameter. The MR
path reuses existing ib_map_mr_sg() infrastructure by constructing
a synthetic scatterlist from the bvec DMA addresses.

The final patch adds the first consumer for the new API: svcrdma.

Based on v6.19-rc6.

---

Changes since v2:
- Add bvec iter arguments to the new API
- Add a synthetic SGL in the MR mapping function
- Try IOVA coalescing before max_sgl_rd triggers MR in bvec path
- Attempt once again to address SQ/CQ/max_rdma_ctxs sizing issues

Changes since v1:
- Simplify rw.c by using bvec iters internally
- IOVA mapping produces a contiguous DMA address range
- Clarify the comment that documents struct svc_rdma_rw_ctxt
- svcrdma now uses pre-allocated bio_vec arrays

Chuck Lever (5):
  RDMA/core: add bio_vec based RDMA read/write API
  RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  RDMA/core: add MR support for bvec-based RDMA operations
  RDMA/core: add rdma_rw_max_sge() helper for SQ sizing
  svcrdma: use bvec-based RDMA read/write API

 drivers/infiniband/core/rw.c             | 591 ++++++++++++++++++++---
 drivers/infiniband/ulp/isert/ib_isert.c  |   4 +-
 drivers/nvme/target/rdma.c               |   4 +-
 include/rdma/ib_verbs.h                  |  42 ++
 include/rdma/rw.h                        |  36 +-
 net/sunrpc/xprtrdma/svc_rdma_rw.c        | 155 +++---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   8 +-
 7 files changed, 699 insertions(+), 141 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/5] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-22 22:03 [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Chuck Lever
@ 2026-01-22 22:03 ` Chuck Lever
  2026-01-23  6:26   ` Christoph Hellwig
  2026-01-22 22:03 ` [PATCH v3 2/5] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations Chuck Lever
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2026-01-22 22:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The existing rdma_rw_ctx_init() API requires callers to construct a
scatterlist, which is then DMA-mapped page by page. Callers that
already have data in bio_vec form (such as the NVMe-oF target) must
first convert to scatterlist, adding overhead and complexity.

Introduce rdma_rw_ctx_init_bvec() and rdma_rw_ctx_destroy_bvec() to
accept bio_vec arrays directly. The new helpers use dma_map_phys()
for hardware RDMA devices and virtual addressing for software RDMA
devices (rxe, siw), avoiding intermediate scatterlist construction.

Memory registration (MR) path support is deferred to a follow-up
series; callers requiring MR-based transfers (iWARP devices or
force_mr=1) receive -EOPNOTSUPP and should use the scatterlist API.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 drivers/infiniband/core/rw.c | 205 +++++++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h      |  42 +++++++
 include/rdma/rw.h            |  11 ++
 3 files changed, 258 insertions(+)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 6354ddf2a274..991006de4a43 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -274,6 +274,123 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	return 1;
 }
 
+static int rdma_rw_init_single_wr_bvec(struct rdma_rw_ctx *ctx,
+		struct ib_qp *qp, const struct bio_vec *bvecs,
+		struct bvec_iter *iter, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	struct ib_rdma_wr *rdma_wr = &ctx->single.wr;
+	struct bio_vec bv = mp_bvec_iter_bvec(bvecs, *iter);
+	u64 dma_addr;
+
+	ctx->nr_ops = 1;
+
+	dma_addr = ib_dma_map_bvec(dev, &bv, dir);
+	if (ib_dma_mapping_error(dev, dma_addr))
+		return -ENOMEM;
+
+	ctx->single.sge.lkey = qp->pd->local_dma_lkey;
+	ctx->single.sge.addr = dma_addr;
+	ctx->single.sge.length = bv.bv_len;
+
+	memset(rdma_wr, 0, sizeof(*rdma_wr));
+	if (dir == DMA_TO_DEVICE)
+		rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
+	else
+		rdma_wr->wr.opcode = IB_WR_RDMA_READ;
+	rdma_wr->wr.sg_list = &ctx->single.sge;
+	rdma_wr->wr.num_sge = 1;
+	rdma_wr->remote_addr = remote_addr;
+	rdma_wr->rkey = rkey;
+
+	ctx->type = RDMA_RW_SINGLE_WR;
+	return 1;
+}
+
+static int rdma_rw_init_map_wrs_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		const struct bio_vec *bvecs, u32 nr_bvec, struct bvec_iter *iter,
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	u32 max_sge = dir == DMA_TO_DEVICE ? qp->max_write_sge :
+		      qp->max_read_sge;
+	struct ib_sge *sge;
+	u32 total_len = 0, i, j;
+	u32 mapped_bvecs = 0;
+	u32 nr_ops = DIV_ROUND_UP(nr_bvec, max_sge);
+	size_t sges_size = array_size(nr_bvec, sizeof(*ctx->map.sges));
+	size_t wrs_offset = ALIGN(sges_size, __alignof__(*ctx->map.wrs));
+	size_t wrs_size = array_size(nr_ops, sizeof(*ctx->map.wrs));
+	void *mem;
+
+	if (sges_size == SIZE_MAX || wrs_size == SIZE_MAX ||
+	    check_add_overflow(wrs_offset, wrs_size, &wrs_size))
+		return -ENOMEM;
+
+	mem = kzalloc(wrs_size, GFP_KERNEL);
+	if (!mem)
+		return -ENOMEM;
+
+	ctx->map.sges = sge = mem;
+	ctx->map.wrs = mem + wrs_offset;
+
+	for (i = 0; i < nr_ops; i++) {
+		struct ib_rdma_wr *rdma_wr = &ctx->map.wrs[i];
+		u32 nr_sge = min(nr_bvec - mapped_bvecs, max_sge);
+
+		if (dir == DMA_TO_DEVICE)
+			rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
+		else
+			rdma_wr->wr.opcode = IB_WR_RDMA_READ;
+		rdma_wr->remote_addr = remote_addr + total_len;
+		rdma_wr->rkey = rkey;
+		rdma_wr->wr.num_sge = nr_sge;
+		rdma_wr->wr.sg_list = sge;
+
+		for (j = 0; j < nr_sge; j++) {
+			const struct bio_vec *base = __bvec_iter_bvec(bvecs, *iter);
+			unsigned int offset = iter->bi_bvec_done;
+			unsigned int len = min(iter->bi_size,
+					       base->bv_len - offset);
+			struct bio_vec bv = {
+				.bv_page = base->bv_page,
+				.bv_len = len,
+				.bv_offset = base->bv_offset + offset,
+			};
+			u64 dma_addr;
+
+			dma_addr = ib_dma_map_bvec(dev, &bv, dir);
+			if (ib_dma_mapping_error(dev, dma_addr))
+				goto out_unmap;
+
+			mapped_bvecs++;
+			sge->addr = dma_addr;
+			sge->length = len;
+			sge->lkey = qp->pd->local_dma_lkey;
+
+			total_len += len;
+			sge++;
+
+			bvec_iter_advance_single(bvecs, iter, len);
+		}
+
+		rdma_wr->wr.next = i + 1 < nr_ops ?
+			&ctx->map.wrs[i + 1].wr : NULL;
+	}
+
+	ctx->nr_ops = nr_ops;
+	ctx->type = RDMA_RW_MULTI_WR;
+	return nr_ops;
+
+out_unmap:
+	for (i = 0; i < mapped_bvecs; i++)
+		ib_dma_unmap_bvec(dev, ctx->map.sges[i].addr,
+				  ctx->map.sges[i].length, dir);
+	kfree(ctx->map.sges);
+	return -ENOMEM;
+}
+
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:	context to initialize
@@ -344,6 +461,53 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num,
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
 
+/**
+ * rdma_rw_ctx_init_bvec - initialize a RDMA READ/WRITE context from bio_vec
+ * @ctx:	context to initialize
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound
+ * @bvecs:	bio_vec array to READ/WRITE from/to
+ * @nr_bvec:	number of entries in @bvecs
+ * @iter:	bvec iterator describing offset and length
+ * @remote_addr: remote address to read/write (relative to @rkey)
+ * @rkey:	remote key to operate on
+ * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ *
+ * Accepts bio_vec arrays directly, avoiding scatterlist conversion for
+ * callers that already have data in bio_vec form. Prefer this over
+ * rdma_rw_ctx_init() when the source data is a bio_vec array.
+ *
+ * This function does not support devices requiring memory registration.
+ * iWARP devices and configurations with force_mr=1 should use
+ * rdma_rw_ctx_init() with a scatterlist instead.
+ *
+ * Returns the number of WQEs that will be needed on the workqueue if
+ * successful, or a negative error code:
+ *
+ *   * -EINVAL  - @nr_bvec is zero or @iter.bi_size is zero
+ *   * -EOPNOTSUPP - device requires MR path (iWARP or force_mr=1)
+ *   * -ENOMEM - DMA mapping or memory allocation failed
+ */
+int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u32 port_num, const struct bio_vec *bvecs, u32 nr_bvec,
+		struct bvec_iter iter, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir)
+{
+	if (nr_bvec == 0 || iter.bi_size == 0)
+		return -EINVAL;
+
+	/* MR path not supported for bvec - reject iWARP and force_mr */
+	if (rdma_rw_io_needs_mr(qp->device, port_num, dir, nr_bvec))
+		return -EOPNOTSUPP;
+
+	if (nr_bvec == 1)
+		return rdma_rw_init_single_wr_bvec(ctx, qp, bvecs, &iter,
+				remote_addr, rkey, dir);
+	return rdma_rw_init_map_wrs_bvec(ctx, qp, bvecs, nr_bvec, &iter,
+			remote_addr, rkey, dir);
+}
+EXPORT_SYMBOL(rdma_rw_ctx_init_bvec);
+
 /**
  * rdma_rw_ctx_signature_init - initialize a RW context with signature offload
  * @ctx:	context to initialize
@@ -598,6 +762,47 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
+/**
+ * rdma_rw_ctx_destroy_bvec - release resources from rdma_rw_ctx_init_bvec
+ * @ctx:	context to release
+ * @qp:		queue pair to operate on
+ * @port_num:	port num to which the connection is bound (unused)
+ * @bvecs:	bio_vec array that was used for the READ/WRITE (unused)
+ * @nr_bvec:	number of entries in @bvecs
+ * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ *
+ * Releases all resources allocated by a successful rdma_rw_ctx_init_bvec()
+ * call. Must not be called if rdma_rw_ctx_init_bvec() returned an error.
+ *
+ * The @port_num and @bvecs parameters are unused but present for API
+ * symmetry with rdma_rw_ctx_destroy().
+ */
+void rdma_rw_ctx_destroy_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u32 __maybe_unused port_num,
+		const struct bio_vec __maybe_unused *bvecs,
+		u32 nr_bvec, enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	u32 i;
+
+	switch (ctx->type) {
+	case RDMA_RW_MULTI_WR:
+		for (i = 0; i < nr_bvec; i++)
+			ib_dma_unmap_bvec(dev, ctx->map.sges[i].addr,
+					  ctx->map.sges[i].length, dir);
+		kfree(ctx->map.sges);
+		break;
+	case RDMA_RW_SINGLE_WR:
+		ib_dma_unmap_bvec(dev, ctx->single.sge.addr,
+				  ctx->single.sge.length, dir);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return;
+	}
+}
+EXPORT_SYMBOL(rdma_rw_ctx_destroy_bvec);
+
 /**
  * rdma_rw_ctx_destroy_signature - release all resources allocated by
  *	rdma_rw_ctx_signature_init
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 6aad66bc5dd7..82958f5117c3 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -15,6 +15,7 @@
 #include <linux/ethtool.h>
 #include <linux/types.h>
 #include <linux/device.h>
+#include <linux/bvec.h>
 #include <linux/dma-mapping.h>
 #include <linux/kref.h>
 #include <linux/list.h>
@@ -4249,6 +4250,47 @@ static inline void ib_dma_unmap_page(struct ib_device *dev,
 		dma_unmap_page(dev->dma_device, addr, size, direction);
 }
 
+/**
+ * ib_dma_map_bvec - Map a bio_vec to DMA address
+ * @dev: The device for which the dma_addr is to be created
+ * @bvec: The bio_vec to map
+ * @direction: The direction of the DMA
+ *
+ * Returns a DMA address for the bio_vec. The caller must check the
+ * result with ib_dma_mapping_error() before use; a failed mapping
+ * must not be passed to ib_dma_unmap_bvec().
+ *
+ * For software RDMA devices (rxe, siw), returns a virtual address
+ * and no actual DMA mapping occurs.
+ */
+static inline u64 ib_dma_map_bvec(struct ib_device *dev,
+				  const struct bio_vec *bvec,
+				  enum dma_data_direction direction)
+{
+	if (ib_uses_virt_dma(dev))
+		return (uintptr_t)(page_address(bvec->bv_page) + bvec->bv_offset);
+	return dma_map_phys(dev->dma_device, bvec_phys(bvec),
+			    bvec->bv_len, direction, 0);
+}
+
+/**
+ * ib_dma_unmap_bvec - Unmap a bio_vec DMA mapping
+ * @dev: The device for which the DMA address was created
+ * @addr: The DMA address returned by ib_dma_map_bvec()
+ * @size: The size of the region in bytes
+ * @direction: The direction of the DMA
+ *
+ * Releases a DMA mapping created by ib_dma_map_bvec(). For software
+ * RDMA devices this is a no-op since no actual mapping occurred.
+ */
+static inline void ib_dma_unmap_bvec(struct ib_device *dev,
+				     u64 addr, size_t size,
+				     enum dma_data_direction direction)
+{
+	if (!ib_uses_virt_dma(dev))
+		dma_unmap_phys(dev->dma_device, addr, size, direction, 0);
+}
+
 int ib_dma_virt_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents);
 static inline int ib_dma_map_sg_attrs(struct ib_device *dev,
 				      struct scatterlist *sg, int nents,
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index d606cac48233..b2fc3e2373d7 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -5,6 +5,7 @@
 #ifndef _RDMA_RW_H
 #define _RDMA_RW_H
 
+#include <linux/bvec.h>
 #include <linux/dma-mapping.h>
 #include <linux/scatterlist.h>
 #include <rdma/ib_verbs.h>
@@ -49,6 +50,16 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 			 u32 port_num, struct scatterlist *sg, u32 sg_cnt,
 			 enum dma_data_direction dir);
 
+struct bio_vec;
+
+int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u32 port_num, const struct bio_vec *bvecs, u32 nr_bvec,
+		struct bvec_iter iter, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir);
+void rdma_rw_ctx_destroy_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u32 port_num, const struct bio_vec *bvecs, u32 nr_bvec,
+		enum dma_data_direction dir);
+
 int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		u32 port_num, struct scatterlist *sg, u32 sg_cnt,
 		struct scatterlist *prot_sg, u32 prot_sg_cnt,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 2/5] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  2026-01-22 22:03 [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Chuck Lever
  2026-01-22 22:03 ` [PATCH v3 1/5] RDMA/core: add bio_vec based RDMA read/write API Chuck Lever
@ 2026-01-22 22:03 ` Chuck Lever
  2026-01-23  6:28   ` Christoph Hellwig
  2026-01-22 22:03 ` [PATCH v3 3/5] RDMA/core: add MR support for bvec-based " Chuck Lever
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2026-01-22 22:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The bvec RDMA API maps each bvec individually via dma_map_phys(),
requiring an IOTLB sync for each mapping. For large I/O operations
with many bvecs, this overhead becomes significant.

The two-step IOVA API (dma_iova_try_alloc / dma_iova_link /
dma_iova_sync) allocates a contiguous IOVA range upfront, links
all physical pages without IOTLB syncs, then performs a single
sync at the end. This reduces IOTLB flushes from O(n) to O(1).

It also requires only a single output dma_addr_t compared to extra
per-input element storage in struct scatterlist.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 drivers/infiniband/core/rw.c | 109 +++++++++++++++++++++++++++++++++++
 include/rdma/rw.h            |   8 +++
 2 files changed, 117 insertions(+)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 991006de4a43..393a9a4d551c 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -14,6 +14,7 @@ enum {
 	RDMA_RW_MULTI_WR,
 	RDMA_RW_MR,
 	RDMA_RW_SIG_MR,
+	RDMA_RW_IOVA,
 };
 
 static bool rdma_rw_force_mr;
@@ -391,6 +392,89 @@ static int rdma_rw_init_map_wrs_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	return -ENOMEM;
 }
 
+/*
+ * Try to use the two-step IOVA API to map bvecs into a contiguous DMA range.
+ * This reduces IOTLB sync overhead by doing one sync at the end instead of
+ * one per bvec, and produces a contiguous DMA address range that can be
+ * described by a single SGE.
+ *
+ * Returns the number of WQEs (always 1) on success, -EOPNOTSUPP if IOVA
+ * mapping is not available, or another negative error code on failure.
+ */
+static int rdma_rw_init_iova_wrs_bvec(struct rdma_rw_ctx *ctx,
+		struct ib_qp *qp, const struct bio_vec *bvec,
+		struct bvec_iter *iter, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	struct device *dma_dev = dev->dma_device;
+	size_t total_len = iter->bi_size;
+	struct bvec_iter link_iter;
+	struct bio_vec first_bv;
+	size_t mapped_len = 0;
+	int ret;
+
+	/* Virtual DMA devices cannot support IOVA allocators */
+	if (ib_uses_virt_dma(dev))
+		return -EOPNOTSUPP;
+
+	/* Try to allocate contiguous IOVA space */
+	first_bv = mp_bvec_iter_bvec(bvec, *iter);
+	if (!dma_iova_try_alloc(dma_dev, &ctx->iova.state,
+				bvec_phys(&first_bv), total_len))
+		return -EOPNOTSUPP;
+
+	/* Link all bvecs into the IOVA space */
+	link_iter = *iter;
+	while (link_iter.bi_size) {
+		struct bio_vec bv = mp_bvec_iter_bvec(bvec, link_iter);
+
+		ret = dma_iova_link(dma_dev, &ctx->iova.state, bvec_phys(&bv),
+				    mapped_len, bv.bv_len, dir, 0);
+		if (ret)
+			goto out_destroy;
+
+		mapped_len += bv.bv_len;
+		bvec_iter_advance(bvec, &link_iter, bv.bv_len);
+	}
+
+	/* Sync the IOTLB once for all linked pages */
+	ret = dma_iova_sync(dma_dev, &ctx->iova.state, 0, mapped_len);
+	if (ret)
+		goto out_destroy;
+
+	ctx->iova.mapped_len = mapped_len;
+
+	/* Single SGE covers the entire contiguous IOVA range */
+	ctx->iova.sge.addr = ctx->iova.state.addr;
+	ctx->iova.sge.length = mapped_len;
+	ctx->iova.sge.lkey = qp->pd->local_dma_lkey;
+
+	/* Single WR for the whole transfer */
+	memset(&ctx->iova.wr, 0, sizeof(ctx->iova.wr));
+	if (dir == DMA_TO_DEVICE)
+		ctx->iova.wr.wr.opcode = IB_WR_RDMA_WRITE;
+	else
+		ctx->iova.wr.wr.opcode = IB_WR_RDMA_READ;
+	ctx->iova.wr.wr.num_sge = 1;
+	ctx->iova.wr.wr.sg_list = &ctx->iova.sge;
+	ctx->iova.wr.remote_addr = remote_addr;
+	ctx->iova.wr.rkey = rkey;
+
+	ctx->type = RDMA_RW_IOVA;
+	ctx->nr_ops = 1;
+	return 1;
+
+out_destroy:
+	/*
+	 * dma_iova_destroy() expects the actual mapped length, not the
+	 * total allocation size. It unlinks only the successfully linked
+	 * range and frees the entire IOVA allocation.
+	 */
+	dma_iova_destroy(dma_dev, &ctx->iova.state, mapped_len, dir, 0);
+	return ret;
+}
+
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:	context to initialize
@@ -493,6 +577,8 @@ int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		struct bvec_iter iter, u64 remote_addr, u32 rkey,
 		enum dma_data_direction dir)
 {
+	int ret;
+
 	if (nr_bvec == 0 || iter.bi_size == 0)
 		return -EINVAL;
 
@@ -503,6 +589,17 @@ int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	if (nr_bvec == 1)
 		return rdma_rw_init_single_wr_bvec(ctx, qp, bvecs, &iter,
 				remote_addr, rkey, dir);
+
+	/*
+	 * Try IOVA-based mapping first for multi-bvec transfers.
+	 * This reduces IOTLB sync overhead by batching all mappings.
+	 * rdma_rw_init_iova_wrs_bvec() does not modify iter on -EOPNOTSUPP.
+	 */
+	ret = rdma_rw_init_iova_wrs_bvec(ctx, qp, bvecs, &iter, remote_addr,
+			rkey, dir);
+	if (ret != -EOPNOTSUPP)
+		return ret;
+
 	return rdma_rw_init_map_wrs_bvec(ctx, qp, bvecs, nr_bvec, &iter,
 			remote_addr, rkey, dir);
 }
@@ -679,6 +776,10 @@ struct ib_send_wr *rdma_rw_ctx_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 			first_wr = &ctx->reg[0].reg_wr.wr;
 		last_wr = &ctx->reg[ctx->nr_ops - 1].wr.wr;
 		break;
+	case RDMA_RW_IOVA:
+		first_wr = &ctx->iova.wr.wr;
+		last_wr = &ctx->iova.wr.wr;
+		break;
 	case RDMA_RW_MULTI_WR:
 		first_wr = &ctx->map.wrs[0].wr;
 		last_wr = &ctx->map.wrs[ctx->nr_ops - 1].wr;
@@ -753,6 +854,10 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		break;
 	case RDMA_RW_SINGLE_WR:
 		break;
+	case RDMA_RW_IOVA:
+		/* IOVA contexts must use rdma_rw_ctx_destroy_bvec() */
+		WARN_ON_ONCE(1);
+		return;
 	default:
 		BUG();
 		break;
@@ -786,6 +891,10 @@ void rdma_rw_ctx_destroy_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	u32 i;
 
 	switch (ctx->type) {
+	case RDMA_RW_IOVA:
+		dma_iova_destroy(dev->dma_device, &ctx->iova.state,
+				 ctx->iova.mapped_len, dir, 0);
+		break;
 	case RDMA_RW_MULTI_WR:
 		for (i = 0; i < nr_bvec; i++)
 			ib_dma_unmap_bvec(dev, ctx->map.sges[i].addr,
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index b2fc3e2373d7..205e16ed6cd8 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -32,6 +32,14 @@ struct rdma_rw_ctx {
 			struct ib_rdma_wr	*wrs;
 		} map;
 
+		/* for IOVA-based mapping of bvecs into contiguous DMA range: */
+		struct {
+			struct dma_iova_state	state;
+			struct ib_sge		sge;
+			struct ib_rdma_wr	wr;
+			size_t			mapped_len;
+		} iova;
+
 		/* for registering multiple WRs: */
 		struct rdma_rw_reg_ctx {
 			struct ib_sge		sge;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 3/5] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-22 22:03 [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Chuck Lever
  2026-01-22 22:03 ` [PATCH v3 1/5] RDMA/core: add bio_vec based RDMA read/write API Chuck Lever
  2026-01-22 22:03 ` [PATCH v3 2/5] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations Chuck Lever
@ 2026-01-22 22:03 ` Chuck Lever
  2026-01-23  6:36   ` Christoph Hellwig
  2026-01-22 22:04 ` [PATCH v3 4/5] RDMA/core: add rdma_rw_max_sge() helper for SQ sizing Chuck Lever
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2026-01-22 22:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The bvec-based RDMA API currently returns -EOPNOTSUPP when Memory
Region registration is required. This prevents iWARP devices from
using the bvec path, since iWARP requires MR registration for RDMA
READ operations. The force_mr debug parameter is also unusable with
bvec input.

Add rdma_rw_init_mr_wrs_bvec() to handle MR registration for bvec
arrays. The approach creates a synthetic scatterlist populated with
DMA addresses from the bvecs, then reuses the existing ib_map_mr_sg()
infrastructure. This avoids driver changes while keeping the
implementation small.

The synthetic scatterlist is stored in the rdma_rw_ctx for cleanup.
On destroy, the MRs are returned to the pool and the bvec DMA
mappings are released using the stored addresses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 drivers/infiniband/core/rw.c            | 250 ++++++++++++++++++------
 drivers/infiniband/ulp/isert/ib_isert.c |   4 +-
 drivers/nvme/target/rdma.c              |   4 +-
 include/rdma/rw.h                       |  17 +-
 4 files changed, 206 insertions(+), 69 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 393a9a4d551c..3a00b788417d 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -38,6 +38,20 @@ static inline bool rdma_rw_can_use_mr(struct ib_device *dev, u32 port_num)
 	return false;
 }
 
+/*
+ * Check if the device requires memory registration for RDMA READs.
+ * iWARP always requires MR for RDMA READ due to protocol limitations.
+ */
+static inline bool rdma_rw_io_requires_mr(struct ib_device *dev, u32 port_num,
+		enum dma_data_direction dir)
+{
+	if (dir == DMA_FROM_DEVICE && rdma_protocol_iwarp(dev, port_num))
+		return true;
+	if (unlikely(rdma_rw_force_mr))
+		return true;
+	return false;
+}
+
 /*
  * Check if the device will use memory registration for this RW operation.
  * For RDMA READs we must use MRs on iWarp and can optionally use them as an
@@ -47,13 +61,10 @@ static inline bool rdma_rw_can_use_mr(struct ib_device *dev, u32 port_num)
 static inline bool rdma_rw_io_needs_mr(struct ib_device *dev, u32 port_num,
 		enum dma_data_direction dir, int dma_nents)
 {
-	if (dir == DMA_FROM_DEVICE) {
-		if (rdma_protocol_iwarp(dev, port_num))
-			return true;
-		if (dev->attrs.max_sgl_rd && dma_nents > dev->attrs.max_sgl_rd)
-			return true;
-	}
-	if (unlikely(rdma_rw_force_mr))
+	if (rdma_rw_io_requires_mr(dev, port_num, dir))
+		return true;
+	if (dir == DMA_FROM_DEVICE &&
+	    dev->attrs.max_sgl_rd && dma_nents > dev->attrs.max_sgl_rd)
 		return true;
 	return false;
 }
@@ -132,14 +143,14 @@ static int rdma_rw_init_mr_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	int i, j, ret = 0, count = 0;
 
 	ctx->nr_ops = DIV_ROUND_UP(sg_cnt, pages_per_mr);
-	ctx->reg = kcalloc(ctx->nr_ops, sizeof(*ctx->reg), GFP_KERNEL);
-	if (!ctx->reg) {
+	ctx->reg.ctx = kcalloc(ctx->nr_ops, sizeof(*ctx->reg.ctx), GFP_KERNEL);
+	if (!ctx->reg.ctx) {
 		ret = -ENOMEM;
 		goto out;
 	}
 
 	for (i = 0; i < ctx->nr_ops; i++) {
-		struct rdma_rw_reg_ctx *reg = &ctx->reg[i];
+		struct rdma_rw_reg_ctx *reg = &ctx->reg.ctx[i];
 		u32 nents = min(sg_cnt, pages_per_mr);
 
 		ret = rdma_rw_init_one_mr(qp, port_num, reg, sg, sg_cnt,
@@ -187,12 +198,118 @@ static int rdma_rw_init_mr_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 
 out_free:
 	while (--i >= 0)
-		ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg[i].mr);
-	kfree(ctx->reg);
+		ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg.ctx[i].mr);
+	kfree(ctx->reg.ctx);
 out:
 	return ret;
 }
 
+static int rdma_rw_init_mr_wrs_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
+		u32 port_num, const struct bio_vec *bvecs, u32 nr_bvec,
+		struct bvec_iter *iter, u64 remote_addr, u32 rkey,
+		enum dma_data_direction dir)
+{
+	struct ib_device *dev = qp->pd->device;
+	struct rdma_rw_reg_ctx *prev = NULL;
+	u32 pages_per_mr = rdma_rw_fr_page_list_len(dev, qp->integrity_en);
+	struct scatterlist *sg;
+	int i, ret, count = 0;
+	u32 nents = 0;
+
+	/*
+	 * Build scatterlist from bvecs using the iterator. This follows
+	 * the pattern from __blk_rq_map_sg.
+	 */
+	ctx->reg.sgt.sgl = kmalloc_array(nr_bvec, sizeof(*ctx->reg.sgt.sgl),
+					 GFP_KERNEL);
+	if (!ctx->reg.sgt.sgl)
+		return -ENOMEM;
+	sg_init_table(ctx->reg.sgt.sgl, nr_bvec);
+
+	for (sg = ctx->reg.sgt.sgl; iter->bi_size; sg = sg_next(sg)) {
+		struct bio_vec bv = mp_bvec_iter_bvec(bvecs, *iter);
+
+		if (nents >= nr_bvec) {
+			ret = -EINVAL;
+			goto out_free_sgl;
+		}
+		sg_set_page(sg, bv.bv_page, bv.bv_len, bv.bv_offset);
+		bvec_iter_advance(bvecs, iter, bv.bv_len);
+		nents++;
+	}
+	sg_mark_end(sg_last(ctx->reg.sgt.sgl, nents));
+	ctx->reg.sgt.orig_nents = nents;
+
+	/* DMA map the scatterlist */
+	ret = ib_dma_map_sgtable_attrs(dev, &ctx->reg.sgt, dir, 0);
+	if (ret)
+		goto out_free_sgl;
+
+	ctx->nr_ops = DIV_ROUND_UP(ctx->reg.sgt.nents, pages_per_mr);
+	ctx->reg.ctx = kcalloc(ctx->nr_ops, sizeof(*ctx->reg.ctx), GFP_KERNEL);
+	if (!ctx->reg.ctx) {
+		ret = -ENOMEM;
+		goto out_unmap_sgt;
+	}
+
+	sg = ctx->reg.sgt.sgl;
+	nents = ctx->reg.sgt.nents;
+	for (i = 0; i < ctx->nr_ops; i++) {
+		struct rdma_rw_reg_ctx *reg = &ctx->reg.ctx[i];
+		u32 sge_cnt = min(nents, pages_per_mr);
+
+		ret = rdma_rw_init_one_mr(qp, port_num, reg, sg, sge_cnt, 0);
+		if (ret < 0)
+			goto out_free_mrs;
+		count += ret;
+
+		if (prev) {
+			if (reg->mr->need_inval)
+				prev->wr.wr.next = &reg->inv_wr;
+			else
+				prev->wr.wr.next = &reg->reg_wr.wr;
+		}
+
+		reg->reg_wr.wr.next = &reg->wr.wr;
+
+		reg->wr.wr.sg_list = &reg->sge;
+		reg->wr.wr.num_sge = 1;
+		reg->wr.remote_addr = remote_addr;
+		reg->wr.rkey = rkey;
+
+		if (dir == DMA_TO_DEVICE) {
+			reg->wr.wr.opcode = IB_WR_RDMA_WRITE;
+		} else if (!rdma_cap_read_inv(qp->device, port_num)) {
+			reg->wr.wr.opcode = IB_WR_RDMA_READ;
+		} else {
+			reg->wr.wr.opcode = IB_WR_RDMA_READ_WITH_INV;
+			reg->wr.wr.ex.invalidate_rkey = reg->mr->lkey;
+		}
+		count++;
+
+		remote_addr += reg->sge.length;
+		nents -= sge_cnt;
+		sg += sge_cnt;
+		prev = reg;
+	}
+
+	if (prev)
+		prev->wr.wr.next = NULL;
+
+	ctx->type = RDMA_RW_MR;
+	return count;
+
+out_free_mrs:
+	while (--i >= 0)
+		ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg.ctx[i].mr);
+	kfree(ctx->reg.ctx);
+out_unmap_sgt:
+	ib_dma_unmap_sgtable_attrs(dev, &ctx->reg.sgt, dir, 0);
+out_free_sgl:
+	kfree(ctx->reg.sgt.sgl);
+	return ret;
+}
+
 static int rdma_rw_init_map_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		struct scatterlist *sg, u32 sg_cnt, u32 offset,
 		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
@@ -557,19 +674,13 @@ EXPORT_SYMBOL(rdma_rw_ctx_init);
  * @rkey:	remote key to operate on
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
  *
- * Accepts bio_vec arrays directly, avoiding scatterlist conversion for
- * callers that already have data in bio_vec form. Prefer this over
- * rdma_rw_ctx_init() when the source data is a bio_vec array.
- *
- * This function does not support devices requiring memory registration.
- * iWARP devices and configurations with force_mr=1 should use
- * rdma_rw_ctx_init() with a scatterlist instead.
+ * Maps the bio_vec array directly, avoiding intermediate scatterlist
+ * conversion. Supports MR registration for iWARP devices and force_mr mode.
  *
  * Returns the number of WQEs that will be needed on the workqueue if
  * successful, or a negative error code:
  *
  *   * -EINVAL  - @nr_bvec is zero or @iter.bi_size is zero
- *   * -EOPNOTSUPP - device requires MR path (iWARP or force_mr=1)
  *   * -ENOMEM - DMA mapping or memory allocation failed
  */
 int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
@@ -577,14 +688,19 @@ int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		struct bvec_iter iter, u64 remote_addr, u32 rkey,
 		enum dma_data_direction dir)
 {
+	struct ib_device *dev = qp->pd->device;
 	int ret;
 
 	if (nr_bvec == 0 || iter.bi_size == 0)
 		return -EINVAL;
 
-	/* MR path not supported for bvec - reject iWARP and force_mr */
-	if (rdma_rw_io_needs_mr(qp->device, port_num, dir, nr_bvec))
-		return -EOPNOTSUPP;
+	/*
+	 * iWARP requires MR registration for all RDMA READs.
+	 */
+	if (rdma_rw_io_requires_mr(dev, port_num, dir))
+		return rdma_rw_init_mr_wrs_bvec(ctx, qp, port_num, bvecs,
+						nr_bvec, &iter, remote_addr,
+						rkey, dir);
 
 	if (nr_bvec == 1)
 		return rdma_rw_init_single_wr_bvec(ctx, qp, bvecs, &iter,
@@ -592,14 +708,23 @@ int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 
 	/*
 	 * Try IOVA-based mapping first for multi-bvec transfers.
-	 * This reduces IOTLB sync overhead by batching all mappings.
-	 * rdma_rw_init_iova_wrs_bvec() does not modify iter on -EOPNOTSUPP.
+	 * IOVA coalesces bvecs into a single DMA-contiguous region,
+	 * reducing the number of WRs needed and avoiding MR overhead.
 	 */
 	ret = rdma_rw_init_iova_wrs_bvec(ctx, qp, bvecs, &iter, remote_addr,
 			rkey, dir);
 	if (ret != -EOPNOTSUPP)
 		return ret;
 
+	/*
+	 * IOVA mapping not available. Check if MR registration provides
+	 * better performance than multiple SGE entries.
+	 */
+	if (rdma_rw_io_needs_mr(dev, port_num, dir, nr_bvec))
+		return rdma_rw_init_mr_wrs_bvec(ctx, qp, port_num, bvecs,
+						nr_bvec, &iter, remote_addr,
+						rkey, dir);
+
 	return rdma_rw_init_map_wrs_bvec(ctx, qp, bvecs, nr_bvec, &iter,
 			remote_addr, rkey, dir);
 }
@@ -660,23 +785,23 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 
 	ctx->type = RDMA_RW_SIG_MR;
 	ctx->nr_ops = 1;
-	ctx->reg = kzalloc(sizeof(*ctx->reg), GFP_KERNEL);
-	if (!ctx->reg) {
+	ctx->reg.ctx = kzalloc(sizeof(*ctx->reg.ctx), GFP_KERNEL);
+	if (!ctx->reg.ctx) {
 		ret = -ENOMEM;
 		goto out_unmap_prot_sg;
 	}
 
-	ctx->reg->mr = ib_mr_pool_get(qp, &qp->sig_mrs);
-	if (!ctx->reg->mr) {
+	ctx->reg.ctx->mr = ib_mr_pool_get(qp, &qp->sig_mrs);
+	if (!ctx->reg.ctx->mr) {
 		ret = -EAGAIN;
 		goto out_free_ctx;
 	}
 
-	count += rdma_rw_inv_key(ctx->reg);
+	count += rdma_rw_inv_key(ctx->reg.ctx);
 
-	memcpy(ctx->reg->mr->sig_attrs, sig_attrs, sizeof(struct ib_sig_attrs));
+	memcpy(ctx->reg.ctx->mr->sig_attrs, sig_attrs, sizeof(struct ib_sig_attrs));
 
-	ret = ib_map_mr_sg_pi(ctx->reg->mr, sg, sgt.nents, NULL, prot_sg,
+	ret = ib_map_mr_sg_pi(ctx->reg.ctx->mr, sg, sgt.nents, NULL, prot_sg,
 			      prot_sgt.nents, NULL, SZ_4K);
 	if (unlikely(ret)) {
 		pr_err("failed to map PI sg (%u)\n",
@@ -684,24 +809,24 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		goto out_destroy_sig_mr;
 	}
 
-	ctx->reg->reg_wr.wr.opcode = IB_WR_REG_MR_INTEGRITY;
-	ctx->reg->reg_wr.wr.wr_cqe = NULL;
-	ctx->reg->reg_wr.wr.num_sge = 0;
-	ctx->reg->reg_wr.wr.send_flags = 0;
-	ctx->reg->reg_wr.access = IB_ACCESS_LOCAL_WRITE;
+	ctx->reg.ctx->reg_wr.wr.opcode = IB_WR_REG_MR_INTEGRITY;
+	ctx->reg.ctx->reg_wr.wr.wr_cqe = NULL;
+	ctx->reg.ctx->reg_wr.wr.num_sge = 0;
+	ctx->reg.ctx->reg_wr.wr.send_flags = 0;
+	ctx->reg.ctx->reg_wr.access = IB_ACCESS_LOCAL_WRITE;
 	if (rdma_protocol_iwarp(qp->device, port_num))
-		ctx->reg->reg_wr.access |= IB_ACCESS_REMOTE_WRITE;
-	ctx->reg->reg_wr.mr = ctx->reg->mr;
-	ctx->reg->reg_wr.key = ctx->reg->mr->lkey;
+		ctx->reg.ctx->reg_wr.access |= IB_ACCESS_REMOTE_WRITE;
+	ctx->reg.ctx->reg_wr.mr = ctx->reg.ctx->mr;
+	ctx->reg.ctx->reg_wr.key = ctx->reg.ctx->mr->lkey;
 	count++;
 
-	ctx->reg->sge.addr = ctx->reg->mr->iova;
-	ctx->reg->sge.length = ctx->reg->mr->length;
+	ctx->reg.ctx->sge.addr = ctx->reg.ctx->mr->iova;
+	ctx->reg.ctx->sge.length = ctx->reg.ctx->mr->length;
 	if (sig_attrs->wire.sig_type == IB_SIG_TYPE_NONE)
-		ctx->reg->sge.length -= ctx->reg->mr->sig_attrs->meta_length;
+		ctx->reg.ctx->sge.length -= ctx->reg.ctx->mr->sig_attrs->meta_length;
 
-	rdma_wr = &ctx->reg->wr;
-	rdma_wr->wr.sg_list = &ctx->reg->sge;
+	rdma_wr = &ctx->reg.ctx->wr;
+	rdma_wr->wr.sg_list = &ctx->reg.ctx->sge;
 	rdma_wr->wr.num_sge = 1;
 	rdma_wr->remote_addr = remote_addr;
 	rdma_wr->rkey = rkey;
@@ -709,15 +834,15 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		rdma_wr->wr.opcode = IB_WR_RDMA_WRITE;
 	else
 		rdma_wr->wr.opcode = IB_WR_RDMA_READ;
-	ctx->reg->reg_wr.wr.next = &rdma_wr->wr;
+	ctx->reg.ctx->reg_wr.wr.next = &rdma_wr->wr;
 	count++;
 
 	return count;
 
 out_destroy_sig_mr:
-	ib_mr_pool_put(qp, &qp->sig_mrs, ctx->reg->mr);
+	ib_mr_pool_put(qp, &qp->sig_mrs, ctx->reg.ctx->mr);
 out_free_ctx:
-	kfree(ctx->reg);
+	kfree(ctx->reg.ctx);
 out_unmap_prot_sg:
 	if (prot_sgt.nents)
 		ib_dma_unmap_sgtable_attrs(dev, &prot_sgt, dir, 0);
@@ -765,16 +890,16 @@ struct ib_send_wr *rdma_rw_ctx_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	case RDMA_RW_SIG_MR:
 	case RDMA_RW_MR:
 		for (i = 0; i < ctx->nr_ops; i++) {
-			rdma_rw_update_lkey(&ctx->reg[i],
-				ctx->reg[i].wr.wr.opcode !=
+			rdma_rw_update_lkey(&ctx->reg.ctx[i],
+				ctx->reg.ctx[i].wr.wr.opcode !=
 					IB_WR_RDMA_READ_WITH_INV);
 		}
 
-		if (ctx->reg[0].inv_wr.next)
-			first_wr = &ctx->reg[0].inv_wr;
+		if (ctx->reg.ctx[0].inv_wr.next)
+			first_wr = &ctx->reg.ctx[0].inv_wr;
 		else
-			first_wr = &ctx->reg[0].reg_wr.wr;
-		last_wr = &ctx->reg[ctx->nr_ops - 1].wr.wr;
+			first_wr = &ctx->reg.ctx[0].reg_wr.wr;
+		last_wr = &ctx->reg.ctx[ctx->nr_ops - 1].wr.wr;
 		break;
 	case RDMA_RW_IOVA:
 		first_wr = &ctx->iova.wr.wr;
@@ -844,9 +969,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 
 	switch (ctx->type) {
 	case RDMA_RW_MR:
+		/* Bvec MR contexts must use rdma_rw_ctx_destroy_bvec() */
+		WARN_ON_ONCE(ctx->reg.sgt.sgl);
 		for (i = 0; i < ctx->nr_ops; i++)
-			ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg[i].mr);
-		kfree(ctx->reg);
+			ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg.ctx[i].mr);
+		kfree(ctx->reg.ctx);
 		break;
 	case RDMA_RW_MULTI_WR:
 		kfree(ctx->map.wrs);
@@ -891,6 +1018,13 @@ void rdma_rw_ctx_destroy_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	u32 i;
 
 	switch (ctx->type) {
+	case RDMA_RW_MR:
+		for (i = 0; i < ctx->nr_ops; i++)
+			ib_mr_pool_put(qp, &qp->rdma_mrs, ctx->reg.ctx[i].mr);
+		kfree(ctx->reg.ctx);
+		ib_dma_unmap_sgtable_attrs(dev, &ctx->reg.sgt, dir, 0);
+		kfree(ctx->reg.sgt.sgl);
+		break;
 	case RDMA_RW_IOVA:
 		dma_iova_destroy(dev->dma_device, &ctx->iova.state,
 				 ctx->iova.mapped_len, dir, 0);
@@ -932,8 +1066,8 @@ void rdma_rw_ctx_destroy_signature(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	if (WARN_ON_ONCE(ctx->type != RDMA_RW_SIG_MR))
 		return;
 
-	ib_mr_pool_put(qp, &qp->sig_mrs, ctx->reg->mr);
-	kfree(ctx->reg);
+	ib_mr_pool_put(qp, &qp->sig_mrs, ctx->reg.ctx->mr);
+	kfree(ctx->reg.ctx);
 
 	if (prot_sg_cnt)
 		ib_dma_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index af811d060cc8..0c6152b7660e 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1589,7 +1589,7 @@ isert_rdma_write_done(struct ib_cq *cq, struct ib_wc *wc)
 
 	isert_dbg("Cmd %p\n", isert_cmd);
 
-	ret = isert_check_pi_status(cmd, isert_cmd->rw.reg->mr);
+	ret = isert_check_pi_status(cmd, isert_cmd->rw.reg.ctx->mr);
 	isert_rdma_rw_ctx_destroy(isert_cmd, isert_conn);
 
 	if (ret) {
@@ -1635,7 +1635,7 @@ isert_rdma_read_done(struct ib_cq *cq, struct ib_wc *wc)
 	iscsit_stop_dataout_timer(cmd);
 
 	if (isert_prot_cmd(isert_conn, se_cmd))
-		ret = isert_check_pi_status(se_cmd, isert_cmd->rw.reg->mr);
+		ret = isert_check_pi_status(se_cmd, isert_cmd->rw.reg.ctx->mr);
 	isert_rdma_rw_ctx_destroy(isert_cmd, isert_conn);
 	cmd->write_data_done = 0;
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 9c12b2361a6d..a4aa6719a86e 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -767,7 +767,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 	}
 
 	if (rsp->req.metadata_len)
-		status = nvmet_rdma_check_pi_status(rsp->rw.reg->mr);
+		status = nvmet_rdma_check_pi_status(rsp->rw.reg.ctx->mr);
 	nvmet_rdma_rw_ctx_destroy(rsp);
 
 	if (unlikely(status))
@@ -808,7 +808,7 @@ static void nvmet_rdma_write_data_done(struct ib_cq *cq, struct ib_wc *wc)
 	 * - if succeeded send good NVMe response
 	 * - if failed send bad NVMe response with appropriate error
 	 */
-	status = nvmet_rdma_check_pi_status(rsp->rw.reg->mr);
+	status = nvmet_rdma_check_pi_status(rsp->rw.reg.ctx->mr);
 	if (unlikely(status))
 		rsp->req.cqe->status = cpu_to_le16(status << 1);
 	nvmet_rdma_rw_ctx_destroy(rsp);
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index 205e16ed6cd8..53ed0f05fa25 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -41,13 +41,16 @@ struct rdma_rw_ctx {
 		} iova;
 
 		/* for registering multiple WRs: */
-		struct rdma_rw_reg_ctx {
-			struct ib_sge		sge;
-			struct ib_rdma_wr	wr;
-			struct ib_reg_wr	reg_wr;
-			struct ib_send_wr	inv_wr;
-			struct ib_mr		*mr;
-		} *reg;
+		struct {
+			struct rdma_rw_reg_ctx {
+				struct ib_sge		sge;
+				struct ib_rdma_wr	wr;
+				struct ib_reg_wr	reg_wr;
+				struct ib_send_wr	inv_wr;
+				struct ib_mr		*mr;
+			}			*ctx;
+			struct sg_table		sgt;
+		} reg;
 	};
 };
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 4/5] RDMA/core: add rdma_rw_max_sge() helper for SQ sizing
  2026-01-22 22:03 [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Chuck Lever
                   ` (2 preceding siblings ...)
  2026-01-22 22:03 ` [PATCH v3 3/5] RDMA/core: add MR support for bvec-based " Chuck Lever
@ 2026-01-22 22:04 ` Chuck Lever
  2026-01-23  6:36   ` Christoph Hellwig
  2026-01-22 22:04 ` [PATCH v3 5/5] svcrdma: use bvec-based RDMA read/write API Chuck Lever
  2026-01-23  6:04 ` [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Zhu Yanjun
  5 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2026-01-22 22:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the
number of rdma_rw contexts (ctxts). This value is used to allocate the
Send CQ and to initialize the sc_sq_avail credit pool.

However, when the device uses memory registration for RDMA operations,
rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three
per context to account for REG and INV work requests. The Send CQ and
credit pool remain sized for only one work request per context,
causing Send Queue exhaustion under heavy NFS WRITE workloads.

Introduce rdma_rw_max_sge() to compute the actual number of Send Queue
entries required for a given number of rdma_rw contexts. Upper layer
protocols call this helper before creating a Queue Pair so that their
Send CQs and credit accounting match the QP's true capacity.

Update svc_rdma_accept() to use rdma_rw_max_sge() when computing
sc_sq_depth, ensuring the credit pool reflects the work requests
that rdma_rw_init_qp() will reserve.

Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 drivers/infiniband/core/rw.c             | 53 +++++++++++++++++-------
 include/rdma/rw.h                        |  2 +
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  8 +++-
 3 files changed, 46 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 3a00b788417d..4bca8a8ab695 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -1099,34 +1099,57 @@ unsigned int rdma_rw_mr_factor(struct ib_device *device, u32 port_num,
 }
 EXPORT_SYMBOL(rdma_rw_mr_factor);
 
+/**
+ * rdma_rw_max_send_wr - compute max Send WRs needed for RDMA R/W contexts
+ * @dev: RDMA device
+ * @port_num: port number
+ * @max_rdma_ctxs: number of rdma_rw_ctx structures
+ * @create_flags: QP create flags (pass IB_QP_CREATE_INTEGRITY_EN if
+ *                data integrity will be enabled on the QP)
+ *
+ * Returns the total number of Send Queue entries needed for
+ * @max_rdma_ctxs. The result accounts for memory registration and
+ * invalidation work requests when the device requires them.
+ *
+ * ULPs use this to size Send Queues and Send CQs before creating a
+ * Queue Pair.
+ */
+unsigned int rdma_rw_max_send_wr(struct ib_device *dev, u32 port_num,
+				 unsigned int max_rdma_ctxs, u32 create_flags)
+{
+	unsigned int factor = 1;
+	unsigned int result;
+
+	if (create_flags & IB_QP_CREATE_INTEGRITY_EN ||
+	    rdma_rw_can_use_mr(dev, port_num))
+		factor += 2;	/* reg + inv */
+
+	if (check_mul_overflow(factor, max_rdma_ctxs, &result))
+		return UINT_MAX;
+	return result;
+}
+EXPORT_SYMBOL(rdma_rw_max_send_wr);
+
 void rdma_rw_init_qp(struct ib_device *dev, struct ib_qp_init_attr *attr)
 {
-	u32 factor;
+	unsigned int factor = 1;
 
 	WARN_ON_ONCE(attr->port_num == 0);
 
 	/*
-	 * Each context needs at least one RDMA READ or WRITE WR.
-	 *
-	 * For some hardware we might need more, eventually we should ask the
-	 * HCA driver for a multiplier here.
-	 */
-	factor = 1;
-
-	/*
-	 * If the device needs MRs to perform RDMA READ or WRITE operations,
-	 * we'll need two additional MRs for the registrations and the
-	 * invalidation.
+	 * If the device uses MRs to perform RDMA READ or WRITE operations,
+	 * or if data integrity is enabled, account for registration and
+	 * invalidation work requests.
 	 */
 	if (attr->create_flags & IB_QP_CREATE_INTEGRITY_EN ||
 	    rdma_rw_can_use_mr(dev, attr->port_num))
-		factor += 2;	/* inv + reg */
+		factor += 2;	/* reg + inv */
 
 	attr->cap.max_send_wr += factor * attr->cap.max_rdma_ctxs;
 
 	/*
-	 * But maybe we were just too high in the sky and the device doesn't
-	 * even support all we need, and we'll have to live with what we get..
+	 * The device might not support all we need, and we'll have to
+	 * live with what we get.
 	 */
 	attr->cap.max_send_wr =
 		min_t(u32, attr->cap.max_send_wr, dev->attrs.max_qp_wr);
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index 53ed0f05fa25..5f96ff754be7 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -88,6 +88,8 @@ int rdma_rw_ctx_post(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num,
 
 unsigned int rdma_rw_mr_factor(struct ib_device *device, u32 port_num,
 		unsigned int maxpages);
+unsigned int rdma_rw_max_send_wr(struct ib_device *dev, u32 port_num,
+		unsigned int max_rdma_ctxs, u32 create_flags);
 void rdma_rw_init_qp(struct ib_device *dev, struct ib_qp_init_attr *attr);
 int rdma_rw_init_mrs(struct ib_qp *qp, struct ib_qp_init_attr *attr);
 void rdma_rw_cleanup_mrs(struct ib_qp *qp);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b7b318ad25c4..9b623849723e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -462,7 +462,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 		newxprt->sc_max_bc_requests = 2;
 	}
 
-	/* Arbitrary estimate of the needed number of rdma_rw contexts.
+	/* Estimate the needed number of rdma_rw contexts. The maximum
+	 * Read and Write chunks have one segment each. Each request
+	 * can involve one Read chunk and either a Write chunk or Reply
+	 * chunk; thus a factor of three.
 	 */
 	maxpayload = min(xprt->xpt_server->sv_max_payload,
 			 RPCSVC_MAXPAYLOAD_RDMA);
@@ -470,7 +473,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 		rdma_rw_mr_factor(dev, newxprt->sc_port_num,
 				  maxpayload >> PAGE_SHIFT);
 
-	newxprt->sc_sq_depth = rq_depth + ctxts;
+	newxprt->sc_sq_depth = rq_depth +
+		rdma_rw_max_send_wr(dev, newxprt->sc_port_num, ctxts, 0);
 	if (newxprt->sc_sq_depth > dev->attrs.max_qp_wr)
 		newxprt->sc_sq_depth = dev->attrs.max_qp_wr;
 	atomic_set(&newxprt->sc_sq_avail, newxprt->sc_sq_depth);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 5/5] svcrdma: use bvec-based RDMA read/write API
  2026-01-22 22:03 [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Chuck Lever
                   ` (3 preceding siblings ...)
  2026-01-22 22:04 ` [PATCH v3 4/5] RDMA/core: add rdma_rw_max_sge() helper for SQ sizing Chuck Lever
@ 2026-01-22 22:04 ` Chuck Lever
  2026-01-23  6:04 ` [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Zhu Yanjun
  5 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2026-01-22 22:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Convert svcrdma to the bvec-based RDMA API introduced earlier in
this series.

The bvec-based RDMA API eliminates the intermediate scatterlist
conversion step, allowing direct DMA mapping from bio_vec arrays.
This simplifies the svc_rdma_rw_ctxt structure by removing the
chained SG table management.

The structure retains an inline array approach similar to the
previous scatterlist implementation: an inline bvec array sized
to max_send_sge handles most I/O operations without additional
allocation. Larger requests fall back to dynamic allocation.
This preserves the allocation-free fast path for typical NFS
operations while supporting arbitrarily large transfers.

The bvec API handles all device types internally, including iWARP
devices which require memory registration. No explicit fallback
path is needed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_rw.c | 155 +++++++++++++++++-------------
 1 file changed, 86 insertions(+), 69 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 310de7a80be5..4ec2f9ae06aa 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -5,6 +5,8 @@
  * Use the core R/W API to move RPC-over-RDMA Read and Write chunks.
  */
 
+#include <linux/bvec.h>
+#include <linux/overflow.h>
 #include <rdma/rw.h>
 
 #include <linux/sunrpc/xdr.h>
@@ -20,30 +22,33 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc);
 /* Each R/W context contains state for one chain of RDMA Read or
  * Write Work Requests.
  *
- * Each WR chain handles a single contiguous server-side buffer,
- * because scatterlist entries after the first have to start on
- * page alignment. xdr_buf iovecs cannot guarantee alignment.
+ * Each WR chain handles a single contiguous server-side buffer.
+ * - each xdr_buf iovec is a single contiguous buffer
+ * - the xdr_buf pages array is a single contiguous buffer because the
+ *   second through the last element always start on a page boundary
  *
  * Each WR chain handles only one R_key. Each RPC-over-RDMA segment
  * from a client may contain a unique R_key, so each WR chain moves
  * up to one segment at a time.
  *
- * The scatterlist makes this data structure over 4KB in size. To
- * make it less likely to fail, and to handle the allocation for
- * smaller I/O requests without disabling bottom-halves, these
- * contexts are created on demand, but cached and reused until the
- * controlling svcxprt_rdma is destroyed.
+ * The inline bvec array is sized to handle most I/O requests without
+ * additional allocation. Larger requests fall back to dynamic allocation.
+ * These contexts are created on demand, but cached and reused until
+ * the controlling svcxprt_rdma is destroyed.
  */
 struct svc_rdma_rw_ctxt {
 	struct llist_node	rw_node;
 	struct list_head	rw_list;
 	struct rdma_rw_ctx	rw_ctx;
 	unsigned int		rw_nents;
-	unsigned int		rw_first_sgl_nents;
-	struct sg_table		rw_sg_table;
-	struct scatterlist	rw_first_sgl[];
+	unsigned int		rw_first_bvec_nents;
+	struct bio_vec		*rw_bvec;
+	struct bio_vec		rw_first_bvec[];
 };
 
+static void svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
+				 struct svc_rdma_rw_ctxt *ctxt);
+
 static inline struct svc_rdma_rw_ctxt *
 svc_rdma_next_ctxt(struct list_head *list)
 {
@@ -52,10 +57,10 @@ svc_rdma_next_ctxt(struct list_head *list)
 }
 
 static struct svc_rdma_rw_ctxt *
-svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int sges)
+svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int nr_bvec)
 {
 	struct ib_device *dev = rdma->sc_cm_id->device;
-	unsigned int first_sgl_nents = dev->attrs.max_send_sge;
+	unsigned int first_bvec_nents = dev->attrs.max_send_sge;
 	struct svc_rdma_rw_ctxt *ctxt;
 	struct llist_node *node;
 
@@ -65,33 +70,44 @@ svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int sges)
 	if (node) {
 		ctxt = llist_entry(node, struct svc_rdma_rw_ctxt, rw_node);
 	} else {
-		ctxt = kmalloc_node(struct_size(ctxt, rw_first_sgl, first_sgl_nents),
+		ctxt = kmalloc_node(struct_size(ctxt, rw_first_bvec,
+						first_bvec_nents),
 				    GFP_KERNEL, ibdev_to_node(dev));
 		if (!ctxt)
 			goto out_noctx;
 
 		INIT_LIST_HEAD(&ctxt->rw_list);
-		ctxt->rw_first_sgl_nents = first_sgl_nents;
+		ctxt->rw_first_bvec_nents = first_bvec_nents;
 	}
 
-	ctxt->rw_sg_table.sgl = ctxt->rw_first_sgl;
-	if (sg_alloc_table_chained(&ctxt->rw_sg_table, sges,
-				   ctxt->rw_sg_table.sgl,
-				   first_sgl_nents))
-		goto out_free;
+	if (nr_bvec <= ctxt->rw_first_bvec_nents) {
+		ctxt->rw_bvec = ctxt->rw_first_bvec;
+	} else {
+		ctxt->rw_bvec = kmalloc_array_node(nr_bvec,
+						   sizeof(*ctxt->rw_bvec),
+						   GFP_KERNEL,
+						   ibdev_to_node(dev));
+		if (!ctxt->rw_bvec)
+			goto out_free;
+	}
 	return ctxt;
 
 out_free:
-	kfree(ctxt);
+	/* Return cached contexts to cache; free freshly allocated ones */
+	if (node)
+		svc_rdma_put_rw_ctxt(rdma, ctxt);
+	else
+		kfree(ctxt);
 out_noctx:
-	trace_svcrdma_rwctx_empty(rdma, sges);
+	trace_svcrdma_rwctx_empty(rdma, nr_bvec);
 	return NULL;
 }
 
 static void __svc_rdma_put_rw_ctxt(struct svc_rdma_rw_ctxt *ctxt,
 				   struct llist_head *list)
 {
-	sg_free_table_chained(&ctxt->rw_sg_table, ctxt->rw_first_sgl_nents);
+	if (ctxt->rw_bvec != ctxt->rw_first_bvec)
+		kfree(ctxt->rw_bvec);
 	llist_add(&ctxt->rw_node, list);
 }
 
@@ -123,6 +139,7 @@ void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma)
  * @ctxt: R/W context to prepare
  * @offset: RDMA offset
  * @handle: RDMA tag/handle
+ * @length: total number of bytes in the bvec array
  * @direction: I/O direction
  *
  * Returns on success, the number of WQEs that will be needed
@@ -130,14 +147,18 @@ void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma)
  */
 static int svc_rdma_rw_ctx_init(struct svcxprt_rdma *rdma,
 				struct svc_rdma_rw_ctxt *ctxt,
-				u64 offset, u32 handle,
+				u64 offset, u32 handle, unsigned int length,
 				enum dma_data_direction direction)
 {
+	struct bvec_iter iter = {
+		.bi_size = length,
+	};
 	int ret;
 
-	ret = rdma_rw_ctx_init(&ctxt->rw_ctx, rdma->sc_qp, rdma->sc_port_num,
-			       ctxt->rw_sg_table.sgl, ctxt->rw_nents,
-			       0, offset, handle, direction);
+	ret = rdma_rw_ctx_init_bvec(&ctxt->rw_ctx, rdma->sc_qp,
+				    rdma->sc_port_num,
+				    ctxt->rw_bvec, ctxt->rw_nents,
+				    iter, offset, handle, direction);
 	if (unlikely(ret < 0)) {
 		trace_svcrdma_dma_map_rw_err(rdma, offset, handle,
 					     ctxt->rw_nents, ret);
@@ -175,7 +196,6 @@ void svc_rdma_cc_release(struct svcxprt_rdma *rdma,
 {
 	struct llist_node *first, *last;
 	struct svc_rdma_rw_ctxt *ctxt;
-	LLIST_HEAD(free);
 
 	trace_svcrdma_cc_release(&cc->cc_cid, cc->cc_sqecount);
 
@@ -183,10 +203,11 @@ void svc_rdma_cc_release(struct svcxprt_rdma *rdma,
 	while ((ctxt = svc_rdma_next_ctxt(&cc->cc_rwctxts)) != NULL) {
 		list_del(&ctxt->rw_list);
 
-		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
-				    rdma->sc_port_num, ctxt->rw_sg_table.sgl,
-				    ctxt->rw_nents, dir);
-		__svc_rdma_put_rw_ctxt(ctxt, &free);
+		rdma_rw_ctx_destroy_bvec(&ctxt->rw_ctx, rdma->sc_qp,
+					 rdma->sc_port_num,
+					 ctxt->rw_bvec, ctxt->rw_nents, dir);
+		if (ctxt->rw_bvec != ctxt->rw_first_bvec)
+			kfree(ctxt->rw_bvec);
 
 		ctxt->rw_node.next = first;
 		first = &ctxt->rw_node;
@@ -414,29 +435,26 @@ static int svc_rdma_post_chunk_ctxt(struct svcxprt_rdma *rdma,
 	return -ENOTCONN;
 }
 
-/* Build and DMA-map an SGL that covers one kvec in an xdr_buf
+/* Build a bvec that covers one kvec in an xdr_buf.
  */
-static void svc_rdma_vec_to_sg(struct svc_rdma_write_info *info,
-			       unsigned int len,
-			       struct svc_rdma_rw_ctxt *ctxt)
+static void svc_rdma_vec_to_bvec(struct svc_rdma_write_info *info,
+				 unsigned int len,
+				 struct svc_rdma_rw_ctxt *ctxt)
 {
-	struct scatterlist *sg = ctxt->rw_sg_table.sgl;
-
-	sg_set_buf(&sg[0], info->wi_base, len);
+	bvec_set_virt(&ctxt->rw_bvec[0], info->wi_base, len);
 	info->wi_base += len;
 
 	ctxt->rw_nents = 1;
 }
 
-/* Build and DMA-map an SGL that covers part of an xdr_buf's pagelist.
+/* Build a bvec array that covers part of an xdr_buf's pagelist.
  */
-static void svc_rdma_pagelist_to_sg(struct svc_rdma_write_info *info,
-				    unsigned int remaining,
-				    struct svc_rdma_rw_ctxt *ctxt)
+static void svc_rdma_pagelist_to_bvec(struct svc_rdma_write_info *info,
+				      unsigned int remaining,
+				      struct svc_rdma_rw_ctxt *ctxt)
 {
-	unsigned int sge_no, sge_bytes, page_off, page_no;
+	unsigned int bvec_idx, bvec_len, page_off, page_no;
 	const struct xdr_buf *xdr = info->wi_xdr;
-	struct scatterlist *sg;
 	struct page **page;
 
 	page_off = info->wi_next_off + xdr->page_base;
@@ -444,21 +462,19 @@ static void svc_rdma_pagelist_to_sg(struct svc_rdma_write_info *info,
 	page_off = offset_in_page(page_off);
 	page = xdr->pages + page_no;
 	info->wi_next_off += remaining;
-	sg = ctxt->rw_sg_table.sgl;
-	sge_no = 0;
+	bvec_idx = 0;
 	do {
-		sge_bytes = min_t(unsigned int, remaining,
-				  PAGE_SIZE - page_off);
-		sg_set_page(sg, *page, sge_bytes, page_off);
-
-		remaining -= sge_bytes;
-		sg = sg_next(sg);
+		bvec_len = min_t(unsigned int, remaining,
+				 PAGE_SIZE - page_off);
+		bvec_set_page(&ctxt->rw_bvec[bvec_idx], *page, bvec_len,
+			      page_off);
+		remaining -= bvec_len;
 		page_off = 0;
-		sge_no++;
+		bvec_idx++;
 		page++;
 	} while (remaining);
 
-	ctxt->rw_nents = sge_no;
+	ctxt->rw_nents = bvec_idx;
 }
 
 /* Construct RDMA Write WRs to send a portion of an xdr_buf containing
@@ -496,7 +512,7 @@ svc_rdma_build_writes(struct svc_rdma_write_info *info,
 		constructor(info, write_len, ctxt);
 		offset = seg->rs_offset + info->wi_seg_off;
 		ret = svc_rdma_rw_ctx_init(rdma, ctxt, offset, seg->rs_handle,
-					   DMA_TO_DEVICE);
+					   write_len, DMA_TO_DEVICE);
 		if (ret < 0)
 			return -EIO;
 		percpu_counter_inc(&svcrdma_stat_write);
@@ -535,7 +551,7 @@ static int svc_rdma_iov_write(struct svc_rdma_write_info *info,
 			      const struct kvec *iov)
 {
 	info->wi_base = iov->iov_base;
-	return svc_rdma_build_writes(info, svc_rdma_vec_to_sg,
+	return svc_rdma_build_writes(info, svc_rdma_vec_to_bvec,
 				     iov->iov_len);
 }
 
@@ -559,7 +575,7 @@ static int svc_rdma_pages_write(struct svc_rdma_write_info *info,
 {
 	info->wi_xdr = xdr;
 	info->wi_next_off = offset - xdr->head[0].iov_len;
-	return svc_rdma_build_writes(info, svc_rdma_pagelist_to_sg,
+	return svc_rdma_build_writes(info, svc_rdma_pagelist_to_bvec,
 				     length);
 }
 
@@ -734,29 +750,29 @@ static int svc_rdma_build_read_segment(struct svc_rqst *rqstp,
 {
 	struct svcxprt_rdma *rdma = svc_rdma_rqst_rdma(rqstp);
 	struct svc_rdma_chunk_ctxt *cc = &head->rc_cc;
-	unsigned int sge_no, seg_len, len;
+	unsigned int bvec_idx, nr_bvec, seg_len, len, total;
 	struct svc_rdma_rw_ctxt *ctxt;
-	struct scatterlist *sg;
 	int ret;
 
 	len = segment->rs_length;
-	sge_no = PAGE_ALIGN(head->rc_pageoff + len) >> PAGE_SHIFT;
-	ctxt = svc_rdma_get_rw_ctxt(rdma, sge_no);
+	if (check_add_overflow(head->rc_pageoff, len, &total))
+		return -EINVAL;
+	nr_bvec = PAGE_ALIGN(total) >> PAGE_SHIFT;
+	ctxt = svc_rdma_get_rw_ctxt(rdma, nr_bvec);
 	if (!ctxt)
 		return -ENOMEM;
-	ctxt->rw_nents = sge_no;
+	ctxt->rw_nents = nr_bvec;
 
-	sg = ctxt->rw_sg_table.sgl;
-	for (sge_no = 0; sge_no < ctxt->rw_nents; sge_no++) {
+	for (bvec_idx = 0; bvec_idx < ctxt->rw_nents; bvec_idx++) {
 		seg_len = min_t(unsigned int, len,
 				PAGE_SIZE - head->rc_pageoff);
 
 		if (!head->rc_pageoff)
 			head->rc_page_count++;
 
-		sg_set_page(sg, rqstp->rq_pages[head->rc_curpage],
-			    seg_len, head->rc_pageoff);
-		sg = sg_next(sg);
+		bvec_set_page(&ctxt->rw_bvec[bvec_idx],
+			      rqstp->rq_pages[head->rc_curpage],
+			      seg_len, head->rc_pageoff);
 
 		head->rc_pageoff += seg_len;
 		if (head->rc_pageoff == PAGE_SIZE) {
@@ -770,7 +786,8 @@ static int svc_rdma_build_read_segment(struct svc_rqst *rqstp,
 	}
 
 	ret = svc_rdma_rw_ctx_init(rdma, ctxt, segment->rs_offset,
-				   segment->rs_handle, DMA_FROM_DEVICE);
+				   segment->rs_handle, segment->rs_length,
+				   DMA_FROM_DEVICE);
 	if (ret < 0)
 		return -EIO;
 	percpu_counter_inc(&svcrdma_stat_read);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/5] Add a bio_vec based API to core/rw.c
  2026-01-22 22:03 [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Chuck Lever
                   ` (4 preceding siblings ...)
  2026-01-22 22:04 ` [PATCH v3 5/5] svcrdma: use bvec-based RDMA read/write API Chuck Lever
@ 2026-01-23  6:04 ` Zhu Yanjun
  2026-01-23 14:13   ` Chuck Lever
  5 siblings, 1 reply; 21+ messages in thread
From: Zhu Yanjun @ 2026-01-23  6:04 UTC (permalink / raw)
  To: Chuck Lever, Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-rdma, linux-nfs, Chuck Lever

在 2026/1/22 14:03, Chuck Lever 写道:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> This series introduces a bio_vec based API for RDMA read and write
> operations in the RDMA core, eliminating unnecessary scatterlist
> conversions for callers that already work with bvecs.
> 
> Current users of rdma_rw_ctx_init() must convert their native data
> structures into scatterlists. For subsystems like svcrdma that
> maintain data in bvec format, this conversion adds overhead both in
> CPU cycles and memory footprint. The new API accepts bvec arrays
> directly.
> 
> For hardware RDMA devices, the implementation uses the IOVA-based
> DMA mapping API to reduce IOTLB synchronization overhead from O(n)
> per-page syncs to a single O(1) sync after all mappings complete.
> Software RDMA devices (rxe, siw) continue using virtual addressing.
> 
> The series includes MR registration support for bvec arrays,
> enabling iWARP devices and the force_mr debug parameter. The MR
> path reuses existing ib_map_mr_sg() infrastructure by constructing
> a synthetic scatterlist from the bvec DMA addresses.

Hi, Chuck Lever

I’ve read through the patch series. As I understand it, the new 
bio_vec–based RDMA read/write API allows callers that already operate on 
bvecs (for example, svcrdma and potentially NVMe-oF) to avoid converting 
their data into scatterlists, which should reduce CPU overhead and 
memory usage in the data path.

For hardware RDMA devices, the use of the IOVA-based DMA mapping API 
also seems likely to reduce IOTLB synchronization overhead compared to 
the existing per-page approach, while software devices (rxe, siw) retain 
the current virtual-addressing model.

Do you happen to have any performance or functional test results you 
could share for this series, in particular:

Hardware RDMA devices (e.g., latency, bandwidth, or CPU utilization 
changes), and/or

Software RDMA devices such as rxe or siw?

Any data points or qualitative observations would be very helpful for 
evaluating the impact of the new API.

Zhu Yanjun

> 
> The final patch adds the first consumer for the new API: svcrdma.
> 
> Based on v6.19-rc6.
> 
> ---
> 
> Changes since v2:
> - Add bvec iter arguments to the new API
> - Add a synthetic SGL in the MR mapping function
> - Try IOVA coalescing before max_sgl_rd triggers MR in bvec path
> - Attempt once again to address SQ/CQ/max_rdma_ctxs sizing issues
> 
> Changes since v1:
> - Simplify rw.c by using bvec iters internally
> - IOVA mapping produces a contiguous DMA address range
> - Clarify the comment that documents struct svc_rdma_rw_ctxt
> - svcrdma now uses pre-allocated bio_vec arrays
> 
> Chuck Lever (5):
>    RDMA/core: add bio_vec based RDMA read/write API
>    RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
>    RDMA/core: add MR support for bvec-based RDMA operations
>    RDMA/core: add rdma_rw_max_sge() helper for SQ sizing
>    svcrdma: use bvec-based RDMA read/write API
> 
>   drivers/infiniband/core/rw.c             | 591 ++++++++++++++++++++---
>   drivers/infiniband/ulp/isert/ib_isert.c  |   4 +-
>   drivers/nvme/target/rdma.c               |   4 +-
>   include/rdma/ib_verbs.h                  |  42 ++
>   include/rdma/rw.h                        |  36 +-
>   net/sunrpc/xprtrdma/svc_rdma_rw.c        | 155 +++---
>   net/sunrpc/xprtrdma/svc_rdma_transport.c |   8 +-
>   7 files changed, 699 insertions(+), 141 deletions(-)
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/5] RDMA/core: add bio_vec based RDMA read/write API
  2026-01-22 22:03 ` [PATCH v3 1/5] RDMA/core: add bio_vec based RDMA read/write API Chuck Lever
@ 2026-01-23  6:26   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2026-01-23  6:26 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma,
	linux-nfs, Chuck Lever

> +static int rdma_rw_init_map_wrs_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
> +		const struct bio_vec *bvecs, u32 nr_bvec, struct bvec_iter *iter,

Overly long line here.

> +		for (j = 0; j < nr_sge; j++) {
> +			const struct bio_vec *base = __bvec_iter_bvec(bvecs, *iter);

Overly long line.

> +			unsigned int offset = iter->bi_bvec_done;
> +			unsigned int len = min(iter->bi_size,
> +					       base->bv_len - offset);
> +			struct bio_vec bv = {
> +				.bv_page = base->bv_page,
> +				.bv_len = len,
> +				.bv_offset = base->bv_offset + offset,
> +			};

Why is this open coding mp_bvec_iter_bvec?

> +static inline u64 ib_dma_map_bvec(struct ib_device *dev,
> +				  const struct bio_vec *bvec,
> +				  enum dma_data_direction direction)
> +{
> +	if (ib_uses_virt_dma(dev))
> +		return (uintptr_t)(page_address(bvec->bv_page) + bvec->bv_offset);

Overly long line here, which could be fixed by just using bvec_virt().


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/5] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  2026-01-22 22:03 ` [PATCH v3 2/5] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations Chuck Lever
@ 2026-01-23  6:28   ` Christoph Hellwig
  2026-01-23 15:04     ` Chuck Lever
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2026-01-23  6:28 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma,
	linux-nfs, Chuck Lever

> +	/* Link all bvecs into the IOVA space */
> +	link_iter = *iter;
> +	while (link_iter.bi_size) {
> +		struct bio_vec bv = mp_bvec_iter_bvec(bvec, link_iter);
> +
> +		ret = dma_iova_link(dma_dev, &ctx->iova.state, bvec_phys(&bv),
> +				    mapped_len, bv.bv_len, dir, 0);
> +		if (ret)
> +			goto out_destroy;
> +
> +		mapped_len += bv.bv_len;
> +		bvec_iter_advance(bvec, &link_iter, bv.bv_len);
> +	}

Why is this using a local link_iter?  We're not using iter later.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/5] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-22 22:03 ` [PATCH v3 3/5] RDMA/core: add MR support for bvec-based " Chuck Lever
@ 2026-01-23  6:36   ` Christoph Hellwig
  2026-01-23 15:06     ` Chuck Lever
  2026-01-23 16:47     ` Chuck Lever
  0 siblings, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2026-01-23  6:36 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma,
	linux-nfs, Chuck Lever

> +/*
> + * Check if the device requires memory registration for RDMA READs.
> + * iWARP always requires MR for RDMA READ due to protocol limitations.
> + */
> +static inline bool rdma_rw_io_requires_mr(struct ib_device *dev, u32 port_num,

>  static inline bool rdma_rw_io_needs_mr(struct ib_device *dev, u32 port_num,
>  		enum dma_data_direction dir, int dma_nents)

I find the naming really confusing here.  I guess requires is that
the protocol (iWarp) doesn't work with it, needs means we need it for
the the number of entries.

And the new API requires the ULP to size the mapping request to never
hit the latter case?

Maybe just kill off the old rdma_rw_io_needs_mr and open code the
latter case in the only user?:w


>  	for (i = 0; i < ctx->nr_ops; i++) {
> -		struct rdma_rw_reg_ctx *reg = &ctx->reg[i];
> +		struct rdma_rw_reg_ctx *reg = &ctx->reg.ctx[i];

Jumping ahead here - why can't the sgtable be stored in ->reg
without renaming?  Is there case where need it, but the rest of
reg?   In 

> +	ctx->nr_ops = DIV_ROUND_UP(ctx->reg.sgt.nents, pages_per_mr);
> +	ctx->reg.ctx = kcalloc(ctx->nr_ops, sizeof(*ctx->reg.ctx), GFP_KERNEL);
> +	if (!ctx->reg.ctx) {
> +		ret = -ENOMEM;
> +		goto out_unmap_sgt;
> +	}
> +
> +	sg = ctx->reg.sgt.sgl;
> +	nents = ctx->reg.sgt.nents;
> +	for (i = 0; i < ctx->nr_ops; i++) {
> +		struct rdma_rw_reg_ctx *reg = &ctx->reg.ctx[i];
> +		u32 sge_cnt = min(nents, pages_per_mr);
> +
> +		ret = rdma_rw_init_one_mr(qp, port_num, reg, sg, sge_cnt, 0);

I guess you looked into that, but never replied, but this still
looks like it duplicates most of rdma_rw_init_mr_wrs.  Is there something
that prevents reusing that directly or with minor refactoring?

> +	memcpy(ctx->reg.ctx->mr->sig_attrs, sig_attrs, sizeof(struct ib_sig_attrs));

Overly long line.  But this also shows an issue that the details of the
rw context leak for the later added signature MR support :P

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 4/5] RDMA/core: add rdma_rw_max_sge() helper for SQ sizing
  2026-01-22 22:04 ` [PATCH v3 4/5] RDMA/core: add rdma_rw_max_sge() helper for SQ sizing Chuck Lever
@ 2026-01-23  6:36   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2026-01-23  6:36 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma,
	linux-nfs, Chuck Lever

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/5] Add a bio_vec based API to core/rw.c
  2026-01-23  6:04 ` [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Zhu Yanjun
@ 2026-01-23 14:13   ` Chuck Lever
  2026-01-24 18:19     ` Zhu Yanjun
  2026-01-26 17:13     ` Jason Gunthorpe
  0 siblings, 2 replies; 21+ messages in thread
From: Chuck Lever @ 2026-01-23 14:13 UTC (permalink / raw)
  To: Zhu Yanjun, Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-rdma, linux-nfs, Chuck Lever

On 1/23/26 1:04 AM, Zhu Yanjun wrote:
> 在 2026/1/22 14:03, Chuck Lever 写道:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> This series introduces a bio_vec based API for RDMA read and write
>> operations in the RDMA core, eliminating unnecessary scatterlist
>> conversions for callers that already work with bvecs.
>>
>> Current users of rdma_rw_ctx_init() must convert their native data
>> structures into scatterlists. For subsystems like svcrdma that
>> maintain data in bvec format, this conversion adds overhead both in
>> CPU cycles and memory footprint. The new API accepts bvec arrays
>> directly.
>>
>> For hardware RDMA devices, the implementation uses the IOVA-based
>> DMA mapping API to reduce IOTLB synchronization overhead from O(n)
>> per-page syncs to a single O(1) sync after all mappings complete.
>> Software RDMA devices (rxe, siw) continue using virtual addressing.
>>
>> The series includes MR registration support for bvec arrays,
>> enabling iWARP devices and the force_mr debug parameter. The MR
>> path reuses existing ib_map_mr_sg() infrastructure by constructing
>> a synthetic scatterlist from the bvec DMA addresses.
> 
> Hi, Chuck Lever
> 
> I’ve read through the patch series. As I understand it, the new bio_vec–
> based RDMA read/write API allows callers that already operate on bvecs
> (for example, svcrdma and potentially NVMe-oF) to avoid converting their
> data into scatterlists, which should reduce CPU overhead and memory
> usage in the data path.
> 
> For hardware RDMA devices, the use of the IOVA-based DMA mapping API
> also seems likely to reduce IOTLB synchronization overhead compared to
> the existing per-page approach, while software devices (rxe, siw) retain
> the current virtual-addressing model.
> 
> Do you happen to have any performance or functional test results you
> could share for this series, in particular:
> 
> Hardware RDMA devices (e.g., latency, bandwidth, or CPU utilization
> changes), and/or

Functional tests with CX-5 Infiniband and NFS/RDMA show no regression.

Performance tests are difficult to evaluate because I don't have a
multi-client set-up here to drive a heavy workload, plus filesystems
bottleneck long before the network transport does. The changes are
designed to improve scalability (eg lower CPU utilization for the same
workload and less interaction between host and RNIC) more than improve
raw throughput. So far I have seen no throughput regression and perhaps
a bit of improvement for tail latencies.

The main purpose of the series, however, is part of an effort to enable
kernel-wide replacement of the use of scatter-gather lists, which are
technical debt. Socket APIs already support struct bio_vec.


> Software RDMA devices such as rxe or siw?

Software providers are not likely to see much change. However, you will
need to test the series with your own preferred configuration and
workload to assess performance and scalability delta.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/5] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  2026-01-23  6:28   ` Christoph Hellwig
@ 2026-01-23 15:04     ` Chuck Lever
  2026-01-26  6:14       ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2026-01-23 15:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Leon Romanovsky, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma, linux-nfs,
	Chuck Lever



On Fri, Jan 23, 2026, at 1:28 AM, Christoph Hellwig wrote:
>> +	/* Link all bvecs into the IOVA space */
>> +	link_iter = *iter;
>> +	while (link_iter.bi_size) {
>> +		struct bio_vec bv = mp_bvec_iter_bvec(bvec, link_iter);
>> +
>> +		ret = dma_iova_link(dma_dev, &ctx->iova.state, bvec_phys(&bv),
>> +				    mapped_len, bv.bv_len, dir, 0);
>> +		if (ret)
>> +			goto out_destroy;
>> +
>> +		mapped_len += bv.bv_len;
>> +		bvec_iter_advance(bvec, &link_iter, bv.bv_len);
>> +	}
>
> Why is this using a local link_iter?  We're not using iter later.

I think we don't want to leak a partially-updated iter if the
API call returns an error.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/5] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-23  6:36   ` Christoph Hellwig
@ 2026-01-23 15:06     ` Chuck Lever
  2026-01-26  6:17       ` Christoph Hellwig
  2026-01-23 16:47     ` Chuck Lever
  1 sibling, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2026-01-23 15:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Leon Romanovsky, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma, linux-nfs,
	Chuck Lever



On Fri, Jan 23, 2026, at 1:36 AM, Christoph Hellwig wrote:

>> +	ctx->nr_ops = DIV_ROUND_UP(ctx->reg.sgt.nents, pages_per_mr);
>> +	ctx->reg.ctx = kcalloc(ctx->nr_ops, sizeof(*ctx->reg.ctx), GFP_KERNEL);
>> +	if (!ctx->reg.ctx) {
>> +		ret = -ENOMEM;
>> +		goto out_unmap_sgt;
>> +	}
>> +
>> +	sg = ctx->reg.sgt.sgl;
>> +	nents = ctx->reg.sgt.nents;
>> +	for (i = 0; i < ctx->nr_ops; i++) {
>> +		struct rdma_rw_reg_ctx *reg = &ctx->reg.ctx[i];
>> +		u32 sge_cnt = min(nents, pages_per_mr);
>> +
>> +		ret = rdma_rw_init_one_mr(qp, port_num, reg, sg, sge_cnt, 0);
>
> I guess you looked into that, but never replied, but this still
> looks like it duplicates most of rdma_rw_init_mr_wrs.  Is there something
> that prevents reusing that directly or with minor refactoring?

IIRC I interpreted your earlier review comment as "let's defer that clean-up".
I'll look at it again.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/5] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-23  6:36   ` Christoph Hellwig
  2026-01-23 15:06     ` Chuck Lever
@ 2026-01-23 16:47     ` Chuck Lever
  2026-01-26  6:16       ` Christoph Hellwig
  1 sibling, 1 reply; 21+ messages in thread
From: Chuck Lever @ 2026-01-23 16:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Leon Romanovsky, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma, linux-nfs,
	Chuck Lever

On Fri, Jan 23, 2026, at 1:36 AM, Christoph Hellwig wrote:
>>  	for (i = 0; i < ctx->nr_ops; i++) {
>> -		struct rdma_rw_reg_ctx *reg = &ctx->reg[i];
>> +		struct rdma_rw_reg_ctx *reg = &ctx->reg.ctx[i];
>
> Jumping ahead here - why can't the sgtable be stored in ->reg
> without renaming?  Is there case where need it, but the rest of
> reg?   In 

I think the answer is yes, with bvec, both fields are needed at
the same time. My preference is to go back to the early form of
the structure without a union, since there are API consumers who
access the reg field directly. Let me know your thoughts.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/5] Add a bio_vec based API to core/rw.c
  2026-01-23 14:13   ` Chuck Lever
@ 2026-01-24 18:19     ` Zhu Yanjun
  2026-01-26 17:13     ` Jason Gunthorpe
  1 sibling, 0 replies; 21+ messages in thread
From: Zhu Yanjun @ 2026-01-24 18:19 UTC (permalink / raw)
  To: Chuck Lever, Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-rdma, linux-nfs, Chuck Lever

在 2026/1/23 6:13, Chuck Lever 写道:
> On 1/23/26 1:04 AM, Zhu Yanjun wrote:
>> 在 2026/1/22 14:03, Chuck Lever 写道:
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>
>>> This series introduces a bio_vec based API for RDMA read and write
>>> operations in the RDMA core, eliminating unnecessary scatterlist
>>> conversions for callers that already work with bvecs.
>>>
>>> Current users of rdma_rw_ctx_init() must convert their native data
>>> structures into scatterlists. For subsystems like svcrdma that
>>> maintain data in bvec format, this conversion adds overhead both in
>>> CPU cycles and memory footprint. The new API accepts bvec arrays
>>> directly.
>>>
>>> For hardware RDMA devices, the implementation uses the IOVA-based
>>> DMA mapping API to reduce IOTLB synchronization overhead from O(n)
>>> per-page syncs to a single O(1) sync after all mappings complete.
>>> Software RDMA devices (rxe, siw) continue using virtual addressing.
>>>
>>> The series includes MR registration support for bvec arrays,
>>> enabling iWARP devices and the force_mr debug parameter. The MR
>>> path reuses existing ib_map_mr_sg() infrastructure by constructing
>>> a synthetic scatterlist from the bvec DMA addresses.
>>
>> Hi, Chuck Lever
>>
>> I’ve read through the patch series. As I understand it, the new bio_vec–
>> based RDMA read/write API allows callers that already operate on bvecs
>> (for example, svcrdma and potentially NVMe-oF) to avoid converting their
>> data into scatterlists, which should reduce CPU overhead and memory
>> usage in the data path.
>>
>> For hardware RDMA devices, the use of the IOVA-based DMA mapping API
>> also seems likely to reduce IOTLB synchronization overhead compared to
>> the existing per-page approach, while software devices (rxe, siw) retain
>> the current virtual-addressing model.
>>
>> Do you happen to have any performance or functional test results you
>> could share for this series, in particular:
>>
>> Hardware RDMA devices (e.g., latency, bandwidth, or CPU utilization
>> changes), and/or
> 
> Functional tests with CX-5 Infiniband and NFS/RDMA show no regression.
> 
> Performance tests are difficult to evaluate because I don't have a
> multi-client set-up here to drive a heavy workload, plus filesystems
> bottleneck long before the network transport does. The changes are
> designed to improve scalability (eg lower CPU utilization for the same
> workload and less interaction between host and RNIC) more than improve
> raw throughput. So far I have seen no throughput regression and perhaps
> a bit of improvement for tail latencies.

Thanks a lot. Based on the code changes, this patch series should 
improve performance. Unfortunately, due to various limitations, we are 
unable to provide performance test results.

Best Regards,
Zhu Yanjun

> 
> The main purpose of the series, however, is part of an effort to enable
> kernel-wide replacement of the use of scatter-gather lists, which are
> technical debt. Socket APIs already support struct bio_vec.
> 
> 
>> Software RDMA devices such as rxe or siw?
> 
> Software providers are not likely to see much change. However, you will
> need to test the series with your own preferred configuration and
> workload to assess performance and scalability delta.
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/5] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
  2026-01-23 15:04     ` Chuck Lever
@ 2026-01-26  6:14       ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2026-01-26  6:14 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Jason Gunthorpe, Leon Romanovsky, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma,
	linux-nfs, Chuck Lever

On Fri, Jan 23, 2026 at 10:04:07AM -0500, Chuck Lever wrote:
> On Fri, Jan 23, 2026, at 1:28 AM, Christoph Hellwig wrote:
> >> +	/* Link all bvecs into the IOVA space */
> >> +	link_iter = *iter;
> >> +	while (link_iter.bi_size) {
> >> +		struct bio_vec bv = mp_bvec_iter_bvec(bvec, link_iter);
> >> +
> >> +		ret = dma_iova_link(dma_dev, &ctx->iova.state, bvec_phys(&bv),
> >> +				    mapped_len, bv.bv_len, dir, 0);
> >> +		if (ret)
> >> +			goto out_destroy;
> >> +
> >> +		mapped_len += bv.bv_len;
> >> +		bvec_iter_advance(bvec, &link_iter, bv.bv_len);
> >> +	}
> >
> > Why is this using a local link_iter?  We're not using iter later.
> 
> I think we don't want to leak a partially-updated iter if the
> API call returns an error.

That's how all the block layer bvec_iter-based API work.  The
functions consume the iter.  If a caller needs to save it for
some reason, it stashes away a copy.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/5] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-23 16:47     ` Chuck Lever
@ 2026-01-26  6:16       ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2026-01-26  6:16 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Jason Gunthorpe, Leon Romanovsky, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma,
	linux-nfs, Chuck Lever

On Fri, Jan 23, 2026 at 11:47:46AM -0500, Chuck Lever wrote:
> On Fri, Jan 23, 2026, at 1:36 AM, Christoph Hellwig wrote:
> >>  	for (i = 0; i < ctx->nr_ops; i++) {
> >> -		struct rdma_rw_reg_ctx *reg = &ctx->reg[i];
> >> +		struct rdma_rw_reg_ctx *reg = &ctx->reg.ctx[i];
> >
> > Jumping ahead here - why can't the sgtable be stored in ->reg
> > without renaming?  Is there case where need it, but the rest of
> > reg?   In 
> 
> I think the answer is yes, with bvec, both fields are needed at
> the same time. My preference is to go back to the early form of
> the structure without a union, since there are API consumers who
> access the reg field directly. Let me know your thoughts.

What I don't understand is why it can't be added to
struct rdma_rw_reg_ctx.  Are there any uses of the new fields that
don't have that allocated?  If yes, just adding the new fields outside
the union seems to cause the least churn for now, although I'd still
want to clean it up later eventually.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/5] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-23 15:06     ` Chuck Lever
@ 2026-01-26  6:17       ` Christoph Hellwig
  2026-01-26 16:48         ` Chuck Lever
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2026-01-26  6:17 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Jason Gunthorpe, Leon Romanovsky, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma,
	linux-nfs, Chuck Lever

On Fri, Jan 23, 2026 at 10:06:36AM -0500, Chuck Lever wrote:
> >> +	nents = ctx->reg.sgt.nents;
> >> +	for (i = 0; i < ctx->nr_ops; i++) {
> >> +		struct rdma_rw_reg_ctx *reg = &ctx->reg.ctx[i];
> >> +		u32 sge_cnt = min(nents, pages_per_mr);
> >> +
> >> +		ret = rdma_rw_init_one_mr(qp, port_num, reg, sg, sge_cnt, 0);
> >
> > I guess you looked into that, but never replied, but this still
> > looks like it duplicates most of rdma_rw_init_mr_wrs.  Is there something
> > that prevents reusing that directly or with minor refactoring?
> 
> IIRC I interpreted your earlier review comment as "let's defer that clean-up".
> I'll look at it again.

I was hoping we could just reuse the code instead of duplicating it.
The one earlier comment was about sharing code where we actualy
pass in the bvec and scatterlist, and that might or might not be useful.
But for this case where we have to actually create a scatterlist first,
not reusing the existing scatterlist code to then work on it feels
wrong.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 3/5] RDMA/core: add MR support for bvec-based RDMA operations
  2026-01-26  6:17       ` Christoph Hellwig
@ 2026-01-26 16:48         ` Chuck Lever
  0 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2026-01-26 16:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Leon Romanovsky, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma, linux-nfs,
	Chuck Lever



On Mon, Jan 26, 2026, at 1:17 AM, Christoph Hellwig wrote:
> On Fri, Jan 23, 2026 at 10:06:36AM -0500, Chuck Lever wrote:
>> >> +	nents = ctx->reg.sgt.nents;
>> >> +	for (i = 0; i < ctx->nr_ops; i++) {
>> >> +		struct rdma_rw_reg_ctx *reg = &ctx->reg.ctx[i];
>> >> +		u32 sge_cnt = min(nents, pages_per_mr);
>> >> +
>> >> +		ret = rdma_rw_init_one_mr(qp, port_num, reg, sg, sge_cnt, 0);
>> >
>> > I guess you looked into that, but never replied, but this still
>> > looks like it duplicates most of rdma_rw_init_mr_wrs.  Is there something
>> > that prevents reusing that directly or with minor refactoring?
>> 
>> IIRC I interpreted your earlier review comment as "let's defer that clean-up".
>> I'll look at it again.
>
> I was hoping we could just reuse the code instead of duplicating it.
> The one earlier comment was about sharing code where we actualy
> pass in the bvec and scatterlist, and that might or might not be useful.
> But for this case where we have to actually create a scatterlist first,
> not reusing the existing scatterlist code to then work on it feels
> wrong.

I added a helper that refactors out the common logic. But I did
that a few days ago and all of the context has dribbled out of
my brain. I'll post a v4 soon, and we can continue from there.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/5] Add a bio_vec based API to core/rw.c
  2026-01-23 14:13   ` Chuck Lever
  2026-01-24 18:19     ` Zhu Yanjun
@ 2026-01-26 17:13     ` Jason Gunthorpe
  1 sibling, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2026-01-26 17:13 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Zhu Yanjun, Leon Romanovsky, Christoph Hellwig, NeilBrown,
	Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-rdma,
	linux-nfs, Chuck Lever

On Fri, Jan 23, 2026 at 09:13:47AM -0500, Chuck Lever wrote:

> The main purpose of the series, however, is part of an effort to enable
> kernel-wide replacement of the use of scatter-gather lists, which are
> technical debt. Socket APIs already support struct bio_vec.

I haven't imagined tree wide, but I am envisioning a world where a
modern server runs all its primary IO paths without using
scatterlist..

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-01-26 17:13 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-22 22:03 [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Chuck Lever
2026-01-22 22:03 ` [PATCH v3 1/5] RDMA/core: add bio_vec based RDMA read/write API Chuck Lever
2026-01-23  6:26   ` Christoph Hellwig
2026-01-22 22:03 ` [PATCH v3 2/5] RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations Chuck Lever
2026-01-23  6:28   ` Christoph Hellwig
2026-01-23 15:04     ` Chuck Lever
2026-01-26  6:14       ` Christoph Hellwig
2026-01-22 22:03 ` [PATCH v3 3/5] RDMA/core: add MR support for bvec-based " Chuck Lever
2026-01-23  6:36   ` Christoph Hellwig
2026-01-23 15:06     ` Chuck Lever
2026-01-26  6:17       ` Christoph Hellwig
2026-01-26 16:48         ` Chuck Lever
2026-01-23 16:47     ` Chuck Lever
2026-01-26  6:16       ` Christoph Hellwig
2026-01-22 22:04 ` [PATCH v3 4/5] RDMA/core: add rdma_rw_max_sge() helper for SQ sizing Chuck Lever
2026-01-23  6:36   ` Christoph Hellwig
2026-01-22 22:04 ` [PATCH v3 5/5] svcrdma: use bvec-based RDMA read/write API Chuck Lever
2026-01-23  6:04 ` [PATCH v3 0/5] Add a bio_vec based API to core/rw.c Zhu Yanjun
2026-01-23 14:13   ` Chuck Lever
2026-01-24 18:19     ` Zhu Yanjun
2026-01-26 17:13     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox