* [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
@ 2026-03-13 19:41 Chuck Lever
2026-03-13 19:41 ` [PATCH v3 1/4] RDMA/rw: Fall back to direct SGE on MR pool exhaustion Chuck Lever
` (4 more replies)
0 siblings, 5 replies; 12+ messages in thread
From: Chuck Lever @ 2026-03-13 19:41 UTC (permalink / raw)
To: Leon Romanovsky, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
This series now carries two MR exhaustion fixes and a proposal for
using contiguous pages for RDMA Read sink buffers in svcrdma.
Fixes for the MR exhaustion issues should go into 7.0-rc and stable,
and the contiguous page patches can wait for the next merge window.
Base commit: v7.0-rc3
---
Changes since v2:
- Fix similar exhaustion issue for SGL
- Add patch that introduces svc_rqst_page_release
Changes since v1:
- Clarify code comments
- Allocate contiguous pages for RDMA Read sink buffers
Chuck Lever (4):
RDMA/rw: Fall back to direct SGE on MR pool exhaustion
RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
SUNRPC: Add svc_rqst_page_release() helper
svcrdma: Use contiguous pages for RDMA Read sink buffers
drivers/infiniband/core/rw.c | 43 ++++--
include/linux/sunrpc/svc.h | 15 ++
net/sunrpc/svc.c | 7 +-
net/sunrpc/svcsock.c | 2 +-
net/sunrpc/xprtrdma/svc_rdma_rw.c | 220 ++++++++++++++++++++++++++++++
5 files changed, 268 insertions(+), 19 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 1/4] RDMA/rw: Fall back to direct SGE on MR pool exhaustion
2026-03-13 19:41 [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
@ 2026-03-13 19:41 ` Chuck Lever
2026-03-17 14:24 ` Christoph Hellwig
2026-03-13 19:41 ` [PATCH v3 2/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
` (3 subsequent siblings)
4 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2026-03-13 19:41 UTC (permalink / raw)
To: Leon Romanovsky, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs()
produces no coalescing: each scatterlist page maps 1:1 to a DMA
entry, so sgt.nents equals the raw page count. A 1 MB transfer
yields 256 DMA entries. If that count exceeds the device's
max_sgl_rd threshold (an optimization hint from mlx5 firmware),
rdma_rw_io_needs_mr() steers the operation into the MR
registration path. Each such operation consumes one or more MRs
from a pool sized at max_rdma_ctxs -- roughly one MR per
concurrent context. Under write-intensive workloads that issue
many concurrent RDMA READs, the pool is rapidly exhausted,
ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns
-EAGAIN. Upper layer protocols treat this as a fatal DMA mapping
failure and tear down the connection.
The max_sgl_rd check is a performance optimization, not a
correctness requirement: the device can handle large SGE counts
via direct posting, just less efficiently than with MR
registration. When the MR pool cannot satisfy a request, falling
back to the direct SGE (map_wrs) path avoids the connection
reset while preserving the MR optimization for the common case
where pool resources are available.
Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from
rdma_rw_init_mr_wrs() triggers direct SGE posting instead of
propagating the error. iWARP devices, which mandate MR
registration for RDMA READs, and force_mr debug mode continue
to treat -EAGAIN as terminal.
Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
drivers/infiniband/core/rw.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index fc45c384833f..c01d5e605053 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -608,14 +608,29 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num,
if (rdma_rw_io_needs_mr(qp->device, port_num, dir, sg_cnt)) {
ret = rdma_rw_init_mr_wrs(ctx, qp, port_num, sg, sg_cnt,
sg_offset, remote_addr, rkey, dir);
- } else if (sg_cnt > 1) {
- ret = rdma_rw_init_map_wrs(ctx, qp, sg, sg_cnt, sg_offset,
- remote_addr, rkey, dir);
- } else {
- ret = rdma_rw_init_single_wr(ctx, qp, sg, sg_offset,
- remote_addr, rkey, dir);
+ /*
+ * If MR init succeeded or failed for a reason other
+ * than pool exhaustion, that result is final.
+ *
+ * Pool exhaustion (-EAGAIN) from the max_sgl_rd
+ * optimization is recoverable: fall back to
+ * direct SGE posting. iWARP and force_mr require
+ * MRs unconditionally, so -EAGAIN is terminal.
+ */
+ if (ret != -EAGAIN ||
+ rdma_protocol_iwarp(qp->device, port_num) ||
+ unlikely(rdma_rw_force_mr))
+ goto out;
}
+ if (sg_cnt > 1)
+ ret = rdma_rw_init_map_wrs(ctx, qp, sg, sg_cnt, sg_offset,
+ remote_addr, rkey, dir);
+ else
+ ret = rdma_rw_init_single_wr(ctx, qp, sg, sg_offset,
+ remote_addr, rkey, dir);
+
+out:
if (ret < 0)
goto out_unmap_sg;
return ret;
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 2/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
2026-03-13 19:41 [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-13 19:41 ` [PATCH v3 1/4] RDMA/rw: Fall back to direct SGE on MR pool exhaustion Chuck Lever
@ 2026-03-13 19:41 ` Chuck Lever
2026-03-17 14:25 ` Christoph Hellwig
2026-03-13 19:42 ` [PATCH v3 3/4] SUNRPC: Add svc_rqst_page_release() helper Chuck Lever
` (2 subsequent siblings)
4 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2026-03-13 19:41 UTC (permalink / raw)
To: Leon Romanovsky, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
When IOVA-based DMA mapping is unavailable (e.g., IOMMU
passthrough mode), rdma_rw_ctx_init_bvec() falls back to
checking rdma_rw_io_needs_mr() with the raw bvec count.
Unlike the scatterlist path in rdma_rw_ctx_init(), which
passes a post-DMA-mapping entry count that reflects
coalescing of physically contiguous pages, the bvec path
passes the pre-mapping page count. This overstates the
number of DMA entries, causing every multi-bvec RDMA READ
to consume an MR from the QP's pool.
Under NFS WRITE workloads the server performs RDMA READs
to pull data from the client. With the inflated MR demand,
the pool is rapidly exhausted, ib_mr_pool_get() returns
NULL, and rdma_rw_init_one_mr() returns -EAGAIN. svcrdma
treats this as a DMA mapping failure, closes the connection,
and the client reconnects -- producing a cycle of 71% RPC
retransmissions and ~100 reconnections per test run. RDMA
WRITEs (NFS READ direction) are unaffected because
DMA_TO_DEVICE never triggers the max_sgl_rd check.
Remove the rdma_rw_io_needs_mr() gate from the bvec path
entirely, so that bvec RDMA operations always use the
map_wrs path (direct WR posting without MR allocation).
The bvec caller has no post-DMA-coalescing segment count
available -- xdr_buf and svc_rqst hold pages as individual
pointers, and physical contiguity is discovered only during
DMA mapping -- so the raw page count cannot serve as a
reliable input to rdma_rw_io_needs_mr(). iWARP devices,
which require MRs unconditionally, are handled by an
earlier check in rdma_rw_ctx_init_bvec() and are unaffected.
Fixes: bea28ac14cab ("RDMA/core: add MR support for bvec-based RDMA operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
drivers/infiniband/core/rw.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c01d5e605053..4fafe393a48c 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -701,14 +701,16 @@ int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
return ret;
/*
- * IOVA mapping not available. Check if MR registration provides
- * better performance than multiple SGE entries.
+ * IOVA not available; fall back to the map_wrs path, which maps
+ * each bvec as a direct SGE. This is always correct: the MR path
+ * is a throughput optimization, not a correctness requirement.
+ * (iWARP, which does require MRs, is handled by the check above.)
+ *
+ * The rdma_rw_io_needs_mr() gate is not used here because nr_bvec
+ * is a raw page count that overstates DMA entry demand -- the bvec
+ * caller has no post-DMA-coalescing segment count, and feeding the
+ * inflated count into the MR path exhausts the pool on RDMA READs.
*/
- if (rdma_rw_io_needs_mr(dev, port_num, dir, nr_bvec))
- return rdma_rw_init_mr_wrs_bvec(ctx, qp, port_num, bvecs,
- nr_bvec, &iter, remote_addr,
- rkey, dir);
-
return rdma_rw_init_map_wrs_bvec(ctx, qp, bvecs, nr_bvec, &iter,
remote_addr, rkey, dir);
}
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 3/4] SUNRPC: Add svc_rqst_page_release() helper
2026-03-13 19:41 [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-13 19:41 ` [PATCH v3 1/4] RDMA/rw: Fall back to direct SGE on MR pool exhaustion Chuck Lever
2026-03-13 19:41 ` [PATCH v3 2/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
@ 2026-03-13 19:42 ` Chuck Lever
2026-03-17 14:26 ` Christoph Hellwig
2026-03-13 19:42 ` [PATCH v3 4/4] svcrdma: Use contiguous pages for RDMA Read sink buffers Chuck Lever
2026-03-16 20:15 ` [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Leon Romanovsky
4 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2026-03-13 19:42 UTC (permalink / raw)
To: Leon Romanovsky, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
When replacing rq_pages[] entries during RPC processing,
old pages are queued in a per-rqst folio batch rather than
released individually. The add-or-flush sequence appears at
every replacement site, exposing folio batch internals to
each caller.
Introduce svc_rqst_page_release() to encapsulate the
batched release mechanism. Convert the call sites in
svc_rqst_replace_page() and svc_tcp_restore_pages().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc.h | 15 +++++++++++++++
net/sunrpc/svc.c | 7 ++-----
net/sunrpc/svcsock.c | 2 +-
3 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 4dc14c7a711b..7a5c9433fda3 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -483,6 +483,21 @@ int svc_generic_rpcbind_set(struct net *net,
#define RPC_MAX_ADDRBUFLEN (63U)
+/**
+ * svc_rqst_page_release - release a page associated with an RPC transaction
+ * @rqstp: RPC transaction context
+ * @page: page to release
+ *
+ * Released pages are batched and freed together, reducing
+ * allocator pressure under heavy RPC workloads.
+ */
+static inline void svc_rqst_page_release(struct svc_rqst *rqstp,
+ struct page *page)
+{
+ if (!folio_batch_add(&rqstp->rq_fbatch, page_folio(page)))
+ __folio_batch_release(&rqstp->rq_fbatch);
+}
+
/*
* When we want to reduce the size of the reserved space in the response
* buffer, we need to take into account the size of any checksum data that
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index d8ccb8e4b5c2..3e57959c1779 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -955,11 +955,8 @@ bool svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page)
return false;
}
- if (*rqstp->rq_next_page) {
- if (!folio_batch_add(&rqstp->rq_fbatch,
- page_folio(*rqstp->rq_next_page)))
- __folio_batch_release(&rqstp->rq_fbatch);
- }
+ if (*rqstp->rq_next_page)
+ svc_rqst_page_release(rqstp, *rqstp->rq_next_page);
get_page(page);
*(rqstp->rq_next_page++) = page;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index f28c6076f7e8..ce28af88e632 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -994,7 +994,7 @@ static size_t svc_tcp_restore_pages(struct svc_sock *svsk,
npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
for (i = 0; i < npages; i++) {
if (rqstp->rq_pages[i] != NULL)
- put_page(rqstp->rq_pages[i]);
+ svc_rqst_page_release(rqstp, rqstp->rq_pages[i]);
BUG_ON(svsk->sk_pages[i] == NULL);
rqstp->rq_pages[i] = svsk->sk_pages[i];
svsk->sk_pages[i] = NULL;
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 4/4] svcrdma: Use contiguous pages for RDMA Read sink buffers
2026-03-13 19:41 [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
` (2 preceding siblings ...)
2026-03-13 19:42 ` [PATCH v3 3/4] SUNRPC: Add svc_rqst_page_release() helper Chuck Lever
@ 2026-03-13 19:42 ` Chuck Lever
2026-03-17 14:28 ` Christoph Hellwig
2026-03-16 20:15 ` [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Leon Romanovsky
4 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2026-03-13 19:42 UTC (permalink / raw)
To: Leon Romanovsky, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever, Christoph Hellwig
From: Chuck Lever <chuck.lever@oracle.com>
svc_rdma_build_read_segment() constructs RDMA Read sink
buffers by consuming pages one-at-a-time from rq_pages[]
and building one bvec per page. A 64KB NFS READ payload
produces 16 separate bvecs, 16 DMA mappings, and
potentially multiple RDMA Read WRs.
A single higher-order allocation followed by split_page()
yields physically contiguous memory while preserving
per-page refcounts. A single bvec spanning the contiguous
range causes rdma_rw_ctx_init_bvec() to take the
rdma_rw_init_single_wr_bvec() fast path: one DMA mapping,
one SGE, one WR.
The split sub-pages replace the original rq_pages[] entries,
so all downstream page tracking, completion handling, and
xdr_buf assembly remain unchanged.
Allocation uses __GFP_NORETRY | __GFP_NOWARN and falls back
through decreasing orders. If even order-1 fails, the
existing per-page path handles the segment.
When nr_pages is not a power of two, get_order() rounds up
and the allocation yields more pages than needed. The extra
split pages replace existing rq_pages[] entries (freed via
put_page() first), so there is no net increase in per-
request page consumption. Successive segments reuse the
same padding slots, preventing accumulation. The
rq_maxpages guard rejects any allocation that would
overrun the array, falling back to the per-page path.
Under memory pressure, __GFP_NORETRY causes the higher-
order allocation to fail without stalling.
The contiguous path is attempted when the segment starts
page-aligned (rc_pageoff == 0) and spans at least two
pages. NFS WRITE segments carry application-modified byte
ranges of arbitrary length, so the optimization is not
restricted to power-of-two page counts.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/svc_rdma_rw.c | 220 ++++++++++++++++++++++++++++++
1 file changed, 220 insertions(+)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 4ec2f9ae06aa..63fcf677c96c 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -732,6 +732,216 @@ int svc_rdma_prepare_reply_chunk(struct svcxprt_rdma *rdma,
return xdr->len;
}
+#if PAGE_SIZE < SZ_64K
+
+/*
+ * Limit contiguous RDMA Read sink allocations to 64KB
+ * (order-4 on 4KB-page systems). Higher orders risk
+ * allocation failure under __GFP_NORETRY, which would
+ * negate the benefit of the contiguous fast path.
+ */
+#define SVC_RDMA_CONTIG_MAX_ORDER get_order(SZ_64K)
+
+/**
+ * svc_rdma_alloc_read_pages - Allocate physically contiguous pages
+ * @nr_pages: number of pages needed
+ * @order: on success, set to the allocation order
+ *
+ * Attempts a higher-order allocation, falling back to smaller orders.
+ * The returned pages are split immediately so each sub-page has its
+ * own refcount and can be freed independently.
+ *
+ * Returns a pointer to the first page on success, or NULL if even
+ * order-1 allocation fails.
+ */
+static struct page *
+svc_rdma_alloc_read_pages(unsigned int nr_pages, unsigned int *order)
+{
+ unsigned int o;
+ struct page *page;
+
+ o = get_order(nr_pages << PAGE_SHIFT);
+ if (o > SVC_RDMA_CONTIG_MAX_ORDER)
+ o = SVC_RDMA_CONTIG_MAX_ORDER;
+
+ while (o >= 1) {
+ page = alloc_pages(GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN,
+ o);
+ if (page) {
+ split_page(page, o);
+ *order = o;
+ return page;
+ }
+ o--;
+ }
+ return NULL;
+}
+
+/*
+ * svc_rdma_fill_contig_bvec - Replace rq_pages with a contiguous allocation
+ * @rqstp: RPC transaction context
+ * @head: context for ongoing I/O
+ * @bv: bvec entry to fill
+ * @pages_left: number of data pages remaining in the segment
+ * @len_left: bytes remaining in the segment
+ *
+ * On success, fills @bv with a bvec spanning the contiguous range and
+ * advances rc_curpage/rc_page_count. Returns the byte length covered,
+ * or zero if the allocation failed or would overrun rq_maxpages.
+ */
+static unsigned int
+svc_rdma_fill_contig_bvec(struct svc_rqst *rqstp,
+ struct svc_rdma_recv_ctxt *head,
+ struct bio_vec *bv, unsigned int pages_left,
+ unsigned int len_left)
+{
+ unsigned int order, alloc_nr, chunk_pages, chunk_len, i;
+ struct page *page;
+
+ page = svc_rdma_alloc_read_pages(pages_left, &order);
+ if (!page)
+ return 0;
+ alloc_nr = 1 << order;
+
+ if (head->rc_curpage + alloc_nr > rqstp->rq_maxpages) {
+ for (i = 0; i < alloc_nr; i++)
+ __free_page(page + i);
+ return 0;
+ }
+
+ for (i = 0; i < alloc_nr; i++) {
+ svc_rqst_page_release(rqstp,
+ rqstp->rq_pages[head->rc_curpage + i]);
+ rqstp->rq_pages[head->rc_curpage + i] = page + i;
+ }
+
+ chunk_pages = min(alloc_nr, pages_left);
+ chunk_len = min_t(unsigned int, chunk_pages << PAGE_SHIFT, len_left);
+ bvec_set_page(bv, page, chunk_len, 0);
+ head->rc_page_count += chunk_pages;
+ head->rc_curpage += chunk_pages;
+ return chunk_len;
+}
+
+/*
+ * svc_rdma_fill_page_bvec - Add a single rq_page to the bvec array
+ * @head: context for ongoing I/O
+ * @ctxt: R/W context whose bvec array is being filled
+ * @cur: page to add
+ * @bvec_idx: pointer to current bvec index, not advanced on merge
+ * @len_left: bytes remaining in the segment
+ *
+ * If @cur is physically contiguous with the preceding bvec, it is
+ * merged by extending that bvec's length. Otherwise a new bvec
+ * entry is created. Returns the byte length covered.
+ */
+static unsigned int
+svc_rdma_fill_page_bvec(struct svc_rdma_recv_ctxt *head,
+ struct svc_rdma_rw_ctxt *ctxt, struct page *cur,
+ unsigned int *bvec_idx, unsigned int len_left)
+{
+ unsigned int chunk_len = min_t(unsigned int, PAGE_SIZE, len_left);
+
+ head->rc_page_count++;
+ head->rc_curpage++;
+
+ if (*bvec_idx > 0) {
+ struct bio_vec *prev = &ctxt->rw_bvec[*bvec_idx - 1];
+
+ if (page_to_phys(prev->bv_page) + prev->bv_offset +
+ prev->bv_len == page_to_phys(cur)) {
+ prev->bv_len += chunk_len;
+ return chunk_len;
+ }
+ }
+
+ bvec_set_page(&ctxt->rw_bvec[*bvec_idx], cur, chunk_len, 0);
+ (*bvec_idx)++;
+ return chunk_len;
+}
+
+/**
+ * svc_rdma_build_read_segment_contig - Build RDMA Read WR with contiguous pages
+ * @rqstp: RPC transaction context
+ * @head: context for ongoing I/O
+ * @segment: co-ordinates of remote memory to be read
+ *
+ * Greedily allocates higher-order pages to cover the segment,
+ * building one bvec per contiguous chunk. Each allocation is
+ * split so sub-pages have independent refcounts. When a
+ * higher-order allocation fails, remaining pages are covered
+ * individually, merging adjacent pages into the preceding bvec
+ * when they are physically contiguous. The split sub-pages
+ * replace entries in rq_pages[] so downstream cleanup is
+ * unchanged.
+ *
+ * Returns:
+ * %0: the Read WR was constructed successfully
+ * %-ENOMEM: allocation failed
+ * %-EIO: a DMA mapping error occurred
+ */
+static int svc_rdma_build_read_segment_contig(struct svc_rqst *rqstp,
+ struct svc_rdma_recv_ctxt *head,
+ const struct svc_rdma_segment *segment)
+{
+ struct svcxprt_rdma *rdma = svc_rdma_rqst_rdma(rqstp);
+ struct svc_rdma_chunk_ctxt *cc = &head->rc_cc;
+ unsigned int nr_data_pages, bvec_idx;
+ struct svc_rdma_rw_ctxt *ctxt;
+ unsigned int len_left;
+ int ret;
+
+ nr_data_pages = PAGE_ALIGN(segment->rs_length) >> PAGE_SHIFT;
+ if (head->rc_curpage + nr_data_pages > rqstp->rq_maxpages)
+ return -ENOMEM;
+
+ ctxt = svc_rdma_get_rw_ctxt(rdma, nr_data_pages);
+ if (!ctxt)
+ return -ENOMEM;
+
+ bvec_idx = 0;
+ len_left = segment->rs_length;
+ while (len_left) {
+ unsigned int pages_left = PAGE_ALIGN(len_left) >> PAGE_SHIFT;
+ unsigned int chunk_len = 0;
+
+ if (pages_left >= 2)
+ chunk_len = svc_rdma_fill_contig_bvec(rqstp, head,
+ &ctxt->rw_bvec[bvec_idx],
+ pages_left, len_left);
+ if (chunk_len) {
+ bvec_idx++;
+ } else {
+ struct page *cur =
+ rqstp->rq_pages[head->rc_curpage];
+ chunk_len = svc_rdma_fill_page_bvec(head, ctxt, cur,
+ &bvec_idx,
+ len_left);
+ }
+
+ len_left -= chunk_len;
+ }
+
+ ctxt->rw_nents = bvec_idx;
+
+ head->rc_pageoff = offset_in_page(segment->rs_length);
+ if (head->rc_pageoff)
+ head->rc_curpage--;
+
+ ret = svc_rdma_rw_ctx_init(rdma, ctxt, segment->rs_offset,
+ segment->rs_handle, segment->rs_length,
+ DMA_FROM_DEVICE);
+ if (ret < 0)
+ return -EIO;
+ percpu_counter_inc(&svcrdma_stat_read);
+
+ list_add(&ctxt->rw_list, &cc->cc_rwctxts);
+ cc->cc_sqecount += ret;
+ return 0;
+}
+
+#endif /* PAGE_SIZE < SZ_64K */
+
/**
* svc_rdma_build_read_segment - Build RDMA Read WQEs to pull one RDMA segment
* @rqstp: RPC transaction context
@@ -758,6 +968,16 @@ static int svc_rdma_build_read_segment(struct svc_rqst *rqstp,
if (check_add_overflow(head->rc_pageoff, len, &total))
return -EINVAL;
nr_bvec = PAGE_ALIGN(total) >> PAGE_SHIFT;
+
+#if PAGE_SIZE < SZ_64K
+ if (head->rc_pageoff == 0 && nr_bvec >= 2) {
+ ret = svc_rdma_build_read_segment_contig(rqstp, head,
+ segment);
+ if (ret != -ENOMEM)
+ return ret;
+ }
+#endif
+
ctxt = svc_rdma_get_rw_ctxt(rdma, nr_bvec);
if (!ctxt)
return -ENOMEM;
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
2026-03-13 19:41 [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
` (3 preceding siblings ...)
2026-03-13 19:42 ` [PATCH v3 4/4] svcrdma: Use contiguous pages for RDMA Read sink buffers Chuck Lever
@ 2026-03-16 20:15 ` Leon Romanovsky
2026-03-16 20:24 ` Chuck Lever
4 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2026-03-16 20:15 UTC (permalink / raw)
To: Chuck Lever
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, linux-rdma, Chuck Lever
On Fri, Mar 13, 2026 at 03:41:57PM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> This series now carries two MR exhaustion fixes and a proposal for
> using contiguous pages for RDMA Read sink buffers in svcrdma.
>
> Fixes for the MR exhaustion issues should go into 7.0-rc and stable,
> and the contiguous page patches can wait for the next merge window.
>
> Base commit: v7.0-rc3
> ---
> Changes since v2:
> - Fix similar exhaustion issue for SGL
> - Add patch that introduces svc_rqst_page_release
>
> Changes since v1:
> - Clarify code comments
> - Allocate contiguous pages for RDMA Read sink buffers
>
> Chuck Lever (4):
> RDMA/rw: Fall back to direct SGE on MR pool exhaustion
> RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
I applied these two to wip/leon-for-rc.
Thanks
> SUNRPC: Add svc_rqst_page_release() helper
> svcrdma: Use contiguous pages for RDMA Read sink buffers
>
> drivers/infiniband/core/rw.c | 43 ++++--
> include/linux/sunrpc/svc.h | 15 ++
> net/sunrpc/svc.c | 7 +-
> net/sunrpc/svcsock.c | 2 +-
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 220 ++++++++++++++++++++++++++++++
> 5 files changed, 268 insertions(+), 19 deletions(-)
>
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
2026-03-16 20:15 ` [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Leon Romanovsky
@ 2026-03-16 20:24 ` Chuck Lever
0 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2026-03-16 20:24 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, linux-rdma, Chuck Lever
On Mon, Mar 16, 2026, at 4:15 PM, Leon Romanovsky wrote:
> On Fri, Mar 13, 2026 at 03:41:57PM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> This series now carries two MR exhaustion fixes and a proposal for
>> using contiguous pages for RDMA Read sink buffers in svcrdma.
>>
>> Fixes for the MR exhaustion issues should go into 7.0-rc and stable,
>> and the contiguous page patches can wait for the next merge window.
>>
>> Base commit: v7.0-rc3
>> ---
>> Changes since v2:
>> - Fix similar exhaustion issue for SGL
>> - Add patch that introduces svc_rqst_page_release
>>
>> Changes since v1:
>> - Clarify code comments
>> - Allocate contiguous pages for RDMA Read sink buffers
>>
>> Chuck Lever (4):
>> RDMA/rw: Fall back to direct SGE on MR pool exhaustion
>> RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
>
> I applied these two to wip/leon-for-rc.
>
> Thanks
Thanks, Leon. I will apply the other two to nfsd-testing.
>> SUNRPC: Add svc_rqst_page_release() helper
>> svcrdma: Use contiguous pages for RDMA Read sink buffers
>>
>> drivers/infiniband/core/rw.c | 43 ++++--
>> include/linux/sunrpc/svc.h | 15 ++
>> net/sunrpc/svc.c | 7 +-
>> net/sunrpc/svcsock.c | 2 +-
>> net/sunrpc/xprtrdma/svc_rdma_rw.c | 220 ++++++++++++++++++++++++++++++
>> 5 files changed, 268 insertions(+), 19 deletions(-)
>>
>> --
>> 2.53.0
>>
--
Chuck Lever
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 1/4] RDMA/rw: Fall back to direct SGE on MR pool exhaustion
2026-03-13 19:41 ` [PATCH v3 1/4] RDMA/rw: Fall back to direct SGE on MR pool exhaustion Chuck Lever
@ 2026-03-17 14:24 ` Christoph Hellwig
0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2026-03-17 14:24 UTC (permalink / raw)
To: Chuck Lever
Cc: Leon Romanovsky, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, linux-rdma,
Chuck Lever
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 2/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
2026-03-13 19:41 ` [PATCH v3 2/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
@ 2026-03-17 14:25 ` Christoph Hellwig
0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2026-03-17 14:25 UTC (permalink / raw)
To: Chuck Lever
Cc: Leon Romanovsky, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, linux-rdma,
Chuck Lever
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 3/4] SUNRPC: Add svc_rqst_page_release() helper
2026-03-13 19:42 ` [PATCH v3 3/4] SUNRPC: Add svc_rqst_page_release() helper Chuck Lever
@ 2026-03-17 14:26 ` Christoph Hellwig
0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2026-03-17 14:26 UTC (permalink / raw)
To: Chuck Lever
Cc: Leon Romanovsky, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, linux-rdma,
Chuck Lever
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 4/4] svcrdma: Use contiguous pages for RDMA Read sink buffers
2026-03-13 19:42 ` [PATCH v3 4/4] svcrdma: Use contiguous pages for RDMA Read sink buffers Chuck Lever
@ 2026-03-17 14:28 ` Christoph Hellwig
2026-03-17 15:26 ` Chuck Lever
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2026-03-17 14:28 UTC (permalink / raw)
To: Chuck Lever
Cc: Leon Romanovsky, Christoph Hellwig, NeilBrown, Jeff Layton,
Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, linux-rdma,
Chuck Lever, Christoph Hellwig
On Fri, Mar 13, 2026 at 03:42:01PM -0400, Chuck Lever wrote:
> Suggested-by: Christoph Hellwig <hch@infradead.org>
I think that's a bit too much credit. I just wondered why sunrpc can't
coalesce pages itself.
> +#if PAGE_SIZE < SZ_64K
> +
> +/*
> + * Limit contiguous RDMA Read sink allocations to 64KB
> + * (order-4 on 4KB-page systems). Higher orders risk
> + * allocation failure under __GFP_NORETRY, which would
> + * negate the benefit of the contiguous fast path.
> + */
> +#define SVC_RDMA_CONTIG_MAX_ORDER get_order(SZ_64K)
Isn't the limit really an order and thus grows with the page size,
instead of based on a fixed size?
> + o = get_order(nr_pages << PAGE_SHIFT);
> + if (o > SVC_RDMA_CONTIG_MAX_ORDER)
> + o = SVC_RDMA_CONTIG_MAX_ORDER;
Use min()?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 4/4] svcrdma: Use contiguous pages for RDMA Read sink buffers
2026-03-17 14:28 ` Christoph Hellwig
@ 2026-03-17 15:26 ` Chuck Lever
0 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2026-03-17 15:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Leon Romanovsky, NeilBrown, Jeff Layton, Olga Kornievskaia,
Dai Ngo, Tom Talpey, linux-nfs, linux-rdma, Chuck Lever,
Christoph Hellwig
On Tue, Mar 17, 2026, at 10:28 AM, Christoph Hellwig wrote:
> On Fri, Mar 13, 2026 at 03:42:01PM -0400, Chuck Lever wrote:
>> +#if PAGE_SIZE < SZ_64K
>> +
>> +/*
>> + * Limit contiguous RDMA Read sink allocations to 64KB
>> + * (order-4 on 4KB-page systems). Higher orders risk
>> + * allocation failure under __GFP_NORETRY, which would
>> + * negate the benefit of the contiguous fast path.
>> + */
>> +#define SVC_RDMA_CONTIG_MAX_ORDER get_order(SZ_64K)
>
> Isn't the limit really an order and thus grows with the page size,
> instead of based on a fixed size?
So on a platform with 16KB pages, an order-4 allocation request
(256KB) is still desirable for our purpose here?
--
Chuck Lever
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-03-17 15:26 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-13 19:41 [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-13 19:41 ` [PATCH v3 1/4] RDMA/rw: Fall back to direct SGE on MR pool exhaustion Chuck Lever
2026-03-17 14:24 ` Christoph Hellwig
2026-03-13 19:41 ` [PATCH v3 2/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-17 14:25 ` Christoph Hellwig
2026-03-13 19:42 ` [PATCH v3 3/4] SUNRPC: Add svc_rqst_page_release() helper Chuck Lever
2026-03-17 14:26 ` Christoph Hellwig
2026-03-13 19:42 ` [PATCH v3 4/4] svcrdma: Use contiguous pages for RDMA Read sink buffers Chuck Lever
2026-03-17 14:28 ` Christoph Hellwig
2026-03-17 15:26 ` Chuck Lever
2026-03-16 20:15 ` [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Leon Romanovsky
2026-03-16 20:24 ` Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox