From: Chuck Lever <cel@kernel.org>
To: Leon Romanovsky <leon@kernel.org>, Christoph Hellwig <hch@lst.de>,
NeilBrown <neilb@ownmail.net>, Jeff Layton <jlayton@kernel.org>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Cc: <linux-nfs@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
Chuck Lever <chuck.lever@oracle.com>
Subject: [PATCH v3 1/4] RDMA/rw: Fall back to direct SGE on MR pool exhaustion
Date: Fri, 13 Mar 2026 15:41:58 -0400 [thread overview]
Message-ID: <20260313194201.5818-2-cel@kernel.org> (raw)
In-Reply-To: <20260313194201.5818-1-cel@kernel.org>
From: Chuck Lever <chuck.lever@oracle.com>
When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs()
produces no coalescing: each scatterlist page maps 1:1 to a DMA
entry, so sgt.nents equals the raw page count. A 1 MB transfer
yields 256 DMA entries. If that count exceeds the device's
max_sgl_rd threshold (an optimization hint from mlx5 firmware),
rdma_rw_io_needs_mr() steers the operation into the MR
registration path. Each such operation consumes one or more MRs
from a pool sized at max_rdma_ctxs -- roughly one MR per
concurrent context. Under write-intensive workloads that issue
many concurrent RDMA READs, the pool is rapidly exhausted,
ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns
-EAGAIN. Upper layer protocols treat this as a fatal DMA mapping
failure and tear down the connection.
The max_sgl_rd check is a performance optimization, not a
correctness requirement: the device can handle large SGE counts
via direct posting, just less efficiently than with MR
registration. When the MR pool cannot satisfy a request, falling
back to the direct SGE (map_wrs) path avoids the connection
reset while preserving the MR optimization for the common case
where pool resources are available.
Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from
rdma_rw_init_mr_wrs() triggers direct SGE posting instead of
propagating the error. iWARP devices, which mandate MR
registration for RDMA READs, and force_mr debug mode continue
to treat -EAGAIN as terminal.
Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
drivers/infiniband/core/rw.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index fc45c384833f..c01d5e605053 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -608,14 +608,29 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num,
if (rdma_rw_io_needs_mr(qp->device, port_num, dir, sg_cnt)) {
ret = rdma_rw_init_mr_wrs(ctx, qp, port_num, sg, sg_cnt,
sg_offset, remote_addr, rkey, dir);
- } else if (sg_cnt > 1) {
- ret = rdma_rw_init_map_wrs(ctx, qp, sg, sg_cnt, sg_offset,
- remote_addr, rkey, dir);
- } else {
- ret = rdma_rw_init_single_wr(ctx, qp, sg, sg_offset,
- remote_addr, rkey, dir);
+ /*
+ * If MR init succeeded or failed for a reason other
+ * than pool exhaustion, that result is final.
+ *
+ * Pool exhaustion (-EAGAIN) from the max_sgl_rd
+ * optimization is recoverable: fall back to
+ * direct SGE posting. iWARP and force_mr require
+ * MRs unconditionally, so -EAGAIN is terminal.
+ */
+ if (ret != -EAGAIN ||
+ rdma_protocol_iwarp(qp->device, port_num) ||
+ unlikely(rdma_rw_force_mr))
+ goto out;
}
+ if (sg_cnt > 1)
+ ret = rdma_rw_init_map_wrs(ctx, qp, sg, sg_cnt, sg_offset,
+ remote_addr, rkey, dir);
+ else
+ ret = rdma_rw_init_single_wr(ctx, qp, sg, sg_offset,
+ remote_addr, rkey, dir);
+
+out:
if (ret < 0)
goto out_unmap_sg;
return ret;
--
2.53.0
next prev parent reply other threads:[~2026-03-13 19:42 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-13 19:41 [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-13 19:41 ` Chuck Lever [this message]
2026-03-17 14:24 ` [PATCH v3 1/4] RDMA/rw: Fall back to direct SGE on MR pool exhaustion Christoph Hellwig
2026-03-13 19:41 ` [PATCH v3 2/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-17 14:25 ` Christoph Hellwig
2026-03-13 19:42 ` [PATCH v3 3/4] SUNRPC: Add svc_rqst_page_release() helper Chuck Lever
2026-03-17 14:26 ` Christoph Hellwig
2026-03-13 19:42 ` [PATCH v3 4/4] svcrdma: Use contiguous pages for RDMA Read sink buffers Chuck Lever
2026-03-17 14:28 ` Christoph Hellwig
2026-03-17 15:26 ` Chuck Lever
2026-03-16 20:15 ` [PATCH v3 0/4] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Leon Romanovsky
2026-03-16 20:24 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260313194201.5818-2-cel@kernel.org \
--to=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=dai.ngo@oracle.com \
--cc=hch@lst.de \
--cc=jlayton@kernel.org \
--cc=leon@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=neilb@ownmail.net \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox