From: Chuck Lever <cel@kernel.org>
To: NeilBrown <neilb@ownmail.net>, Jeff Layton <jlayton@kernel.org>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
Leon Romanovsky <leon@kernel.org>, Christoph Hellwig <hch@lst.de>
Cc: <linux-nfs@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
Chuck Lever <chuck.lever@oracle.com>
Subject: [PATCH v2 1/2] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
Date: Thu, 12 Mar 2026 09:40:07 -0400 [thread overview]
Message-ID: <20260312134008.7387-2-cel@kernel.org> (raw)
In-Reply-To: <20260312134008.7387-1-cel@kernel.org>
From: Chuck Lever <chuck.lever@oracle.com>
When IOVA-based DMA mapping is unavailable (e.g., IOMMU
passthrough mode), rdma_rw_ctx_init_bvec() falls back to
checking rdma_rw_io_needs_mr() with the raw bvec count.
Unlike the scatterlist path in rdma_rw_ctx_init(), which
passes a post-DMA-mapping entry count that reflects
coalescing of physically contiguous pages, the bvec path
passes the pre-mapping page count. This overstates the
number of DMA entries, causing every multi-bvec RDMA READ
to consume an MR from the QP's pool.
Under NFS WRITE workloads the server performs RDMA READs
to pull data from the client. With the inflated MR demand,
the pool is rapidly exhausted, ib_mr_pool_get() returns
NULL, and rdma_rw_init_one_mr() returns -EAGAIN. svcrdma
treats this as a DMA mapping failure, closes the connection,
and the client reconnects -- producing a cycle of 71% RPC
retransmissions and ~100 reconnections per test run. RDMA
WRITEs (NFS READ direction) are unaffected because
DMA_TO_DEVICE never triggers the max_sgl_rd check.
Remove the rdma_rw_io_needs_mr() gate from the bvec path
entirely, so that bvec RDMA operations always use the
map_wrs path (direct WR posting without MR allocation).
The bvec caller has no post-DMA-coalescing segment count
available -- xdr_buf and svc_rqst hold pages as individual
pointers, and physical contiguity is discovered only during
DMA mapping -- so the raw page count cannot serve as a
reliable input to rdma_rw_io_needs_mr(). iWARP devices,
which require MRs unconditionally, are handled by an
earlier check in rdma_rw_ctx_init_bvec() and are unaffected.
Fixes: bea28ac14cab ("RDMA/core: add MR support for bvec-based RDMA operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
drivers/infiniband/core/rw.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index fc45c384833f..1b74293adec1 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -686,14 +686,16 @@ int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
return ret;
/*
- * IOVA mapping not available. Check if MR registration provides
- * better performance than multiple SGE entries.
+ * IOVA not available; fall back to the map_wrs path, which maps
+ * each bvec as a direct SGE. This is always correct: the MR path
+ * is a throughput optimization, not a correctness requirement.
+ * (iWARP, which does require MRs, is handled by the check above.)
+ *
+ * The rdma_rw_io_needs_mr() gate is not used here because nr_bvec
+ * is a raw page count that overstates DMA entry demand -- the bvec
+ * caller has no post-DMA-coalescing segment count, and feeding the
+ * inflated count into the MR path exhausts the pool on RDMA READs.
*/
- if (rdma_rw_io_needs_mr(dev, port_num, dir, nr_bvec))
- return rdma_rw_init_mr_wrs_bvec(ctx, qp, port_num, bvecs,
- nr_bvec, &iter, remote_addr,
- rkey, dir);
-
return rdma_rw_init_map_wrs_bvec(ctx, qp, bvecs, nr_bvec, &iter,
remote_addr, rkey, dir);
}
--
2.52.0
next prev parent reply other threads:[~2026-03-12 13:40 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-12 13:40 [PATCH v2 0/2] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-12 13:40 ` Chuck Lever [this message]
2026-03-12 13:40 ` [PATCH v2 2/2] svcrdma: Use contiguous pages for RDMA Read sink buffers Chuck Lever
2026-03-13 8:51 ` kernel test robot
2026-03-13 12:34 ` Chuck Lever
2026-03-13 17:31 ` kernel test robot
2026-03-14 5:06 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260312134008.7387-2-cel@kernel.org \
--to=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=dai.ngo@oracle.com \
--cc=hch@lst.de \
--cc=jlayton@kernel.org \
--cc=leon@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=neilb@ownmail.net \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox