From: Chuck Lever <cel@kernel.org>
To: Christoph Hellwig <hch@infradead.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>,
Leon Romanovsky <leon@kernel.org>, Christoph Hellwig <hch@lst.de>,
linux-rdma@vger.kernel.org, Chuck Lever <chuck.lever@oracle.com>
Subject: Re: [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
Date: Tue, 10 Mar 2026 10:36:43 -0400 [thread overview]
Message-ID: <d0fefc47-b60f-47ba-8f2f-7eb05b1bb86d@kernel.org> (raw)
In-Reply-To: <abAftjplHdwdwrkd@infradead.org>
On 3/10/26 9:42 AM, Christoph Hellwig wrote:
> On Mon, Mar 09, 2026 at 11:46:21PM -0400, Chuck Lever wrote:
>> Under NFS WRITE workloads the server performs RDMA READs to
>> pull data from the client. With the inflated MR demand, the
>> pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and
>> rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as
>> a DMA mapping failure, closes the connection, and the client
>> reconnects -- producing a cycle of 71% RPC retransmissions and
>> ~100 reconnections per test run. RDMA WRITEs (NFS READ
>> direction) are unaffected because DMA_TO_DEVICE never triggers
>> the max_sgl_rd check.
>
> So this changelog extensively describes the problem, but it doesn't
> actually say how you fix it.
I didn't want to waste everyone's time, but I can add that.
>> + *
>> + * TODO: A bulk DMA mapping API for bvecs analogous to
>> + * dma_map_sgtable() would provide a proper post-DMA-
>> + * coalescing segment count here, enabling the map_wrs
>> + * path in more cases.
>
> This isn't really something the DMA layer can easily do without getting
> as inefficient as the sgtable based path. What the block layer does
> here is to simply keep a higher level count of merged segments. The
> other option would be to not create multiple bvecs for continguous
> regions, which is what modern file system do in general, and why the
> above block layer nr_phys_segments based optimization isn't actually
> used all that much these days.
Technically, NFSD isn't a file system, it's a protocol adapter.
> Why can't NFS send a single bvec for contiguous ranges?
Have a look at svc_rdma_build_read_segment(). The RDMA READ path builds
bvecs from rqstp->rq_pages[], which is an array of individual struct
page pointers. Each bvec entry covers at most one page.
This is because I/O payloads arrive in an xdr_buf, which represents its
page data as a struct page ** array (xdr->pages), and svc_rqst::rq_pages
is likewise a flat array of single-page pointers. These pages are
allocated individually (typically from the page allocator via
alloc_pages()), so there's no guarantee of physical contiguity. Even if
adjacent pages happen to be contiguous, the code has no way to know that
without inspecting PFNs (which is exactly what the DMA mapping layer
does).
So currently svcrdma can't send a single bvec for contiguous ranges
because the contiguity information doesn't exist at the NFSD or RPC
layer. Contiguity is (re)discovered only at DMA map time.
The alternative is to build an SGL for mapping the bvec so that rw.c can
get the real contiguity of the pages before proceeding. But that seems
icky.
Long term, I expect that NFSD will need to preserve the folios it gets
from file systems and pass those to the RPC transports without
translating them to an array of page pointers.
--
Chuck Lever
next prev parent reply other threads:[~2026-03-10 14:36 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-10 3:46 [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-10 13:42 ` Christoph Hellwig
2026-03-10 14:36 ` Chuck Lever [this message]
2026-03-10 18:37 ` Leon Romanovsky
2026-03-10 18:49 ` Chuck Lever
2026-03-10 19:31 ` Leon Romanovsky
2026-03-10 19:56 ` [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers Chuck Lever
2026-03-10 20:27 ` Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d0fefc47-b60f-47ba-8f2f-7eb05b1bb86d@kernel.org \
--to=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=hch@infradead.org \
--cc=hch@lst.de \
--cc=jgg@nvidia.com \
--cc=leon@kernel.org \
--cc=linux-rdma@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox