From: Leon Romanovsky <leon@kernel.org>
To: Chuck Lever <cel@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>,
Jason Gunthorpe <jgg@nvidia.com>, Christoph Hellwig <hch@lst.de>,
linux-rdma@vger.kernel.org, Chuck Lever <chuck.lever@oracle.com>
Subject: Re: [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
Date: Tue, 10 Mar 2026 20:37:56 +0200 [thread overview]
Message-ID: <20260310183756.GH12611@unreal> (raw)
In-Reply-To: <d0fefc47-b60f-47ba-8f2f-7eb05b1bb86d@kernel.org>
On Tue, Mar 10, 2026 at 10:36:43AM -0400, Chuck Lever wrote:
> On 3/10/26 9:42 AM, Christoph Hellwig wrote:
> > On Mon, Mar 09, 2026 at 11:46:21PM -0400, Chuck Lever wrote:
> >> Under NFS WRITE workloads the server performs RDMA READs to
> >> pull data from the client. With the inflated MR demand, the
> >> pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and
> >> rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as
> >> a DMA mapping failure, closes the connection, and the client
> >> reconnects -- producing a cycle of 71% RPC retransmissions and
> >> ~100 reconnections per test run. RDMA WRITEs (NFS READ
> >> direction) are unaffected because DMA_TO_DEVICE never triggers
> >> the max_sgl_rd check.
> >
> > So this changelog extensively describes the problem, but it doesn't
> > actually say how you fix it.
>
> I didn't want to waste everyone's time, but I can add that.
>
>
> >> + *
> >> + * TODO: A bulk DMA mapping API for bvecs analogous to
> >> + * dma_map_sgtable() would provide a proper post-DMA-
> >> + * coalescing segment count here, enabling the map_wrs
> >> + * path in more cases.
> >
> > This isn't really something the DMA layer can easily do without getting
> > as inefficient as the sgtable based path. What the block layer does
> > here is to simply keep a higher level count of merged segments. The
> > other option would be to not create multiple bvecs for continguous
> > regions, which is what modern file system do in general, and why the
> > above block layer nr_phys_segments based optimization isn't actually
> > used all that much these days.
>
> Technically, NFSD isn't a file system, it's a protocol adapter.
>
>
> > Why can't NFS send a single bvec for contiguous ranges?
>
> Have a look at svc_rdma_build_read_segment(). The RDMA READ path builds
> bvecs from rqstp->rq_pages[], which is an array of individual struct
> page pointers. Each bvec entry covers at most one page.
>
> This is because I/O payloads arrive in an xdr_buf, which represents its
> page data as a struct page ** array (xdr->pages), and svc_rqst::rq_pages
> is likewise a flat array of single-page pointers. These pages are
> allocated individually (typically from the page allocator via
> alloc_pages()), so there's no guarantee of physical contiguity. Even if
> adjacent pages happen to be contiguous, the code has no way to know that
> without inspecting PFNs (which is exactly what the DMA mapping layer
> does).
>
> So currently svcrdma can't send a single bvec for contiguous ranges
> because the contiguity information doesn't exist at the NFSD or RPC
> layer. Contiguity is (re)discovered only at DMA map time.
>
> The alternative is to build an SGL for mapping the bvec so that rw.c can
> get the real contiguity of the pages before proceeding. But that seems
> icky.
>
> Long term, I expect that NFSD will need to preserve the folios it gets
> from file systems and pass those to the RPC transports without
> translating them to an array of page pointers.
Folio sounds like a correct approach to me, why do you mark it as "long term"?
Thanks
>
>
> --
> Chuck Lever
next prev parent reply other threads:[~2026-03-10 18:38 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-10 3:46 [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-10 13:42 ` Christoph Hellwig
2026-03-10 14:36 ` Chuck Lever
2026-03-10 18:37 ` Leon Romanovsky [this message]
2026-03-10 18:49 ` Chuck Lever
2026-03-10 19:31 ` Leon Romanovsky
2026-03-10 19:56 ` [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers Chuck Lever
2026-03-10 20:27 ` Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260310183756.GH12611@unreal \
--to=leon@kernel.org \
--cc=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=hch@infradead.org \
--cc=hch@lst.de \
--cc=jgg@nvidia.com \
--cc=linux-rdma@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.