All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <cel@kernel.org>
To: Leon Romanovsky <leon@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	Jason Gunthorpe <jgg@nvidia.com>, Christoph Hellwig <hch@lst.de>,
	linux-rdma@vger.kernel.org, Chuck Lever <chuck.lever@oracle.com>
Subject: Re: [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
Date: Tue, 10 Mar 2026 14:49:46 -0400	[thread overview]
Message-ID: <2ae88f79-36a1-4b43-bf40-da86425046a5@kernel.org> (raw)
In-Reply-To: <20260310183756.GH12611@unreal>

On 3/10/26 2:37 PM, Leon Romanovsky wrote:
> On Tue, Mar 10, 2026 at 10:36:43AM -0400, Chuck Lever wrote:
>> On 3/10/26 9:42 AM, Christoph Hellwig wrote:
>>> On Mon, Mar 09, 2026 at 11:46:21PM -0400, Chuck Lever wrote:
>>>> Under NFS WRITE workloads the server performs RDMA READs to
>>>> pull data from the client. With the inflated MR demand, the
>>>> pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and
>>>> rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as
>>>> a DMA mapping failure, closes the connection, and the client
>>>> reconnects -- producing a cycle of 71% RPC retransmissions and
>>>> ~100 reconnections per test run. RDMA WRITEs (NFS READ
>>>> direction) are unaffected because DMA_TO_DEVICE never triggers
>>>> the max_sgl_rd check.
>>>
>>> So this changelog extensively describes the problem, but it doesn't
>>> actually say how you fix it.
>>
>> I didn't want to waste everyone's time, but I can add that.
>>
>>
>>>> +	 *
>>>> +	 * TODO: A bulk DMA mapping API for bvecs analogous to
>>>> +	 * dma_map_sgtable() would provide a proper post-DMA-
>>>> +	 * coalescing segment count here, enabling the map_wrs
>>>> +	 * path in more cases.
>>>
>>> This isn't really something the DMA layer can easily do without getting
>>> as inefficient as the sgtable based path.  What the block layer does
>>> here is to simply keep a higher level count of merged segments.  The
>>> other option would be to not create multiple bvecs for continguous
>>> regions, which is what modern file system do in general, and why the
>>> above block layer nr_phys_segments based optimization isn't actually
>>> used all that much these days.
>>
>> Technically, NFSD isn't a file system, it's a protocol adapter.
>>
>>
>>> Why can't NFS send a single bvec for contiguous ranges?
>>
>> Have a look at svc_rdma_build_read_segment(). The RDMA READ path builds
>> bvecs from rqstp->rq_pages[], which is an array of individual struct
>> page pointers. Each bvec entry covers at most one page.
>>
>> This is because I/O payloads arrive in an xdr_buf, which represents its
>> page data as a struct page ** array (xdr->pages), and svc_rqst::rq_pages
>> is likewise a flat array of single-page pointers. These pages are
>> allocated individually (typically from the page allocator via
>> alloc_pages()), so there's no guarantee of physical contiguity. Even if
>> adjacent pages happen to be contiguous, the code has no way to know that
>> without inspecting PFNs (which is exactly what the DMA mapping layer
>> does).
>>
>> So currently svcrdma can't send a single bvec for contiguous ranges
>> because the contiguity information doesn't exist at the NFSD or RPC
>> layer. Contiguity is (re)discovered only at DMA map time.
>>
>> The alternative is to build an SGL for mapping the bvec so that rw.c can
>> get the real contiguity of the pages before proceeding. But that seems
>> icky.
>>
>> Long term, I expect that NFSD will need to preserve the folios it gets
>> from file systems and pass those to the RPC transports without
>> translating them to an array of page pointers.
> 
> Folio sounds like a correct approach to me, why do you mark it as "long term"?

All four NFS maintainers are aware of this need, and I /think/ we are
all on board with the "folio/bvec" approach. But there are a number of
reasons this is not just a "go write the code" kind of problem:

- xdr_buf is used by both the NFS client and NFS server stacks, and they
are separately maintained.

- xdr_buf is used from nearly the top to the bottom of these stacks, so
making this kind of change will be painstaking.

- The XDR layer is full of helper APIs that deal with xdr_buf page
arrays that would need attention.

- I need to understand whether addressing the DMA map problem has
benefits for more broadly-deployed RPC transports such as TCP.


I'll have to think if we can just add something clever to the xdr_buf
that only NFSD can use and then have that act as the conduit from file
system to RPC transport. That could act as a prototype.


-- 
Chuck Lever

  reply	other threads:[~2026-03-10 18:49 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-10  3:46 [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-10 13:42 ` Christoph Hellwig
2026-03-10 14:36   ` Chuck Lever
2026-03-10 18:37     ` Leon Romanovsky
2026-03-10 18:49       ` Chuck Lever [this message]
2026-03-10 19:31         ` Leon Romanovsky
2026-03-10 19:56           ` [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers Chuck Lever
2026-03-10 20:27             ` Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2ae88f79-36a1-4b43-bf40-da86425046a5@kernel.org \
    --to=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=hch@infradead.org \
    --cc=hch@lst.de \
    --cc=jgg@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.