[PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
@ 2026-03-10  3:46 Chuck Lever
  2026-03-10 13:42 ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2026-03-10  3:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig
  Cc: linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

When IOVA-based DMA mapping is unavailable (eg IOMMU passthrough
mode), rdma_rw_ctx_init_bvec() falls back to checking
rdma_rw_io_needs_mr() with the raw bvec count. Unlike the
scatterlist path in rdma_rw_ctx_init(), which passes a
post-DMA-mapping entry count that reflects coalescing of
physically contiguous pages, the bvec path passes the
pre-mapping page count. This overstates the number of DMA
entries, causing every multi-bvec RDMA READ to consume an MR
from the QP's pool.

Under NFS WRITE workloads the server performs RDMA READs to
pull data from the client. With the inflated MR demand, the
pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and
rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as
a DMA mapping failure, closes the connection, and the client
reconnects -- producing a cycle of 71% RPC retransmissions and
~100 reconnections per test run. RDMA WRITEs (NFS READ
direction) are unaffected because DMA_TO_DEVICE never triggers
the max_sgl_rd check.

Fixes: bea28ac14cab ("RDMA/core: add MR support for bvec-based RDMA operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 drivers/infiniband/core/rw.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index fc45c384833f..9e227b7746a1 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -686,14 +686,15 @@ int rdma_rw_ctx_init_bvec(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		return ret;

 	/*
-	 * IOVA mapping not available. Check if MR registration provides
-	 * better performance than multiple SGE entries.
+	 * IOVA not available; map each bvec individually. Do not
+	 * check max_sgl_rd here: nr_bvec is a raw page count that
+	 * overstates DMA entry demand and exhausts the MR pool.
+	 *
+	 * TODO: A bulk DMA mapping API for bvecs analogous to
+	 * dma_map_sgtable() would provide a proper post-DMA-
+	 * coalescing segment count here, enabling the map_wrs
+	 * path in more cases.
 	 */
-	if (rdma_rw_io_needs_mr(dev, port_num, dir, nr_bvec))
-		return rdma_rw_init_mr_wrs_bvec(ctx, qp, port_num, bvecs,
-						nr_bvec, &iter, remote_addr,
-						rkey, dir);
-
 	return rdma_rw_init_map_wrs_bvec(ctx, qp, bvecs, nr_bvec, &iter,
 			remote_addr, rkey, dir);
 }
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
  2026-03-10  3:46 [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
@ 2026-03-10 13:42 ` Christoph Hellwig
  2026-03-10 14:36   ` Chuck Lever
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2026-03-10 13:42 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, linux-rdma,
	Chuck Lever

On Mon, Mar 09, 2026 at 11:46:21PM -0400, Chuck Lever wrote:
> Under NFS WRITE workloads the server performs RDMA READs to
> pull data from the client. With the inflated MR demand, the
> pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and
> rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as
> a DMA mapping failure, closes the connection, and the client
> reconnects -- producing a cycle of 71% RPC retransmissions and
> ~100 reconnections per test run. RDMA WRITEs (NFS READ
> direction) are unaffected because DMA_TO_DEVICE never triggers
> the max_sgl_rd check.

So this changelog extensively describes the problem, but it doesn't
actually say how you fix it.

>  	/*
> +	 * IOVA not available; map each bvec individually. Do not
> +	 * check max_sgl_rd here: nr_bvec is a raw page count that
> +	 * overstates DMA entry demand and exhausts the MR pool.

It fails to explain why that is fine?

> +	 *
> +	 * TODO: A bulk DMA mapping API for bvecs analogous to
> +	 * dma_map_sgtable() would provide a proper post-DMA-
> +	 * coalescing segment count here, enabling the map_wrs
> +	 * path in more cases.

This isn't really something the DMA layer can easily do without getting
as inefficient as the sgtable based path.  What the block layer does
here is to simply keep a higher level count of merged segments.  The
other option would be to not create multiple bvecs for continguous
regions, which is what modern file system do in general, and why the
above block layer nr_phys_segments based optimization isn't actually
used all that much these days.

Why can't NFS send a single bvec for contiguous ranges?

> -	if (rdma_rw_io_needs_mr(dev, port_num, dir, nr_bvec))
> -		return rdma_rw_init_mr_wrs_bvec(ctx, qp, port_num, bvecs,
> -						nr_bvec, &iter, remote_addr,
> -						rkey, dir);
> -
>  	return rdma_rw_init_map_wrs_bvec(ctx, qp, bvecs, nr_bvec, &iter,
>  			remote_addr, rkey, dir);

First I thought this breaks iWarp reads, but they are handled earlier,
which might have been usefu in the commit log.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
  2026-03-10 13:42 ` Christoph Hellwig
@ 2026-03-10 14:36   ` Chuck Lever
  2026-03-10 18:37     ` Leon Romanovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2026-03-10 14:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Leon Romanovsky, Christoph Hellwig, linux-rdma,
	Chuck Lever

On 3/10/26 9:42 AM, Christoph Hellwig wrote:
> On Mon, Mar 09, 2026 at 11:46:21PM -0400, Chuck Lever wrote:
>> Under NFS WRITE workloads the server performs RDMA READs to
>> pull data from the client. With the inflated MR demand, the
>> pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and
>> rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as
>> a DMA mapping failure, closes the connection, and the client
>> reconnects -- producing a cycle of 71% RPC retransmissions and
>> ~100 reconnections per test run. RDMA WRITEs (NFS READ
>> direction) are unaffected because DMA_TO_DEVICE never triggers
>> the max_sgl_rd check.
> 
> So this changelog extensively describes the problem, but it doesn't
> actually say how you fix it.

I didn't want to waste everyone's time, but I can add that.

>> +	 *
>> +	 * TODO: A bulk DMA mapping API for bvecs analogous to
>> +	 * dma_map_sgtable() would provide a proper post-DMA-
>> +	 * coalescing segment count here, enabling the map_wrs
>> +	 * path in more cases.
> 
> This isn't really something the DMA layer can easily do without getting
> as inefficient as the sgtable based path.  What the block layer does
> here is to simply keep a higher level count of merged segments.  The
> other option would be to not create multiple bvecs for continguous
> regions, which is what modern file system do in general, and why the
> above block layer nr_phys_segments based optimization isn't actually
> used all that much these days.

Technically, NFSD isn't a file system, it's a protocol adapter.

> Why can't NFS send a single bvec for contiguous ranges?

Have a look at svc_rdma_build_read_segment(). The RDMA READ path builds
bvecs from rqstp->rq_pages[], which is an array of individual struct
page pointers. Each bvec entry covers at most one page.

This is because I/O payloads arrive in an xdr_buf, which represents its
page data as a struct page ** array (xdr->pages), and svc_rqst::rq_pages
is likewise a flat array of single-page pointers. These pages are
allocated individually (typically from the page allocator via
alloc_pages()), so there's no guarantee of physical contiguity. Even if
adjacent pages happen to be contiguous, the code has no way to know that
without inspecting PFNs (which is exactly what the DMA mapping layer
does).

So currently svcrdma can't send a single bvec for contiguous ranges
because the contiguity information doesn't exist at the NFSD or RPC
layer. Contiguity is (re)discovered only at DMA map time.

The alternative is to build an SGL for mapping the bvec so that rw.c can
get the real contiguity of the pages before proceeding. But that seems
icky.

Long term, I expect that NFSD will need to preserve the folios it gets
from file systems and pass those to the RPC transports without
translating them to an array of page pointers.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
  2026-03-10 14:36   ` Chuck Lever
@ 2026-03-10 18:37     ` Leon Romanovsky
  2026-03-10 18:49       ` Chuck Lever
  0 siblings, 1 reply; 8+ messages in thread
From: Leon Romanovsky @ 2026-03-10 18:37 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Jason Gunthorpe, Christoph Hellwig, linux-rdma,
	Chuck Lever

On Tue, Mar 10, 2026 at 10:36:43AM -0400, Chuck Lever wrote:
> On 3/10/26 9:42 AM, Christoph Hellwig wrote:
> > On Mon, Mar 09, 2026 at 11:46:21PM -0400, Chuck Lever wrote:
> >> Under NFS WRITE workloads the server performs RDMA READs to
> >> pull data from the client. With the inflated MR demand, the
> >> pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and
> >> rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as
> >> a DMA mapping failure, closes the connection, and the client
> >> reconnects -- producing a cycle of 71% RPC retransmissions and
> >> ~100 reconnections per test run. RDMA WRITEs (NFS READ
> >> direction) are unaffected because DMA_TO_DEVICE never triggers
> >> the max_sgl_rd check.
> > 
> > So this changelog extensively describes the problem, but it doesn't
> > actually say how you fix it.
> 
> I didn't want to waste everyone's time, but I can add that.
> 
> 
> >> +	 *
> >> +	 * TODO: A bulk DMA mapping API for bvecs analogous to
> >> +	 * dma_map_sgtable() would provide a proper post-DMA-
> >> +	 * coalescing segment count here, enabling the map_wrs
> >> +	 * path in more cases.
> > 
> > This isn't really something the DMA layer can easily do without getting
> > as inefficient as the sgtable based path.  What the block layer does
> > here is to simply keep a higher level count of merged segments.  The
> > other option would be to not create multiple bvecs for continguous
> > regions, which is what modern file system do in general, and why the
> > above block layer nr_phys_segments based optimization isn't actually
> > used all that much these days.
> 
> Technically, NFSD isn't a file system, it's a protocol adapter.
> 
> 
> > Why can't NFS send a single bvec for contiguous ranges?
> 
> Have a look at svc_rdma_build_read_segment(). The RDMA READ path builds
> bvecs from rqstp->rq_pages[], which is an array of individual struct
> page pointers. Each bvec entry covers at most one page.
> 
> This is because I/O payloads arrive in an xdr_buf, which represents its
> page data as a struct page ** array (xdr->pages), and svc_rqst::rq_pages
> is likewise a flat array of single-page pointers. These pages are
> allocated individually (typically from the page allocator via
> alloc_pages()), so there's no guarantee of physical contiguity. Even if
> adjacent pages happen to be contiguous, the code has no way to know that
> without inspecting PFNs (which is exactly what the DMA mapping layer
> does).
> 
> So currently svcrdma can't send a single bvec for contiguous ranges
> because the contiguity information doesn't exist at the NFSD or RPC
> layer. Contiguity is (re)discovered only at DMA map time.
> 
> The alternative is to build an SGL for mapping the bvec so that rw.c can
> get the real contiguity of the pages before proceeding. But that seems
> icky.
> 
> Long term, I expect that NFSD will need to preserve the folios it gets
> from file systems and pass those to the RPC transports without
> translating them to an array of page pointers.

Folio sounds like a correct approach to me, why do you mark it as "long term"? 

Thanks

> 
> 
> -- 
> Chuck Lever

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
  2026-03-10 18:37     ` Leon Romanovsky
@ 2026-03-10 18:49       ` Chuck Lever
  2026-03-10 19:31         ` Leon Romanovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2026-03-10 18:49 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Jason Gunthorpe, Christoph Hellwig, linux-rdma,
	Chuck Lever

On 3/10/26 2:37 PM, Leon Romanovsky wrote:
> On Tue, Mar 10, 2026 at 10:36:43AM -0400, Chuck Lever wrote:
>> On 3/10/26 9:42 AM, Christoph Hellwig wrote:
>>> On Mon, Mar 09, 2026 at 11:46:21PM -0400, Chuck Lever wrote:
>>>> Under NFS WRITE workloads the server performs RDMA READs to
>>>> pull data from the client. With the inflated MR demand, the
>>>> pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and
>>>> rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as
>>>> a DMA mapping failure, closes the connection, and the client
>>>> reconnects -- producing a cycle of 71% RPC retransmissions and
>>>> ~100 reconnections per test run. RDMA WRITEs (NFS READ
>>>> direction) are unaffected because DMA_TO_DEVICE never triggers
>>>> the max_sgl_rd check.
>>>
>>> So this changelog extensively describes the problem, but it doesn't
>>> actually say how you fix it.
>>
>> I didn't want to waste everyone's time, but I can add that.
>>
>>
>>>> +	 *
>>>> +	 * TODO: A bulk DMA mapping API for bvecs analogous to
>>>> +	 * dma_map_sgtable() would provide a proper post-DMA-
>>>> +	 * coalescing segment count here, enabling the map_wrs
>>>> +	 * path in more cases.
>>>
>>> This isn't really something the DMA layer can easily do without getting
>>> as inefficient as the sgtable based path.  What the block layer does
>>> here is to simply keep a higher level count of merged segments.  The
>>> other option would be to not create multiple bvecs for continguous
>>> regions, which is what modern file system do in general, and why the
>>> above block layer nr_phys_segments based optimization isn't actually
>>> used all that much these days.
>>
>> Technically, NFSD isn't a file system, it's a protocol adapter.
>>
>>
>>> Why can't NFS send a single bvec for contiguous ranges?
>>
>> Have a look at svc_rdma_build_read_segment(). The RDMA READ path builds
>> bvecs from rqstp->rq_pages[], which is an array of individual struct
>> page pointers. Each bvec entry covers at most one page.
>>
>> This is because I/O payloads arrive in an xdr_buf, which represents its
>> page data as a struct page ** array (xdr->pages), and svc_rqst::rq_pages
>> is likewise a flat array of single-page pointers. These pages are
>> allocated individually (typically from the page allocator via
>> alloc_pages()), so there's no guarantee of physical contiguity. Even if
>> adjacent pages happen to be contiguous, the code has no way to know that
>> without inspecting PFNs (which is exactly what the DMA mapping layer
>> does).
>>
>> So currently svcrdma can't send a single bvec for contiguous ranges
>> because the contiguity information doesn't exist at the NFSD or RPC
>> layer. Contiguity is (re)discovered only at DMA map time.
>>
>> The alternative is to build an SGL for mapping the bvec so that rw.c can
>> get the real contiguity of the pages before proceeding. But that seems
>> icky.
>>
>> Long term, I expect that NFSD will need to preserve the folios it gets
>> from file systems and pass those to the RPC transports without
>> translating them to an array of page pointers.
> 
> Folio sounds like a correct approach to me, why do you mark it as "long term"?

All four NFS maintainers are aware of this need, and I /think/ we are
all on board with the "folio/bvec" approach. But there are a number of
reasons this is not just a "go write the code" kind of problem:

- xdr_buf is used by both the NFS client and NFS server stacks, and they
are separately maintained.

- xdr_buf is used from nearly the top to the bottom of these stacks, so
making this kind of change will be painstaking.

- The XDR layer is full of helper APIs that deal with xdr_buf page
arrays that would need attention.

- I need to understand whether addressing the DMA map problem has
benefits for more broadly-deployed RPC transports such as TCP.


I'll have to think if we can just add something clever to the xdr_buf
that only NFSD can use and then have that act as the conduit from file
system to RPC transport. That could act as a prototype.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
  2026-03-10 18:49       ` Chuck Lever
@ 2026-03-10 19:31         ` Leon Romanovsky
  2026-03-10 19:56           ` [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers Chuck Lever
  0 siblings, 1 reply; 8+ messages in thread
From: Leon Romanovsky @ 2026-03-10 19:31 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Jason Gunthorpe, Christoph Hellwig, linux-rdma,
	Chuck Lever

On Tue, Mar 10, 2026 at 02:49:46PM -0400, Chuck Lever wrote:
> On 3/10/26 2:37 PM, Leon Romanovsky wrote:
> > On Tue, Mar 10, 2026 at 10:36:43AM -0400, Chuck Lever wrote:
> >> On 3/10/26 9:42 AM, Christoph Hellwig wrote:
> >>> On Mon, Mar 09, 2026 at 11:46:21PM -0400, Chuck Lever wrote:
> >>>> Under NFS WRITE workloads the server performs RDMA READs to
> >>>> pull data from the client. With the inflated MR demand, the
> >>>> pool is rapidly exhausted, ib_mr_pool_get() returns NULL, and
> >>>> rdma_rw_init_one_mr() returns -EAGAIN. svcrdma treats this as
> >>>> a DMA mapping failure, closes the connection, and the client
> >>>> reconnects -- producing a cycle of 71% RPC retransmissions and
> >>>> ~100 reconnections per test run. RDMA WRITEs (NFS READ
> >>>> direction) are unaffected because DMA_TO_DEVICE never triggers
> >>>> the max_sgl_rd check.
> >>>
> >>> So this changelog extensively describes the problem, but it doesn't
> >>> actually say how you fix it.
> >>
> >> I didn't want to waste everyone's time, but I can add that.
> >>
> >>
> >>>> +	 *
> >>>> +	 * TODO: A bulk DMA mapping API for bvecs analogous to
> >>>> +	 * dma_map_sgtable() would provide a proper post-DMA-
> >>>> +	 * coalescing segment count here, enabling the map_wrs
> >>>> +	 * path in more cases.
> >>>
> >>> This isn't really something the DMA layer can easily do without getting
> >>> as inefficient as the sgtable based path.  What the block layer does
> >>> here is to simply keep a higher level count of merged segments.  The
> >>> other option would be to not create multiple bvecs for continguous
> >>> regions, which is what modern file system do in general, and why the
> >>> above block layer nr_phys_segments based optimization isn't actually
> >>> used all that much these days.
> >>
> >> Technically, NFSD isn't a file system, it's a protocol adapter.
> >>
> >>
> >>> Why can't NFS send a single bvec for contiguous ranges?
> >>
> >> Have a look at svc_rdma_build_read_segment(). The RDMA READ path builds
> >> bvecs from rqstp->rq_pages[], which is an array of individual struct
> >> page pointers. Each bvec entry covers at most one page.
> >>
> >> This is because I/O payloads arrive in an xdr_buf, which represents its
> >> page data as a struct page ** array (xdr->pages), and svc_rqst::rq_pages
> >> is likewise a flat array of single-page pointers. These pages are
> >> allocated individually (typically from the page allocator via
> >> alloc_pages()), so there's no guarantee of physical contiguity. Even if
> >> adjacent pages happen to be contiguous, the code has no way to know that
> >> without inspecting PFNs (which is exactly what the DMA mapping layer
> >> does).
> >>
> >> So currently svcrdma can't send a single bvec for contiguous ranges
> >> because the contiguity information doesn't exist at the NFSD or RPC
> >> layer. Contiguity is (re)discovered only at DMA map time.
> >>
> >> The alternative is to build an SGL for mapping the bvec so that rw.c can
> >> get the real contiguity of the pages before proceeding. But that seems
> >> icky.
> >>
> >> Long term, I expect that NFSD will need to preserve the folios it gets
> >> from file systems and pass those to the RPC transports without
> >> translating them to an array of page pointers.
> > 
> > Folio sounds like a correct approach to me, why do you mark it as "long term"?
> 
> All four NFS maintainers are aware of this need, and I /think/ we are
> all on board with the "folio/bvec" approach. But there are a number of
> reasons this is not just a "go write the code" kind of problem:
> 
> - xdr_buf is used by both the NFS client and NFS server stacks, and they
> are separately maintained.
> 
> - xdr_buf is used from nearly the top to the bottom of these stacks, so
> making this kind of change will be painstaking.
> 
> - The XDR layer is full of helper APIs that deal with xdr_buf page
> arrays that would need attention.
> 
> - I need to understand whether addressing the DMA map problem has
> benefits for more broadly-deployed RPC transports such as TCP.
> 
> 
> I'll have to think if we can just add something clever to the xdr_buf
> that only NFSD can use and then have that act as the conduit from file
> system to RPC transport. That could act as a prototype.

Thanks for the summary. It explains.

> 
> 
> -- 
> Chuck Lever

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers
  2026-03-10 19:31         ` Leon Romanovsky
@ 2026-03-10 19:56           ` Chuck Lever
  2026-03-10 20:27             ` Leon Romanovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2026-03-10 19:56 UTC (permalink / raw)
  To: Leon Romanovsky, Christoph Hellwig; +Cc: linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

svc_rdma_build_read_segment() constructs RDMA Read sink
buffers by consuming pages one-at-a-time from rq_pages[]
and building one bvec per page. A 64KB NFS READ payload
produces 16 separate bvecs, 16 DMA mappings, and
potentially multiple RDMA Read WRs.

A single higher-order allocation followed by split_page()
yields physically contiguous memory while preserving
per-page refcounts. A single bvec spanning the contiguous
range causes rdma_rw_ctx_init_bvec() to take the
rdma_rw_init_single_wr_bvec() fast path: one DMA mapping,
one SGE, one WR.

The split sub-pages replace the original rq_pages[] entries,
so all downstream page tracking, completion handling, and
xdr_buf assembly remain unchanged.

Allocation uses __GFP_NORETRY | __GFP_NOWARN and falls back
through decreasing orders. If even order-1 fails, the
existing per-page path handles the segment.

The compound path is attempted only when the segment starts
page-aligned (rc_pageoff == 0) and spans at least two pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_rw.c | 120 ++++++++++++++++++++++++++++++
 1 file changed, 120 insertions(+)

What if svcrdma did something derpy like this?

diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 9e17700fae2a..42de7151ae68 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -754,6 +754,118 @@ int svc_rdma_prepare_reply_chunk(struct svcxprt_rdma *rdma,
 	return xdr->len;
 }
 
+#define SVC_RDMA_COMPOUND_MAX_ORDER	4	/* 64KB max */
+
+/**
+ * svc_rdma_alloc_read_pages - Allocate physically contiguous pages
+ * @nr_pages: number of pages needed
+ * @order: on success, set to the allocation order
+ *
+ * Attempts a higher-order allocation, falling back to smaller orders.
+ * The returned pages are split immediately so each sub-page has its
+ * own refcount and can be freed independently.
+ *
+ * Returns a pointer to the first page on success, or NULL if even
+ * order-1 allocation fails.
+ */
+static struct page *
+svc_rdma_alloc_read_pages(unsigned int nr_pages, unsigned int *order)
+{
+	unsigned int o;
+	struct page *page;
+
+	o = get_order(nr_pages << PAGE_SHIFT);
+	if (o > SVC_RDMA_COMPOUND_MAX_ORDER)
+		o = SVC_RDMA_COMPOUND_MAX_ORDER;
+
+	while (o >= 1) {
+		page = alloc_pages(GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN,
+				   o);
+		if (page) {
+			split_page(page, o);
+			*order = o;
+			return page;
+		}
+		o--;
+	}
+	return NULL;
+}
+
+/**
+ * svc_rdma_build_read_segment_compound - Build a single RDMA Read WR using compound pages
+ * @rqstp: RPC transaction context
+ * @head: context for ongoing I/O
+ * @segment: co-ordinates of remote memory to be read
+ *
+ * Allocates a higher-order page and splits it, then builds a single
+ * bvec spanning the contiguous physical range. The split sub-pages
+ * replace entries in rq_pages[] so downstream cleanup is unchanged.
+ *
+ * Returns:
+ *   %0: the Read WR was constructed successfully
+ *   %-EINVAL: not enough rq_pages slots
+ *   %-ENOMEM: compound allocation or rw_ctxt allocation failed
+ *   %-EIO: a DMA mapping error occurred
+ */
+static int svc_rdma_build_read_segment_compound(struct svc_rqst *rqstp,
+						struct svc_rdma_recv_ctxt *head,
+						const struct svc_rdma_segment *segment)
+{
+	struct svcxprt_rdma *rdma = svc_rdma_rqst_rdma(rqstp);
+	struct svc_rdma_chunk_ctxt *cc = &head->rc_cc;
+	unsigned int order, alloc_nr, nr_data_pages, i;
+	struct svc_rdma_rw_ctxt *ctxt;
+	struct page *page;
+	int ret;
+
+	nr_data_pages = PAGE_ALIGN(segment->rs_length) >> PAGE_SHIFT;
+
+	page = svc_rdma_alloc_read_pages(nr_data_pages, &order);
+	if (!page)
+		return -ENOMEM;
+	alloc_nr = 1 << order;
+
+	if (alloc_nr < nr_data_pages ||
+	    head->rc_curpage + alloc_nr > rqstp->rq_maxpages) {
+		for (i = 0; i < alloc_nr; i++)
+			__free_page(page + i);
+		return -ENOMEM;
+	}
+
+	ctxt = svc_rdma_get_rw_ctxt(rdma, 1);
+	if (!ctxt) {
+		for (i = 0; i < alloc_nr; i++)
+			__free_page(page + i);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < alloc_nr; i++) {
+		put_page(rqstp->rq_pages[head->rc_curpage + i]);
+		rqstp->rq_pages[head->rc_curpage + i] = page + i;
+	}
+
+	bvec_set_page(&ctxt->rw_bvec[0], page, segment->rs_length, 0);
+	ctxt->rw_nents = 1;
+
+	head->rc_page_count += nr_data_pages;
+	head->rc_pageoff = offset_in_page(segment->rs_length);
+	if (head->rc_pageoff)
+		head->rc_curpage += nr_data_pages - 1;
+	else
+		head->rc_curpage += nr_data_pages;
+
+	ret = svc_rdma_rw_ctx_init(rdma, ctxt, segment->rs_offset,
+				   segment->rs_handle, segment->rs_length,
+				   DMA_FROM_DEVICE);
+	if (ret < 0)
+		return -EIO;
+	percpu_counter_inc(&svcrdma_stat_read);
+
+	list_add(&ctxt->rw_list, &cc->cc_rwctxts);
+	cc->cc_sqecount += ret;
+	return 0;
+}
+
 /**
  * svc_rdma_build_read_segment - Build RDMA Read WQEs to pull one RDMA segment
  * @rqstp: RPC transaction context
@@ -780,6 +892,14 @@ static int svc_rdma_build_read_segment(struct svc_rqst *rqstp,
 	if (check_add_overflow(head->rc_pageoff, len, &total))
 		return -EINVAL;
 	nr_bvec = PAGE_ALIGN(total) >> PAGE_SHIFT;
+
+	if (head->rc_pageoff == 0 && nr_bvec >= 2) {
+		ret = svc_rdma_build_read_segment_compound(rqstp, head,
+							   segment);
+		if (ret != -ENOMEM)
+			return ret;
+	}
+
 	ctxt = svc_rdma_get_rw_ctxt(rdma, nr_bvec);
 	if (!ctxt)
 		return -ENOMEM;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers
  2026-03-10 19:56           ` [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers Chuck Lever
@ 2026-03-10 20:27             ` Leon Romanovsky
  0 siblings, 0 replies; 8+ messages in thread
From: Leon Romanovsky @ 2026-03-10 20:27 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Christoph Hellwig, linux-rdma, Chuck Lever

On Tue, Mar 10, 2026 at 03:56:50PM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> svc_rdma_build_read_segment() constructs RDMA Read sink
> buffers by consuming pages one-at-a-time from rq_pages[]
> and building one bvec per page. A 64KB NFS READ payload
> produces 16 separate bvecs, 16 DMA mappings, and
> potentially multiple RDMA Read WRs.
> 
> A single higher-order allocation followed by split_page()
> yields physically contiguous memory while preserving
> per-page refcounts. A single bvec spanning the contiguous
> range causes rdma_rw_ctx_init_bvec() to take the
> rdma_rw_init_single_wr_bvec() fast path: one DMA mapping,
> one SGE, one WR.

Nice, clever.

> 
> The split sub-pages replace the original rq_pages[] entries,
> so all downstream page tracking, completion handling, and
> xdr_buf assembly remain unchanged.
> 
> Allocation uses __GFP_NORETRY | __GFP_NOWARN and falls back
> through decreasing orders. If even order-1 fails, the
> existing per-page path handles the segment.
> 
> The compound path is attempted only when the segment starts
> page-aligned (rc_pageoff == 0) and spans at least two pages.

In cases where you request three pages (order‑2), one page will be wasted.
This can lead to problems later, especially under stress testing.

Thanks

> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/svc_rdma_rw.c | 120 ++++++++++++++++++++++++++++++
>  1 file changed, 120 insertions(+)
> 
> What if svcrdma did something derpy like this?
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> index 9e17700fae2a..42de7151ae68 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -754,6 +754,118 @@ int svc_rdma_prepare_reply_chunk(struct svcxprt_rdma *rdma,
>  	return xdr->len;
>  }
>  
> +#define SVC_RDMA_COMPOUND_MAX_ORDER	4	/* 64KB max */
> +
> +/**
> + * svc_rdma_alloc_read_pages - Allocate physically contiguous pages
> + * @nr_pages: number of pages needed
> + * @order: on success, set to the allocation order
> + *
> + * Attempts a higher-order allocation, falling back to smaller orders.
> + * The returned pages are split immediately so each sub-page has its
> + * own refcount and can be freed independently.
> + *
> + * Returns a pointer to the first page on success, or NULL if even
> + * order-1 allocation fails.
> + */
> +static struct page *
> +svc_rdma_alloc_read_pages(unsigned int nr_pages, unsigned int *order)
> +{
> +	unsigned int o;
> +	struct page *page;
> +
> +	o = get_order(nr_pages << PAGE_SHIFT);
> +	if (o > SVC_RDMA_COMPOUND_MAX_ORDER)
> +		o = SVC_RDMA_COMPOUND_MAX_ORDER;
> +
> +	while (o >= 1) {
> +		page = alloc_pages(GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN,
> +				   o);
> +		if (page) {
> +			split_page(page, o);
> +			*order = o;
> +			return page;
> +		}
> +		o--;
> +	}
> +	return NULL;
> +}
> +
> +/**
> + * svc_rdma_build_read_segment_compound - Build a single RDMA Read WR using compound pages
> + * @rqstp: RPC transaction context
> + * @head: context for ongoing I/O
> + * @segment: co-ordinates of remote memory to be read
> + *
> + * Allocates a higher-order page and splits it, then builds a single
> + * bvec spanning the contiguous physical range. The split sub-pages
> + * replace entries in rq_pages[] so downstream cleanup is unchanged.
> + *
> + * Returns:
> + *   %0: the Read WR was constructed successfully
> + *   %-EINVAL: not enough rq_pages slots
> + *   %-ENOMEM: compound allocation or rw_ctxt allocation failed
> + *   %-EIO: a DMA mapping error occurred
> + */
> +static int svc_rdma_build_read_segment_compound(struct svc_rqst *rqstp,
> +						struct svc_rdma_recv_ctxt *head,
> +						const struct svc_rdma_segment *segment)
> +{
> +	struct svcxprt_rdma *rdma = svc_rdma_rqst_rdma(rqstp);
> +	struct svc_rdma_chunk_ctxt *cc = &head->rc_cc;
> +	unsigned int order, alloc_nr, nr_data_pages, i;
> +	struct svc_rdma_rw_ctxt *ctxt;
> +	struct page *page;
> +	int ret;
> +
> +	nr_data_pages = PAGE_ALIGN(segment->rs_length) >> PAGE_SHIFT;
> +
> +	page = svc_rdma_alloc_read_pages(nr_data_pages, &order);
> +	if (!page)
> +		return -ENOMEM;
> +	alloc_nr = 1 << order;
> +
> +	if (alloc_nr < nr_data_pages ||
> +	    head->rc_curpage + alloc_nr > rqstp->rq_maxpages) {
> +		for (i = 0; i < alloc_nr; i++)
> +			__free_page(page + i);
> +		return -ENOMEM;
> +	}
> +
> +	ctxt = svc_rdma_get_rw_ctxt(rdma, 1);
> +	if (!ctxt) {
> +		for (i = 0; i < alloc_nr; i++)
> +			__free_page(page + i);
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < alloc_nr; i++) {
> +		put_page(rqstp->rq_pages[head->rc_curpage + i]);
> +		rqstp->rq_pages[head->rc_curpage + i] = page + i;
> +	}
> +
> +	bvec_set_page(&ctxt->rw_bvec[0], page, segment->rs_length, 0);
> +	ctxt->rw_nents = 1;
> +
> +	head->rc_page_count += nr_data_pages;
> +	head->rc_pageoff = offset_in_page(segment->rs_length);
> +	if (head->rc_pageoff)
> +		head->rc_curpage += nr_data_pages - 1;
> +	else
> +		head->rc_curpage += nr_data_pages;
> +
> +	ret = svc_rdma_rw_ctx_init(rdma, ctxt, segment->rs_offset,
> +				   segment->rs_handle, segment->rs_length,
> +				   DMA_FROM_DEVICE);
> +	if (ret < 0)
> +		return -EIO;
> +	percpu_counter_inc(&svcrdma_stat_read);
> +
> +	list_add(&ctxt->rw_list, &cc->cc_rwctxts);
> +	cc->cc_sqecount += ret;
> +	return 0;
> +}
> +
>  /**
>   * svc_rdma_build_read_segment - Build RDMA Read WQEs to pull one RDMA segment
>   * @rqstp: RPC transaction context
> @@ -780,6 +892,14 @@ static int svc_rdma_build_read_segment(struct svc_rqst *rqstp,
>  	if (check_add_overflow(head->rc_pageoff, len, &total))
>  		return -EINVAL;
>  	nr_bvec = PAGE_ALIGN(total) >> PAGE_SHIFT;
> +
> +	if (head->rc_pageoff == 0 && nr_bvec >= 2) {
> +		ret = svc_rdma_build_read_segment_compound(rqstp, head,
> +							   segment);
> +		if (ret != -ENOMEM)
> +			return ret;
> +	}
> +
>  	ctxt = svc_rdma_get_rw_ctxt(rdma, nr_bvec);
>  	if (!ctxt)
>  		return -ENOMEM;
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-10 20:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10  3:46 [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-10 13:42 ` Christoph Hellwig
2026-03-10 14:36   ` Chuck Lever
2026-03-10 18:37     ` Leon Romanovsky
2026-03-10 18:49       ` Chuck Lever
2026-03-10 19:31         ` Leon Romanovsky
2026-03-10 19:56           ` [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers Chuck Lever
2026-03-10 20:27             ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox