public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
From: Leon Romanovsky <leon@kernel.org>
To: Chuck Lever <cel@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>,
	linux-rdma@vger.kernel.org, Chuck Lever <chuck.lever@oracle.com>
Subject: Re: [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers
Date: Tue, 10 Mar 2026 22:27:06 +0200	[thread overview]
Message-ID: <20260310202706.GQ12611@unreal> (raw)
In-Reply-To: <20260310195650.15785-1-cel@kernel.org>

On Tue, Mar 10, 2026 at 03:56:50PM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> svc_rdma_build_read_segment() constructs RDMA Read sink
> buffers by consuming pages one-at-a-time from rq_pages[]
> and building one bvec per page. A 64KB NFS READ payload
> produces 16 separate bvecs, 16 DMA mappings, and
> potentially multiple RDMA Read WRs.
> 
> A single higher-order allocation followed by split_page()
> yields physically contiguous memory while preserving
> per-page refcounts. A single bvec spanning the contiguous
> range causes rdma_rw_ctx_init_bvec() to take the
> rdma_rw_init_single_wr_bvec() fast path: one DMA mapping,
> one SGE, one WR.

Nice, clever.

> 
> The split sub-pages replace the original rq_pages[] entries,
> so all downstream page tracking, completion handling, and
> xdr_buf assembly remain unchanged.
> 
> Allocation uses __GFP_NORETRY | __GFP_NOWARN and falls back
> through decreasing orders. If even order-1 fails, the
> existing per-page path handles the segment.
> 
> The compound path is attempted only when the segment starts
> page-aligned (rc_pageoff == 0) and spans at least two pages.

In cases where you request three pages (order‑2), one page will be wasted.
This can lead to problems later, especially under stress testing.

Thanks

> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/svc_rdma_rw.c | 120 ++++++++++++++++++++++++++++++
>  1 file changed, 120 insertions(+)
> 
> What if svcrdma did something derpy like this?
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> index 9e17700fae2a..42de7151ae68 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -754,6 +754,118 @@ int svc_rdma_prepare_reply_chunk(struct svcxprt_rdma *rdma,
>  	return xdr->len;
>  }
>  
> +#define SVC_RDMA_COMPOUND_MAX_ORDER	4	/* 64KB max */
> +
> +/**
> + * svc_rdma_alloc_read_pages - Allocate physically contiguous pages
> + * @nr_pages: number of pages needed
> + * @order: on success, set to the allocation order
> + *
> + * Attempts a higher-order allocation, falling back to smaller orders.
> + * The returned pages are split immediately so each sub-page has its
> + * own refcount and can be freed independently.
> + *
> + * Returns a pointer to the first page on success, or NULL if even
> + * order-1 allocation fails.
> + */
> +static struct page *
> +svc_rdma_alloc_read_pages(unsigned int nr_pages, unsigned int *order)
> +{
> +	unsigned int o;
> +	struct page *page;
> +
> +	o = get_order(nr_pages << PAGE_SHIFT);
> +	if (o > SVC_RDMA_COMPOUND_MAX_ORDER)
> +		o = SVC_RDMA_COMPOUND_MAX_ORDER;
> +
> +	while (o >= 1) {
> +		page = alloc_pages(GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN,
> +				   o);
> +		if (page) {
> +			split_page(page, o);
> +			*order = o;
> +			return page;
> +		}
> +		o--;
> +	}
> +	return NULL;
> +}
> +
> +/**
> + * svc_rdma_build_read_segment_compound - Build a single RDMA Read WR using compound pages
> + * @rqstp: RPC transaction context
> + * @head: context for ongoing I/O
> + * @segment: co-ordinates of remote memory to be read
> + *
> + * Allocates a higher-order page and splits it, then builds a single
> + * bvec spanning the contiguous physical range. The split sub-pages
> + * replace entries in rq_pages[] so downstream cleanup is unchanged.
> + *
> + * Returns:
> + *   %0: the Read WR was constructed successfully
> + *   %-EINVAL: not enough rq_pages slots
> + *   %-ENOMEM: compound allocation or rw_ctxt allocation failed
> + *   %-EIO: a DMA mapping error occurred
> + */
> +static int svc_rdma_build_read_segment_compound(struct svc_rqst *rqstp,
> +						struct svc_rdma_recv_ctxt *head,
> +						const struct svc_rdma_segment *segment)
> +{
> +	struct svcxprt_rdma *rdma = svc_rdma_rqst_rdma(rqstp);
> +	struct svc_rdma_chunk_ctxt *cc = &head->rc_cc;
> +	unsigned int order, alloc_nr, nr_data_pages, i;
> +	struct svc_rdma_rw_ctxt *ctxt;
> +	struct page *page;
> +	int ret;
> +
> +	nr_data_pages = PAGE_ALIGN(segment->rs_length) >> PAGE_SHIFT;
> +
> +	page = svc_rdma_alloc_read_pages(nr_data_pages, &order);
> +	if (!page)
> +		return -ENOMEM;
> +	alloc_nr = 1 << order;
> +
> +	if (alloc_nr < nr_data_pages ||
> +	    head->rc_curpage + alloc_nr > rqstp->rq_maxpages) {
> +		for (i = 0; i < alloc_nr; i++)
> +			__free_page(page + i);
> +		return -ENOMEM;
> +	}
> +
> +	ctxt = svc_rdma_get_rw_ctxt(rdma, 1);
> +	if (!ctxt) {
> +		for (i = 0; i < alloc_nr; i++)
> +			__free_page(page + i);
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < alloc_nr; i++) {
> +		put_page(rqstp->rq_pages[head->rc_curpage + i]);
> +		rqstp->rq_pages[head->rc_curpage + i] = page + i;
> +	}
> +
> +	bvec_set_page(&ctxt->rw_bvec[0], page, segment->rs_length, 0);
> +	ctxt->rw_nents = 1;
> +
> +	head->rc_page_count += nr_data_pages;
> +	head->rc_pageoff = offset_in_page(segment->rs_length);
> +	if (head->rc_pageoff)
> +		head->rc_curpage += nr_data_pages - 1;
> +	else
> +		head->rc_curpage += nr_data_pages;
> +
> +	ret = svc_rdma_rw_ctx_init(rdma, ctxt, segment->rs_offset,
> +				   segment->rs_handle, segment->rs_length,
> +				   DMA_FROM_DEVICE);
> +	if (ret < 0)
> +		return -EIO;
> +	percpu_counter_inc(&svcrdma_stat_read);
> +
> +	list_add(&ctxt->rw_list, &cc->cc_rwctxts);
> +	cc->cc_sqecount += ret;
> +	return 0;
> +}
> +
>  /**
>   * svc_rdma_build_read_segment - Build RDMA Read WQEs to pull one RDMA segment
>   * @rqstp: RPC transaction context
> @@ -780,6 +892,14 @@ static int svc_rdma_build_read_segment(struct svc_rqst *rqstp,
>  	if (check_add_overflow(head->rc_pageoff, len, &total))
>  		return -EINVAL;
>  	nr_bvec = PAGE_ALIGN(total) >> PAGE_SHIFT;
> +
> +	if (head->rc_pageoff == 0 && nr_bvec >= 2) {
> +		ret = svc_rdma_build_read_segment_compound(rqstp, head,
> +							   segment);
> +		if (ret != -ENOMEM)
> +			return ret;
> +	}
> +
>  	ctxt = svc_rdma_get_rw_ctxt(rdma, nr_bvec);
>  	if (!ctxt)
>  		return -ENOMEM;
> -- 
> 2.53.0
> 

      reply	other threads:[~2026-03-10 20:27 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-10  3:46 [PATCH] RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path Chuck Lever
2026-03-10 13:42 ` Christoph Hellwig
2026-03-10 14:36   ` Chuck Lever
2026-03-10 18:37     ` Leon Romanovsky
2026-03-10 18:49       ` Chuck Lever
2026-03-10 19:31         ` Leon Romanovsky
2026-03-10 19:56           ` [RFC PATCH] svcrdma: Use compound pages for RDMA Read sink buffers Chuck Lever
2026-03-10 20:27             ` Leon Romanovsky [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260310202706.GQ12611@unreal \
    --to=leon@kernel.org \
    --cc=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=hch@lst.de \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox