* [RFC PATCH 1/6] sunrpc: Tighten bounds checking in svc_rqst_replace_page
2026-02-22 16:19 [RFC PATCH 0/6] Optimize NFSD buffer page management Chuck Lever
@ 2026-02-22 16:19 ` Chuck Lever
2026-02-22 16:19 ` [RFC PATCH 2/6] sunrpc: Allocate a separate Reply page array Chuck Lever
` (4 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2026-02-22 16:19 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_rqst_replace_page() builds the Reply buffer by advancing
rq_next_page through the response page range. The bounds
check validates rq_next_page against the full rq_pages array,
but the valid range for rq_next_page is
[rq_respages, rq_page_end]. Use those bounds instead.
This is correct today because rq_respages and rq_page_end
both point into rq_pages, and it prepares for a subsequent
change that separates the Reply page array from rq_pages.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/svc.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 346ac560dcc2..05ba4040a24a 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -935,11 +935,11 @@ svc_set_num_threads(struct svc_serv *serv, unsigned int min_threads,
EXPORT_SYMBOL_GPL(svc_set_num_threads);
/**
- * svc_rqst_replace_page - Replace one page in rq_pages[]
+ * svc_rqst_replace_page - Replace one page in rq_respages[]
* @rqstp: svc_rqst with pages to replace
* @page: replacement page
*
- * When replacing a page in rq_pages, batch the release of the
+ * When replacing a page in rq_respages, batch the release of the
* replaced pages to avoid hammering the page allocator.
*
* Return values:
@@ -948,8 +948,8 @@ EXPORT_SYMBOL_GPL(svc_set_num_threads);
*/
bool svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page)
{
- struct page **begin = rqstp->rq_pages;
- struct page **end = &rqstp->rq_pages[rqstp->rq_maxpages];
+ struct page **begin = rqstp->rq_respages;
+ struct page **end = rqstp->rq_page_end;
if (unlikely(rqstp->rq_next_page < begin || rqstp->rq_next_page > end)) {
trace_svc_replace_page_err(rqstp);
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* [RFC PATCH 2/6] sunrpc: Allocate a separate Reply page array
2026-02-22 16:19 [RFC PATCH 0/6] Optimize NFSD buffer page management Chuck Lever
2026-02-22 16:19 ` [RFC PATCH 1/6] sunrpc: Tighten bounds checking in svc_rqst_replace_page Chuck Lever
@ 2026-02-22 16:19 ` Chuck Lever
2026-02-23 0:15 ` NeilBrown
2026-02-22 16:19 ` [RFC PATCH 3/6] sunrpc: Handle NULL entries in svc_rqst_release_pages Chuck Lever
` (3 subsequent siblings)
5 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2026-02-22 16:19 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
struct svc_rqst uses a single dynamically-allocated page array
(rq_pages) for both the incoming RPC Call message and the
outgoing RPC Reply message. rq_respages is a sliding pointer
into rq_pages that each transport receive path must compute
based on how many pages the Call consumed. This boundary
tracking is a source of confusion and bugs, and prevents an
RPC transaction from having both a large Call and a large
Reply simultaneously.
Allocate rq_respages as its own page array, eliminating the
boundary arithmetic. This decouples Call and Reply buffer
lifetimes, following the precedent set by rq_bvec (a separate
dynamically-allocated array for I/O vectors).
rq_next_page is initialized in svc_alloc_arg() and
svc_process() during Reply construction, and in
svc_rdma_recvfrom() as a precaution on error paths.
Transport receive paths no longer compute it from the
Call size.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc.h | 43 ++++++++++++-------------
net/sunrpc/svc.c | 27 +++++++++++++---
net/sunrpc/svc_xprt.c | 36 +++++++++++++++------
net/sunrpc/svcsock.c | 6 ----
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 15 +++------
5 files changed, 73 insertions(+), 54 deletions(-)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 4dc14c7a711b..b1fb728724f5 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -134,25 +134,24 @@ enum {
extern u32 svc_max_payload(const struct svc_rqst *rqstp);
/*
- * RPC Requests and replies are stored in one or more pages.
- * We maintain an array of pages for each server thread.
- * Requests are copied into these pages as they arrive. Remaining
- * pages are available to write the reply into.
+ * RPC Call and Reply messages each have their own page array.
+ * rq_pages holds the incoming Call message; rq_respages holds
+ * the outgoing Reply message. Both arrays are sized to
+ * svc_serv_maxpages() entries and are allocated dynamically.
*
- * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each server thread
- * needs to allocate more to replace those used in sending. To help keep track
- * of these pages we have a receive list where all pages initialy live, and a
- * send list where pages are moved to when there are to be part of a reply.
+ * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each
+ * server thread needs to allocate more to replace those used in
+ * sending.
*
- * We use xdr_buf for holding responses as it fits well with NFS
- * read responses (that have a header, and some data pages, and possibly
- * a tail) and means we can share some client side routines.
+ * xdr_buf holds responses; the structure fits NFS read responses
+ * (header, data pages, optional tail) and enables sharing of
+ * client-side routines.
*
- * The xdr_buf.head kvec always points to the first page in the rq_*pages
- * list. The xdr_buf.pages pointer points to the second page on that
- * list. xdr_buf.tail points to the end of the first page.
- * This assumes that the non-page part of an rpc reply will fit
- * in a page - NFSd ensures this. lockd also has no trouble.
+ * The xdr_buf.head kvec always points to the first page in the
+ * rq_*pages list. The xdr_buf.pages pointer points to the second
+ * page on that list. xdr_buf.tail points to the end of the first
+ * page. This assumes that the non-page part of an rpc reply will
+ * fit in a page - NFSd ensures this. lockd also has no trouble.
*/
/**
@@ -162,10 +161,10 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* Returns a count of pages or vectors that can hold the maximum
* size RPC message for @serv.
*
- * Each request/reply pair can have at most one "payload", plus two
- * pages, one for the request, and one for the reply.
- * nfsd_splice_actor() might need an extra page when a READ payload
- * is not page-aligned.
+ * Each page array can hold at most one payload plus two
+ * overhead pages (one for the RPC header, one for tail data).
+ * nfsd_splice_actor() might need an extra page when a READ
+ * payload is not page-aligned.
*/
static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
{
@@ -201,9 +200,9 @@ struct svc_rqst {
struct xdr_stream rq_res_stream;
struct folio *rq_scratch_folio;
struct xdr_buf rq_res;
- unsigned long rq_maxpages; /* num of entries in rq_pages */
+ unsigned long rq_maxpages; /* entries per page array */
struct page * *rq_pages;
- struct page * *rq_respages; /* points into rq_pages */
+ struct page * *rq_respages; /* Reply buffer pages */
struct page * *rq_next_page; /* next reply page to use */
struct page * *rq_page_end; /* one past the last page */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 05ba4040a24a..f850a2af90c2 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -639,13 +639,21 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
{
rqstp->rq_maxpages = svc_serv_maxpages(serv);
- /* rq_pages' last entry is NULL for historical reasons. */
rqstp->rq_pages = kcalloc_node(rqstp->rq_maxpages + 1,
sizeof(struct page *),
GFP_KERNEL, node);
if (!rqstp->rq_pages)
return false;
+ rqstp->rq_respages = kcalloc_node(rqstp->rq_maxpages + 1,
+ sizeof(struct page *),
+ GFP_KERNEL, node);
+ if (!rqstp->rq_respages) {
+ kfree(rqstp->rq_pages);
+ rqstp->rq_pages = NULL;
+ return false;
+ }
+
return true;
}
@@ -657,10 +665,19 @@ svc_release_buffer(struct svc_rqst *rqstp)
{
unsigned long i;
- for (i = 0; i < rqstp->rq_maxpages; i++)
- if (rqstp->rq_pages[i])
- put_page(rqstp->rq_pages[i]);
- kfree(rqstp->rq_pages);
+ if (rqstp->rq_pages) {
+ for (i = 0; i < rqstp->rq_maxpages; i++)
+ if (rqstp->rq_pages[i])
+ put_page(rqstp->rq_pages[i]);
+ kfree(rqstp->rq_pages);
+ }
+
+ if (rqstp->rq_respages) {
+ for (i = 0; i < rqstp->rq_maxpages; i++)
+ if (rqstp->rq_respages[i])
+ put_page(rqstp->rq_respages[i]);
+ kfree(rqstp->rq_respages);
+ }
}
static void
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 56a663b8939f..cd38f09c1803 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -650,14 +650,13 @@ static void svc_check_conn_limits(struct svc_serv *serv)
}
}
-static bool svc_alloc_arg(struct svc_rqst *rqstp)
+static bool svc_fill_pages(struct svc_rqst *rqstp, struct page **pages,
+ unsigned long npages)
{
- struct xdr_buf *arg = &rqstp->rq_arg;
- unsigned long pages, filled, ret;
+ unsigned long filled, ret;
- pages = rqstp->rq_maxpages;
- for (filled = 0; filled < pages; filled = ret) {
- ret = alloc_pages_bulk(GFP_KERNEL, pages, rqstp->rq_pages);
+ for (filled = 0; filled < npages; filled = ret) {
+ ret = alloc_pages_bulk(GFP_KERNEL, npages, pages);
if (ret > filled)
/* Made progress, don't sleep yet */
continue;
@@ -667,11 +666,29 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
set_current_state(TASK_RUNNING);
return false;
}
- trace_svc_alloc_arg_err(pages, ret);
+ trace_svc_alloc_arg_err(npages, ret);
memalloc_retry_wait(GFP_KERNEL);
}
- rqstp->rq_page_end = &rqstp->rq_pages[pages];
- rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
+ return true;
+}
+
+static bool svc_alloc_arg(struct svc_rqst *rqstp)
+{
+ struct xdr_buf *arg = &rqstp->rq_arg;
+ unsigned long pages;
+
+ pages = rqstp->rq_maxpages;
+
+ if (!svc_fill_pages(rqstp, rqstp->rq_pages, pages))
+ return false;
+ if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
+ return false;
+ rqstp->rq_next_page = rqstp->rq_respages;
+ rqstp->rq_page_end = &rqstp->rq_respages[pages];
+ /* svc_rqst_replace_page() dereferences *rq_next_page even
+ * at rq_page_end; NULL prevents releasing a garbage page.
+ */
+ rqstp->rq_respages[pages] = NULL;
/* Make arg->head point to first page and arg->pages point to rest */
arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
@@ -1277,7 +1294,6 @@ static noinline int svc_deferred_recv(struct svc_rqst *rqstp)
rqstp->rq_addrlen = dr->addrlen;
/* Save off transport header len in case we get deferred again */
rqstp->rq_daddr = dr->daddr;
- rqstp->rq_respages = rqstp->rq_pages;
rqstp->rq_xprt_ctxt = dr->xprt_ctxt;
dr->xprt_ctxt = NULL;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index d61cd9b40491..10a298f440cc 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -351,8 +351,6 @@ static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, size_t buflen,
for (i = 0, t = 0; t < buflen; i++, t += PAGE_SIZE)
bvec_set_page(&bvec[i], rqstp->rq_pages[i], PAGE_SIZE, 0);
- rqstp->rq_respages = &rqstp->rq_pages[i];
- rqstp->rq_next_page = rqstp->rq_respages + 1;
iov_iter_bvec(&msg.msg_iter, ITER_DEST, bvec, i, buflen);
if (seek) {
@@ -677,13 +675,9 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
if (len <= rqstp->rq_arg.head[0].iov_len) {
rqstp->rq_arg.head[0].iov_len = len;
rqstp->rq_arg.page_len = 0;
- rqstp->rq_respages = rqstp->rq_pages+1;
} else {
rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
- rqstp->rq_respages = rqstp->rq_pages + 1 +
- DIV_ROUND_UP(rqstp->rq_arg.page_len, PAGE_SIZE);
}
- rqstp->rq_next_page = rqstp->rq_respages+1;
if (serv->sv_stats)
serv->sv_stats->netudpcnt++;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index e7e4a39ca6c6..3081a37a5896 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -861,18 +861,12 @@ static noinline void svc_rdma_read_complete(struct svc_rqst *rqstp,
unsigned int i;
/* Transfer the Read chunk pages into @rqstp.rq_pages, replacing
- * the rq_pages that were already allocated for this rqstp.
+ * the receive buffer pages already allocated for this rqstp.
*/
- release_pages(rqstp->rq_respages, ctxt->rc_page_count);
+ release_pages(rqstp->rq_pages, ctxt->rc_page_count);
for (i = 0; i < ctxt->rc_page_count; i++)
rqstp->rq_pages[i] = ctxt->rc_pages[i];
- /* Update @rqstp's result send buffer to start after the
- * last page in the RDMA Read payload.
- */
- rqstp->rq_respages = &rqstp->rq_pages[ctxt->rc_page_count];
- rqstp->rq_next_page = rqstp->rq_respages + 1;
-
/* Prevent svc_rdma_recv_ctxt_put() from releasing the
* pages in ctxt::rc_pages a second time.
*/
@@ -931,10 +925,9 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
struct svc_rdma_recv_ctxt *ctxt;
int ret;
- /* Prevent svc_xprt_release() from releasing pages in rq_pages
- * when returning 0 or an error.
+ /* Precaution: a zero page count on error return causes
+ * svc_rqst_release_pages() to release nothing.
*/
- rqstp->rq_respages = rqstp->rq_pages;
rqstp->rq_next_page = rqstp->rq_respages;
rqstp->rq_xprt_ctxt = NULL;
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [RFC PATCH 2/6] sunrpc: Allocate a separate Reply page array
2026-02-22 16:19 ` [RFC PATCH 2/6] sunrpc: Allocate a separate Reply page array Chuck Lever
@ 2026-02-23 0:15 ` NeilBrown
2026-02-23 14:43 ` Chuck Lever
0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2026-02-23 0:15 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On Mon, 23 Feb 2026, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> struct svc_rqst uses a single dynamically-allocated page array
> (rq_pages) for both the incoming RPC Call message and the
> outgoing RPC Reply message. rq_respages is a sliding pointer
> into rq_pages that each transport receive path must compute
> based on how many pages the Call consumed. This boundary
> tracking is a source of confusion and bugs, and prevents an
> RPC transaction from having both a large Call and a large
> Reply simultaneously.
>
> Allocate rq_respages as its own page array, eliminating the
> boundary arithmetic. This decouples Call and Reply buffer
> lifetimes, following the precedent set by rq_bvec (a separate
> dynamically-allocated array for I/O vectors).
>
> rq_next_page is initialized in svc_alloc_arg() and
> svc_process() during Reply construction, and in
> svc_rdma_recvfrom() as a precaution on error paths.
> Transport receive paths no longer compute it from the
> Call size.
I do like that you have split the one array in two - it is certainly
conceptually cleaner.
I don't like that you now allocate twice as many pages. This could have
a significant impact on a smaller server, and should at least be
highlighted in the commit description.
For NFSv3, this is complete waste.
For NFSv4 we do have the option is returning NFS4ERR_RESOURCE for ops
that we cannot reply to.
So could we allocate reply pages on demand, either stealing them from
the end of the request buffer, to using kmalloc when request buffer has
no spare?
More comments below..
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> include/linux/sunrpc/svc.h | 43 ++++++++++++-------------
> net/sunrpc/svc.c | 27 +++++++++++++---
> net/sunrpc/svc_xprt.c | 36 +++++++++++++++------
> net/sunrpc/svcsock.c | 6 ----
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 15 +++------
> 5 files changed, 73 insertions(+), 54 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 4dc14c7a711b..b1fb728724f5 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -134,25 +134,24 @@ enum {
> extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>
> /*
> - * RPC Requests and replies are stored in one or more pages.
> - * We maintain an array of pages for each server thread.
> - * Requests are copied into these pages as they arrive. Remaining
> - * pages are available to write the reply into.
> + * RPC Call and Reply messages each have their own page array.
> + * rq_pages holds the incoming Call message; rq_respages holds
> + * the outgoing Reply message. Both arrays are sized to
> + * svc_serv_maxpages() entries and are allocated dynamically.
> *
> - * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each server thread
> - * needs to allocate more to replace those used in sending. To help keep track
> - * of these pages we have a receive list where all pages initialy live, and a
> - * send list where pages are moved to when there are to be part of a reply.
> + * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each
> + * server thread needs to allocate more to replace those used in
> + * sending.
> *
> - * We use xdr_buf for holding responses as it fits well with NFS
> - * read responses (that have a header, and some data pages, and possibly
> - * a tail) and means we can share some client side routines.
> + * xdr_buf holds responses; the structure fits NFS read responses
> + * (header, data pages, optional tail) and enables sharing of
> + * client-side routines.
> *
> - * The xdr_buf.head kvec always points to the first page in the rq_*pages
> - * list. The xdr_buf.pages pointer points to the second page on that
> - * list. xdr_buf.tail points to the end of the first page.
> - * This assumes that the non-page part of an rpc reply will fit
> - * in a page - NFSd ensures this. lockd also has no trouble.
> + * The xdr_buf.head kvec always points to the first page in the
> + * rq_*pages list. The xdr_buf.pages pointer points to the second
> + * page on that list. xdr_buf.tail points to the end of the first
> + * page. This assumes that the non-page part of an rpc reply will
> + * fit in a page - NFSd ensures this. lockd also has no trouble.
> */
>
> /**
> @@ -162,10 +161,10 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> * Returns a count of pages or vectors that can hold the maximum
> * size RPC message for @serv.
> *
> - * Each request/reply pair can have at most one "payload", plus two
> - * pages, one for the request, and one for the reply.
> - * nfsd_splice_actor() might need an extra page when a READ payload
> - * is not page-aligned.
> + * Each page array can hold at most one payload plus two
> + * overhead pages (one for the RPC header, one for tail data).
> + * nfsd_splice_actor() might need an extra page when a READ
> + * payload is not page-aligned.
> */
> static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
> {
> @@ -201,9 +200,9 @@ struct svc_rqst {
> struct xdr_stream rq_res_stream;
> struct folio *rq_scratch_folio;
> struct xdr_buf rq_res;
> - unsigned long rq_maxpages; /* num of entries in rq_pages */
> + unsigned long rq_maxpages; /* entries per page array */
> struct page * *rq_pages;
Should we now document above as /* Request buffer pages */ ??
Maybe even rename it to rq_reqpages ???
> - struct page * *rq_respages; /* points into rq_pages */
> + struct page * *rq_respages; /* Reply buffer pages */
> struct page * *rq_next_page; /* next reply page to use */
> struct page * *rq_page_end; /* one past the last page */
"one past the last request page" or "... last page in rq_respages".
>
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 05ba4040a24a..f850a2af90c2 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -639,13 +639,21 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
> {
> rqstp->rq_maxpages = svc_serv_maxpages(serv);
>
> - /* rq_pages' last entry is NULL for historical reasons. */
> rqstp->rq_pages = kcalloc_node(rqstp->rq_maxpages + 1,
You've removed the comment justifying the "+ 1" but you haven't removed
the "+ 1" !!
> sizeof(struct page *),
> GFP_KERNEL, node);
> if (!rqstp->rq_pages)
> return false;
>
> + rqstp->rq_respages = kcalloc_node(rqstp->rq_maxpages + 1,
> + sizeof(struct page *),
> + GFP_KERNEL, node);
Please justify this +1 - or remove it.
> + if (!rqstp->rq_respages) {
> + kfree(rqstp->rq_pages);
> + rqstp->rq_pages = NULL;
> + return false;
If I we writing this I would:
rq_pages = kcalloc
rq_respages = kcalloc
if (!rq_page || !rq_respage) {
kfree(rq_pages)
kfree(rq_respages)
rq_page = NULL
rq_respages = NULL
return false;
}
But you might reasonably prefer your version.
> + }
> +
> return true;
> }
>
> @@ -657,10 +665,19 @@ svc_release_buffer(struct svc_rqst *rqstp)
> {
> unsigned long i;
>
> - for (i = 0; i < rqstp->rq_maxpages; i++)
> - if (rqstp->rq_pages[i])
> - put_page(rqstp->rq_pages[i]);
> - kfree(rqstp->rq_pages);
> + if (rqstp->rq_pages) {
> + for (i = 0; i < rqstp->rq_maxpages; i++)
> + if (rqstp->rq_pages[i])
> + put_page(rqstp->rq_pages[i]);
> + kfree(rqstp->rq_pages);
> + }
> +
> + if (rqstp->rq_respages) {
> + for (i = 0; i < rqstp->rq_maxpages; i++)
> + if (rqstp->rq_respages[i])
> + put_page(rqstp->rq_respages[i]);
> + kfree(rqstp->rq_respages);
> + }
> }
>
> static void
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 56a663b8939f..cd38f09c1803 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -650,14 +650,13 @@ static void svc_check_conn_limits(struct svc_serv *serv)
> }
> }
>
> -static bool svc_alloc_arg(struct svc_rqst *rqstp)
> +static bool svc_fill_pages(struct svc_rqst *rqstp, struct page **pages,
> + unsigned long npages)
> {
> - struct xdr_buf *arg = &rqstp->rq_arg;
> - unsigned long pages, filled, ret;
> + unsigned long filled, ret;
>
> - pages = rqstp->rq_maxpages;
> - for (filled = 0; filled < pages; filled = ret) {
> - ret = alloc_pages_bulk(GFP_KERNEL, pages, rqstp->rq_pages);
> + for (filled = 0; filled < npages; filled = ret) {
> + ret = alloc_pages_bulk(GFP_KERNEL, npages, pages);
> if (ret > filled)
> /* Made progress, don't sleep yet */
> continue;
> @@ -667,11 +666,29 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
> set_current_state(TASK_RUNNING);
> return false;
> }
> - trace_svc_alloc_arg_err(pages, ret);
> + trace_svc_alloc_arg_err(npages, ret);
> memalloc_retry_wait(GFP_KERNEL);
> }
> - rqstp->rq_page_end = &rqstp->rq_pages[pages];
> - rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
> + return true;
> +}
> +
> +static bool svc_alloc_arg(struct svc_rqst *rqstp)
> +{
> + struct xdr_buf *arg = &rqstp->rq_arg;
> + unsigned long pages;
> +
> + pages = rqstp->rq_maxpages;
> +
> + if (!svc_fill_pages(rqstp, rqstp->rq_pages, pages))
> + return false;
> + if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
> + return false;
> + rqstp->rq_next_page = rqstp->rq_respages;
> + rqstp->rq_page_end = &rqstp->rq_respages[pages];
> + /* svc_rqst_replace_page() dereferences *rq_next_page even
> + * at rq_page_end; NULL prevents releasing a garbage page.
> + */
> + rqstp->rq_respages[pages] = NULL;
Would
rqstp->rq_page_end[0] = NULL;
be a clearer match for the comment?
Thanks,
NeilBrown
>
> /* Make arg->head point to first page and arg->pages point to rest */
> arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
> @@ -1277,7 +1294,6 @@ static noinline int svc_deferred_recv(struct svc_rqst *rqstp)
> rqstp->rq_addrlen = dr->addrlen;
> /* Save off transport header len in case we get deferred again */
> rqstp->rq_daddr = dr->daddr;
> - rqstp->rq_respages = rqstp->rq_pages;
> rqstp->rq_xprt_ctxt = dr->xprt_ctxt;
>
> dr->xprt_ctxt = NULL;
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index d61cd9b40491..10a298f440cc 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -351,8 +351,6 @@ static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, size_t buflen,
>
> for (i = 0, t = 0; t < buflen; i++, t += PAGE_SIZE)
> bvec_set_page(&bvec[i], rqstp->rq_pages[i], PAGE_SIZE, 0);
> - rqstp->rq_respages = &rqstp->rq_pages[i];
> - rqstp->rq_next_page = rqstp->rq_respages + 1;
>
> iov_iter_bvec(&msg.msg_iter, ITER_DEST, bvec, i, buflen);
> if (seek) {
> @@ -677,13 +675,9 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
> if (len <= rqstp->rq_arg.head[0].iov_len) {
> rqstp->rq_arg.head[0].iov_len = len;
> rqstp->rq_arg.page_len = 0;
> - rqstp->rq_respages = rqstp->rq_pages+1;
> } else {
> rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
> - rqstp->rq_respages = rqstp->rq_pages + 1 +
> - DIV_ROUND_UP(rqstp->rq_arg.page_len, PAGE_SIZE);
> }
> - rqstp->rq_next_page = rqstp->rq_respages+1;
>
> if (serv->sv_stats)
> serv->sv_stats->netudpcnt++;
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index e7e4a39ca6c6..3081a37a5896 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -861,18 +861,12 @@ static noinline void svc_rdma_read_complete(struct svc_rqst *rqstp,
> unsigned int i;
>
> /* Transfer the Read chunk pages into @rqstp.rq_pages, replacing
> - * the rq_pages that were already allocated for this rqstp.
> + * the receive buffer pages already allocated for this rqstp.
> */
> - release_pages(rqstp->rq_respages, ctxt->rc_page_count);
> + release_pages(rqstp->rq_pages, ctxt->rc_page_count);
> for (i = 0; i < ctxt->rc_page_count; i++)
> rqstp->rq_pages[i] = ctxt->rc_pages[i];
>
> - /* Update @rqstp's result send buffer to start after the
> - * last page in the RDMA Read payload.
> - */
> - rqstp->rq_respages = &rqstp->rq_pages[ctxt->rc_page_count];
> - rqstp->rq_next_page = rqstp->rq_respages + 1;
> -
> /* Prevent svc_rdma_recv_ctxt_put() from releasing the
> * pages in ctxt::rc_pages a second time.
> */
> @@ -931,10 +925,9 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> struct svc_rdma_recv_ctxt *ctxt;
> int ret;
>
> - /* Prevent svc_xprt_release() from releasing pages in rq_pages
> - * when returning 0 or an error.
> + /* Precaution: a zero page count on error return causes
> + * svc_rqst_release_pages() to release nothing.
> */
> - rqstp->rq_respages = rqstp->rq_pages;
> rqstp->rq_next_page = rqstp->rq_respages;
>
> rqstp->rq_xprt_ctxt = NULL;
> --
> 2.53.0
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [RFC PATCH 2/6] sunrpc: Allocate a separate Reply page array
2026-02-23 0:15 ` NeilBrown
@ 2026-02-23 14:43 ` Chuck Lever
0 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2026-02-23 14:43 UTC (permalink / raw)
To: NeilBrown
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On Sun, Feb 22, 2026, at 7:15 PM, NeilBrown wrote:
> On Mon, 23 Feb 2026, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> struct svc_rqst uses a single dynamically-allocated page array
>> (rq_pages) for both the incoming RPC Call message and the
>> outgoing RPC Reply message. rq_respages is a sliding pointer
>> into rq_pages that each transport receive path must compute
>> based on how many pages the Call consumed. This boundary
>> tracking is a source of confusion and bugs, and prevents an
>> RPC transaction from having both a large Call and a large
>> Reply simultaneously.
>>
>> Allocate rq_respages as its own page array, eliminating the
>> boundary arithmetic. This decouples Call and Reply buffer
>> lifetimes, following the precedent set by rq_bvec (a separate
>> dynamically-allocated array for I/O vectors).
>>
>> rq_next_page is initialized in svc_alloc_arg() and
>> svc_process() during Reply construction, and in
>> svc_rdma_recvfrom() as a precaution on error paths.
>> Transport receive paths no longer compute it from the
>> Call size.
>
> I do like that you have split the one array in two - it is certainly
> conceptually cleaner.
>
> I don't like that you now allocate twice as many pages. This could have
> a significant impact on a smaller server, and should at least be
> highlighted in the commit description.
The number of pages nailed down by NFSD is directly controlled by the
number of nfsd threads. So on small-memory servers, run fewer threads.
We have a dynamic thread count facility now, which keeps the number
of threads minimal on an idle server.
For 16 threads and 1MB maximum payload size, that's an extra 16MB of
nailed memory. That might have been a lot of memory to nail down on
typical systems 25 years ago, but these days, I don't think it's that
significant and it's entirely under the control of the NFSD
administrator.
> For NFSv3, this is complete waste.
Perhaps, but I don't see a way around that. nfsd threads are
fungible: they are called upon to handle all NFS versions, and
thus have to be prepared to deal with either NFSv3 or NFSv4
requests.
> For NFSv4 we do have the option is returning NFS4ERR_RESOURCE for ops
> that we cannot reply to.
NFS4ERR_RESOURCE is NFSv4.0-only. NFSv4.1 and above advertise their
maximum message sizes via CREATE_SESSION, and currently those numbers
are a lie.
> So could we allocate reply pages on demand, either stealing them from
> the end of the request buffer, to using kmalloc when request buffer has
> no spare?
The most pernicious cost of the second array is the CPU time spent
repopulating it, and subsequent patches in this series deal with that.
If NFSD starts stealing pages out of rq_pages for building the reply,
then alloc_bulk_pages() will have to walk the entire array to ensure
all pages are filled, and that's something I'd like to avoid.
As far as allocating pages on demand, there is a cost to that as
well (in terms of contention with other memory allocators) that I
want to keep out of the I/O paths.
--
Chuck Lever
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH 3/6] sunrpc: Handle NULL entries in svc_rqst_release_pages
2026-02-22 16:19 [RFC PATCH 0/6] Optimize NFSD buffer page management Chuck Lever
2026-02-22 16:19 ` [RFC PATCH 1/6] sunrpc: Tighten bounds checking in svc_rqst_replace_page Chuck Lever
2026-02-22 16:19 ` [RFC PATCH 2/6] sunrpc: Allocate a separate Reply page array Chuck Lever
@ 2026-02-22 16:19 ` Chuck Lever
2026-02-22 16:20 ` [RFC PATCH 4/6] svcrdma: preserve rq_next_page in svc_rdma_save_io_pages Chuck Lever
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2026-02-22 16:19 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_rqst_release_pages() releases response pages between rq_respages
and rq_next_page. It currently passes the entire range to
release_pages(), which does not expect NULL entries.
A subsequent patch preserves the rq_next_page pointer in
svc_rdma_save_io_pages() so that it accurately records how many
response pages were consumed. After that change, the range
[rq_respages, rq_next_page) can contain NULL entries where pages
have already been transferred to a send context.
Iterate through the range entry by entry, skipping NULLs, to handle
this case correctly.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/svc.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index f850a2af90c2..620de9abedbb 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -989,18 +989,24 @@ EXPORT_SYMBOL_GPL(svc_rqst_replace_page);
* svc_rqst_release_pages - Release Reply buffer pages
* @rqstp: RPC transaction context
*
- * Release response pages that might still be in flight after
- * svc_send, and any spliced filesystem-owned pages.
+ * Release response pages in the range [rq_respages, rq_next_page).
+ * NULL entries in this range are skipped, allowing transports to
+ * transfer pages to a send context before this function runs.
*/
void svc_rqst_release_pages(struct svc_rqst *rqstp)
{
- int i, count = rqstp->rq_next_page - rqstp->rq_respages;
+ struct page **pp;
- if (count) {
- release_pages(rqstp->rq_respages, count);
- for (i = 0; i < count; i++)
- rqstp->rq_respages[i] = NULL;
+ for (pp = rqstp->rq_respages; pp < rqstp->rq_next_page; pp++) {
+ if (*pp) {
+ if (!folio_batch_add(&rqstp->rq_fbatch,
+ page_folio(*pp)))
+ __folio_batch_release(&rqstp->rq_fbatch);
+ *pp = NULL;
+ }
}
+ if (rqstp->rq_fbatch.nr)
+ __folio_batch_release(&rqstp->rq_fbatch);
}
/**
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* [RFC PATCH 4/6] svcrdma: preserve rq_next_page in svc_rdma_save_io_pages
2026-02-22 16:19 [RFC PATCH 0/6] Optimize NFSD buffer page management Chuck Lever
` (2 preceding siblings ...)
2026-02-22 16:19 ` [RFC PATCH 3/6] sunrpc: Handle NULL entries in svc_rqst_release_pages Chuck Lever
@ 2026-02-22 16:20 ` Chuck Lever
2026-02-22 16:20 ` [RFC PATCH 5/6] sunrpc: Track consumed rq_pages entries Chuck Lever
2026-02-22 16:20 ` [RFC PATCH 6/6] sunrpc: Optimize rq_respages allocation in svc_alloc_arg Chuck Lever
5 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2026-02-22 16:20 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_rdma_save_io_pages() transfers response pages to the send
context and sets those slots to NULL. It then resets rq_next_page
to equal rq_respages, hiding the NULL region from
svc_rqst_release_pages().
Now that svc_rqst_release_pages() handles NULL entries, this
reset is no longer necessary. Removing it preserves the
invariant that the range [rq_respages, rq_next_page) accurately
describes how many response pages were consumed, enabling a
subsequent optimization in svc_alloc_arg() that refills only
the consumed range.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 914cd263c2f1..17c8429da9d5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -858,7 +858,8 @@ int svc_rdma_map_reply_msg(struct svcxprt_rdma *rdma,
/* The svc_rqst and all resources it owns are released as soon as
* svc_rdma_sendto returns. Transfer pages under I/O to the ctxt
- * so they are released by the Send completion handler.
+ * so they are released only after Send completion, and not by
+ * svc_rqst_release_pages().
*/
static void svc_rdma_save_io_pages(struct svc_rqst *rqstp,
struct svc_rdma_send_ctxt *ctxt)
@@ -870,9 +871,6 @@ static void svc_rdma_save_io_pages(struct svc_rqst *rqstp,
ctxt->sc_pages[i] = rqstp->rq_respages[i];
rqstp->rq_respages[i] = NULL;
}
-
- /* Prevent svc_xprt_release from releasing pages in rq_pages */
- rqstp->rq_next_page = rqstp->rq_respages;
}
/* Prepare the portion of the RPC Reply that will be transmitted
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* [RFC PATCH 5/6] sunrpc: Track consumed rq_pages entries
2026-02-22 16:19 [RFC PATCH 0/6] Optimize NFSD buffer page management Chuck Lever
` (3 preceding siblings ...)
2026-02-22 16:20 ` [RFC PATCH 4/6] svcrdma: preserve rq_next_page in svc_rdma_save_io_pages Chuck Lever
@ 2026-02-22 16:20 ` Chuck Lever
2026-02-23 0:19 ` NeilBrown
2026-02-22 16:20 ` [RFC PATCH 6/6] sunrpc: Optimize rq_respages allocation in svc_alloc_arg Chuck Lever
5 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2026-02-22 16:20 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Transports that consume pages from the rq_pages array --
svc_tcp_save_pages() for TCP fragment reassembly and
svc_rdma_clear_rqst_pages() for RDMA Read I/O -- NULL
the consumed entries starting at rq_pages[0]. A new
rq_pages_nfree field in struct svc_rqst records the
count of entries consumed.
svc_alloc_arg() uses rq_pages_nfree to refill only
the consumed entries rather than scanning the full
rq_pages array. In steady state, the transport consume
path NULLs a handful of entries per RPC, so the
allocator visits only those entries instead of the
full ~259 slots (for 1MB messages).
svc_init_buffer() initializes rq_pages_nfree to
rq_maxpages so that the first svc_alloc_arg() call
populates every entry.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc.h | 8 ++++++++
net/sunrpc/svc.c | 1 +
net/sunrpc/svc_xprt.c | 11 ++++++++---
net/sunrpc/svcsock.c | 1 +
net/sunrpc/xprtrdma/svc_rdma_rw.c | 1 +
5 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index b1fb728724f5..bb2029e396db 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -143,6 +143,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* server thread needs to allocate more to replace those used in
* sending.
*
+ * Transport page-consumption contract:
+ *
+ * Transports that consume rq_pages entries (for TCP fragment
+ * reassembly or RDMA Read I/O) NULL entries starting at
+ * rq_pages[0] and set rq_pages_nfree to the count of entries
+ * consumed. svc_alloc_arg() refills only that many entries.
+ *
* xdr_buf holds responses; the structure fits NFS read responses
* (header, data pages, optional tail) and enables sharing of
* client-side routines.
@@ -201,6 +208,7 @@ struct svc_rqst {
struct folio *rq_scratch_folio;
struct xdr_buf rq_res;
unsigned long rq_maxpages; /* entries per page array */
+ unsigned long rq_pages_nfree; /* NULL rq_pages to refill */
struct page * *rq_pages;
struct page * *rq_respages; /* Reply buffer pages */
struct page * *rq_next_page; /* next reply page to use */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 620de9abedbb..a8c26634ecf1 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -654,6 +654,7 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
return false;
}
+ rqstp->rq_pages_nfree = rqstp->rq_maxpages;
return true;
}
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index cd38f09c1803..9d8f6adcfe1f 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -675,12 +675,17 @@ static bool svc_fill_pages(struct svc_rqst *rqstp, struct page **pages,
static bool svc_alloc_arg(struct svc_rqst *rqstp)
{
struct xdr_buf *arg = &rqstp->rq_arg;
- unsigned long pages;
+ unsigned long pages, nfree;
pages = rqstp->rq_maxpages;
- if (!svc_fill_pages(rqstp, rqstp->rq_pages, pages))
- return false;
+ nfree = rqstp->rq_pages_nfree;
+ if (nfree) {
+ if (!svc_fill_pages(rqstp, rqstp->rq_pages, nfree))
+ return false;
+ rqstp->rq_pages_nfree = 0;
+ }
+
if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
return false;
rqstp->rq_next_page = rqstp->rq_respages;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 10a298f440cc..f9140ac8ed99 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1009,6 +1009,7 @@ static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp)
svsk->sk_pages[i] = rqstp->rq_pages[i];
rqstp->rq_pages[i] = NULL;
}
+ rqstp->rq_pages_nfree = npages;
}
static void svc_tcp_clear_pages(struct svc_sock *svsk)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 4ec2f9ae06aa..cf4a1762b629 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -1107,6 +1107,7 @@ static void svc_rdma_clear_rqst_pages(struct svc_rqst *rqstp,
head->rc_pages[i] = rqstp->rq_pages[i];
rqstp->rq_pages[i] = NULL;
}
+ rqstp->rq_pages_nfree = head->rc_page_count;
}
/**
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [RFC PATCH 5/6] sunrpc: Track consumed rq_pages entries
2026-02-22 16:20 ` [RFC PATCH 5/6] sunrpc: Track consumed rq_pages entries Chuck Lever
@ 2026-02-23 0:19 ` NeilBrown
0 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2026-02-23 0:19 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
Chuck Lever
On Mon, 23 Feb 2026, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Transports that consume pages from the rq_pages array --
> svc_tcp_save_pages() for TCP fragment reassembly and
> svc_rdma_clear_rqst_pages() for RDMA Read I/O -- NULL
> the consumed entries starting at rq_pages[0]. A new
> rq_pages_nfree field in struct svc_rqst records the
> count of entries consumed.
Maybe I'm confused, but isn't it rq_respages which might get consumed.
rq_pages - containing the request - doesn't .... does it?
Thanks
NeilBrown
>
> svc_alloc_arg() uses rq_pages_nfree to refill only
> the consumed entries rather than scanning the full
> rq_pages array. In steady state, the transport consume
> path NULLs a handful of entries per RPC, so the
> allocator visits only those entries instead of the
> full ~259 slots (for 1MB messages).
>
> svc_init_buffer() initializes rq_pages_nfree to
> rq_maxpages so that the first svc_alloc_arg() call
> populates every entry.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> include/linux/sunrpc/svc.h | 8 ++++++++
> net/sunrpc/svc.c | 1 +
> net/sunrpc/svc_xprt.c | 11 ++++++++---
> net/sunrpc/svcsock.c | 1 +
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 1 +
> 5 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index b1fb728724f5..bb2029e396db 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -143,6 +143,13 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> * server thread needs to allocate more to replace those used in
> * sending.
> *
> + * Transport page-consumption contract:
> + *
> + * Transports that consume rq_pages entries (for TCP fragment
> + * reassembly or RDMA Read I/O) NULL entries starting at
> + * rq_pages[0] and set rq_pages_nfree to the count of entries
> + * consumed. svc_alloc_arg() refills only that many entries.
> + *
> * xdr_buf holds responses; the structure fits NFS read responses
> * (header, data pages, optional tail) and enables sharing of
> * client-side routines.
> @@ -201,6 +208,7 @@ struct svc_rqst {
> struct folio *rq_scratch_folio;
> struct xdr_buf rq_res;
> unsigned long rq_maxpages; /* entries per page array */
> + unsigned long rq_pages_nfree; /* NULL rq_pages to refill */
> struct page * *rq_pages;
> struct page * *rq_respages; /* Reply buffer pages */
> struct page * *rq_next_page; /* next reply page to use */
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 620de9abedbb..a8c26634ecf1 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -654,6 +654,7 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
> return false;
> }
>
> + rqstp->rq_pages_nfree = rqstp->rq_maxpages;
> return true;
> }
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index cd38f09c1803..9d8f6adcfe1f 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -675,12 +675,17 @@ static bool svc_fill_pages(struct svc_rqst *rqstp, struct page **pages,
> static bool svc_alloc_arg(struct svc_rqst *rqstp)
> {
> struct xdr_buf *arg = &rqstp->rq_arg;
> - unsigned long pages;
> + unsigned long pages, nfree;
>
> pages = rqstp->rq_maxpages;
>
> - if (!svc_fill_pages(rqstp, rqstp->rq_pages, pages))
> - return false;
> + nfree = rqstp->rq_pages_nfree;
> + if (nfree) {
> + if (!svc_fill_pages(rqstp, rqstp->rq_pages, nfree))
> + return false;
> + rqstp->rq_pages_nfree = 0;
> + }
> +
> if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
> return false;
> rqstp->rq_next_page = rqstp->rq_respages;
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 10a298f440cc..f9140ac8ed99 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1009,6 +1009,7 @@ static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp)
> svsk->sk_pages[i] = rqstp->rq_pages[i];
> rqstp->rq_pages[i] = NULL;
> }
> + rqstp->rq_pages_nfree = npages;
> }
>
> static void svc_tcp_clear_pages(struct svc_sock *svsk)
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> index 4ec2f9ae06aa..cf4a1762b629 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -1107,6 +1107,7 @@ static void svc_rdma_clear_rqst_pages(struct svc_rqst *rqstp,
> head->rc_pages[i] = rqstp->rq_pages[i];
> rqstp->rq_pages[i] = NULL;
> }
> + rqstp->rq_pages_nfree = head->rc_page_count;
> }
>
> /**
> --
> 2.53.0
>
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH 6/6] sunrpc: Optimize rq_respages allocation in svc_alloc_arg
2026-02-22 16:19 [RFC PATCH 0/6] Optimize NFSD buffer page management Chuck Lever
` (4 preceding siblings ...)
2026-02-22 16:20 ` [RFC PATCH 5/6] sunrpc: Track consumed rq_pages entries Chuck Lever
@ 2026-02-22 16:20 ` Chuck Lever
5 siblings, 0 replies; 10+ messages in thread
From: Chuck Lever @ 2026-02-22 16:20 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_alloc_arg() invokes alloc_pages_bulk() with the full
rq_maxpages count (~259 for 1MB messages) for the rq_respages
array, causing a full-array scan despite most slots holding valid
pages.
svc_rqst_release_pages() NULLs only the range
[rq_respages, rq_next_page) after each RPC, so only that range
contains NULL entries. Limit the rq_respages fill in svc_alloc_arg()
to that range instead of scanning the full array.
svc_init_buffer() initializes rq_next_page to span the entire
rq_respages array, so the first svc_alloc_arg() call fills all
slots.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc.h | 4 ++++
net/sunrpc/svc.c | 1 +
net/sunrpc/svc_xprt.c | 8 +++++++-
3 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index bb2029e396db..35e9b34e2190 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -150,6 +150,10 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* rq_pages[0] and set rq_pages_nfree to the count of entries
* consumed. svc_alloc_arg() refills only that many entries.
*
+ * For rq_respages, svc_rqst_release_pages() NULLs entries in
+ * [rq_respages, rq_next_page) after each RPC. svc_alloc_arg()
+ * refills only that range.
+ *
* xdr_buf holds responses; the structure fits NFS read responses
* (header, data pages, optional tail) and enables sharing of
* client-side routines.
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index a8c26634ecf1..ef18365c9d9a 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -655,6 +655,7 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
}
rqstp->rq_pages_nfree = rqstp->rq_maxpages;
+ rqstp->rq_next_page = rqstp->rq_respages + rqstp->rq_maxpages;
return true;
}
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 9d8f6adcfe1f..9b3425f57068 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -686,8 +686,14 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
rqstp->rq_pages_nfree = 0;
}
- if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
+ if (WARN_ON_ONCE(rqstp->rq_next_page < rqstp->rq_respages))
return false;
+ nfree = rqstp->rq_next_page - rqstp->rq_respages;
+ if (nfree) {
+ if (!svc_fill_pages(rqstp, rqstp->rq_respages, nfree))
+ return false;
+ }
+
rqstp->rq_next_page = rqstp->rq_respages;
rqstp->rq_page_end = &rqstp->rq_respages[pages];
/* svc_rqst_replace_page() dereferences *rq_next_page even
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread