* [PATCH v2 0/6] Optimize NFSD buffer page management
@ 2026-02-26 14:47 Chuck Lever
2026-02-26 14:47 ` [PATCH v2 1/6] SUNRPC: Tighten bounds checking in svc_rqst_replace_page Chuck Lever
` (6 more replies)
0 siblings, 7 replies; 15+ messages in thread
From: Chuck Lever @ 2026-02-26 14:47 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
This series solves two problems. First:
NFSv3 operations have complementary Request and Response sizes.
When a Request message is large, the corresponding Response
message is small, and vice versa. The sum of the two message
sizes is never more than the maximum transport payload size. So
NFSD could get away with maintaining a single array of pages,
split between the RPC send and Receive buffer.
NFSv4 is not as cut and dried. An NFSv4 client may construct an
NFSv4 COMPOUND that is arbitrarily complex, mixing operations
that can have large Request size with operations that have a
large Response size. The resulting server-side buffer size
requirement can be larger than the maximum transport payload size.
Therefore we must increase the allocated RPC Call landing zone and
the RPC Reply construction zone to ensure that arbitrary NFSv4
COMPOUNDs can be handled.
Second:
Due to the above, and because NFSD can now handle payload sizes
considerably larger than 1MB, the number of array entries that
alloc_bulk_pages() walks through to reset the rqst page arrays
after each RPC completes has increased dramatically.
But we observe that the mean size of NFS requests remains smaller
than a few pages. If only a few pages are consumed while processing
each RPC, then traversing all of the pages in the page arrays for
refills is wasted effort. The CPU cost of walking these arrays is
noticeable in "perf" captures.
It would be more efficient to keep track of which entries need to
be refilled, since that is likely to be a small number in the most
common case, and use alloc_bulk_pages() to fill only those entries.
---
Changes since RFC:
- Clarify a number of comments based on review (NeilBrown)
- Possible NFSv3 waste is still open for discussion
Chuck Lever (6):
SUNRPC: Tighten bounds checking in svc_rqst_replace_page
SUNRPC: Allocate a separate Reply page array
SUNRPC: Handle NULL entries in svc_rqst_release_pages
svcrdma: preserve rq_next_page in svc_rdma_save_io_pages
SUNRPC: Track consumed rq_pages entries
SUNRPC: Optimize rq_respages allocation in svc_alloc_arg
include/linux/sunrpc/svc.h | 61 +++++++++++++++----------
net/sunrpc/svc.c | 59 +++++++++++++++++-------
net/sunrpc/svc_xprt.c | 47 +++++++++++++++----
net/sunrpc/svcsock.c | 7 +--
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 15 ++----
net/sunrpc/xprtrdma/svc_rdma_rw.c | 1 +
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 6 +--
7 files changed, 125 insertions(+), 71 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 1/6] SUNRPC: Tighten bounds checking in svc_rqst_replace_page
2026-02-26 14:47 [PATCH v2 0/6] Optimize NFSD buffer page management Chuck Lever
@ 2026-02-26 14:47 ` Chuck Lever
2026-03-10 18:01 ` Jeff Layton
2026-02-26 14:47 ` [PATCH v2 2/6] SUNRPC: Allocate a separate Reply page array Chuck Lever
` (5 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2026-02-26 14:47 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_rqst_replace_page() builds the Reply buffer by advancing
rq_next_page through the response page range. The bounds
check validates rq_next_page against the full rq_pages array,
but the valid range for rq_next_page is
[rq_respages, rq_page_end].
Use those bounds instead.
This is correct today because rq_respages and rq_page_end
both point into rq_pages, and it prepares for a subsequent
change that separates the Reply page array from rq_pages.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/svc.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index d8ccb8e4b5c2..f7ec02457328 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -934,11 +934,11 @@ svc_set_num_threads(struct svc_serv *serv, unsigned int min_threads,
EXPORT_SYMBOL_GPL(svc_set_num_threads);
/**
- * svc_rqst_replace_page - Replace one page in rq_pages[]
+ * svc_rqst_replace_page - Replace one page in rq_respages[]
* @rqstp: svc_rqst with pages to replace
* @page: replacement page
*
- * When replacing a page in rq_pages, batch the release of the
+ * When replacing a page in rq_respages, batch the release of the
* replaced pages to avoid hammering the page allocator.
*
* Return values:
@@ -947,8 +947,8 @@ EXPORT_SYMBOL_GPL(svc_set_num_threads);
*/
bool svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page)
{
- struct page **begin = rqstp->rq_pages;
- struct page **end = &rqstp->rq_pages[rqstp->rq_maxpages];
+ struct page **begin = rqstp->rq_respages;
+ struct page **end = rqstp->rq_page_end;
if (unlikely(rqstp->rq_next_page < begin || rqstp->rq_next_page > end)) {
trace_svc_replace_page_err(rqstp);
--
2.53.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 2/6] SUNRPC: Allocate a separate Reply page array
2026-02-26 14:47 [PATCH v2 0/6] Optimize NFSD buffer page management Chuck Lever
2026-02-26 14:47 ` [PATCH v2 1/6] SUNRPC: Tighten bounds checking in svc_rqst_replace_page Chuck Lever
@ 2026-02-26 14:47 ` Chuck Lever
2026-03-10 18:10 ` Jeff Layton
2026-02-26 14:47 ` [PATCH v2 3/6] SUNRPC: Handle NULL entries in svc_rqst_release_pages Chuck Lever
` (4 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2026-02-26 14:47 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
struct svc_rqst uses a single dynamically-allocated page array
(rq_pages) for both the incoming RPC Call message and the outgoing
RPC Reply message. rq_respages is a sliding pointer into rq_pages
that each transport receive path must compute based on how many
pages the Call consumed. This boundary tracking is a source of
confusion and bugs, and prevents an RPC transaction from having
both a large Call and a large Reply simultaneously.
Allocate rq_respages as its own page array, eliminating the boundary
arithmetic. This decouples Call and Reply buffer lifetimes,
following the precedent set by rq_bvec (a separate dynamically-
allocated array for I/O vectors).
Each svc_rqst now pins twice as many pages as before. For a server
running 16 threads with a 1MB maximum payload, the additional cost
is roughly 16MB of pinned memory. The new dynamic svc thread count
facility keeps this overhead minimal on an idle server. A subsequent
patch in this series limits per-request repopulation to only the
pages released during the previous RPC, avoiding a full-array scan
on each call to svc_alloc_arg().
Note: We've considered several alternatives to maintaining a full
second array. Each alternative reintroduces either boundary logic
complexity or I/O-path allocation pressure.
rq_next_page is initialized in svc_alloc_arg() and svc_process()
during Reply construction, and in svc_rdma_recvfrom() as a
precaution on error paths. Transport receive paths no longer compute
it from the Call size.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc.h | 47 ++++++++++++-------------
net/sunrpc/svc.c | 29 ++++++++++++---
net/sunrpc/svc_xprt.c | 36 +++++++++++++------
net/sunrpc/svcsock.c | 6 ----
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 15 +++-----
5 files changed, 77 insertions(+), 56 deletions(-)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 4dc14c7a711b..3559de664f64 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -134,25 +134,24 @@ enum {
extern u32 svc_max_payload(const struct svc_rqst *rqstp);
/*
- * RPC Requests and replies are stored in one or more pages.
- * We maintain an array of pages for each server thread.
- * Requests are copied into these pages as they arrive. Remaining
- * pages are available to write the reply into.
+ * RPC Call and Reply messages each have their own page array.
+ * rq_pages holds the incoming Call message; rq_respages holds
+ * the outgoing Reply message. Both arrays are sized to
+ * svc_serv_maxpages() entries and are allocated dynamically.
*
- * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each server thread
- * needs to allocate more to replace those used in sending. To help keep track
- * of these pages we have a receive list where all pages initialy live, and a
- * send list where pages are moved to when there are to be part of a reply.
+ * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each
+ * server thread needs to allocate more to replace those used in
+ * sending.
*
- * We use xdr_buf for holding responses as it fits well with NFS
- * read responses (that have a header, and some data pages, and possibly
- * a tail) and means we can share some client side routines.
+ * xdr_buf holds responses; the structure fits NFS read responses
+ * (header, data pages, optional tail) and enables sharing of
+ * client-side routines.
*
- * The xdr_buf.head kvec always points to the first page in the rq_*pages
- * list. The xdr_buf.pages pointer points to the second page on that
- * list. xdr_buf.tail points to the end of the first page.
- * This assumes that the non-page part of an rpc reply will fit
- * in a page - NFSd ensures this. lockd also has no trouble.
+ * The xdr_buf.head kvec always points to the first page in the
+ * rq_*pages list. The xdr_buf.pages pointer points to the second
+ * page on that list. xdr_buf.tail points to the end of the first
+ * page. This assumes that the non-page part of an rpc reply will
+ * fit in a page - NFSd ensures this. lockd also has no trouble.
*/
/**
@@ -162,10 +161,10 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* Returns a count of pages or vectors that can hold the maximum
* size RPC message for @serv.
*
- * Each request/reply pair can have at most one "payload", plus two
- * pages, one for the request, and one for the reply.
- * nfsd_splice_actor() might need an extra page when a READ payload
- * is not page-aligned.
+ * Each page array can hold at most one payload plus two
+ * overhead pages (one for the RPC header, one for tail data).
+ * nfsd_splice_actor() might need an extra page when a READ
+ * payload is not page-aligned.
*/
static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
{
@@ -201,11 +200,11 @@ struct svc_rqst {
struct xdr_stream rq_res_stream;
struct folio *rq_scratch_folio;
struct xdr_buf rq_res;
- unsigned long rq_maxpages; /* num of entries in rq_pages */
- struct page * *rq_pages;
- struct page * *rq_respages; /* points into rq_pages */
+ unsigned long rq_maxpages; /* entries per page array */
+ struct page * *rq_pages; /* Call buffer pages */
+ struct page * *rq_respages; /* Reply buffer pages */
struct page * *rq_next_page; /* next reply page to use */
- struct page * *rq_page_end; /* one past the last page */
+ struct page * *rq_page_end; /* one past the last reply page */
struct folio_batch rq_fbatch;
struct bio_vec *rq_bvec;
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index f7ec02457328..9abef638b1e0 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -638,13 +638,23 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
{
rqstp->rq_maxpages = svc_serv_maxpages(serv);
- /* rq_pages' last entry is NULL for historical reasons. */
+ /* +1 for a NULL sentinel readable by nfsd_splice_actor() */
rqstp->rq_pages = kcalloc_node(rqstp->rq_maxpages + 1,
sizeof(struct page *),
GFP_KERNEL, node);
if (!rqstp->rq_pages)
return false;
+ /* +1 for a NULL sentinel at rq_page_end (see svc_rqst_replace_page) */
+ rqstp->rq_respages = kcalloc_node(rqstp->rq_maxpages + 1,
+ sizeof(struct page *),
+ GFP_KERNEL, node);
+ if (!rqstp->rq_respages) {
+ kfree(rqstp->rq_pages);
+ rqstp->rq_pages = NULL;
+ return false;
+ }
+
return true;
}
@@ -656,10 +666,19 @@ svc_release_buffer(struct svc_rqst *rqstp)
{
unsigned long i;
- for (i = 0; i < rqstp->rq_maxpages; i++)
- if (rqstp->rq_pages[i])
- put_page(rqstp->rq_pages[i]);
- kfree(rqstp->rq_pages);
+ if (rqstp->rq_pages) {
+ for (i = 0; i < rqstp->rq_maxpages; i++)
+ if (rqstp->rq_pages[i])
+ put_page(rqstp->rq_pages[i]);
+ kfree(rqstp->rq_pages);
+ }
+
+ if (rqstp->rq_respages) {
+ for (i = 0; i < rqstp->rq_maxpages; i++)
+ if (rqstp->rq_respages[i])
+ put_page(rqstp->rq_respages[i]);
+ kfree(rqstp->rq_respages);
+ }
}
static void
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 56a663b8939f..e027765f4307 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -650,14 +650,13 @@ static void svc_check_conn_limits(struct svc_serv *serv)
}
}
-static bool svc_alloc_arg(struct svc_rqst *rqstp)
+static bool svc_fill_pages(struct svc_rqst *rqstp, struct page **pages,
+ unsigned long npages)
{
- struct xdr_buf *arg = &rqstp->rq_arg;
- unsigned long pages, filled, ret;
+ unsigned long filled, ret;
- pages = rqstp->rq_maxpages;
- for (filled = 0; filled < pages; filled = ret) {
- ret = alloc_pages_bulk(GFP_KERNEL, pages, rqstp->rq_pages);
+ for (filled = 0; filled < npages; filled = ret) {
+ ret = alloc_pages_bulk(GFP_KERNEL, npages, pages);
if (ret > filled)
/* Made progress, don't sleep yet */
continue;
@@ -667,11 +666,29 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
set_current_state(TASK_RUNNING);
return false;
}
- trace_svc_alloc_arg_err(pages, ret);
+ trace_svc_alloc_arg_err(npages, ret);
memalloc_retry_wait(GFP_KERNEL);
}
- rqstp->rq_page_end = &rqstp->rq_pages[pages];
- rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
+ return true;
+}
+
+static bool svc_alloc_arg(struct svc_rqst *rqstp)
+{
+ struct xdr_buf *arg = &rqstp->rq_arg;
+ unsigned long pages;
+
+ pages = rqstp->rq_maxpages;
+
+ if (!svc_fill_pages(rqstp, rqstp->rq_pages, pages))
+ return false;
+ if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
+ return false;
+ rqstp->rq_next_page = rqstp->rq_respages;
+ rqstp->rq_page_end = &rqstp->rq_respages[pages];
+ /* svc_rqst_replace_page() dereferences *rq_next_page even
+ * at rq_page_end; NULL prevents releasing a garbage page.
+ */
+ rqstp->rq_page_end[0] = NULL;
/* Make arg->head point to first page and arg->pages point to rest */
arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
@@ -1277,7 +1294,6 @@ static noinline int svc_deferred_recv(struct svc_rqst *rqstp)
rqstp->rq_addrlen = dr->addrlen;
/* Save off transport header len in case we get deferred again */
rqstp->rq_daddr = dr->daddr;
- rqstp->rq_respages = rqstp->rq_pages;
rqstp->rq_xprt_ctxt = dr->xprt_ctxt;
dr->xprt_ctxt = NULL;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index f28c6076f7e8..c86f28f720f7 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -351,8 +351,6 @@ static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, size_t buflen,
for (i = 0, t = 0; t < buflen; i++, t += PAGE_SIZE)
bvec_set_page(&bvec[i], rqstp->rq_pages[i], PAGE_SIZE, 0);
- rqstp->rq_respages = &rqstp->rq_pages[i];
- rqstp->rq_next_page = rqstp->rq_respages + 1;
iov_iter_bvec(&msg.msg_iter, ITER_DEST, bvec, i, buflen);
if (seek) {
@@ -677,13 +675,9 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
if (len <= rqstp->rq_arg.head[0].iov_len) {
rqstp->rq_arg.head[0].iov_len = len;
rqstp->rq_arg.page_len = 0;
- rqstp->rq_respages = rqstp->rq_pages+1;
} else {
rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
- rqstp->rq_respages = rqstp->rq_pages + 1 +
- DIV_ROUND_UP(rqstp->rq_arg.page_len, PAGE_SIZE);
}
- rqstp->rq_next_page = rqstp->rq_respages+1;
if (serv->sv_stats)
serv->sv_stats->netudpcnt++;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index e7e4a39ca6c6..3081a37a5896 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -861,18 +861,12 @@ static noinline void svc_rdma_read_complete(struct svc_rqst *rqstp,
unsigned int i;
/* Transfer the Read chunk pages into @rqstp.rq_pages, replacing
- * the rq_pages that were already allocated for this rqstp.
+ * the receive buffer pages already allocated for this rqstp.
*/
- release_pages(rqstp->rq_respages, ctxt->rc_page_count);
+ release_pages(rqstp->rq_pages, ctxt->rc_page_count);
for (i = 0; i < ctxt->rc_page_count; i++)
rqstp->rq_pages[i] = ctxt->rc_pages[i];
- /* Update @rqstp's result send buffer to start after the
- * last page in the RDMA Read payload.
- */
- rqstp->rq_respages = &rqstp->rq_pages[ctxt->rc_page_count];
- rqstp->rq_next_page = rqstp->rq_respages + 1;
-
/* Prevent svc_rdma_recv_ctxt_put() from releasing the
* pages in ctxt::rc_pages a second time.
*/
@@ -931,10 +925,9 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
struct svc_rdma_recv_ctxt *ctxt;
int ret;
- /* Prevent svc_xprt_release() from releasing pages in rq_pages
- * when returning 0 or an error.
+ /* Precaution: a zero page count on error return causes
+ * svc_rqst_release_pages() to release nothing.
*/
- rqstp->rq_respages = rqstp->rq_pages;
rqstp->rq_next_page = rqstp->rq_respages;
rqstp->rq_xprt_ctxt = NULL;
--
2.53.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 3/6] SUNRPC: Handle NULL entries in svc_rqst_release_pages
2026-02-26 14:47 [PATCH v2 0/6] Optimize NFSD buffer page management Chuck Lever
2026-02-26 14:47 ` [PATCH v2 1/6] SUNRPC: Tighten bounds checking in svc_rqst_replace_page Chuck Lever
2026-02-26 14:47 ` [PATCH v2 2/6] SUNRPC: Allocate a separate Reply page array Chuck Lever
@ 2026-02-26 14:47 ` Chuck Lever
2026-03-10 18:11 ` Jeff Layton
2026-02-26 14:47 ` [PATCH v2 4/6] svcrdma: preserve rq_next_page in svc_rdma_save_io_pages Chuck Lever
` (3 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2026-02-26 14:47 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_rqst_release_pages() releases response pages between rq_respages
and rq_next_page. It currently passes the entire range to
release_pages(), which does not expect NULL entries.
A subsequent patch preserves the rq_next_page pointer in
svc_rdma_save_io_pages() so that it accurately records how many
response pages were consumed. After that change, the range
[rq_respages, rq_next_page)
can contain NULL entries where pages have already been transferred
to a send context.
Iterate through the range entry by entry, skipping NULLs, to handle
this case correctly.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/svc.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 9abef638b1e0..0ce16e9abdf6 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -990,18 +990,24 @@ EXPORT_SYMBOL_GPL(svc_rqst_replace_page);
* svc_rqst_release_pages - Release Reply buffer pages
* @rqstp: RPC transaction context
*
- * Release response pages that might still be in flight after
- * svc_send, and any spliced filesystem-owned pages.
+ * Release response pages in the range [rq_respages, rq_next_page).
+ * NULL entries in this range are skipped, allowing transports to
+ * transfer pages to a send context before this function runs.
*/
void svc_rqst_release_pages(struct svc_rqst *rqstp)
{
- int i, count = rqstp->rq_next_page - rqstp->rq_respages;
+ struct page **pp;
- if (count) {
- release_pages(rqstp->rq_respages, count);
- for (i = 0; i < count; i++)
- rqstp->rq_respages[i] = NULL;
+ for (pp = rqstp->rq_respages; pp < rqstp->rq_next_page; pp++) {
+ if (*pp) {
+ if (!folio_batch_add(&rqstp->rq_fbatch,
+ page_folio(*pp)))
+ __folio_batch_release(&rqstp->rq_fbatch);
+ *pp = NULL;
+ }
}
+ if (rqstp->rq_fbatch.nr)
+ __folio_batch_release(&rqstp->rq_fbatch);
}
/**
--
2.53.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 4/6] svcrdma: preserve rq_next_page in svc_rdma_save_io_pages
2026-02-26 14:47 [PATCH v2 0/6] Optimize NFSD buffer page management Chuck Lever
` (2 preceding siblings ...)
2026-02-26 14:47 ` [PATCH v2 3/6] SUNRPC: Handle NULL entries in svc_rqst_release_pages Chuck Lever
@ 2026-02-26 14:47 ` Chuck Lever
2026-03-10 18:13 ` Jeff Layton
2026-02-26 14:47 ` [PATCH v2 5/6] SUNRPC: Track consumed rq_pages entries Chuck Lever
` (2 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2026-02-26 14:47 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_rdma_save_io_pages() transfers response pages to the send
context and sets those slots to NULL. It then resets rq_next_page to
equal rq_respages, hiding the NULL region from
svc_rqst_release_pages().
Now that svc_rqst_release_pages() handles NULL entries, this reset
is no longer necessary. Removing it preserves the invariant that the
range [rq_respages, rq_next_page) accurately describes how many
response pages were consumed, enabling a subsequent optimization in
svc_alloc_arg() that refills only the consumed range.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 914cd263c2f1..17c8429da9d5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -858,7 +858,8 @@ int svc_rdma_map_reply_msg(struct svcxprt_rdma *rdma,
/* The svc_rqst and all resources it owns are released as soon as
* svc_rdma_sendto returns. Transfer pages under I/O to the ctxt
- * so they are released by the Send completion handler.
+ * so they are released only after Send completion, and not by
+ * svc_rqst_release_pages().
*/
static void svc_rdma_save_io_pages(struct svc_rqst *rqstp,
struct svc_rdma_send_ctxt *ctxt)
@@ -870,9 +871,6 @@ static void svc_rdma_save_io_pages(struct svc_rqst *rqstp,
ctxt->sc_pages[i] = rqstp->rq_respages[i];
rqstp->rq_respages[i] = NULL;
}
-
- /* Prevent svc_xprt_release from releasing pages in rq_pages */
- rqstp->rq_next_page = rqstp->rq_respages;
}
/* Prepare the portion of the RPC Reply that will be transmitted
--
2.53.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 5/6] SUNRPC: Track consumed rq_pages entries
2026-02-26 14:47 [PATCH v2 0/6] Optimize NFSD buffer page management Chuck Lever
` (3 preceding siblings ...)
2026-02-26 14:47 ` [PATCH v2 4/6] svcrdma: preserve rq_next_page in svc_rdma_save_io_pages Chuck Lever
@ 2026-02-26 14:47 ` Chuck Lever
2026-03-10 18:16 ` Jeff Layton
2026-02-26 14:47 ` [PATCH v2 6/6] SUNRPC: Optimize rq_respages allocation in svc_alloc_arg Chuck Lever
2026-03-10 18:19 ` [PATCH v2 0/6] Optimize NFSD buffer page management Jeff Layton
6 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2026-02-26 14:47 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The rq_pages array holds pages allocated for incoming RPC requests.
Two transport receive paths NULL entries in rq_pages to prevent
svc_rqst_release_pages() from freeing pages that the transport has
taken ownership of:
- svc_tcp_save_pages() moves partial request data pages to
svsk->sk_pages during multi-fragment TCP reassembly.
- svc_rdma_clear_rqst_pages() moves request data pages to
head->rc_pages because they are targets of active RDMA Read WRs.
A new rq_pages_nfree field in struct svc_rqst records how many
entries were NULLed. svc_alloc_arg() uses it to refill only those
entries rather than scanning the full rq_pages array. In steady
state, the transport NULLs a handful of entries per RPC, so the
allocator visits only those entries instead of the full ~259 slots
(for 1MB messages).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc.h | 10 ++++++++++
net/sunrpc/svc.c | 1 +
net/sunrpc/svc_xprt.c | 11 ++++++++---
net/sunrpc/svcsock.c | 1 +
net/sunrpc/xprtrdma/svc_rdma_rw.c | 1 +
5 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 3559de664f64..b5a842dd97a4 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -143,6 +143,15 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* server thread needs to allocate more to replace those used in
* sending.
*
+ * rq_pages request page contract:
+ *
+ * Transport receive paths that move request data pages out of
+ * rq_pages -- TCP multi-fragment reassembly (svc_tcp_save_pages)
+ * and RDMA Read I/O (svc_rdma_clear_rqst_pages) -- NULL those
+ * entries to prevent svc_rqst_release_pages() from freeing pages
+ * still in transport use, and set rq_pages_nfree to the count.
+ * svc_alloc_arg() refills only that many rq_pages entries.
+ *
* xdr_buf holds responses; the structure fits NFS read responses
* (header, data pages, optional tail) and enables sharing of
* client-side routines.
@@ -201,6 +210,7 @@ struct svc_rqst {
struct folio *rq_scratch_folio;
struct xdr_buf rq_res;
unsigned long rq_maxpages; /* entries per page array */
+ unsigned long rq_pages_nfree; /* rq_pages entries NULLed by transport */
struct page * *rq_pages; /* Call buffer pages */
struct page * *rq_respages; /* Reply buffer pages */
struct page * *rq_next_page; /* next reply page to use */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 0ce16e9abdf6..6e57e35fa6d6 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -655,6 +655,7 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
return false;
}
+ rqstp->rq_pages_nfree = rqstp->rq_maxpages;
return true;
}
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index e027765f4307..795b5729525f 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -675,12 +675,17 @@ static bool svc_fill_pages(struct svc_rqst *rqstp, struct page **pages,
static bool svc_alloc_arg(struct svc_rqst *rqstp)
{
struct xdr_buf *arg = &rqstp->rq_arg;
- unsigned long pages;
+ unsigned long pages, nfree;
pages = rqstp->rq_maxpages;
- if (!svc_fill_pages(rqstp, rqstp->rq_pages, pages))
- return false;
+ nfree = rqstp->rq_pages_nfree;
+ if (nfree) {
+ if (!svc_fill_pages(rqstp, rqstp->rq_pages, nfree))
+ return false;
+ rqstp->rq_pages_nfree = 0;
+ }
+
if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
return false;
rqstp->rq_next_page = rqstp->rq_respages;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index c86f28f720f7..2ce43f9995f1 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1009,6 +1009,7 @@ static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp)
svsk->sk_pages[i] = rqstp->rq_pages[i];
rqstp->rq_pages[i] = NULL;
}
+ rqstp->rq_pages_nfree = npages;
}
static void svc_tcp_clear_pages(struct svc_sock *svsk)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 4ec2f9ae06aa..cf4a1762b629 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -1107,6 +1107,7 @@ static void svc_rdma_clear_rqst_pages(struct svc_rqst *rqstp,
head->rc_pages[i] = rqstp->rq_pages[i];
rqstp->rq_pages[i] = NULL;
}
+ rqstp->rq_pages_nfree = head->rc_page_count;
}
/**
--
2.53.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 6/6] SUNRPC: Optimize rq_respages allocation in svc_alloc_arg
2026-02-26 14:47 [PATCH v2 0/6] Optimize NFSD buffer page management Chuck Lever
` (4 preceding siblings ...)
2026-02-26 14:47 ` [PATCH v2 5/6] SUNRPC: Track consumed rq_pages entries Chuck Lever
@ 2026-02-26 14:47 ` Chuck Lever
2026-03-10 18:18 ` Jeff Layton
2026-03-10 18:19 ` [PATCH v2 0/6] Optimize NFSD buffer page management Jeff Layton
6 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2026-02-26 14:47 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_alloc_arg() invokes alloc_pages_bulk() with the full rq_maxpages
count (~259 for 1MB messages) for the rq_respages array, causing a
full-array scan despite most slots holding valid pages.
svc_rqst_release_pages() NULLs only the range
[rq_respages, rq_next_page)
after each RPC, so only that range contains NULL entries. Limit the
rq_respages fill in svc_alloc_arg() to that range instead of
scanning the full array.
svc_init_buffer() initializes rq_next_page to span the entire
rq_respages array, so the first svc_alloc_arg() call fills all
slots.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc.h | 4 ++++
net/sunrpc/svc.c | 1 +
net/sunrpc/svc_xprt.c | 8 +++++++-
3 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index b5a842dd97a4..7315c529f88a 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -152,6 +152,10 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* still in transport use, and set rq_pages_nfree to the count.
* svc_alloc_arg() refills only that many rq_pages entries.
*
+ * For rq_respages, svc_rqst_release_pages() NULLs entries in
+ * [rq_respages, rq_next_page) after each RPC. svc_alloc_arg()
+ * refills only that range.
+ *
* xdr_buf holds responses; the structure fits NFS read responses
* (header, data pages, optional tail) and enables sharing of
* client-side routines.
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 6e57e35fa6d6..5e0b5ec2fd52 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -656,6 +656,7 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
}
rqstp->rq_pages_nfree = rqstp->rq_maxpages;
+ rqstp->rq_next_page = rqstp->rq_respages + rqstp->rq_maxpages;
return true;
}
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 795b5729525f..b16e710926c1 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -686,8 +686,14 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
rqstp->rq_pages_nfree = 0;
}
- if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
+ if (WARN_ON_ONCE(rqstp->rq_next_page < rqstp->rq_respages))
return false;
+ nfree = rqstp->rq_next_page - rqstp->rq_respages;
+ if (nfree) {
+ if (!svc_fill_pages(rqstp, rqstp->rq_respages, nfree))
+ return false;
+ }
+
rqstp->rq_next_page = rqstp->rq_respages;
rqstp->rq_page_end = &rqstp->rq_respages[pages];
/* svc_rqst_replace_page() dereferences *rq_next_page even
--
2.53.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 1/6] SUNRPC: Tighten bounds checking in svc_rqst_replace_page
2026-02-26 14:47 ` [PATCH v2 1/6] SUNRPC: Tighten bounds checking in svc_rqst_replace_page Chuck Lever
@ 2026-03-10 18:01 ` Jeff Layton
0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2026-03-10 18:01 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Thu, 2026-02-26 at 09:47 -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> svc_rqst_replace_page() builds the Reply buffer by advancing
> rq_next_page through the response page range. The bounds
> check validates rq_next_page against the full rq_pages array,
> but the valid range for rq_next_page is
>
> [rq_respages, rq_page_end].
>
> Use those bounds instead.
>
> This is correct today because rq_respages and rq_page_end
> both point into rq_pages, and it prepares for a subsequent
> change that separates the Reply page array from rq_pages.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> net/sunrpc/svc.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index d8ccb8e4b5c2..f7ec02457328 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -934,11 +934,11 @@ svc_set_num_threads(struct svc_serv *serv, unsigned int min_threads,
> EXPORT_SYMBOL_GPL(svc_set_num_threads);
>
> /**
> - * svc_rqst_replace_page - Replace one page in rq_pages[]
> + * svc_rqst_replace_page - Replace one page in rq_respages[]
> * @rqstp: svc_rqst with pages to replace
> * @page: replacement page
> *
> - * When replacing a page in rq_pages, batch the release of the
> + * When replacing a page in rq_respages, batch the release of the
> * replaced pages to avoid hammering the page allocator.
> *
> * Return values:
> @@ -947,8 +947,8 @@ EXPORT_SYMBOL_GPL(svc_set_num_threads);
> */
> bool svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page)
> {
> - struct page **begin = rqstp->rq_pages;
> - struct page **end = &rqstp->rq_pages[rqstp->rq_maxpages];
> + struct page **begin = rqstp->rq_respages;
> + struct page **end = rqstp->rq_page_end;
>
> if (unlikely(rqstp->rq_next_page < begin || rqstp->rq_next_page > end)) {
> trace_svc_replace_page_err(rqstp);
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/6] SUNRPC: Allocate a separate Reply page array
2026-02-26 14:47 ` [PATCH v2 2/6] SUNRPC: Allocate a separate Reply page array Chuck Lever
@ 2026-03-10 18:10 ` Jeff Layton
0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2026-03-10 18:10 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Thu, 2026-02-26 at 09:47 -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> struct svc_rqst uses a single dynamically-allocated page array
> (rq_pages) for both the incoming RPC Call message and the outgoing
> RPC Reply message. rq_respages is a sliding pointer into rq_pages
> that each transport receive path must compute based on how many
> pages the Call consumed. This boundary tracking is a source of
> confusion and bugs, and prevents an RPC transaction from having
> both a large Call and a large Reply simultaneously.
>
> Allocate rq_respages as its own page array, eliminating the boundary
> arithmetic. This decouples Call and Reply buffer lifetimes,
> following the precedent set by rq_bvec (a separate dynamically-
> allocated array for I/O vectors).
>
> Each svc_rqst now pins twice as many pages as before. For a server
> running 16 threads with a 1MB maximum payload, the additional cost
> is roughly 16MB of pinned memory. The new dynamic svc thread count
> facility keeps this overhead minimal on an idle server. A subsequent
> patch in this series limits per-request repopulation to only the
> pages released during the previous RPC, avoiding a full-array scan
> on each call to svc_alloc_arg().
>
> Note: We've considered several alternatives to maintaining a full
> second array. Each alternative reintroduces either boundary logic
> complexity or I/O-path allocation pressure.
>
> rq_next_page is initialized in svc_alloc_arg() and svc_process()
> during Reply construction, and in svc_rdma_recvfrom() as a
> precaution on error paths. Transport receive paths no longer compute
> it from the Call size.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> include/linux/sunrpc/svc.h | 47 ++++++++++++-------------
> net/sunrpc/svc.c | 29 ++++++++++++---
> net/sunrpc/svc_xprt.c | 36 +++++++++++++------
> net/sunrpc/svcsock.c | 6 ----
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 15 +++-----
> 5 files changed, 77 insertions(+), 56 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 4dc14c7a711b..3559de664f64 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -134,25 +134,24 @@ enum {
> extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>
> /*
> - * RPC Requests and replies are stored in one or more pages.
> - * We maintain an array of pages for each server thread.
> - * Requests are copied into these pages as they arrive. Remaining
> - * pages are available to write the reply into.
> + * RPC Call and Reply messages each have their own page array.
> + * rq_pages holds the incoming Call message; rq_respages holds
> + * the outgoing Reply message. Both arrays are sized to
> + * svc_serv_maxpages() entries and are allocated dynamically.
> *
> - * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each server thread
> - * needs to allocate more to replace those used in sending. To help keep track
> - * of these pages we have a receive list where all pages initialy live, and a
> - * send list where pages are moved to when there are to be part of a reply.
> + * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each
> + * server thread needs to allocate more to replace those used in
> + * sending.
> *
> - * We use xdr_buf for holding responses as it fits well with NFS
> - * read responses (that have a header, and some data pages, and possibly
> - * a tail) and means we can share some client side routines.
> + * xdr_buf holds responses; the structure fits NFS read responses
> + * (header, data pages, optional tail) and enables sharing of
> + * client-side routines.
> *
> - * The xdr_buf.head kvec always points to the first page in the rq_*pages
> - * list. The xdr_buf.pages pointer points to the second page on that
> - * list. xdr_buf.tail points to the end of the first page.
> - * This assumes that the non-page part of an rpc reply will fit
> - * in a page - NFSd ensures this. lockd also has no trouble.
> + * The xdr_buf.head kvec always points to the first page in the
> + * rq_*pages list. The xdr_buf.pages pointer points to the second
> + * page on that list. xdr_buf.tail points to the end of the first
> + * page. This assumes that the non-page part of an rpc reply will
> + * fit in a page - NFSd ensures this. lockd also has no trouble.
> */
>
> /**
> @@ -162,10 +161,10 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> * Returns a count of pages or vectors that can hold the maximum
> * size RPC message for @serv.
> *
> - * Each request/reply pair can have at most one "payload", plus two
> - * pages, one for the request, and one for the reply.
> - * nfsd_splice_actor() might need an extra page when a READ payload
> - * is not page-aligned.
> + * Each page array can hold at most one payload plus two
> + * overhead pages (one for the RPC header, one for tail data).
This is worded a little awkwardly. It sounds sort of like you can only
have a single payload page. Maybe:
"Each page array can hold a set of pages for the payload, plus two
overhead pages..."
> + * nfsd_splice_actor() might need an extra page when a READ
> + * payload is not page-aligned.
> */
> static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
> {
> @@ -201,11 +200,11 @@ struct svc_rqst {
> struct xdr_stream rq_res_stream;
> struct folio *rq_scratch_folio;
> struct xdr_buf rq_res;
> - unsigned long rq_maxpages; /* num of entries in rq_pages */
> - struct page * *rq_pages;
> - struct page * *rq_respages; /* points into rq_pages */
> + unsigned long rq_maxpages; /* entries per page array */
> + struct page * *rq_pages; /* Call buffer pages */
> + struct page * *rq_respages; /* Reply buffer pages */
> struct page * *rq_next_page; /* next reply page to use */
> - struct page * *rq_page_end; /* one past the last page */
> + struct page * *rq_page_end; /* one past the last reply page */
>
> struct folio_batch rq_fbatch;
> struct bio_vec *rq_bvec;
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index f7ec02457328..9abef638b1e0 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -638,13 +638,23 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
> {
> rqstp->rq_maxpages = svc_serv_maxpages(serv);
>
> - /* rq_pages' last entry is NULL for historical reasons. */
> + /* +1 for a NULL sentinel readable by nfsd_splice_actor() */
> rqstp->rq_pages = kcalloc_node(rqstp->rq_maxpages + 1,
> sizeof(struct page *),
> GFP_KERNEL, node);
> if (!rqstp->rq_pages)
> return false;
>
> + /* +1 for a NULL sentinel at rq_page_end (see svc_rqst_replace_page) */
> + rqstp->rq_respages = kcalloc_node(rqstp->rq_maxpages + 1,
> + sizeof(struct page *),
> + GFP_KERNEL, node);
> + if (!rqstp->rq_respages) {
> + kfree(rqstp->rq_pages);
> + rqstp->rq_pages = NULL;
> + return false;
> + }
> +
> return true;
> }
>
> @@ -656,10 +666,19 @@ svc_release_buffer(struct svc_rqst *rqstp)
> {
> unsigned long i;
>
> - for (i = 0; i < rqstp->rq_maxpages; i++)
> - if (rqstp->rq_pages[i])
> - put_page(rqstp->rq_pages[i]);
> - kfree(rqstp->rq_pages);
> + if (rqstp->rq_pages) {
> + for (i = 0; i < rqstp->rq_maxpages; i++)
> + if (rqstp->rq_pages[i])
> + put_page(rqstp->rq_pages[i]);
> + kfree(rqstp->rq_pages);
> + }
> +
> + if (rqstp->rq_respages) {
> + for (i = 0; i < rqstp->rq_maxpages; i++)
> + if (rqstp->rq_respages[i])
> + put_page(rqstp->rq_respages[i]);
> + kfree(rqstp->rq_respages);
> + }
> }
>
> static void
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 56a663b8939f..e027765f4307 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -650,14 +650,13 @@ static void svc_check_conn_limits(struct svc_serv *serv)
> }
> }
>
> -static bool svc_alloc_arg(struct svc_rqst *rqstp)
> +static bool svc_fill_pages(struct svc_rqst *rqstp, struct page **pages,
> + unsigned long npages)
> {
> - struct xdr_buf *arg = &rqstp->rq_arg;
> - unsigned long pages, filled, ret;
> + unsigned long filled, ret;
>
> - pages = rqstp->rq_maxpages;
> - for (filled = 0; filled < pages; filled = ret) {
> - ret = alloc_pages_bulk(GFP_KERNEL, pages, rqstp->rq_pages);
> + for (filled = 0; filled < npages; filled = ret) {
> + ret = alloc_pages_bulk(GFP_KERNEL, npages, pages);
> if (ret > filled)
> /* Made progress, don't sleep yet */
> continue;
> @@ -667,11 +666,29 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
> set_current_state(TASK_RUNNING);
> return false;
> }
> - trace_svc_alloc_arg_err(pages, ret);
> + trace_svc_alloc_arg_err(npages, ret);
> memalloc_retry_wait(GFP_KERNEL);
> }
> - rqstp->rq_page_end = &rqstp->rq_pages[pages];
> - rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
> + return true;
> +}
> +
> +static bool svc_alloc_arg(struct svc_rqst *rqstp)
> +{
> + struct xdr_buf *arg = &rqstp->rq_arg;
> + unsigned long pages;
> +
> + pages = rqstp->rq_maxpages;
> +
> + if (!svc_fill_pages(rqstp, rqstp->rq_pages, pages))
> + return false;
> + if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
> + return false;
> + rqstp->rq_next_page = rqstp->rq_respages;
> + rqstp->rq_page_end = &rqstp->rq_respages[pages];
> + /* svc_rqst_replace_page() dereferences *rq_next_page even
> + * at rq_page_end; NULL prevents releasing a garbage page.
> + */
> + rqstp->rq_page_end[0] = NULL;
>
> /* Make arg->head point to first page and arg->pages point to rest */
> arg->head[0].iov_base = page_address(rqstp->rq_pages[0]);
> @@ -1277,7 +1294,6 @@ static noinline int svc_deferred_recv(struct svc_rqst *rqstp)
> rqstp->rq_addrlen = dr->addrlen;
> /* Save off transport header len in case we get deferred again */
> rqstp->rq_daddr = dr->daddr;
> - rqstp->rq_respages = rqstp->rq_pages;
> rqstp->rq_xprt_ctxt = dr->xprt_ctxt;
>
> dr->xprt_ctxt = NULL;
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index f28c6076f7e8..c86f28f720f7 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -351,8 +351,6 @@ static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, size_t buflen,
>
> for (i = 0, t = 0; t < buflen; i++, t += PAGE_SIZE)
> bvec_set_page(&bvec[i], rqstp->rq_pages[i], PAGE_SIZE, 0);
> - rqstp->rq_respages = &rqstp->rq_pages[i];
> - rqstp->rq_next_page = rqstp->rq_respages + 1;
>
> iov_iter_bvec(&msg.msg_iter, ITER_DEST, bvec, i, buflen);
> if (seek) {
> @@ -677,13 +675,9 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
> if (len <= rqstp->rq_arg.head[0].iov_len) {
> rqstp->rq_arg.head[0].iov_len = len;
> rqstp->rq_arg.page_len = 0;
> - rqstp->rq_respages = rqstp->rq_pages+1;
> } else {
> rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
> - rqstp->rq_respages = rqstp->rq_pages + 1 +
> - DIV_ROUND_UP(rqstp->rq_arg.page_len, PAGE_SIZE);
> }
> - rqstp->rq_next_page = rqstp->rq_respages+1;
>
> if (serv->sv_stats)
> serv->sv_stats->netudpcnt++;
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index e7e4a39ca6c6..3081a37a5896 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -861,18 +861,12 @@ static noinline void svc_rdma_read_complete(struct svc_rqst *rqstp,
> unsigned int i;
>
> /* Transfer the Read chunk pages into @rqstp.rq_pages, replacing
> - * the rq_pages that were already allocated for this rqstp.
> + * the receive buffer pages already allocated for this rqstp.
> */
> - release_pages(rqstp->rq_respages, ctxt->rc_page_count);
> + release_pages(rqstp->rq_pages, ctxt->rc_page_count);
> for (i = 0; i < ctxt->rc_page_count; i++)
> rqstp->rq_pages[i] = ctxt->rc_pages[i];
>
> - /* Update @rqstp's result send buffer to start after the
> - * last page in the RDMA Read payload.
> - */
> - rqstp->rq_respages = &rqstp->rq_pages[ctxt->rc_page_count];
> - rqstp->rq_next_page = rqstp->rq_respages + 1;
> -
> /* Prevent svc_rdma_recv_ctxt_put() from releasing the
> * pages in ctxt::rc_pages a second time.
> */
> @@ -931,10 +925,9 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
> struct svc_rdma_recv_ctxt *ctxt;
> int ret;
>
> - /* Prevent svc_xprt_release() from releasing pages in rq_pages
> - * when returning 0 or an error.
> + /* Precaution: a zero page count on error return causes
> + * svc_rqst_release_pages() to release nothing.
> */
> - rqstp->rq_respages = rqstp->rq_pages;
> rqstp->rq_next_page = rqstp->rq_respages;
>
> rqstp->rq_xprt_ctxt = NULL;
The rest looks good to me though:
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/6] SUNRPC: Handle NULL entries in svc_rqst_release_pages
2026-02-26 14:47 ` [PATCH v2 3/6] SUNRPC: Handle NULL entries in svc_rqst_release_pages Chuck Lever
@ 2026-03-10 18:11 ` Jeff Layton
0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2026-03-10 18:11 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Thu, 2026-02-26 at 09:47 -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> svc_rqst_release_pages() releases response pages between rq_respages
> and rq_next_page. It currently passes the entire range to
> release_pages(), which does not expect NULL entries.
>
> A subsequent patch preserves the rq_next_page pointer in
> svc_rdma_save_io_pages() so that it accurately records how many
> response pages were consumed. After that change, the range
>
> [rq_respages, rq_next_page)
>
> can contain NULL entries where pages have already been transferred
> to a send context.
>
> Iterate through the range entry by entry, skipping NULLs, to handle
> this case correctly.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> net/sunrpc/svc.c | 20 +++++++++++++-------
> 1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 9abef638b1e0..0ce16e9abdf6 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -990,18 +990,24 @@ EXPORT_SYMBOL_GPL(svc_rqst_replace_page);
> * svc_rqst_release_pages - Release Reply buffer pages
> * @rqstp: RPC transaction context
> *
> - * Release response pages that might still be in flight after
> - * svc_send, and any spliced filesystem-owned pages.
> + * Release response pages in the range [rq_respages, rq_next_page).
> + * NULL entries in this range are skipped, allowing transports to
> + * transfer pages to a send context before this function runs.
> */
> void svc_rqst_release_pages(struct svc_rqst *rqstp)
> {
> - int i, count = rqstp->rq_next_page - rqstp->rq_respages;
> + struct page **pp;
>
> - if (count) {
> - release_pages(rqstp->rq_respages, count);
> - for (i = 0; i < count; i++)
> - rqstp->rq_respages[i] = NULL;
> + for (pp = rqstp->rq_respages; pp < rqstp->rq_next_page; pp++) {
> + if (*pp) {
> + if (!folio_batch_add(&rqstp->rq_fbatch,
> + page_folio(*pp)))
> + __folio_batch_release(&rqstp->rq_fbatch);
> + *pp = NULL;
> + }
> }
> + if (rqstp->rq_fbatch.nr)
> + __folio_batch_release(&rqstp->rq_fbatch);
> }
>
> /**
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 4/6] svcrdma: preserve rq_next_page in svc_rdma_save_io_pages
2026-02-26 14:47 ` [PATCH v2 4/6] svcrdma: preserve rq_next_page in svc_rdma_save_io_pages Chuck Lever
@ 2026-03-10 18:13 ` Jeff Layton
0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2026-03-10 18:13 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Thu, 2026-02-26 at 09:47 -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> svc_rdma_save_io_pages() transfers response pages to the send
> context and sets those slots to NULL. It then resets rq_next_page to
> equal rq_respages, hiding the NULL region from
> svc_rqst_release_pages().
>
> Now that svc_rqst_release_pages() handles NULL entries, this reset
> is no longer necessary. Removing it preserves the invariant that the
> range [rq_respages, rq_next_page) accurately describes how many
> response pages were consumed, enabling a subsequent optimization in
> svc_alloc_arg() that refills only the consumed range.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> index 914cd263c2f1..17c8429da9d5 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> @@ -858,7 +858,8 @@ int svc_rdma_map_reply_msg(struct svcxprt_rdma *rdma,
>
> /* The svc_rqst and all resources it owns are released as soon as
> * svc_rdma_sendto returns. Transfer pages under I/O to the ctxt
> - * so they are released by the Send completion handler.
> + * so they are released only after Send completion, and not by
> + * svc_rqst_release_pages().
> */
> static void svc_rdma_save_io_pages(struct svc_rqst *rqstp,
> struct svc_rdma_send_ctxt *ctxt)
> @@ -870,9 +871,6 @@ static void svc_rdma_save_io_pages(struct svc_rqst *rqstp,
> ctxt->sc_pages[i] = rqstp->rq_respages[i];
> rqstp->rq_respages[i] = NULL;
> }
> -
> - /* Prevent svc_xprt_release from releasing pages in rq_pages */
> - rqstp->rq_next_page = rqstp->rq_respages;
> }
>
> /* Prepare the portion of the RPC Reply that will be transmitted
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 5/6] SUNRPC: Track consumed rq_pages entries
2026-02-26 14:47 ` [PATCH v2 5/6] SUNRPC: Track consumed rq_pages entries Chuck Lever
@ 2026-03-10 18:16 ` Jeff Layton
0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2026-03-10 18:16 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Thu, 2026-02-26 at 09:47 -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> The rq_pages array holds pages allocated for incoming RPC requests.
> Two transport receive paths NULL entries in rq_pages to prevent
> svc_rqst_release_pages() from freeing pages that the transport has
> taken ownership of:
>
> - svc_tcp_save_pages() moves partial request data pages to
> svsk->sk_pages during multi-fragment TCP reassembly.
>
> - svc_rdma_clear_rqst_pages() moves request data pages to
> head->rc_pages because they are targets of active RDMA Read WRs.
>
> A new rq_pages_nfree field in struct svc_rqst records how many
> entries were NULLed. svc_alloc_arg() uses it to refill only those
> entries rather than scanning the full rq_pages array. In steady
> state, the transport NULLs a handful of entries per RPC, so the
> allocator visits only those entries instead of the full ~259 slots
> (for 1MB messages).
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> include/linux/sunrpc/svc.h | 10 ++++++++++
> net/sunrpc/svc.c | 1 +
> net/sunrpc/svc_xprt.c | 11 ++++++++---
> net/sunrpc/svcsock.c | 1 +
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 1 +
> 5 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 3559de664f64..b5a842dd97a4 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -143,6 +143,15 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> * server thread needs to allocate more to replace those used in
> * sending.
> *
> + * rq_pages request page contract:
> + *
> + * Transport receive paths that move request data pages out of
> + * rq_pages -- TCP multi-fragment reassembly (svc_tcp_save_pages)
> + * and RDMA Read I/O (svc_rdma_clear_rqst_pages) -- NULL those
> + * entries to prevent svc_rqst_release_pages() from freeing pages
> + * still in transport use, and set rq_pages_nfree to the count.
> + * svc_alloc_arg() refills only that many rq_pages entries.
> + *
> * xdr_buf holds responses; the structure fits NFS read responses
> * (header, data pages, optional tail) and enables sharing of
> * client-side routines.
> @@ -201,6 +210,7 @@ struct svc_rqst {
> struct folio *rq_scratch_folio;
> struct xdr_buf rq_res;
> unsigned long rq_maxpages; /* entries per page array */
> + unsigned long rq_pages_nfree; /* rq_pages entries NULLed by transport */
> struct page * *rq_pages; /* Call buffer pages */
> struct page * *rq_respages; /* Reply buffer pages */
> struct page * *rq_next_page; /* next reply page to use */
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 0ce16e9abdf6..6e57e35fa6d6 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -655,6 +655,7 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
> return false;
> }
>
> + rqstp->rq_pages_nfree = rqstp->rq_maxpages;
> return true;
> }
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index e027765f4307..795b5729525f 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -675,12 +675,17 @@ static bool svc_fill_pages(struct svc_rqst *rqstp, struct page **pages,
> static bool svc_alloc_arg(struct svc_rqst *rqstp)
> {
> struct xdr_buf *arg = &rqstp->rq_arg;
> - unsigned long pages;
> + unsigned long pages, nfree;
>
> pages = rqstp->rq_maxpages;
>
> - if (!svc_fill_pages(rqstp, rqstp->rq_pages, pages))
> - return false;
> + nfree = rqstp->rq_pages_nfree;
> + if (nfree) {
> + if (!svc_fill_pages(rqstp, rqstp->rq_pages, nfree))
> + return false;
> + rqstp->rq_pages_nfree = 0;
> + }
> +
> if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
> return false;
> rqstp->rq_next_page = rqstp->rq_respages;
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index c86f28f720f7..2ce43f9995f1 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1009,6 +1009,7 @@ static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp)
> svsk->sk_pages[i] = rqstp->rq_pages[i];
> rqstp->rq_pages[i] = NULL;
> }
> + rqstp->rq_pages_nfree = npages;
> }
>
> static void svc_tcp_clear_pages(struct svc_sock *svsk)
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> index 4ec2f9ae06aa..cf4a1762b629 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -1107,6 +1107,7 @@ static void svc_rdma_clear_rqst_pages(struct svc_rqst *rqstp,
> head->rc_pages[i] = rqstp->rq_pages[i];
> rqstp->rq_pages[i] = NULL;
> }
> + rqstp->rq_pages_nfree = head->rc_page_count;
> }
>
> /**
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 6/6] SUNRPC: Optimize rq_respages allocation in svc_alloc_arg
2026-02-26 14:47 ` [PATCH v2 6/6] SUNRPC: Optimize rq_respages allocation in svc_alloc_arg Chuck Lever
@ 2026-03-10 18:18 ` Jeff Layton
0 siblings, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2026-03-10 18:18 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Thu, 2026-02-26 at 09:47 -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> svc_alloc_arg() invokes alloc_pages_bulk() with the full rq_maxpages
> count (~259 for 1MB messages) for the rq_respages array, causing a
> full-array scan despite most slots holding valid pages.
>
> svc_rqst_release_pages() NULLs only the range
>
> [rq_respages, rq_next_page)
>
> after each RPC, so only that range contains NULL entries. Limit the
> rq_respages fill in svc_alloc_arg() to that range instead of
> scanning the full array.
>
> svc_init_buffer() initializes rq_next_page to span the entire
> rq_respages array, so the first svc_alloc_arg() call fills all
> slots.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> include/linux/sunrpc/svc.h | 4 ++++
> net/sunrpc/svc.c | 1 +
> net/sunrpc/svc_xprt.c | 8 +++++++-
> 3 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index b5a842dd97a4..7315c529f88a 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -152,6 +152,10 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> * still in transport use, and set rq_pages_nfree to the count.
> * svc_alloc_arg() refills only that many rq_pages entries.
> *
> + * For rq_respages, svc_rqst_release_pages() NULLs entries in
> + * [rq_respages, rq_next_page) after each RPC. svc_alloc_arg()
> + * refills only that range.
> + *
> * xdr_buf holds responses; the structure fits NFS read responses
> * (header, data pages, optional tail) and enables sharing of
> * client-side routines.
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 6e57e35fa6d6..5e0b5ec2fd52 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -656,6 +656,7 @@ svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
> }
>
> rqstp->rq_pages_nfree = rqstp->rq_maxpages;
> + rqstp->rq_next_page = rqstp->rq_respages + rqstp->rq_maxpages;
> return true;
> }
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 795b5729525f..b16e710926c1 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -686,8 +686,14 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
> rqstp->rq_pages_nfree = 0;
> }
>
> - if (!svc_fill_pages(rqstp, rqstp->rq_respages, pages))
> + if (WARN_ON_ONCE(rqstp->rq_next_page < rqstp->rq_respages))
> return false;
> + nfree = rqstp->rq_next_page - rqstp->rq_respages;
> + if (nfree) {
> + if (!svc_fill_pages(rqstp, rqstp->rq_respages, nfree))
> + return false;
> + }
> +
> rqstp->rq_next_page = rqstp->rq_respages;
> rqstp->rq_page_end = &rqstp->rq_respages[pages];
> /* svc_rqst_replace_page() dereferences *rq_next_page even
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 0/6] Optimize NFSD buffer page management
2026-02-26 14:47 [PATCH v2 0/6] Optimize NFSD buffer page management Chuck Lever
` (5 preceding siblings ...)
2026-02-26 14:47 ` [PATCH v2 6/6] SUNRPC: Optimize rq_respages allocation in svc_alloc_arg Chuck Lever
@ 2026-03-10 18:19 ` Jeff Layton
2026-03-10 18:24 ` Chuck Lever
6 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2026-03-10 18:19 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On Thu, 2026-02-26 at 09:47 -0500, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> This series solves two problems. First:
>
> NFSv3 operations have complementary Request and Response sizes.
> When a Request message is large, the corresponding Response
> message is small, and vice versa. The sum of the two message
> sizes is never more than the maximum transport payload size. So
> NFSD could get away with maintaining a single array of pages,
> split between the RPC send and Receive buffer.
>
> NFSv4 is not as cut and dried. An NFSv4 client may construct an
> NFSv4 COMPOUND that is arbitrarily complex, mixing operations
> that can have large Request size with operations that have a
> large Response size. The resulting server-side buffer size
> requirement can be larger than the maximum transport payload size.
>
> Therefore we must increase the allocated RPC Call landing zone and
> the RPC Reply construction zone to ensure that arbitrary NFSv4
> COMPOUNDs can be handled.
>
> Second:
>
> Due to the above, and because NFSD can now handle payload sizes
> considerably larger than 1MB, the number of array entries that
> alloc_bulk_pages() walks through to reset the rqst page arrays
> after each RPC completes has increased dramatically.
>
> But we observe that the mean size of NFS requests remains smaller
> than a few pages. If only a few pages are consumed while processing
> each RPC, then traversing all of the pages in the page arrays for
> refills is wasted effort. The CPU cost of walking these arrays is
> noticeable in "perf" captures.
>
> It would be more efficient to keep track of which entries need to
> be refilled, since that is likely to be a small number in the most
> common case, and use alloc_bulk_pages() to fill only those entries.
>
> ---
>
> Changes since RFC:
> - Clarify a number of comments based on review (NeilBrown)
> - Possible NFSv3 waste is still open for discussion
>
> Chuck Lever (6):
> SUNRPC: Tighten bounds checking in svc_rqst_replace_page
> SUNRPC: Allocate a separate Reply page array
> SUNRPC: Handle NULL entries in svc_rqst_release_pages
> svcrdma: preserve rq_next_page in svc_rdma_save_io_pages
> SUNRPC: Track consumed rq_pages entries
> SUNRPC: Optimize rq_respages allocation in svc_alloc_arg
>
> include/linux/sunrpc/svc.h | 61 +++++++++++++++----------
> net/sunrpc/svc.c | 59 +++++++++++++++++-------
> net/sunrpc/svc_xprt.c | 47 +++++++++++++++----
> net/sunrpc/svcsock.c | 7 +--
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 15 ++----
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 1 +
> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 6 +--
> 7 files changed, 125 insertions(+), 71 deletions(-)
The patches all look good to me. I like getting rid of the old "sliding
window", which was error prone. The extra memory consumption sort of
sucks, but an extra 1M per thread should be no big deal for most nfsd
deployments.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 0/6] Optimize NFSD buffer page management
2026-03-10 18:19 ` [PATCH v2 0/6] Optimize NFSD buffer page management Jeff Layton
@ 2026-03-10 18:24 ` Chuck Lever
0 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever @ 2026-03-10 18:24 UTC (permalink / raw)
To: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
On 3/10/26 2:19 PM, Jeff Layton wrote:
> On Thu, 2026-02-26 at 09:47 -0500, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> This series solves two problems. First:
>>
>> NFSv3 operations have complementary Request and Response sizes.
>> When a Request message is large, the corresponding Response
>> message is small, and vice versa. The sum of the two message
>> sizes is never more than the maximum transport payload size. So
>> NFSD could get away with maintaining a single array of pages,
>> split between the RPC send and Receive buffer.
>>
>> NFSv4 is not as cut and dried. An NFSv4 client may construct an
>> NFSv4 COMPOUND that is arbitrarily complex, mixing operations
>> that can have large Request size with operations that have a
>> large Response size. The resulting server-side buffer size
>> requirement can be larger than the maximum transport payload size.
>>
>> Therefore we must increase the allocated RPC Call landing zone and
>> the RPC Reply construction zone to ensure that arbitrary NFSv4
>> COMPOUNDs can be handled.
>>
>> Second:
>>
>> Due to the above, and because NFSD can now handle payload sizes
>> considerably larger than 1MB, the number of array entries that
>> alloc_bulk_pages() walks through to reset the rqst page arrays
>> after each RPC completes has increased dramatically.
>>
>> But we observe that the mean size of NFS requests remains smaller
>> than a few pages. If only a few pages are consumed while processing
>> each RPC, then traversing all of the pages in the page arrays for
>> refills is wasted effort. The CPU cost of walking these arrays is
>> noticeable in "perf" captures.
>>
>> It would be more efficient to keep track of which entries need to
>> be refilled, since that is likely to be a small number in the most
>> common case, and use alloc_bulk_pages() to fill only those entries.
>>
>> ---
>>
>> Changes since RFC:
>> - Clarify a number of comments based on review (NeilBrown)
>> - Possible NFSv3 waste is still open for discussion
>>
>> Chuck Lever (6):
>> SUNRPC: Tighten bounds checking in svc_rqst_replace_page
>> SUNRPC: Allocate a separate Reply page array
>> SUNRPC: Handle NULL entries in svc_rqst_release_pages
>> svcrdma: preserve rq_next_page in svc_rdma_save_io_pages
>> SUNRPC: Track consumed rq_pages entries
>> SUNRPC: Optimize rq_respages allocation in svc_alloc_arg
>>
>> include/linux/sunrpc/svc.h | 61 +++++++++++++++----------
>> net/sunrpc/svc.c | 59 +++++++++++++++++-------
>> net/sunrpc/svc_xprt.c | 47 +++++++++++++++----
>> net/sunrpc/svcsock.c | 7 +--
>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 15 ++----
>> net/sunrpc/xprtrdma/svc_rdma_rw.c | 1 +
>> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 6 +--
>> 7 files changed, 125 insertions(+), 71 deletions(-)
>
> The patches all look good to me. I like getting rid of the old "sliding
> window", which was error prone. The extra memory consumption sort of
> sucks, but an extra 1M per thread should be no big deal for most nfsd
> deployments.
>
Thanks for having a look. My feeling is the new dynamic nfsd thread
count mechanism should mitigate much of the extra memory consumption.
Meanwhile, I think this is not going to be the final stop on the train
of optimizations in this area.
--
Chuck Lever
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-03-10 18:24 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26 14:47 [PATCH v2 0/6] Optimize NFSD buffer page management Chuck Lever
2026-02-26 14:47 ` [PATCH v2 1/6] SUNRPC: Tighten bounds checking in svc_rqst_replace_page Chuck Lever
2026-03-10 18:01 ` Jeff Layton
2026-02-26 14:47 ` [PATCH v2 2/6] SUNRPC: Allocate a separate Reply page array Chuck Lever
2026-03-10 18:10 ` Jeff Layton
2026-02-26 14:47 ` [PATCH v2 3/6] SUNRPC: Handle NULL entries in svc_rqst_release_pages Chuck Lever
2026-03-10 18:11 ` Jeff Layton
2026-02-26 14:47 ` [PATCH v2 4/6] svcrdma: preserve rq_next_page in svc_rdma_save_io_pages Chuck Lever
2026-03-10 18:13 ` Jeff Layton
2026-02-26 14:47 ` [PATCH v2 5/6] SUNRPC: Track consumed rq_pages entries Chuck Lever
2026-03-10 18:16 ` Jeff Layton
2026-02-26 14:47 ` [PATCH v2 6/6] SUNRPC: Optimize rq_respages allocation in svc_alloc_arg Chuck Lever
2026-03-10 18:18 ` Jeff Layton
2026-03-10 18:19 ` [PATCH v2 0/6] Optimize NFSD buffer page management Jeff Layton
2026-03-10 18:24 ` Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox