linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/14] Allocate payload arrays dynamically
@ 2025-04-28 19:36 cel
  2025-04-28 19:36 ` [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP cel
                   ` (15 more replies)
  0 siblings, 16 replies; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

In order to make RPCSVC_MAXPAYLOAD larger (or variable in size), we
need to do something clever with the payload arrays embedded in
struct svc_rqst and elsewhere.

My preference is to keep these arrays allocated all the time because
allocating them on demand increases the risk of a memory allocation
failure during a large I/O. This is a quick-and-dirty approach that
might be replaced once NFSD is converted to use large folios.

The downside of this design choice is that it pins a few pages per
NFSD thread (and that's the current situation already). But note
that because RPCSVC_MAXPAGES is 259, each array is just over a page
in size, making the allocation waste quite a bit of memory beyond
the end of the array due to power-of-2 allocator round up. This gets
worse as the MAXPAGES value is doubled or quadrupled.

This series also addresses similar issues in the socket and RDMA
transports.

v4 is "code complete", unless there are new code change requests.
I'm not convinced that adding XDR pad alignment to svc_reserve()
is good, but I'm willing to consider it further.

It turns out there is already a tuneable for the maximum read and
write size in NFSD:

  /proc/fs/nfsd/max_block_size

Since there is an existing user space API for this, my initial
arguments against adding a tuneable are moot. max_block_size should
be adequate for this purpose, and enabling it to be set to larger
values should not impact the kernel-user space API in any way.

Changes since v3:
* Improved the rdma_rw context count estimate
* Dropped "NFSD: Remove NFSSVC_MAXBLKSIZE from .pc_xdrressize"
* Cleaned up the max size macros a bit
* Completed the implementation of adjustable max_block_size

Changes since v2:
* Address Jeff's review comments
* Address Neil's review comments
* Start removing a few uses of NFSSVC_MAXBLKSIZE

Chuck Lever (14):
  svcrdma: Reduce the number of rdma_rw contexts per-QP
  sunrpc: Add a helper to derive maxpages from sv_max_mesg
  sunrpc: Remove backchannel check in svc_init_buffer()
  sunrpc: Replace the rq_pages array with dynamically-allocated memory
  sunrpc: Replace the rq_vec array with dynamically-allocated memory
  sunrpc: Replace the rq_bvec array with dynamically-allocated memory
  sunrpc: Adjust size of socket's receive page array dynamically
  svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
  svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages
  sunrpc: Remove the RPCSVC_MAXPAGES macro
  NFSD: Remove NFSD_BUFSIZE
  NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro
  NFSD: Add a "default" block size
  SUNRPC: Bump the maximum payload size for the server

 fs/nfsd/nfs4proc.c                       |  2 +-
 fs/nfsd/nfs4state.c                      |  2 +-
 fs/nfsd/nfs4xdr.c                        |  2 +-
 fs/nfsd/nfsd.h                           | 24 ++++-------
 fs/nfsd/nfsproc.c                        |  4 +-
 fs/nfsd/nfssvc.c                         |  2 +-
 fs/nfsd/nfsxdr.c                         |  4 +-
 fs/nfsd/vfs.c                            |  2 +-
 include/linux/sunrpc/svc.h               | 45 +++++++++++++--------
 include/linux/sunrpc/svc_rdma.h          |  6 ++-
 include/linux/sunrpc/svcsock.h           |  4 +-
 net/sunrpc/svc.c                         | 51 +++++++++++++++---------
 net/sunrpc/svc_xprt.c                    | 10 +----
 net/sunrpc/svcsock.c                     | 15 ++++---
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  8 +++-
 net/sunrpc/xprtrdma/svc_rdma_rw.c        |  2 +-
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    | 16 ++++++--
 net/sunrpc/xprtrdma/svc_rdma_transport.c | 14 ++++---
 18 files changed, 122 insertions(+), 91 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
@ 2025-04-28 19:36 ` cel
  2025-05-06 13:08   ` Christoph Hellwig
  2025-04-28 19:36 ` [PATCH v4 02/14] sunrpc: Add a helper to derive maxpages from sv_max_mesg cel
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

There is an upper bound on the number of rdma_rw contexts that can
be created per QP.

This invisible upper bound is because rdma_create_qp() adds one or
more additional SQEs for each ctxt that the ULP requests via
qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
the order of the sum of qp_attr.cap.max_send_wr and a factor times
qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
on whether MR operations are required before RDMA Reads.

This limit is not visible to RDMA consumers via dev->attrs. When the
limit is surpassed, QP creation fails with -ENOMEM. For example:

svcrdma's estimate of the number of rdma_rw contexts it needs is
three times the number of pages in RPCSVC_MAXPAGES. When MAXPAGES
is about 260, the internally-computed SQ length should be:

64 credits + 10 backlog + 3 * (3 * 260) = 2414

Which is well below the advertised qp_max_wr of 32768.

If RPCSVC_MAXPAGES is increased to 4MB, that's 1040 pages:

64 credits + 10 backlog + 3 * (3 * 1040) = 9434

However, QP creation fails. Dynamic printk for mlx5 shows:

calc_sq_size:618:(pid 1514): send queue size (9326 * 256 / 64 -> 65536) exceeds limits(32768)

Although 9326 is still far below qp_max_wr, QP creation still
fails.

Because the total SQ length calculation is opaque to RDMA consumers,
there doesn't seem to be much that can be done about this except for
consumers to try to keep the requested rdma_rw ctxt count low.

Fixes: 2da0f610e733 ("svcrdma: Increase the per-transport rw_ctx count")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_transport.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 5940a56023d1..3d7f1413df02 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -406,12 +406,12 @@ static void svc_rdma_xprt_done(struct rpcrdma_notification *rn)
  */
 static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 {
+	unsigned int ctxts, rq_depth, maxpayload;
 	struct svcxprt_rdma *listen_rdma;
 	struct svcxprt_rdma *newxprt = NULL;
 	struct rdma_conn_param conn_param;
 	struct rpcrdma_connect_private pmsg;
 	struct ib_qp_init_attr qp_attr;
-	unsigned int ctxts, rq_depth;
 	struct ib_device *dev;
 	int ret = 0;
 	RPC_IFDEBUG(struct sockaddr *sap);
@@ -462,12 +462,14 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 		newxprt->sc_max_bc_requests = 2;
 	}
 
-	/* Arbitrarily estimate the number of rw_ctxs needed for
-	 * this transport. This is enough rw_ctxs to make forward
-	 * progress even if the client is using one rkey per page
-	 * in each Read chunk.
+	/* Arbitrary estimate of the needed number of rdma_rw contexts.
 	 */
-	ctxts = 3 * RPCSVC_MAXPAGES;
+	maxpayload = min(xprt->xpt_server->sv_max_payload,
+			 RPCSVC_MAXPAYLOAD_RDMA);
+	ctxts = newxprt->sc_max_requests * 3 *
+		rdma_rw_mr_factor(dev, newxprt->sc_port_num,
+				  maxpayload >> PAGE_SHIFT);
+
 	newxprt->sc_sq_depth = rq_depth + ctxts;
 	if (newxprt->sc_sq_depth > dev->attrs.max_qp_wr)
 		newxprt->sc_sq_depth = dev->attrs.max_qp_wr;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 02/14] sunrpc: Add a helper to derive maxpages from sv_max_mesg
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
  2025-04-28 19:36 ` [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP cel
@ 2025-04-28 19:36 ` cel
  2025-05-06 13:10   ` Christoph Hellwig
  2025-04-28 19:36 ` [PATCH v4 03/14] sunrpc: Remove backchannel check in svc_init_buffer() cel
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

This page count is to be used to allocate various arrays of pages,
bio_vecs, and kvecs, replacing the fixed RPCSVC_MAXPAGES value.

The documenting comment is somewhat stale -- of course NFSv4
COMPOUND procedures may have multiple payloads.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 74658cca0f38..e83ac14267e8 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -159,6 +159,23 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
 #define RPCSVC_MAXPAGES		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
 				+ 2 + 1)
 
+/**
+ * svc_serv_maxpages - maximum pages/kvecs needed for one RPC message
+ * @serv: RPC service context
+ *
+ * Returns a count of pages or vectors that can hold the maximum
+ * size RPC message for @serv.
+ *
+ * Each request/reply pair can have at most one "payload", plus two
+ * pages, one for the request, and one for the reply.
+ * nfsd_splice_actor() might need an extra page when a READ payload
+ * is not page-aligned.
+ */
+static inline unsigned long svc_serv_maxpages(const struct svc_serv *serv)
+{
+	return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
+}
+
 /*
  * The context of a single thread, including the request currently being
  * processed.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 03/14] sunrpc: Remove backchannel check in svc_init_buffer()
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
  2025-04-28 19:36 ` [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP cel
  2025-04-28 19:36 ` [PATCH v4 02/14] sunrpc: Add a helper to derive maxpages from sv_max_mesg cel
@ 2025-04-28 19:36 ` cel
  2025-05-06 13:11   ` Christoph Hellwig
  2025-04-28 19:36 ` [PATCH v4 04/14] sunrpc: Replace the rq_pages array with dynamically-allocated memory cel
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The server's backchannel uses struct svc_rqst, but does not use the
pages in svc_rqst::rq_pages. It's rq_arg::pages and rq_res::pages
comes from the RPC client's page allocator. Currently,
svc_init_buffer() skips allocating pages in rq_pages for that
reason.

Except that, svc_rqst::rq_pages is filled anyway when a backchannel
svc_rqst is passed to svc_recv() -> and then to svc_alloc_arg().

This isn't really a problem at the moment, except that these pages
are allocated but then never used, as far as I can tell.

The problem is that later in this series, in addition to populating
the entries of rq_pages[], svc_init_buffer() will also allocate the
memory underlying the rq_pages[] array itself. If that allocation is
skipped, then svc_alloc_args() chases a NULL pointer for ingress
backchannel requests.

This approach avoids introducing extra conditional logic in
svc_alloc_args(), which is a hot path.

Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/svc.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index e7f9c295d13c..8ce3e6b3df6a 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -640,10 +640,6 @@ svc_init_buffer(struct svc_rqst *rqstp, unsigned int size, int node)
 {
 	unsigned long pages, ret;
 
-	/* bc_xprt uses fore channel allocated buffers */
-	if (svc_is_backchannel(rqstp))
-		return true;
-
 	pages = size / PAGE_SIZE + 1; /* extra page as we hold both request and reply.
 				       * We assume one is at most one page
 				       */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 04/14] sunrpc: Replace the rq_pages array with dynamically-allocated memory
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (2 preceding siblings ...)
  2025-04-28 19:36 ` [PATCH v4 03/14] sunrpc: Remove backchannel check in svc_init_buffer() cel
@ 2025-04-28 19:36 ` cel
  2025-04-30  4:53   ` NeilBrown
  2025-04-28 19:36 ` [PATCH v4 05/14] sunrpc: Replace the rq_vec " cel
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

As a step towards making NFSD's maximum rsize and wsize variable at
run-time, replace the fixed-size rq_vec[] array in struct svc_rqst
with a chunk of dynamically-allocated memory.

On a system with 8-byte pointers and 4KB pages, pahole reports that
the rq_pages[] array is 2080 bytes. This patch replaces that with
a single 8-byte pointer field.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h        |  3 ++-
 net/sunrpc/svc.c                  | 34 ++++++++++++++++++-------------
 net/sunrpc/svc_xprt.c             | 10 +--------
 net/sunrpc/xprtrdma/svc_rdma_rw.c |  2 +-
 4 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index e83ac14267e8..ea3a33eec29b 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -205,7 +205,8 @@ struct svc_rqst {
 	struct xdr_stream	rq_res_stream;
 	struct page		*rq_scratch_page;
 	struct xdr_buf		rq_res;
-	struct page		*rq_pages[RPCSVC_MAXPAGES + 1];
+	unsigned long		rq_maxpages;	/* num of entries in rq_pages */
+	struct page *		*rq_pages;
 	struct page *		*rq_respages;	/* points into rq_pages */
 	struct page *		*rq_next_page; /* next reply page to use */
 	struct page *		*rq_page_end;  /* one past the last page */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 8ce3e6b3df6a..682e11c9be36 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -636,20 +636,25 @@ svc_destroy(struct svc_serv **servp)
 EXPORT_SYMBOL_GPL(svc_destroy);
 
 static bool
-svc_init_buffer(struct svc_rqst *rqstp, unsigned int size, int node)
+svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
 {
-	unsigned long pages, ret;
+	unsigned long ret;
 
-	pages = size / PAGE_SIZE + 1; /* extra page as we hold both request and reply.
-				       * We assume one is at most one page
-				       */
-	WARN_ON_ONCE(pages > RPCSVC_MAXPAGES);
-	if (pages > RPCSVC_MAXPAGES)
-		pages = RPCSVC_MAXPAGES;
+	/* Add an extra page, as rq_pages holds both request and reply.
+	 * We assume one of those is at most one page.
+	 */
+	rqstp->rq_maxpages = svc_serv_maxpages(serv) + 1;
 
-	ret = alloc_pages_bulk_node(GFP_KERNEL, node, pages,
+	/* rq_pages' last entry is NULL for historical reasons. */
+	rqstp->rq_pages = kcalloc_node(rqstp->rq_maxpages + 1,
+				       sizeof(struct page *),
+				       GFP_KERNEL, node);
+	if (!rqstp->rq_pages)
+		return false;
+
+	ret = alloc_pages_bulk_node(GFP_KERNEL, node, rqstp->rq_maxpages,
 				    rqstp->rq_pages);
-	return ret == pages;
+	return ret == rqstp->rq_maxpages;
 }
 
 /*
@@ -658,11 +663,12 @@ svc_init_buffer(struct svc_rqst *rqstp, unsigned int size, int node)
 static void
 svc_release_buffer(struct svc_rqst *rqstp)
 {
-	unsigned int i;
+	unsigned long i;
 
-	for (i = 0; i < ARRAY_SIZE(rqstp->rq_pages); i++)
+	for (i = 0; i < rqstp->rq_maxpages; i++)
 		if (rqstp->rq_pages[i])
 			put_page(rqstp->rq_pages[i]);
+	kfree(rqstp->rq_pages);
 }
 
 static void
@@ -704,7 +710,7 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 	if (!rqstp->rq_resp)
 		goto out_enomem;
 
-	if (!svc_init_buffer(rqstp, serv->sv_max_mesg, node))
+	if (!svc_init_buffer(rqstp, serv, node))
 		goto out_enomem;
 
 	rqstp->rq_err = -EAGAIN; /* No error yet */
@@ -896,7 +902,7 @@ EXPORT_SYMBOL_GPL(svc_set_num_threads);
 bool svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page)
 {
 	struct page **begin = rqstp->rq_pages;
-	struct page **end = &rqstp->rq_pages[RPCSVC_MAXPAGES];
+	struct page **end = &rqstp->rq_pages[rqstp->rq_maxpages];
 
 	if (unlikely(rqstp->rq_next_page < begin || rqstp->rq_next_page > end)) {
 		trace_svc_replace_page_err(rqstp);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index ae25405d8bd2..23547ed25269 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -651,18 +651,10 @@ static void svc_check_conn_limits(struct svc_serv *serv)
 
 static bool svc_alloc_arg(struct svc_rqst *rqstp)
 {
-	struct svc_serv *serv = rqstp->rq_server;
 	struct xdr_buf *arg = &rqstp->rq_arg;
 	unsigned long pages, filled, ret;
 
-	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
-	if (pages > RPCSVC_MAXPAGES) {
-		pr_warn_once("svc: warning: pages=%lu > RPCSVC_MAXPAGES=%lu\n",
-			     pages, RPCSVC_MAXPAGES);
-		/* use as many pages as possible */
-		pages = RPCSVC_MAXPAGES;
-	}
-
+	pages = rqstp->rq_maxpages;
 	for (filled = 0; filled < pages; filled = ret) {
 		ret = alloc_pages_bulk(GFP_KERNEL, pages, rqstp->rq_pages);
 		if (ret > filled)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 40797114d50a..661b3fe2779f 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -765,7 +765,7 @@ static int svc_rdma_build_read_segment(struct svc_rqst *rqstp,
 		}
 		len -= seg_len;
 
-		if (len && ((head->rc_curpage + 1) > ARRAY_SIZE(rqstp->rq_pages)))
+		if (len && ((head->rc_curpage + 1) > rqstp->rq_maxpages))
 			goto out_overrun;
 	}
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 05/14] sunrpc: Replace the rq_vec array with dynamically-allocated memory
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (3 preceding siblings ...)
  2025-04-28 19:36 ` [PATCH v4 04/14] sunrpc: Replace the rq_pages array with dynamically-allocated memory cel
@ 2025-04-28 19:36 ` cel
  2025-05-06 13:29   ` Christoph Hellwig
  2025-04-28 19:36 ` [PATCH v4 06/14] sunrpc: Replace the rq_bvec " cel
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

As a step towards making NFSD's maximum rsize and wsize variable at
run-time, replace the fixed-size rq_vec[] array in struct svc_rqst
with a chunk of dynamically-allocated memory.

The rq_vec array is sized assuming request processing will need at
most one kvec per page in a maximum-sized RPC message.

On a system with 8-byte pointers and 4KB pages, pahole reports that
the rq_vec[] array is 4144 bytes. This patch replaces that array
with a single 8-byte pointer field.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c              | 2 +-
 include/linux/sunrpc/svc.h | 2 +-
 net/sunrpc/svc.c           | 8 +++++++-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 9abdc4b75813..4eaac3aa7e15 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1094,7 +1094,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		++v;
 		base = 0;
 	}
-	WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec));
+	WARN_ON_ONCE(v > rqstp->rq_maxpages);
 
 	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
 	iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count);
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index ea3a33eec29b..f663d58abd7a 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -212,7 +212,7 @@ struct svc_rqst {
 	struct page *		*rq_page_end;  /* one past the last page */
 
 	struct folio_batch	rq_fbatch;
-	struct kvec		rq_vec[RPCSVC_MAXPAGES]; /* generally useful.. */
+	struct kvec		*rq_vec;
 	struct bio_vec		rq_bvec[RPCSVC_MAXPAGES];
 
 	__be32			rq_xid;		/* transmission id */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 682e11c9be36..5808d4b97547 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -675,6 +675,7 @@ static void
 svc_rqst_free(struct svc_rqst *rqstp)
 {
 	folio_batch_release(&rqstp->rq_fbatch);
+	kfree(rqstp->rq_vec);
 	svc_release_buffer(rqstp);
 	if (rqstp->rq_scratch_page)
 		put_page(rqstp->rq_scratch_page);
@@ -713,6 +714,11 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 	if (!svc_init_buffer(rqstp, serv, node))
 		goto out_enomem;
 
+	rqstp->rq_vec = kcalloc_node(rqstp->rq_maxpages, sizeof(struct kvec),
+				      GFP_KERNEL, node);
+	if (!rqstp->rq_vec)
+		goto out_enomem;
+
 	rqstp->rq_err = -EAGAIN; /* No error yet */
 
 	serv->sv_nrthreads += 1;
@@ -1750,7 +1756,7 @@ unsigned int svc_fill_write_vector(struct svc_rqst *rqstp,
 		++pages;
 	}
 
-	WARN_ON_ONCE(i > ARRAY_SIZE(rqstp->rq_vec));
+	WARN_ON_ONCE(i > rqstp->rq_maxpages);
 	return i;
 }
 EXPORT_SYMBOL_GPL(svc_fill_write_vector);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 06/14] sunrpc: Replace the rq_bvec array with dynamically-allocated memory
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (4 preceding siblings ...)
  2025-04-28 19:36 ` [PATCH v4 05/14] sunrpc: Replace the rq_vec " cel
@ 2025-04-28 19:36 ` cel
  2025-04-28 19:36 ` [PATCH v4 07/14] sunrpc: Adjust size of socket's receive page array dynamically cel
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

As a step towards making NFSD's maximum rsize and wsize variable at
run-time, replace the fixed-size rq_bvec[] array in struct svc_rqst
with a chunk of dynamically-allocated memory.

The rq_bvec[] array contains enough bio_vecs to handle each page in
a maximum size RPC message.

On a system with 8-byte pointers and 4KB pages, pahole reports that
the rq_bvec[] array is 4144 bytes. This patch replaces that array
with a single 8-byte pointer field.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h | 2 +-
 net/sunrpc/svc.c           | 7 +++++++
 net/sunrpc/svcsock.c       | 7 +++----
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index f663d58abd7a..4e6074bb0573 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -213,7 +213,7 @@ struct svc_rqst {
 
 	struct folio_batch	rq_fbatch;
 	struct kvec		*rq_vec;
-	struct bio_vec		rq_bvec[RPCSVC_MAXPAGES];
+	struct bio_vec		*rq_bvec;
 
 	__be32			rq_xid;		/* transmission id */
 	u32			rq_prog;	/* program number */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 5808d4b97547..0741e506c35c 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -675,6 +675,7 @@ static void
 svc_rqst_free(struct svc_rqst *rqstp)
 {
 	folio_batch_release(&rqstp->rq_fbatch);
+	kfree(rqstp->rq_bvec);
 	kfree(rqstp->rq_vec);
 	svc_release_buffer(rqstp);
 	if (rqstp->rq_scratch_page)
@@ -719,6 +720,12 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 	if (!rqstp->rq_vec)
 		goto out_enomem;
 
+	rqstp->rq_bvec = kcalloc_node(rqstp->rq_maxpages,
+				      sizeof(struct bio_vec),
+				      GFP_KERNEL, node);
+	if (!rqstp->rq_bvec)
+		goto out_enomem;
+
 	rqstp->rq_err = -EAGAIN; /* No error yet */
 
 	serv->sv_nrthreads += 1;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 72e5a01df3d3..c846341bb08c 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -713,8 +713,7 @@ static int svc_udp_sendto(struct svc_rqst *rqstp)
 	if (svc_xprt_is_dead(xprt))
 		goto out_notconn;
 
-	count = xdr_buf_to_bvec(rqstp->rq_bvec,
-				ARRAY_SIZE(rqstp->rq_bvec), xdr);
+	count = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, xdr);
 
 	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
 		      count, rqstp->rq_res.len);
@@ -1219,8 +1218,8 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
 	memcpy(buf, &marker, sizeof(marker));
 	bvec_set_virt(rqstp->rq_bvec, buf, sizeof(marker));
 
-	count = xdr_buf_to_bvec(rqstp->rq_bvec + 1,
-				ARRAY_SIZE(rqstp->rq_bvec) - 1, &rqstp->rq_res);
+	count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages,
+				&rqstp->rq_res);
 
 	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
 		      1 + count, sizeof(marker) + rqstp->rq_res.len);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 07/14] sunrpc: Adjust size of socket's receive page array dynamically
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (5 preceding siblings ...)
  2025-04-28 19:36 ` [PATCH v4 06/14] sunrpc: Replace the rq_bvec " cel
@ 2025-04-28 19:36 ` cel
  2025-04-28 19:36 ` [PATCH v4 08/14] svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages cel
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

As a step towards making NFSD's maximum rsize and wsize variable at
run-time, make sk_pages a flexible array.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svcsock.h | 4 +++-
 net/sunrpc/svcsock.c           | 8 ++++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index bf45d9e8492a..963bbe251e52 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -40,7 +40,9 @@ struct svc_sock {
 
 	struct completion	sk_handshake_done;
 
-	struct page *		sk_pages[RPCSVC_MAXPAGES];	/* received data */
+	/* received data */
+	unsigned long		sk_maxpages;
+	struct page *		sk_pages[] __counted_by(sk_maxpages);
 };
 
 static inline u32 svc_sock_reclen(struct svc_sock *svsk)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index c846341bb08c..5432e4a2f858 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1339,7 +1339,8 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 		svsk->sk_marker = xdr_zero;
 		svsk->sk_tcplen = 0;
 		svsk->sk_datalen = 0;
-		memset(&svsk->sk_pages[0], 0, sizeof(svsk->sk_pages));
+		memset(&svsk->sk_pages[0], 0,
+		       svsk->sk_maxpages * sizeof(struct page *));
 
 		tcp_sock_set_nodelay(sk);
 
@@ -1378,10 +1379,13 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
 	struct svc_sock	*svsk;
 	struct sock	*inet;
 	int		pmap_register = !(flags & SVC_SOCK_ANONYMOUS);
+	unsigned long	pages;
 
-	svsk = kzalloc(sizeof(*svsk), GFP_KERNEL);
+	pages = svc_serv_maxpages(serv);
+	svsk = kzalloc(struct_size(svsk, sk_pages, pages), GFP_KERNEL);
 	if (!svsk)
 		return ERR_PTR(-ENOMEM);
+	svsk->sk_maxpages = pages;
 
 	inet = sock->sk;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 08/14] svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (6 preceding siblings ...)
  2025-04-28 19:36 ` [PATCH v4 07/14] sunrpc: Adjust size of socket's receive page array dynamically cel
@ 2025-04-28 19:36 ` cel
  2025-05-06 13:31   ` Christoph Hellwig
  2025-04-28 19:36 ` [PATCH v4 09/14] svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages cel
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Allow allocation of more entries in the rc_pages[] array when the
maximum size of an RPC message is increased.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h         | 3 ++-
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 8 ++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 619fc0bd837a..1016f2feddc4 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -202,7 +202,8 @@ struct svc_rdma_recv_ctxt {
 	struct svc_rdma_pcl	rc_reply_pcl;
 
 	unsigned int		rc_page_count;
-	struct page		*rc_pages[RPCSVC_MAXPAGES];
+	unsigned long		rc_maxpages;
+	struct page		*rc_pages[] __counted_by(rc_maxpages);
 };
 
 /*
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 292022f0976e..e7e4a39ca6c6 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -120,12 +120,16 @@ svc_rdma_recv_ctxt_alloc(struct svcxprt_rdma *rdma)
 {
 	int node = ibdev_to_node(rdma->sc_cm_id->device);
 	struct svc_rdma_recv_ctxt *ctxt;
+	unsigned long pages;
 	dma_addr_t addr;
 	void *buffer;
 
-	ctxt = kzalloc_node(sizeof(*ctxt), GFP_KERNEL, node);
+	pages = svc_serv_maxpages(rdma->sc_xprt.xpt_server);
+	ctxt = kzalloc_node(struct_size(ctxt, rc_pages, pages),
+			    GFP_KERNEL, node);
 	if (!ctxt)
 		goto fail0;
+	ctxt->rc_maxpages = pages;
 	buffer = kmalloc_node(rdma->sc_max_req_size, GFP_KERNEL, node);
 	if (!buffer)
 		goto fail1;
@@ -497,7 +501,7 @@ static bool xdr_check_write_chunk(struct svc_rdma_recv_ctxt *rctxt)
 	 * a computation, perform a simple range check. This is an
 	 * arbitrary but sensible limit (ie, not architectural).
 	 */
-	if (unlikely(segcount > RPCSVC_MAXPAGES))
+	if (unlikely(segcount > rctxt->rc_maxpages))
 		return false;
 
 	p = xdr_inline_decode(&rctxt->rc_stream,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 09/14] svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (7 preceding siblings ...)
  2025-04-28 19:36 ` [PATCH v4 08/14] svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages cel
@ 2025-04-28 19:36 ` cel
  2025-04-28 19:36 ` [PATCH v4 10/14] sunrpc: Remove the RPCSVC_MAXPAGES macro cel
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Allow allocation of more entries in the sc_pages[] array when the
maximum size of an RPC message is increased.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h       |  3 ++-
 net/sunrpc/xprtrdma/svc_rdma_sendto.c | 16 +++++++++++++---
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 1016f2feddc4..22704c2e5b9b 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -245,7 +245,8 @@ struct svc_rdma_send_ctxt {
 	void			*sc_xprt_buf;
 	int			sc_page_count;
 	int			sc_cur_sge_no;
-	struct page		*sc_pages[RPCSVC_MAXPAGES];
+	unsigned long		sc_maxpages;
+	struct page		**sc_pages;
 	struct ib_sge		sc_sges[];
 };
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 96154a2367a1..914cd263c2f1 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -118,6 +118,7 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
 {
 	int node = ibdev_to_node(rdma->sc_cm_id->device);
 	struct svc_rdma_send_ctxt *ctxt;
+	unsigned long pages;
 	dma_addr_t addr;
 	void *buffer;
 	int i;
@@ -126,13 +127,19 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
 			    GFP_KERNEL, node);
 	if (!ctxt)
 		goto fail0;
+	pages = svc_serv_maxpages(rdma->sc_xprt.xpt_server);
+	ctxt->sc_pages = kcalloc_node(pages, sizeof(struct page *),
+				      GFP_KERNEL, node);
+	if (!ctxt->sc_pages)
+		goto fail1;
+	ctxt->sc_maxpages = pages;
 	buffer = kmalloc_node(rdma->sc_max_req_size, GFP_KERNEL, node);
 	if (!buffer)
-		goto fail1;
+		goto fail2;
 	addr = ib_dma_map_single(rdma->sc_pd->device, buffer,
 				 rdma->sc_max_req_size, DMA_TO_DEVICE);
 	if (ib_dma_mapping_error(rdma->sc_pd->device, addr))
-		goto fail2;
+		goto fail3;
 
 	svc_rdma_send_cid_init(rdma, &ctxt->sc_cid);
 
@@ -151,8 +158,10 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
 		ctxt->sc_sges[i].lkey = rdma->sc_pd->local_dma_lkey;
 	return ctxt;
 
-fail2:
+fail3:
 	kfree(buffer);
+fail2:
+	kfree(ctxt->sc_pages);
 fail1:
 	kfree(ctxt);
 fail0:
@@ -176,6 +185,7 @@ void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma)
 				    rdma->sc_max_req_size,
 				    DMA_TO_DEVICE);
 		kfree(ctxt->sc_xprt_buf);
+		kfree(ctxt->sc_pages);
 		kfree(ctxt);
 	}
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 10/14] sunrpc: Remove the RPCSVC_MAXPAGES macro
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (8 preceding siblings ...)
  2025-04-28 19:36 ` [PATCH v4 09/14] svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages cel
@ 2025-04-28 19:36 ` cel
  2025-04-28 19:36 ` [PATCH v4 11/14] NFSD: Remove NFSD_BUFSIZE cel
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

It is no longer used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 4e6074bb0573..e27bc051ec67 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -150,14 +150,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
  * list.  xdr_buf.tail points to the end of the first page.
  * This assumes that the non-page part of an rpc reply will fit
  * in a page - NFSd ensures this.  lockd also has no trouble.
- *
- * Each request/reply pair can have at most one "payload", plus two pages,
- * one for the request, and one for the reply.
- * We using ->sendfile to return read data, we might need one extra page
- * if the request is not page-aligned.  So add another '1'.
  */
-#define RPCSVC_MAXPAGES		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
-				+ 2 + 1)
 
 /**
  * svc_serv_maxpages - maximum pages/kvecs needed for one RPC message
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 11/14] NFSD: Remove NFSD_BUFSIZE
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (9 preceding siblings ...)
  2025-04-28 19:36 ` [PATCH v4 10/14] sunrpc: Remove the RPCSVC_MAXPAGES macro cel
@ 2025-04-28 19:36 ` cel
  2025-04-28 21:03   ` Jeff Layton
  2025-05-06 13:32   ` Christoph Hellwig
  2025-04-28 19:37 ` [PATCH v4 12/14] NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro cel
                   ` (4 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: cel @ 2025-04-28 19:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Clean up: The documenting comment for NFSD_BUFSIZE is quite stale.
NFSD_BUFSIZE is used only for NFSv4 Reply these days; never for
NFSv2 or v3, and never for RPC Calls. Even so, the byte count
estimate does not include the size of the NFSv4 COMPOUND Reply
HEADER or the RPC auth flavor.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs4proc.c  |  2 +-
 fs/nfsd/nfs4state.c |  2 +-
 fs/nfsd/nfs4xdr.c   |  2 +-
 fs/nfsd/nfsd.h      | 13 -------------
 4 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index b397246dae7b..59451b405b5c 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -3832,7 +3832,7 @@ static const struct svc_procedure nfsd_procedures4[2] = {
 		.pc_ressize = sizeof(struct nfsd4_compoundres),
 		.pc_release = nfsd4_release_compoundargs,
 		.pc_cachetype = RC_NOCACHE,
-		.pc_xdrressize = NFSD_BUFSIZE/4,
+		.pc_xdrressize = 3+NFSSVC_MAXBLKSIZE/4,
 		.pc_name = "COMPOUND",
 	},
 };
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 59a693f22452..8adcee9dc4d3 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4402,7 +4402,7 @@ nfsd4_sequence(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 				    nfserr_rep_too_big;
 	if (xdr_restrict_buflen(xdr, buflen - rqstp->rq_auth_slack))
 		goto out_put_session;
-	svc_reserve(rqstp, buflen);
+	svc_reserve_auth(rqstp, buflen);
 
 	status = nfs_ok;
 	/* Success! accept new slot seqid */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 44e7fb34f433..ac1bc2431f27 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2564,7 +2564,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
 	/* Sessions make the DRC unnecessary: */
 	if (argp->minorversion)
 		cachethis = false;
-	svc_reserve(argp->rqstp, max_reply + readbytes);
+	svc_reserve_auth(argp->rqstp, max_reply + readbytes);
 	argp->rqstp->rq_cachetype = cachethis ? RC_REPLBUFF : RC_NOCACHE;
 
 	argp->splice_ok = nfsd_read_splice_ok(argp->rqstp);
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index e2997f0ffbc5..91d144655351 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -50,19 +50,6 @@ bool nfsd_support_version(int vers);
 /* NFSv2 is limited by the protocol specification, see RFC 1094 */
 #define NFSSVC_MAXBLKSIZE_V2    (8*1024)
 
-
-/*
- * Largest number of bytes we need to allocate for an NFS
- * call or reply.  Used to control buffer sizes.  We use
- * the length of v3 WRITE, READDIR and READDIR replies
- * which are an RPC header, up to 26 XDR units of reply
- * data, and some page data.
- *
- * Note that accuracy here doesn't matter too much as the
- * size is rounded up to a page size when allocating space.
- */
-#define NFSD_BUFSIZE            ((RPC_MAX_HEADER_WITH_AUTH+26)*XDR_UNIT + NFSSVC_MAXBLKSIZE)
-
 struct readdir_cd {
 	__be32			err;	/* 0, nfserr, or nfserr_eof */
 };
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 12/14] NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (10 preceding siblings ...)
  2025-04-28 19:36 ` [PATCH v4 11/14] NFSD: Remove NFSD_BUFSIZE cel
@ 2025-04-28 19:37 ` cel
  2025-05-06 13:33   ` Christoph Hellwig
  2025-04-28 19:37 ` [PATCH v4 13/14] NFSD: Add a "default" block size cel
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: cel @ 2025-04-28 19:37 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The 8192-byte maximum is a protocol-defined limit, and we already
have a symbolic constant defined whose name matches the name of
the limit defined in the protocol. Replace the duplicate.

No change in behavior is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfsd.h    | 2 --
 fs/nfsd/nfsproc.c | 4 ++--
 fs/nfsd/nfsxdr.c  | 4 ++--
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 91d144655351..2c85b3efe977 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -47,8 +47,6 @@ bool nfsd_support_version(int vers);
  * Maximum blocksizes supported by daemon under various circumstances.
  */
 #define NFSSVC_MAXBLKSIZE       RPCSVC_MAXPAYLOAD
-/* NFSv2 is limited by the protocol specification, see RFC 1094 */
-#define NFSSVC_MAXBLKSIZE_V2    (8*1024)
 
 struct readdir_cd {
 	__be32			err;	/* 0, nfserr, or nfserr_eof */
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 6dda081eb24c..5d842671fe6f 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -211,7 +211,7 @@ nfsd_proc_read(struct svc_rqst *rqstp)
 		SVCFH_fmt(&argp->fh),
 		argp->count, argp->offset);
 
-	argp->count = min_t(u32, argp->count, NFSSVC_MAXBLKSIZE_V2);
+	argp->count = min_t(u32, argp->count, NFS_MAXDATA);
 	argp->count = min_t(u32, argp->count, rqstp->rq_res.buflen);
 
 	resp->pages = rqstp->rq_next_page;
@@ -739,7 +739,7 @@ static const struct svc_procedure nfsd_procedures2[18] = {
 		.pc_argzero = sizeof(struct nfsd_readargs),
 		.pc_ressize = sizeof(struct nfsd_readres),
 		.pc_cachetype = RC_NOCACHE,
-		.pc_xdrressize = ST+AT+1+NFSSVC_MAXBLKSIZE_V2/4,
+		.pc_xdrressize = ST+AT+1+NFS_MAXDATA/4,
 		.pc_name = "READ",
 	},
 	[NFSPROC_WRITECACHE] = {
diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index 5777f40c7353..fc262ceafca9 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -336,7 +336,7 @@ nfssvc_decode_writeargs(struct svc_rqst *rqstp, struct xdr_stream *xdr)
 	/* opaque data */
 	if (xdr_stream_decode_u32(xdr, &args->len) < 0)
 		return false;
-	if (args->len > NFSSVC_MAXBLKSIZE_V2)
+	if (args->len > NFS_MAXDATA)
 		return false;
 
 	return xdr_stream_subsegment(xdr, &args->payload, args->len);
@@ -540,7 +540,7 @@ nfssvc_encode_statfsres(struct svc_rqst *rqstp, struct xdr_stream *xdr)
 		p = xdr_reserve_space(xdr, XDR_UNIT * 5);
 		if (!p)
 			return false;
-		*p++ = cpu_to_be32(NFSSVC_MAXBLKSIZE_V2);
+		*p++ = cpu_to_be32(NFS_MAXDATA);
 		*p++ = cpu_to_be32(stat->f_bsize);
 		*p++ = cpu_to_be32(stat->f_blocks);
 		*p++ = cpu_to_be32(stat->f_bfree);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 13/14] NFSD: Add a "default" block size
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (11 preceding siblings ...)
  2025-04-28 19:37 ` [PATCH v4 12/14] NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro cel
@ 2025-04-28 19:37 ` cel
  2025-04-28 21:07   ` Jeff Layton
  2025-04-28 19:37 ` [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server cel
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: cel @ 2025-04-28 19:37 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

We'd like to increase the maximum r/wsize that NFSD can support,
but without introducing possible regressions. So let's add a
default setting of 1MB. A subsequent patch will raise the
maximum value but leave the default alone.

No behavior change is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfsd.h   | 9 +++++++--
 fs/nfsd/nfssvc.c | 2 +-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 2c85b3efe977..614971a700d8 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -44,9 +44,14 @@ bool nfsd_support_version(int vers);
 #include "stats.h"
 
 /*
- * Maximum blocksizes supported by daemon under various circumstances.
+ * Default and maximum payload size (NFS READ or WRITE), in bytes.
+ * The default is historical, and the maximum is an implementation
+ * limit.
  */
-#define NFSSVC_MAXBLKSIZE       RPCSVC_MAXPAYLOAD
+enum {
+	NFSSVC_DEFBLKSIZE       = 1 * 1024 * 1024,
+	NFSSVC_MAXBLKSIZE       = RPCSVC_MAXPAYLOAD,
+};
 
 struct readdir_cd {
 	__be32			err;	/* 0, nfserr, or nfserr_eof */
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 9b3d6cff0e1e..692d2ef30db1 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -582,7 +582,7 @@ static int nfsd_get_default_max_blksize(void)
 	 */
 	target >>= 12;
 
-	ret = NFSSVC_MAXBLKSIZE;
+	ret = NFSSVC_DEFBLKSIZE;
 	while (ret > target && ret >= 8*1024*2)
 		ret /= 2;
 	return ret;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (12 preceding siblings ...)
  2025-04-28 19:37 ` [PATCH v4 13/14] NFSD: Add a "default" block size cel
@ 2025-04-28 19:37 ` cel
  2025-04-28 21:08   ` Jeff Layton
  2025-05-06 13:34   ` Christoph Hellwig
  2025-04-29 13:06 ` [PATCH v4 00/14] Allocate payload arrays dynamically Zhu Yanjun
  2025-04-30  5:11 ` NeilBrown
  15 siblings, 2 replies; 52+ messages in thread
From: cel @ 2025-04-28 19:37 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Increase the maximum server-side RPC payload to 4MB. The default
remains at 1MB.

To adjust the operational maximum, shut down the NFS server. Then
echo a new value into:

  /proc/fs/nfsd/max_block_size

And restart the NFS server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index e27bc051ec67..b449eb02e00a 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -119,14 +119,14 @@ void svc_destroy(struct svc_serv **svcp);
  * Linux limit; someone who cares more about NFS/UDP performance
  * can test a larger number.
  *
- * For TCP transports we have more freedom.  A size of 1MB is
- * chosen to match the client limit.  Other OSes are known to
- * have larger limits, but those numbers are probably beyond
- * the point of diminishing returns.
+ * For non-UDP transports we have more freedom.  A size of 4MB is
+ * chosen to accommodate clients that support larger I/O sizes.
  */
-#define RPCSVC_MAXPAYLOAD	(1*1024*1024u)
-#define RPCSVC_MAXPAYLOAD_TCP	RPCSVC_MAXPAYLOAD
-#define RPCSVC_MAXPAYLOAD_UDP	(32*1024u)
+enum {
+	RPCSVC_MAXPAYLOAD	= 4 * 1024 * 1024,
+	RPCSVC_MAXPAYLOAD_TCP	= RPCSVC_MAXPAYLOAD,
+	RPCSVC_MAXPAYLOAD_UDP	= 32 * 1024,
+};
 
 extern u32 svc_max_payload(const struct svc_rqst *rqstp);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 11/14] NFSD: Remove NFSD_BUFSIZE
  2025-04-28 19:36 ` [PATCH v4 11/14] NFSD: Remove NFSD_BUFSIZE cel
@ 2025-04-28 21:03   ` Jeff Layton
  2025-05-06 13:32   ` Christoph Hellwig
  1 sibling, 0 replies; 52+ messages in thread
From: Jeff Layton @ 2025-04-28 21:03 UTC (permalink / raw)
  To: cel, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

On Mon, 2025-04-28 at 15:36 -0400, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Clean up: The documenting comment for NFSD_BUFSIZE is quite stale.
> NFSD_BUFSIZE is used only for NFSv4 Reply these days; never for
> NFSv2 or v3, and never for RPC Calls. Even so, the byte count
> estimate does not include the size of the NFSv4 COMPOUND Reply
> HEADER or the RPC auth flavor.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/nfs4proc.c  |  2 +-
>  fs/nfsd/nfs4state.c |  2 +-
>  fs/nfsd/nfs4xdr.c   |  2 +-
>  fs/nfsd/nfsd.h      | 13 -------------
>  4 files changed, 3 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index b397246dae7b..59451b405b5c 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -3832,7 +3832,7 @@ static const struct svc_procedure nfsd_procedures4[2] = {
>  		.pc_ressize = sizeof(struct nfsd4_compoundres),
>  		.pc_release = nfsd4_release_compoundargs,
>  		.pc_cachetype = RC_NOCACHE,
> -		.pc_xdrressize = NFSD_BUFSIZE/4,
> +		.pc_xdrressize = 3+NFSSVC_MAXBLKSIZE/4,
>  		.pc_name = "COMPOUND",
>  	},
>  };
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 59a693f22452..8adcee9dc4d3 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -4402,7 +4402,7 @@ nfsd4_sequence(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  				    nfserr_rep_too_big;
>  	if (xdr_restrict_buflen(xdr, buflen - rqstp->rq_auth_slack))
>  		goto out_put_session;
> -	svc_reserve(rqstp, buflen);
> +	svc_reserve_auth(rqstp, buflen);
>  
>  	status = nfs_ok;
>  	/* Success! accept new slot seqid */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 44e7fb34f433..ac1bc2431f27 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -2564,7 +2564,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
>  	/* Sessions make the DRC unnecessary: */
>  	if (argp->minorversion)
>  		cachethis = false;
> -	svc_reserve(argp->rqstp, max_reply + readbytes);
> +	svc_reserve_auth(argp->rqstp, max_reply + readbytes);
>  	argp->rqstp->rq_cachetype = cachethis ? RC_REPLBUFF : RC_NOCACHE;
>  
>  	argp->splice_ok = nfsd_read_splice_ok(argp->rqstp);
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index e2997f0ffbc5..91d144655351 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -50,19 +50,6 @@ bool nfsd_support_version(int vers);
>  /* NFSv2 is limited by the protocol specification, see RFC 1094 */
>  #define NFSSVC_MAXBLKSIZE_V2    (8*1024)
>  
> -
> -/*
> - * Largest number of bytes we need to allocate for an NFS
> - * call or reply.  Used to control buffer sizes.  We use
> - * the length of v3 WRITE, READDIR and READDIR replies
> - * which are an RPC header, up to 26 XDR units of reply
> - * data, and some page data.
> - *
> - * Note that accuracy here doesn't matter too much as the
> - * size is rounded up to a page size when allocating space.
> - */
> -#define NFSD_BUFSIZE            ((RPC_MAX_HEADER_WITH_AUTH+26)*XDR_UNIT + NFSSVC_MAXBLKSIZE)
> -
>  struct readdir_cd {
>  	__be32			err;	/* 0, nfserr, or nfserr_eof */
>  };

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 13/14] NFSD: Add a "default" block size
  2025-04-28 19:37 ` [PATCH v4 13/14] NFSD: Add a "default" block size cel
@ 2025-04-28 21:07   ` Jeff Layton
  0 siblings, 0 replies; 52+ messages in thread
From: Jeff Layton @ 2025-04-28 21:07 UTC (permalink / raw)
  To: cel, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

On Mon, 2025-04-28 at 15:37 -0400, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> We'd like to increase the maximum r/wsize that NFSD can support,
> but without introducing possible regressions. So let's add a
> default setting of 1MB. A subsequent patch will raise the
> maximum value but leave the default alone.
> 
> No behavior change is expected.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/nfsd.h   | 9 +++++++--
>  fs/nfsd/nfssvc.c | 2 +-
>  2 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 2c85b3efe977..614971a700d8 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -44,9 +44,14 @@ bool nfsd_support_version(int vers);
>  #include "stats.h"
>  
>  /*
> - * Maximum blocksizes supported by daemon under various circumstances.
> + * Default and maximum payload size (NFS READ or WRITE), in bytes.
> + * The default is historical, and the maximum is an implementation
> + * limit.
>   */
> -#define NFSSVC_MAXBLKSIZE       RPCSVC_MAXPAYLOAD
> +enum {
> +	NFSSVC_DEFBLKSIZE       = 1 * 1024 * 1024,
> +	NFSSVC_MAXBLKSIZE       = RPCSVC_MAXPAYLOAD,
> +};
>  
>  struct readdir_cd {
>  	__be32			err;	/* 0, nfserr, or nfserr_eof */
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 9b3d6cff0e1e..692d2ef30db1 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -582,7 +582,7 @@ static int nfsd_get_default_max_blksize(void)
>  	 */
>  	target >>= 12;
>  
> -	ret = NFSSVC_MAXBLKSIZE;
> +	ret = NFSSVC_DEFBLKSIZE;
>  	while (ret > target && ret >= 8*1024*2)
>  		ret /= 2;
>  	return ret;

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server
  2025-04-28 19:37 ` [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server cel
@ 2025-04-28 21:08   ` Jeff Layton
  2025-04-29 15:44     ` Chuck Lever
  2025-05-06 13:34   ` Christoph Hellwig
  1 sibling, 1 reply; 52+ messages in thread
From: Jeff Layton @ 2025-04-28 21:08 UTC (permalink / raw)
  To: cel, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

On Mon, 2025-04-28 at 15:37 -0400, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Increase the maximum server-side RPC payload to 4MB. The default
> remains at 1MB.
> 
> To adjust the operational maximum, shut down the NFS server. Then
> echo a new value into:
> 
>   /proc/fs/nfsd/max_block_size
> 
> And restart the NFS server.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/sunrpc/svc.h | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index e27bc051ec67..b449eb02e00a 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -119,14 +119,14 @@ void svc_destroy(struct svc_serv **svcp);
>   * Linux limit; someone who cares more about NFS/UDP performance
>   * can test a larger number.
>   *
> - * For TCP transports we have more freedom.  A size of 1MB is
> - * chosen to match the client limit.  Other OSes are known to
> - * have larger limits, but those numbers are probably beyond
> - * the point of diminishing returns.
> + * For non-UDP transports we have more freedom.  A size of 4MB is
> + * chosen to accommodate clients that support larger I/O sizes.
>   */
> -#define RPCSVC_MAXPAYLOAD	(1*1024*1024u)
> -#define RPCSVC_MAXPAYLOAD_TCP	RPCSVC_MAXPAYLOAD
> -#define RPCSVC_MAXPAYLOAD_UDP	(32*1024u)
> +enum {
> +	RPCSVC_MAXPAYLOAD	= 4 * 1024 * 1024,
> +	RPCSVC_MAXPAYLOAD_TCP	= RPCSVC_MAXPAYLOAD,
> +	RPCSVC_MAXPAYLOAD_UDP	= 32 * 1024,
> +};

I guess the enum is so that the symbol names remain in debuginfo?

>  
>  extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>  

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 00/14] Allocate payload arrays dynamically
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (13 preceding siblings ...)
  2025-04-28 19:37 ` [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server cel
@ 2025-04-29 13:06 ` Zhu Yanjun
  2025-04-29 13:41   ` Chuck Lever
  2025-04-30  5:11 ` NeilBrown
  15 siblings, 1 reply; 52+ messages in thread
From: Zhu Yanjun @ 2025-04-29 13:06 UTC (permalink / raw)
  To: cel, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

On 28.04.25 21:36, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> In order to make RPCSVC_MAXPAYLOAD larger (or variable in size), we
> need to do something clever with the payload arrays embedded in
> struct svc_rqst and elsewhere.
> 
> My preference is to keep these arrays allocated all the time because
> allocating them on demand increases the risk of a memory allocation
> failure during a large I/O. This is a quick-and-dirty approach that
> might be replaced once NFSD is converted to use large folios.
> 
> The downside of this design choice is that it pins a few pages per
> NFSD thread (and that's the current situation already). But note
> that because RPCSVC_MAXPAGES is 259, each array is just over a page
> in size, making the allocation waste quite a bit of memory beyond
> the end of the array due to power-of-2 allocator round up. This gets
> worse as the MAXPAGES value is doubled or quadrupled.
> 
> This series also addresses similar issues in the socket and RDMA
> transports.
> 
> v4 is "code complete", unless there are new code change requests.
> I'm not convinced that adding XDR pad alignment to svc_reserve()
> is good, but I'm willing to consider it further.
> 
> It turns out there is already a tuneable for the maximum read and
> write size in NFSD:
> 
>    /proc/fs/nfsd/max_block_size

Hi,

Based on the head commit ca91b9500108 Merge tag 
'v6.15-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd, I applied 
this patch series.

When I built the kernel, the following error will pop out.
"
In file included from ./arch/x86/include/asm/bug.h:103,
                  from ./include/linux/bug.h:5,
                  from ./arch/x86/include/asm/paravirt.h:19,
                  from ./arch/x86/include/asm/irqflags.h:102,
                  from ./include/linux/irqflags.h:18,
                  from ./include/linux/spinlock.h:59,
                  from ./include/linux/fs_struct.h:6,
                  from fs/nfsd/nfs4proc.c:35:
fs/nfsd/nfs4proc.c: In function ‘nfsd4_write’:
./include/linux/array_size.h:11:38: warning: division ‘sizeof (struct 
kvec *) / sizeof (struct kvec)’ does not compute the number of array 
elements [-Wsizeof-pointer-div]
    11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + 
__must_be_array(arr))
       |                                      ^
./include/asm-generic/bug.h:111:32: note: in definition of macro 
‘WARN_ON_ONCE’
   111 |         int __ret_warn_on = !!(condition);                      \
       |                                ^~~~~~~~~
fs/nfsd/nfs4proc.c:1231:30: note: in expansion of macro ‘ARRAY_SIZE’
  1231 |         WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
       |                              ^~~~~~~~~~
./include/linux/compiler.h:197:62: error: static assertion failed: "must 
be array"
   197 | #define __BUILD_BUG_ON_ZERO_MSG(e, msg) ((int)sizeof(struct 
{_Static_assert(!(e), msg);}))
       | 
^~~~~~~~~~~~~~
./include/asm-generic/bug.h:111:32: note: in definition of macro 
‘WARN_ON_ONCE’
   111 |         int __ret_warn_on = !!(condition);                      \
       |                                ^~~~~~~~~
./include/linux/compiler.h:202:33: note: in expansion of macro 
‘__BUILD_BUG_ON_ZERO_MSG’
   202 | #define __must_be_array(a) 
__BUILD_BUG_ON_ZERO_MSG(!__is_array(a), \
       |                                 ^~~~~~~~~~~~~~~~~~~~~~~
./include/linux/array_size.h:11:59: note: in expansion of macro 
‘__must_be_array’
    11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + 
__must_be_array(arr))
       | 
^~~~~~~~~~~~~~~
fs/nfsd/nfs4proc.c:1231:30: note: in expansion of macro ‘ARRAY_SIZE’
  1231 |         WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
       |                              ^~~~~~~~~~
make[4]: *** [scripts/Makefile.build:203: fs/nfsd/nfs4proc.o] Error 1
make[4]: *** Waiting for unfinished jobs....
make[3]: *** [scripts/Makefile.build:461: fs/nfsd] Error 2
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [scripts/Makefile.build:461: fs] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [/home/zyanjun/Development/github-linux/Makefile:2011: .] 
Error 2
make: *** [Makefile:248: __sub-make] Error 2
"

The building host is as below:

$ cat /etc/issue.net
Ubuntu 22.04.5 LTS

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ uname -a
Linux lb03055 6.8.0-58-generic #60~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC 
Fri Mar 28 16:09:21 UTC 2 x86_64 x86_64 x86_64 GNU/Linux


Zhu Yanjun

> 
> Since there is an existing user space API for this, my initial
> arguments against adding a tuneable are moot. max_block_size should
> be adequate for this purpose, and enabling it to be set to larger
> values should not impact the kernel-user space API in any way.
> 
> Changes since v3:
> * Improved the rdma_rw context count estimate
> * Dropped "NFSD: Remove NFSSVC_MAXBLKSIZE from .pc_xdrressize"
> * Cleaned up the max size macros a bit
> * Completed the implementation of adjustable max_block_size
> 
> Changes since v2:
> * Address Jeff's review comments
> * Address Neil's review comments
> * Start removing a few uses of NFSSVC_MAXBLKSIZE
> 
> Chuck Lever (14):
>    svcrdma: Reduce the number of rdma_rw contexts per-QP
>    sunrpc: Add a helper to derive maxpages from sv_max_mesg
>    sunrpc: Remove backchannel check in svc_init_buffer()
>    sunrpc: Replace the rq_pages array with dynamically-allocated memory
>    sunrpc: Replace the rq_vec array with dynamically-allocated memory
>    sunrpc: Replace the rq_bvec array with dynamically-allocated memory
>    sunrpc: Adjust size of socket's receive page array dynamically
>    svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
>    svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages
>    sunrpc: Remove the RPCSVC_MAXPAGES macro
>    NFSD: Remove NFSD_BUFSIZE
>    NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro
>    NFSD: Add a "default" block size
>    SUNRPC: Bump the maximum payload size for the server
> 
>   fs/nfsd/nfs4proc.c                       |  2 +-
>   fs/nfsd/nfs4state.c                      |  2 +-
>   fs/nfsd/nfs4xdr.c                        |  2 +-
>   fs/nfsd/nfsd.h                           | 24 ++++-------
>   fs/nfsd/nfsproc.c                        |  4 +-
>   fs/nfsd/nfssvc.c                         |  2 +-
>   fs/nfsd/nfsxdr.c                         |  4 +-
>   fs/nfsd/vfs.c                            |  2 +-
>   include/linux/sunrpc/svc.h               | 45 +++++++++++++--------
>   include/linux/sunrpc/svc_rdma.h          |  6 ++-
>   include/linux/sunrpc/svcsock.h           |  4 +-
>   net/sunrpc/svc.c                         | 51 +++++++++++++++---------
>   net/sunrpc/svc_xprt.c                    | 10 +----
>   net/sunrpc/svcsock.c                     | 15 ++++---
>   net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  8 +++-
>   net/sunrpc/xprtrdma/svc_rdma_rw.c        |  2 +-
>   net/sunrpc/xprtrdma/svc_rdma_sendto.c    | 16 ++++++--
>   net/sunrpc/xprtrdma/svc_rdma_transport.c | 14 ++++---
>   18 files changed, 122 insertions(+), 91 deletions(-)
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 00/14] Allocate payload arrays dynamically
  2025-04-29 13:06 ` [PATCH v4 00/14] Allocate payload arrays dynamically Zhu Yanjun
@ 2025-04-29 13:41   ` Chuck Lever
  2025-04-29 13:52     ` Zhu Yanjun
  0 siblings, 1 reply; 52+ messages in thread
From: Chuck Lever @ 2025-04-29 13:41 UTC (permalink / raw)
  To: Zhu Yanjun, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

On 4/29/25 9:06 AM, Zhu Yanjun wrote:
> On 28.04.25 21:36, cel@kernel.org wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> In order to make RPCSVC_MAXPAYLOAD larger (or variable in size), we
>> need to do something clever with the payload arrays embedded in
>> struct svc_rqst and elsewhere.
>>
>> My preference is to keep these arrays allocated all the time because
>> allocating them on demand increases the risk of a memory allocation
>> failure during a large I/O. This is a quick-and-dirty approach that
>> might be replaced once NFSD is converted to use large folios.
>>
>> The downside of this design choice is that it pins a few pages per
>> NFSD thread (and that's the current situation already). But note
>> that because RPCSVC_MAXPAGES is 259, each array is just over a page
>> in size, making the allocation waste quite a bit of memory beyond
>> the end of the array due to power-of-2 allocator round up. This gets
>> worse as the MAXPAGES value is doubled or quadrupled.
>>
>> This series also addresses similar issues in the socket and RDMA
>> transports.
>>
>> v4 is "code complete", unless there are new code change requests.
>> I'm not convinced that adding XDR pad alignment to svc_reserve()
>> is good, but I'm willing to consider it further.
>>
>> It turns out there is already a tuneable for the maximum read and
>> write size in NFSD:
>>
>>    /proc/fs/nfsd/max_block_size
> 
> Hi,
> 
> Based on the head commit ca91b9500108 Merge tag 'v6.15-rc4-ksmbd-server-
> fixes' of git://git.samba.org/ksmbd, I applied this patch series.
> 
> When I built the kernel, the following error will pop out.
> "
> In file included from ./arch/x86/include/asm/bug.h:103,
>                  from ./include/linux/bug.h:5,
>                  from ./arch/x86/include/asm/paravirt.h:19,
>                  from ./arch/x86/include/asm/irqflags.h:102,
>                  from ./include/linux/irqflags.h:18,
>                  from ./include/linux/spinlock.h:59,
>                  from ./include/linux/fs_struct.h:6,
>                  from fs/nfsd/nfs4proc.c:35:
> fs/nfsd/nfs4proc.c: In function ‘nfsd4_write’:
> ./include/linux/array_size.h:11:38: warning: division ‘sizeof (struct
> kvec *) / sizeof (struct kvec)’ does not compute the number of array
> elements [-Wsizeof-pointer-div]
>    11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) +
> __must_be_array(arr))
>       |                                      ^
> ./include/asm-generic/bug.h:111:32: note: in definition of macro
> ‘WARN_ON_ONCE’
>   111 |         int __ret_warn_on = !!(condition);                      \
>       |                                ^~~~~~~~~
> fs/nfsd/nfs4proc.c:1231:30: note: in expansion of macro ‘ARRAY_SIZE’
>  1231 |         WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
>       |                              ^~~~~~~~~~
> ./include/linux/compiler.h:197:62: error: static assertion failed: "must
> be array"
>   197 | #define __BUILD_BUG_ON_ZERO_MSG(e, msg) ((int)sizeof(struct
> {_Static_assert(!(e), msg);}))
>       | ^~~~~~~~~~~~~~
> ./include/asm-generic/bug.h:111:32: note: in definition of macro
> ‘WARN_ON_ONCE’
>   111 |         int __ret_warn_on = !!(condition);                      \
>       |                                ^~~~~~~~~
> ./include/linux/compiler.h:202:33: note: in expansion of macro
> ‘__BUILD_BUG_ON_ZERO_MSG’
>   202 | #define __must_be_array(a) __BUILD_BUG_ON_ZERO_MSG(!
> __is_array(a), \
>       |                                 ^~~~~~~~~~~~~~~~~~~~~~~
> ./include/linux/array_size.h:11:59: note: in expansion of macro
> ‘__must_be_array’
>    11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) +
> __must_be_array(arr))
>       | ^~~~~~~~~~~~~~~
> fs/nfsd/nfs4proc.c:1231:30: note: in expansion of macro ‘ARRAY_SIZE’
>  1231 |         WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
>       |                              ^~~~~~~~~~
> make[4]: *** [scripts/Makefile.build:203: fs/nfsd/nfs4proc.o] Error 1
> make[4]: *** Waiting for unfinished jobs....
> make[3]: *** [scripts/Makefile.build:461: fs/nfsd] Error 2
> make[3]: *** Waiting for unfinished jobs....
> make[2]: *** [scripts/Makefile.build:461: fs] Error 2
> make[2]: *** Waiting for unfinished jobs....
> make[1]: *** [/home/zyanjun/Development/github-linux/Makefile:2011: .]
> Error 2
> make: *** [Makefile:248: __sub-make] Error 2
> "

The patches actually are to be applied to nfsd-testing, which has a
patch that removes the errant WARN_ON_ONCE.

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-testing&id=a356997303fbea4914bfbdad9645c61d88b28c4d

If you apply that one-liner first, then this series, it should compile
properly.


> The building host is as below:
> 
> $ cat /etc/issue.net
> Ubuntu 22.04.5 LTS
> 
> $ gcc --version
> gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
> Copyright (C) 2021 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> $ uname -a
> Linux lb03055 6.8.0-58-generic #60~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC
> Fri Mar 28 16:09:21 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> Zhu Yanjun
> 
>>
>> Since there is an existing user space API for this, my initial
>> arguments against adding a tuneable are moot. max_block_size should
>> be adequate for this purpose, and enabling it to be set to larger
>> values should not impact the kernel-user space API in any way.
>>
>> Changes since v3:
>> * Improved the rdma_rw context count estimate
>> * Dropped "NFSD: Remove NFSSVC_MAXBLKSIZE from .pc_xdrressize"
>> * Cleaned up the max size macros a bit
>> * Completed the implementation of adjustable max_block_size
>>
>> Changes since v2:
>> * Address Jeff's review comments
>> * Address Neil's review comments
>> * Start removing a few uses of NFSSVC_MAXBLKSIZE
>>
>> Chuck Lever (14):
>>    svcrdma: Reduce the number of rdma_rw contexts per-QP
>>    sunrpc: Add a helper to derive maxpages from sv_max_mesg
>>    sunrpc: Remove backchannel check in svc_init_buffer()
>>    sunrpc: Replace the rq_pages array with dynamically-allocated memory
>>    sunrpc: Replace the rq_vec array with dynamically-allocated memory
>>    sunrpc: Replace the rq_bvec array with dynamically-allocated memory
>>    sunrpc: Adjust size of socket's receive page array dynamically
>>    svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
>>    svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages
>>    sunrpc: Remove the RPCSVC_MAXPAGES macro
>>    NFSD: Remove NFSD_BUFSIZE
>>    NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro
>>    NFSD: Add a "default" block size
>>    SUNRPC: Bump the maximum payload size for the server
>>
>>   fs/nfsd/nfs4proc.c                       |  2 +-
>>   fs/nfsd/nfs4state.c                      |  2 +-
>>   fs/nfsd/nfs4xdr.c                        |  2 +-
>>   fs/nfsd/nfsd.h                           | 24 ++++-------
>>   fs/nfsd/nfsproc.c                        |  4 +-
>>   fs/nfsd/nfssvc.c                         |  2 +-
>>   fs/nfsd/nfsxdr.c                         |  4 +-
>>   fs/nfsd/vfs.c                            |  2 +-
>>   include/linux/sunrpc/svc.h               | 45 +++++++++++++--------
>>   include/linux/sunrpc/svc_rdma.h          |  6 ++-
>>   include/linux/sunrpc/svcsock.h           |  4 +-
>>   net/sunrpc/svc.c                         | 51 +++++++++++++++---------
>>   net/sunrpc/svc_xprt.c                    | 10 +----
>>   net/sunrpc/svcsock.c                     | 15 ++++---
>>   net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  8 +++-
>>   net/sunrpc/xprtrdma/svc_rdma_rw.c        |  2 +-
>>   net/sunrpc/xprtrdma/svc_rdma_sendto.c    | 16 ++++++--
>>   net/sunrpc/xprtrdma/svc_rdma_transport.c | 14 ++++---
>>   18 files changed, 122 insertions(+), 91 deletions(-)
>>
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 00/14] Allocate payload arrays dynamically
  2025-04-29 13:41   ` Chuck Lever
@ 2025-04-29 13:52     ` Zhu Yanjun
  0 siblings, 0 replies; 52+ messages in thread
From: Zhu Yanjun @ 2025-04-29 13:52 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever


On 29.04.25 15:41, Chuck Lever wrote:
> On 4/29/25 9:06 AM, Zhu Yanjun wrote:
>> On 28.04.25 21:36, cel@kernel.org wrote:
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>
>>> In order to make RPCSVC_MAXPAYLOAD larger (or variable in size), we
>>> need to do something clever with the payload arrays embedded in
>>> struct svc_rqst and elsewhere.
>>>
>>> My preference is to keep these arrays allocated all the time because
>>> allocating them on demand increases the risk of a memory allocation
>>> failure during a large I/O. This is a quick-and-dirty approach that
>>> might be replaced once NFSD is converted to use large folios.
>>>
>>> The downside of this design choice is that it pins a few pages per
>>> NFSD thread (and that's the current situation already). But note
>>> that because RPCSVC_MAXPAGES is 259, each array is just over a page
>>> in size, making the allocation waste quite a bit of memory beyond
>>> the end of the array due to power-of-2 allocator round up. This gets
>>> worse as the MAXPAGES value is doubled or quadrupled.
>>>
>>> This series also addresses similar issues in the socket and RDMA
>>> transports.
>>>
>>> v4 is "code complete", unless there are new code change requests.
>>> I'm not convinced that adding XDR pad alignment to svc_reserve()
>>> is good, but I'm willing to consider it further.
>>>
>>> It turns out there is already a tuneable for the maximum read and
>>> write size in NFSD:
>>>
>>>     /proc/fs/nfsd/max_block_size
>> Hi,
>>
>> Based on the head commit ca91b9500108 Merge tag 'v6.15-rc4-ksmbd-server-
>> fixes' of git://git.samba.org/ksmbd, I applied this patch series.
>>
>> When I built the kernel, the following error will pop out.
>> "
>> In file included from ./arch/x86/include/asm/bug.h:103,
>>                   from ./include/linux/bug.h:5,
>>                   from ./arch/x86/include/asm/paravirt.h:19,
>>                   from ./arch/x86/include/asm/irqflags.h:102,
>>                   from ./include/linux/irqflags.h:18,
>>                   from ./include/linux/spinlock.h:59,
>>                   from ./include/linux/fs_struct.h:6,
>>                   from fs/nfsd/nfs4proc.c:35:
>> fs/nfsd/nfs4proc.c: In function ‘nfsd4_write’:
>> ./include/linux/array_size.h:11:38: warning: division ‘sizeof (struct
>> kvec *) / sizeof (struct kvec)’ does not compute the number of array
>> elements [-Wsizeof-pointer-div]
>>     11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) +
>> __must_be_array(arr))
>>        |                                      ^
>> ./include/asm-generic/bug.h:111:32: note: in definition of macro
>> ‘WARN_ON_ONCE’
>>    111 |         int __ret_warn_on = !!(condition);                      \
>>        |                                ^~~~~~~~~
>> fs/nfsd/nfs4proc.c:1231:30: note: in expansion of macro ‘ARRAY_SIZE’
>>   1231 |         WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
>>        |                              ^~~~~~~~~~
>> ./include/linux/compiler.h:197:62: error: static assertion failed: "must
>> be array"
>>    197 | #define __BUILD_BUG_ON_ZERO_MSG(e, msg) ((int)sizeof(struct
>> {_Static_assert(!(e), msg);}))
>>        | ^~~~~~~~~~~~~~
>> ./include/asm-generic/bug.h:111:32: note: in definition of macro
>> ‘WARN_ON_ONCE’
>>    111 |         int __ret_warn_on = !!(condition);                      \
>>        |                                ^~~~~~~~~
>> ./include/linux/compiler.h:202:33: note: in expansion of macro
>> ‘__BUILD_BUG_ON_ZERO_MSG’
>>    202 | #define __must_be_array(a) __BUILD_BUG_ON_ZERO_MSG(!
>> __is_array(a), \
>>        |                                 ^~~~~~~~~~~~~~~~~~~~~~~
>> ./include/linux/array_size.h:11:59: note: in expansion of macro
>> ‘__must_be_array’
>>     11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) +
>> __must_be_array(arr))
>>        | ^~~~~~~~~~~~~~~
>> fs/nfsd/nfs4proc.c:1231:30: note: in expansion of macro ‘ARRAY_SIZE’
>>   1231 |         WARN_ON_ONCE(nvecs > ARRAY_SIZE(rqstp->rq_vec));
>>        |                              ^~~~~~~~~~
>> make[4]: *** [scripts/Makefile.build:203: fs/nfsd/nfs4proc.o] Error 1
>> make[4]: *** Waiting for unfinished jobs....
>> make[3]: *** [scripts/Makefile.build:461: fs/nfsd] Error 2
>> make[3]: *** Waiting for unfinished jobs....
>> make[2]: *** [scripts/Makefile.build:461: fs] Error 2
>> make[2]: *** Waiting for unfinished jobs....
>> make[1]: *** [/home/zyanjun/Development/github-linux/Makefile:2011: .]
>> Error 2
>> make: *** [Makefile:248: __sub-make] Error 2
>> "
> The patches actually are to be applied to nfsd-testing, which has a
> patch that removes the errant WARN_ON_ONCE.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-testing&id=a356997303fbea4914bfbdad9645c61d88b28c4d
>
> If you apply that one-liner first, then this series, it should compile
> properly.

Thanks. Great.

Follow your advice, the patch series can compile properly.

Best Regards,

Zhu Yanjun

>
>
>> The building host is as below:
>>
>> $ cat /etc/issue.net
>> Ubuntu 22.04.5 LTS
>>
>> $ gcc --version
>> gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
>> Copyright (C) 2021 Free Software Foundation, Inc.
>> This is free software; see the source for copying conditions.  There is NO
>> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>>
>> $ uname -a
>> Linux lb03055 6.8.0-58-generic #60~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC
>> Fri Mar 28 16:09:21 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>> Zhu Yanjun
>>
>>> Since there is an existing user space API for this, my initial
>>> arguments against adding a tuneable are moot. max_block_size should
>>> be adequate for this purpose, and enabling it to be set to larger
>>> values should not impact the kernel-user space API in any way.
>>>
>>> Changes since v3:
>>> * Improved the rdma_rw context count estimate
>>> * Dropped "NFSD: Remove NFSSVC_MAXBLKSIZE from .pc_xdrressize"
>>> * Cleaned up the max size macros a bit
>>> * Completed the implementation of adjustable max_block_size
>>>
>>> Changes since v2:
>>> * Address Jeff's review comments
>>> * Address Neil's review comments
>>> * Start removing a few uses of NFSSVC_MAXBLKSIZE
>>>
>>> Chuck Lever (14):
>>>     svcrdma: Reduce the number of rdma_rw contexts per-QP
>>>     sunrpc: Add a helper to derive maxpages from sv_max_mesg
>>>     sunrpc: Remove backchannel check in svc_init_buffer()
>>>     sunrpc: Replace the rq_pages array with dynamically-allocated memory
>>>     sunrpc: Replace the rq_vec array with dynamically-allocated memory
>>>     sunrpc: Replace the rq_bvec array with dynamically-allocated memory
>>>     sunrpc: Adjust size of socket's receive page array dynamically
>>>     svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
>>>     svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages
>>>     sunrpc: Remove the RPCSVC_MAXPAGES macro
>>>     NFSD: Remove NFSD_BUFSIZE
>>>     NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro
>>>     NFSD: Add a "default" block size
>>>     SUNRPC: Bump the maximum payload size for the server
>>>
>>>    fs/nfsd/nfs4proc.c                       |  2 +-
>>>    fs/nfsd/nfs4state.c                      |  2 +-
>>>    fs/nfsd/nfs4xdr.c                        |  2 +-
>>>    fs/nfsd/nfsd.h                           | 24 ++++-------
>>>    fs/nfsd/nfsproc.c                        |  4 +-
>>>    fs/nfsd/nfssvc.c                         |  2 +-
>>>    fs/nfsd/nfsxdr.c                         |  4 +-
>>>    fs/nfsd/vfs.c                            |  2 +-
>>>    include/linux/sunrpc/svc.h               | 45 +++++++++++++--------
>>>    include/linux/sunrpc/svc_rdma.h          |  6 ++-
>>>    include/linux/sunrpc/svcsock.h           |  4 +-
>>>    net/sunrpc/svc.c                         | 51 +++++++++++++++---------
>>>    net/sunrpc/svc_xprt.c                    | 10 +----
>>>    net/sunrpc/svcsock.c                     | 15 ++++---
>>>    net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  8 +++-
>>>    net/sunrpc/xprtrdma/svc_rdma_rw.c        |  2 +-
>>>    net/sunrpc/xprtrdma/svc_rdma_sendto.c    | 16 ++++++--
>>>    net/sunrpc/xprtrdma/svc_rdma_transport.c | 14 ++++---
>>>    18 files changed, 122 insertions(+), 91 deletions(-)
>>>
>
-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server
  2025-04-28 21:08   ` Jeff Layton
@ 2025-04-29 15:44     ` Chuck Lever
  0 siblings, 0 replies; 52+ messages in thread
From: Chuck Lever @ 2025-04-29 15:44 UTC (permalink / raw)
  To: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker
  Cc: linux-nfs, linux-rdma, Chuck Lever

On 4/28/25 5:08 PM, Jeff Layton wrote:
> On Mon, 2025-04-28 at 15:37 -0400, cel@kernel.org wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Increase the maximum server-side RPC payload to 4MB. The default
>> remains at 1MB.
>>
>> To adjust the operational maximum, shut down the NFS server. Then
>> echo a new value into:
>>
>>   /proc/fs/nfsd/max_block_size
>>
>> And restart the NFS server.
>>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  include/linux/sunrpc/svc.h | 14 +++++++-------
>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index e27bc051ec67..b449eb02e00a 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -119,14 +119,14 @@ void svc_destroy(struct svc_serv **svcp);
>>   * Linux limit; someone who cares more about NFS/UDP performance
>>   * can test a larger number.
>>   *
>> - * For TCP transports we have more freedom.  A size of 1MB is
>> - * chosen to match the client limit.  Other OSes are known to
>> - * have larger limits, but those numbers are probably beyond
>> - * the point of diminishing returns.
>> + * For non-UDP transports we have more freedom.  A size of 4MB is
>> + * chosen to accommodate clients that support larger I/O sizes.
>>   */
>> -#define RPCSVC_MAXPAYLOAD	(1*1024*1024u)
>> -#define RPCSVC_MAXPAYLOAD_TCP	RPCSVC_MAXPAYLOAD
>> -#define RPCSVC_MAXPAYLOAD_UDP	(32*1024u)
>> +enum {
>> +	RPCSVC_MAXPAYLOAD	= 4 * 1024 * 1024,
>> +	RPCSVC_MAXPAYLOAD_TCP	= RPCSVC_MAXPAYLOAD,
>> +	RPCSVC_MAXPAYLOAD_UDP	= 32 * 1024,
>> +};
> 
> I guess the enum is so that the symbol names remain in debuginfo?

My impression is these days enum is preferred over #define for
this kind of symbolic constant. This part of the change is merely
clean up.


>>  extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>>  
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>

Thanks for the review!

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 04/14] sunrpc: Replace the rq_pages array with dynamically-allocated memory
  2025-04-28 19:36 ` [PATCH v4 04/14] sunrpc: Replace the rq_pages array with dynamically-allocated memory cel
@ 2025-04-30  4:53   ` NeilBrown
  0 siblings, 0 replies; 52+ messages in thread
From: NeilBrown @ 2025-04-30  4:53 UTC (permalink / raw)
  To: cel
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On Tue, 29 Apr 2025, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> As a step towards making NFSD's maximum rsize and wsize variable at
> run-time, replace the fixed-size rq_vec[] array in struct svc_rqst
> with a chunk of dynamically-allocated memory.
> 
> On a system with 8-byte pointers and 4KB pages, pahole reports that
> the rq_pages[] array is 2080 bytes. This patch replaces that with
> a single 8-byte pointer field.
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/sunrpc/svc.h        |  3 ++-
>  net/sunrpc/svc.c                  | 34 ++++++++++++++++++-------------
>  net/sunrpc/svc_xprt.c             | 10 +--------
>  net/sunrpc/xprtrdma/svc_rdma_rw.c |  2 +-
>  4 files changed, 24 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index e83ac14267e8..ea3a33eec29b 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -205,7 +205,8 @@ struct svc_rqst {
>  	struct xdr_stream	rq_res_stream;
>  	struct page		*rq_scratch_page;
>  	struct xdr_buf		rq_res;
> -	struct page		*rq_pages[RPCSVC_MAXPAGES + 1];
> +	unsigned long		rq_maxpages;	/* num of entries in rq_pages */
> +	struct page *		*rq_pages;
>  	struct page *		*rq_respages;	/* points into rq_pages */
>  	struct page *		*rq_next_page; /* next reply page to use */
>  	struct page *		*rq_page_end;  /* one past the last page */
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 8ce3e6b3df6a..682e11c9be36 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -636,20 +636,25 @@ svc_destroy(struct svc_serv **servp)
>  EXPORT_SYMBOL_GPL(svc_destroy);
>  
>  static bool
> -svc_init_buffer(struct svc_rqst *rqstp, unsigned int size, int node)
> +svc_init_buffer(struct svc_rqst *rqstp, const struct svc_serv *serv, int node)
>  {
> -	unsigned long pages, ret;
> +	unsigned long ret;
>  
> -	pages = size / PAGE_SIZE + 1; /* extra page as we hold both request and reply.
> -				       * We assume one is at most one page
> -				       */
> -	WARN_ON_ONCE(pages > RPCSVC_MAXPAGES);
> -	if (pages > RPCSVC_MAXPAGES)
> -		pages = RPCSVC_MAXPAGES;
> +	/* Add an extra page, as rq_pages holds both request and reply.
> +	 * We assume one of those is at most one page.
> +	 */
> +	rqstp->rq_maxpages = svc_serv_maxpages(serv) + 1;

The calculation in svc_serv_maxpages() already allows for both request
and reply.  I think the "+ 1" here is wrong.

NeilBrown

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 00/14] Allocate payload arrays dynamically
  2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
                   ` (14 preceding siblings ...)
  2025-04-29 13:06 ` [PATCH v4 00/14] Allocate payload arrays dynamically Zhu Yanjun
@ 2025-04-30  5:11 ` NeilBrown
  2025-04-30 12:45   ` Chuck Lever
  15 siblings, 1 reply; 52+ messages in thread
From: NeilBrown @ 2025-04-30  5:11 UTC (permalink / raw)
  To: cel
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On Tue, 29 Apr 2025, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> In order to make RPCSVC_MAXPAYLOAD larger (or variable in size), we
> need to do something clever with the payload arrays embedded in
> struct svc_rqst and elsewhere.
> 
> My preference is to keep these arrays allocated all the time because
> allocating them on demand increases the risk of a memory allocation
> failure during a large I/O. This is a quick-and-dirty approach that
> might be replaced once NFSD is converted to use large folios.
> 
> The downside of this design choice is that it pins a few pages per
> NFSD thread (and that's the current situation already). But note
> that because RPCSVC_MAXPAGES is 259, each array is just over a page
> in size, making the allocation waste quite a bit of memory beyond
> the end of the array due to power-of-2 allocator round up. This gets
> worse as the MAXPAGES value is doubled or quadrupled.

I wonder if we should special-case those 3 extra.
We don't need any for rq_vec and only need 2 (I think) for rq_bvec.

We could use the arrays only for payload and have dedicated
page/vec/bvec for request, reply, read-padding.
Or maybe we could not allow read requests that result in the extra page
due to alignment needs.  Would that be much cost?

Apart from the one issue I noted separately, I think the series looks
good.

Reviewed-by: NeilBrown <neil@brown.name>

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 00/14] Allocate payload arrays dynamically
  2025-04-30  5:11 ` NeilBrown
@ 2025-04-30 12:45   ` Chuck Lever
  0 siblings, 0 replies; 52+ messages in thread
From: Chuck Lever @ 2025-04-30 12:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On 4/30/25 1:11 AM, NeilBrown wrote:
> On Tue, 29 Apr 2025, cel@kernel.org wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> In order to make RPCSVC_MAXPAYLOAD larger (or variable in size), we
>> need to do something clever with the payload arrays embedded in
>> struct svc_rqst and elsewhere.
>>
>> My preference is to keep these arrays allocated all the time because
>> allocating them on demand increases the risk of a memory allocation
>> failure during a large I/O. This is a quick-and-dirty approach that
>> might be replaced once NFSD is converted to use large folios.
>>
>> The downside of this design choice is that it pins a few pages per
>> NFSD thread (and that's the current situation already). But note
>> that because RPCSVC_MAXPAGES is 259, each array is just over a page
>> in size, making the allocation waste quite a bit of memory beyond
>> the end of the array due to power-of-2 allocator round up. This gets
>> worse as the MAXPAGES value is doubled or quadrupled.
> 
> I wonder if we should special-case those 3 extra.
> We don't need any for rq_vec and only need 2 (I think) for rq_bvec.

For rq_vec, I believe we need one extra entry in case part of the
payload is in the xdr_buf's head iovec.

For rq_bvec, we need one for the transport header, and one each for
the xdr_buf's head and tail iovecs.

But, I agree, the rationales for the size of each of these arrays are
slightly different.


> We could use the arrays only for payload and have dedicated
> page/vec/bvec for request, reply, read-padding.

I might not fully understand what you are suggesting, but it has
occurred to me that for NFSv4, both Call and Reply can be large for
one RPC transaction (though that is going to be quite infrequent). A
separate rq_pages[] array each for receive and send is possibly in
NFSD's future.


> Or maybe we could not allow read requests that result in the extra page
> due to alignment needs.  Would that be much cost?
> 
> Apart from the one issue I noted separately, I think the series looks
> good.
> 
> Reviewed-by: NeilBrown <neil@brown.name>

Thanks for having a look.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-04-28 19:36 ` [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP cel
@ 2025-05-06 13:08   ` Christoph Hellwig
  2025-05-06 13:17     ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-06 13:08 UTC (permalink / raw)
  To: cel
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever,
	Jason Gunthorpe, Leon Romanovsky

On Mon, Apr 28, 2025 at 03:36:49PM -0400, cel@kernel.org wrote:
> qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
> the order of the sum of qp_attr.cap.max_send_wr and a factor times
> qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
> on whether MR operations are required before RDMA Reads.
> 
> This limit is not visible to RDMA consumers via dev->attrs. When the
> limit is surpassed, QP creation fails with -ENOMEM. For example:

Can we find a way to expose this limit from the HCA drivers and the
RDMA core?

Having to guess it in ULP feels rather cumbersome.

In the meantime this patch looks good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

> 
> svcrdma's estimate of the number of rdma_rw contexts it needs is
> three times the number of pages in RPCSVC_MAXPAGES. When MAXPAGES
> is about 260, the internally-computed SQ length should be:
> 
> 64 credits + 10 backlog + 3 * (3 * 260) = 2414
> 
> Which is well below the advertised qp_max_wr of 32768.
> 
> If RPCSVC_MAXPAGES is increased to 4MB, that's 1040 pages:
> 
> 64 credits + 10 backlog + 3 * (3 * 1040) = 9434
> 
> However, QP creation fails. Dynamic printk for mlx5 shows:
> 
> calc_sq_size:618:(pid 1514): send queue size (9326 * 256 / 64 -> 65536) exceeds limits(32768)
> 
> Although 9326 is still far below qp_max_wr, QP creation still
> fails.
> 
> Because the total SQ length calculation is opaque to RDMA consumers,
> there doesn't seem to be much that can be done about this except for
> consumers to try to keep the requested rdma_rw ctxt count low.
> 
> Fixes: 2da0f610e733 ("svcrdma: Increase the per-transport rw_ctx count")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/svc_rdma_transport.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index 5940a56023d1..3d7f1413df02 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -406,12 +406,12 @@ static void svc_rdma_xprt_done(struct rpcrdma_notification *rn)
>   */
>  static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>  {
> +	unsigned int ctxts, rq_depth, maxpayload;
>  	struct svcxprt_rdma *listen_rdma;
>  	struct svcxprt_rdma *newxprt = NULL;
>  	struct rdma_conn_param conn_param;
>  	struct rpcrdma_connect_private pmsg;
>  	struct ib_qp_init_attr qp_attr;
> -	unsigned int ctxts, rq_depth;
>  	struct ib_device *dev;
>  	int ret = 0;
>  	RPC_IFDEBUG(struct sockaddr *sap);
> @@ -462,12 +462,14 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>  		newxprt->sc_max_bc_requests = 2;
>  	}
>  
> -	/* Arbitrarily estimate the number of rw_ctxs needed for
> -	 * this transport. This is enough rw_ctxs to make forward
> -	 * progress even if the client is using one rkey per page
> -	 * in each Read chunk.
> +	/* Arbitrary estimate of the needed number of rdma_rw contexts.
>  	 */
> -	ctxts = 3 * RPCSVC_MAXPAGES;
> +	maxpayload = min(xprt->xpt_server->sv_max_payload,
> +			 RPCSVC_MAXPAYLOAD_RDMA);
> +	ctxts = newxprt->sc_max_requests * 3 *
> +		rdma_rw_mr_factor(dev, newxprt->sc_port_num,
> +				  maxpayload >> PAGE_SHIFT);
> +
>  	newxprt->sc_sq_depth = rq_depth + ctxts;
>  	if (newxprt->sc_sq_depth > dev->attrs.max_qp_wr)
>  		newxprt->sc_sq_depth = dev->attrs.max_qp_wr;
> -- 
> 2.49.0
> 
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 02/14] sunrpc: Add a helper to derive maxpages from sv_max_mesg
  2025-04-28 19:36 ` [PATCH v4 02/14] sunrpc: Add a helper to derive maxpages from sv_max_mesg cel
@ 2025-05-06 13:10   ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-06 13:10 UTC (permalink / raw)
  To: cel
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On Mon, Apr 28, 2025 at 03:36:50PM -0400, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> This page count is to be used to allocate various arrays of pages,
> bio_vecs, and kvecs, replacing the fixed RPCSVC_MAXPAGES value.
> 
> The documenting comment is somewhat stale -- of course NFSv4
> COMPOUND procedures may have multiple payloads.

This helper looks fine, but please don't talk about the kvecs.
The fact that nfs currently only allocates PAGE_SIZE chunks
is home grown limitation that shouldn't be there.  I have a series
trying to fix this, but it got stuck, so it might take a whіle.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 03/14] sunrpc: Remove backchannel check in svc_init_buffer()
  2025-04-28 19:36 ` [PATCH v4 03/14] sunrpc: Remove backchannel check in svc_init_buffer() cel
@ 2025-05-06 13:11   ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-06 13:11 UTC (permalink / raw)
  To: cel
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-05-06 13:08   ` Christoph Hellwig
@ 2025-05-06 13:17     ` Jason Gunthorpe
  2025-05-06 13:40       ` Christoph Hellwig
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-05-06 13:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: cel, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever,
	Leon Romanovsky

On Tue, May 06, 2025 at 06:08:59AM -0700, Christoph Hellwig wrote:
> On Mon, Apr 28, 2025 at 03:36:49PM -0400, cel@kernel.org wrote:
> > qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
> > the order of the sum of qp_attr.cap.max_send_wr and a factor times
> > qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
> > on whether MR operations are required before RDMA Reads.
> > 
> > This limit is not visible to RDMA consumers via dev->attrs. When the
> > limit is surpassed, QP creation fails with -ENOMEM. For example:
> 
> Can we find a way to expose this limit from the HCA drivers and the
> RDMA core?

Shouldn't it be max_qp_wr?

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/14] sunrpc: Replace the rq_vec array with dynamically-allocated memory
  2025-04-28 19:36 ` [PATCH v4 05/14] sunrpc: Replace the rq_vec " cel
@ 2025-05-06 13:29   ` Christoph Hellwig
  2025-05-06 16:31     ` Chuck Lever
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-06 13:29 UTC (permalink / raw)
  To: cel
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On Mon, Apr 28, 2025 at 03:36:53PM -0400, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> As a step towards making NFSD's maximum rsize and wsize variable at
> run-time, replace the fixed-size rq_vec[] array in struct svc_rqst
> with a chunk of dynamically-allocated memory.
> 
> The rq_vec array is sized assuming request processing will need at
> most one kvec per page in a maximum-sized RPC message.
> 
> On a system with 8-byte pointers and 4KB pages, pahole reports that
> the rq_vec[] array is 4144 bytes. This patch replaces that array
> with a single 8-byte pointer field.

The right thing to do here is to kill this array.  There is no
reason to use kvecs in the VFS read/write APIs these days, we can
use bio_vecs just fine, for which we have another allocation.

Instead this should use the same bio_vec array as the svcsock code.

And given that both are only used by the server and never the client
maybe they should both only be conditionally allocated?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/14] svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
  2025-04-28 19:36 ` [PATCH v4 08/14] svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages cel
@ 2025-05-06 13:31   ` Christoph Hellwig
  2025-05-06 15:20     ` Chuck Lever
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-06 13:31 UTC (permalink / raw)
  To: cel
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On Mon, Apr 28, 2025 at 03:36:56PM -0400, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Allow allocation of more entries in the rc_pages[] array when the
> maximum size of an RPC message is increased.

Can we maybe also look into a way to not allocate the pages in the
rqst first just to free them when they get replaced with those from the
RDMA receive context?  Currently a lot of memory is wasted and pointless
burden is put on the page allocator when using the RDMA transport on
the server side.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 11/14] NFSD: Remove NFSD_BUFSIZE
  2025-04-28 19:36 ` [PATCH v4 11/14] NFSD: Remove NFSD_BUFSIZE cel
  2025-04-28 21:03   ` Jeff Layton
@ 2025-05-06 13:32   ` Christoph Hellwig
  1 sibling, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-06 13:32 UTC (permalink / raw)
  To: cel
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

Any reason the subject prefix switches from lower to upper case at this
point?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 12/14] NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro
  2025-04-28 19:37 ` [PATCH v4 12/14] NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro cel
@ 2025-05-06 13:33   ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-06 13:33 UTC (permalink / raw)
  To: cel
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server
  2025-04-28 19:37 ` [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server cel
  2025-04-28 21:08   ` Jeff Layton
@ 2025-05-06 13:34   ` Christoph Hellwig
  2025-05-06 13:52     ` Chuck Lever
  1 sibling, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-06 13:34 UTC (permalink / raw)
  To: cel
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On Mon, Apr 28, 2025 at 03:37:02PM -0400, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Increase the maximum server-side RPC payload to 4MB. The default
> remains at 1MB.
> 
> To adjust the operational maximum, shut down the NFS server. Then
> echo a new value into:
> 
>   /proc/fs/nfsd/max_block_size
> 
> And restart the NFS server.

Are you going to wire this up to a config file in nfs-utils that
gets set before the daemon starts?  Because otherwise this is a
pretty horrible user interface.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-05-06 13:17     ` Jason Gunthorpe
@ 2025-05-06 13:40       ` Christoph Hellwig
  2025-05-06 13:55         ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-06 13:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, cel, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Anna Schumaker, linux-nfs, linux-rdma,
	Chuck Lever, Leon Romanovsky

On Tue, May 06, 2025 at 10:17:22AM -0300, Jason Gunthorpe wrote:
> On Tue, May 06, 2025 at 06:08:59AM -0700, Christoph Hellwig wrote:
> > On Mon, Apr 28, 2025 at 03:36:49PM -0400, cel@kernel.org wrote:
> > > qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
> > > the order of the sum of qp_attr.cap.max_send_wr and a factor times
> > > qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
> > > on whether MR operations are required before RDMA Reads.
> > > 
> > > This limit is not visible to RDMA consumers via dev->attrs. When the
> > > limit is surpassed, QP creation fails with -ENOMEM. For example:
> > 
> > Can we find a way to expose this limit from the HCA drivers and the
> > RDMA core?
> 
> Shouldn't it be max_qp_wr?

Does that allow for arbitrary combination of different WRs?  If so
we'd just need a RW API helper to calculate how many WRs it needs
for each operation for the given device and flags and compare to that,
yes.

(unless my memory is rusty, it's been a while since I touched RDMA code)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server
  2025-05-06 13:34   ` Christoph Hellwig
@ 2025-05-06 13:52     ` Chuck Lever
  2025-05-06 13:54       ` Jeff Layton
  2025-05-07  7:42       ` Christoph Hellwig
  0 siblings, 2 replies; 52+ messages in thread
From: Chuck Lever @ 2025-05-06 13:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On 5/6/25 9:34 AM, Christoph Hellwig wrote:
> On Mon, Apr 28, 2025 at 03:37:02PM -0400, cel@kernel.org wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Increase the maximum server-side RPC payload to 4MB. The default
>> remains at 1MB.
>>
>> To adjust the operational maximum, shut down the NFS server. Then
>> echo a new value into:
>>
>>   /proc/fs/nfsd/max_block_size
>>
>> And restart the NFS server.
> 
> Are you going to wire this up to a config file in nfs-utils that
> gets set before the daemon starts?

That's up to SteveD -- it might be added to /etc/nfs.conf.


> Because otherwise this is a pretty horrible user interface.

This is an API that has existed forever.

I don't even like that this maximum can be tuned. After a period of
experimentation, I was going to set the default to a higher value and
be done with it, because I can't think of a reason why it needs to be
shifted up or down after that.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server
  2025-05-06 13:52     ` Chuck Lever
@ 2025-05-06 13:54       ` Jeff Layton
  2025-05-06 13:59         ` Chuck Lever
  2025-05-07  7:42       ` Christoph Hellwig
  1 sibling, 1 reply; 52+ messages in thread
From: Jeff Layton @ 2025-05-06 13:54 UTC (permalink / raw)
  To: Chuck Lever, Christoph Hellwig
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Anna Schumaker,
	linux-nfs, linux-rdma, Chuck Lever

On Tue, 2025-05-06 at 09:52 -0400, Chuck Lever wrote:
> On 5/6/25 9:34 AM, Christoph Hellwig wrote:
> > On Mon, Apr 28, 2025 at 03:37:02PM -0400, cel@kernel.org wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Increase the maximum server-side RPC payload to 4MB. The default
> > > remains at 1MB.
> > > 
> > > To adjust the operational maximum, shut down the NFS server. Then
> > > echo a new value into:
> > > 
> > >   /proc/fs/nfsd/max_block_size
> > > 
> > > And restart the NFS server.
> > 
> > Are you going to wire this up to a config file in nfs-utils that
> > gets set before the daemon starts?
> 
> That's up to SteveD -- it might be added to /etc/nfs.conf.
> 
> 

Can we also add this to the netlink interface for nfsd and nfsdctl?

> > Because otherwise this is a pretty horrible user interface.
> 
> This is an API that has existed forever.
> 
> I don't even like that this maximum can be tuned. After a period of
> experimentation, I was going to set the default to a higher value and
> be done with it, because I can't think of a reason why it needs to be
> shifted up or down after that.
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-05-06 13:40       ` Christoph Hellwig
@ 2025-05-06 13:55         ` Jason Gunthorpe
  2025-05-06 14:13           ` Chuck Lever
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-05-06 13:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: cel, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever,
	Leon Romanovsky

On Tue, May 06, 2025 at 06:40:25AM -0700, Christoph Hellwig wrote:
> On Tue, May 06, 2025 at 10:17:22AM -0300, Jason Gunthorpe wrote:
> > On Tue, May 06, 2025 at 06:08:59AM -0700, Christoph Hellwig wrote:
> > > On Mon, Apr 28, 2025 at 03:36:49PM -0400, cel@kernel.org wrote:
> > > > qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
> > > > the order of the sum of qp_attr.cap.max_send_wr and a factor times
> > > > qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
> > > > on whether MR operations are required before RDMA Reads.
> > > > 
> > > > This limit is not visible to RDMA consumers via dev->attrs. When the
> > > > limit is surpassed, QP creation fails with -ENOMEM. For example:
> > > 
> > > Can we find a way to expose this limit from the HCA drivers and the
> > > RDMA core?
> > 
> > Shouldn't it be max_qp_wr?
> 
> Does that allow for arbitrary combination of different WRs?  

I think it is supposed to be the maximum QP WR depth you can create..

A QP shouldn't behave differently depending on the WR operation, each
one takes one WR entry.

Chuck do you know differently?

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server
  2025-05-06 13:54       ` Jeff Layton
@ 2025-05-06 13:59         ` Chuck Lever
  0 siblings, 0 replies; 52+ messages in thread
From: Chuck Lever @ 2025-05-06 13:59 UTC (permalink / raw)
  To: Jeff Layton, Christoph Hellwig
  Cc: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Anna Schumaker,
	linux-nfs, linux-rdma, Chuck Lever

On 5/6/25 9:54 AM, Jeff Layton wrote:
> On Tue, 2025-05-06 at 09:52 -0400, Chuck Lever wrote:
>> On 5/6/25 9:34 AM, Christoph Hellwig wrote:
>>> On Mon, Apr 28, 2025 at 03:37:02PM -0400, cel@kernel.org wrote:
>>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>>
>>>> Increase the maximum server-side RPC payload to 4MB. The default
>>>> remains at 1MB.
>>>>
>>>> To adjust the operational maximum, shut down the NFS server. Then
>>>> echo a new value into:
>>>>
>>>>   /proc/fs/nfsd/max_block_size
>>>>
>>>> And restart the NFS server.
>>>
>>> Are you going to wire this up to a config file in nfs-utils that
>>> gets set before the daemon starts?
>>
>> That's up to SteveD -- it might be added to /etc/nfs.conf.
>>
>>
> 
> Can we also add this to the netlink interface for nfsd and nfsdctl?

Sure, that's possible, however:

The purpose of this series is only to enable experimentation (aside from
the other nice clean-ups).

Once that is complete, what are the use cases for admins to increase or
decrease this value? (Not a rhetorical question: I'd like to invite some
discussion about that).

As always, these interfaces need documentation and long-term support. I
would like to get some technical rationale on the table before we
commit to the support costs.


>>> Because otherwise this is a pretty horrible user interface.
>>
>> This is an API that has existed forever.
>>
>> I don't even like that this maximum can be tuned. After a period of
>> experimentation, I was going to set the default to a higher value and
>> be done with it, because I can't think of a reason why it needs to be
>> shifted up or down after that.
>>
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-05-06 13:55         ` Jason Gunthorpe
@ 2025-05-06 14:13           ` Chuck Lever
  2025-05-06 14:17             ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Chuck Lever @ 2025-05-06 14:13 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever,
	Leon Romanovsky

On 5/6/25 9:55 AM, Jason Gunthorpe wrote:
> On Tue, May 06, 2025 at 06:40:25AM -0700, Christoph Hellwig wrote:
>> On Tue, May 06, 2025 at 10:17:22AM -0300, Jason Gunthorpe wrote:
>>> On Tue, May 06, 2025 at 06:08:59AM -0700, Christoph Hellwig wrote:
>>>> On Mon, Apr 28, 2025 at 03:36:49PM -0400, cel@kernel.org wrote:
>>>>> qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
>>>>> the order of the sum of qp_attr.cap.max_send_wr and a factor times
>>>>> qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
>>>>> on whether MR operations are required before RDMA Reads.
>>>>>
>>>>> This limit is not visible to RDMA consumers via dev->attrs. When the
>>>>> limit is surpassed, QP creation fails with -ENOMEM. For example:
>>>>
>>>> Can we find a way to expose this limit from the HCA drivers and the
>>>> RDMA core?
>>>
>>> Shouldn't it be max_qp_wr?
>>
>> Does that allow for arbitrary combination of different WRs?  
> 
> I think it is supposed to be the maximum QP WR depth you can create..
> 
> A QP shouldn't behave differently depending on the WR operation, each
> one takes one WR entry.
> 
> Chuck do you know differently?

qp_attr.cap.max_rdma_ctxs reserves a number of SQEs over and above
qp_attr.cap.max_send_wr. The sum of those two cannot exceed max_qp_wr,
of course.

But there is a multiplier, due to whether the device wants a
registration and invalidation WR in addition to each RDMA Read WR.

Further, in drivers/infiniband/hw/mlx5/qp.c :: calc_sq_size

        wq_size = roundup_pow_of_two(attr->cap.max_send_wr * wqe_size);
        qp->sq.wqe_cnt = wq_size / MLX5_SEND_WQE_BB;
        if (qp->sq.wqe_cnt > (1 << MLX5_CAP_GEN(dev->mdev,
log_max_qp_sz))) {
                mlx5_ib_dbg(dev, "send queue size (%d * %d / %d -> %d)
exceeds limits(%d)\n",
                            attr->cap.max_send_wr, wqe_size,
MLX5_SEND_WQE_BB
                            qp->sq.wqe_cnt,

                            1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz));
                return -ENOMEM;
        }

So when svcrdma requests a large number of ctxts on top of a Send
Queue size of 135, svc_rdma_accept() fails and the debug message above
pops out.

In this patch I'm trying to include the reg/inv multiplier in the
calculation, but that doesn't seem to be enough to make "accept"
reliable, IMO due to this extra calculation in calc_sq_size().

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-05-06 14:13           ` Chuck Lever
@ 2025-05-06 14:17             ` Jason Gunthorpe
  2025-05-06 14:19               ` Chuck Lever
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-05-06 14:17 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Anna Schumaker, linux-nfs, linux-rdma,
	Chuck Lever, Leon Romanovsky

On Tue, May 06, 2025 at 10:13:00AM -0400, Chuck Lever wrote:
> On 5/6/25 9:55 AM, Jason Gunthorpe wrote:
> > On Tue, May 06, 2025 at 06:40:25AM -0700, Christoph Hellwig wrote:
> >> On Tue, May 06, 2025 at 10:17:22AM -0300, Jason Gunthorpe wrote:
> >>> On Tue, May 06, 2025 at 06:08:59AM -0700, Christoph Hellwig wrote:
> >>>> On Mon, Apr 28, 2025 at 03:36:49PM -0400, cel@kernel.org wrote:
> >>>>> qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
> >>>>> the order of the sum of qp_attr.cap.max_send_wr and a factor times
> >>>>> qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
> >>>>> on whether MR operations are required before RDMA Reads.
> >>>>>
> >>>>> This limit is not visible to RDMA consumers via dev->attrs. When the
> >>>>> limit is surpassed, QP creation fails with -ENOMEM. For example:
> >>>>
> >>>> Can we find a way to expose this limit from the HCA drivers and the
> >>>> RDMA core?
> >>>
> >>> Shouldn't it be max_qp_wr?
> >>
> >> Does that allow for arbitrary combination of different WRs?  
> > 
> > I think it is supposed to be the maximum QP WR depth you can create..
> > 
> > A QP shouldn't behave differently depending on the WR operation, each
> > one takes one WR entry.
> > 
> > Chuck do you know differently?
> 
> qp_attr.cap.max_rdma_ctxs reserves a number of SQEs over and above
> qp_attr.cap.max_send_wr. The sum of those two cannot exceed max_qp_wr,
> of course.

Yes

> But there is a multiplier, due to whether the device wants a
> registration and invalidation WR in addition to each RDMA Read WR.

Yes, but both of these are in the rdma rw layer
 
> Further, in drivers/infiniband/hw/mlx5/qp.c :: calc_sq_size
> 
>         wq_size = roundup_pow_of_two(attr->cap.max_send_wr * wqe_size);
>         qp->sq.wqe_cnt = wq_size / MLX5_SEND_WQE_BB;
>         if (qp->sq.wqe_cnt > (1 << MLX5_CAP_GEN(dev->mdev,
> log_max_qp_sz))) {

And this log_max_qp_sz should be used to derive attr.max_qp_wr

> In this patch I'm trying to include the reg/inv multiplier in the
> calculation, but that doesn't seem to be enough to make "accept"
> reliable, IMO due to this extra calculation in calc_sq_size().

Did ib_create_qp get called with more than max_qp_wr ?

Or is max_qp_wr not working?

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-05-06 14:17             ` Jason Gunthorpe
@ 2025-05-06 14:19               ` Chuck Lever
  2025-05-06 14:22                 ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Chuck Lever @ 2025-05-06 14:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Anna Schumaker, linux-nfs, linux-rdma,
	Chuck Lever, Leon Romanovsky

On 5/6/25 10:17 AM, Jason Gunthorpe wrote:
> On Tue, May 06, 2025 at 10:13:00AM -0400, Chuck Lever wrote:
>> On 5/6/25 9:55 AM, Jason Gunthorpe wrote:
>>> On Tue, May 06, 2025 at 06:40:25AM -0700, Christoph Hellwig wrote:
>>>> On Tue, May 06, 2025 at 10:17:22AM -0300, Jason Gunthorpe wrote:
>>>>> On Tue, May 06, 2025 at 06:08:59AM -0700, Christoph Hellwig wrote:
>>>>>> On Mon, Apr 28, 2025 at 03:36:49PM -0400, cel@kernel.org wrote:
>>>>>>> qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
>>>>>>> the order of the sum of qp_attr.cap.max_send_wr and a factor times
>>>>>>> qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
>>>>>>> on whether MR operations are required before RDMA Reads.
>>>>>>>
>>>>>>> This limit is not visible to RDMA consumers via dev->attrs. When the
>>>>>>> limit is surpassed, QP creation fails with -ENOMEM. For example:
>>>>>>
>>>>>> Can we find a way to expose this limit from the HCA drivers and the
>>>>>> RDMA core?
>>>>>
>>>>> Shouldn't it be max_qp_wr?
>>>>
>>>> Does that allow for arbitrary combination of different WRs?  
>>>
>>> I think it is supposed to be the maximum QP WR depth you can create..
>>>
>>> A QP shouldn't behave differently depending on the WR operation, each
>>> one takes one WR entry.
>>>
>>> Chuck do you know differently?
>>
>> qp_attr.cap.max_rdma_ctxs reserves a number of SQEs over and above
>> qp_attr.cap.max_send_wr. The sum of those two cannot exceed max_qp_wr,
>> of course.
> 
> Yes
> 
>> But there is a multiplier, due to whether the device wants a
>> registration and invalidation WR in addition to each RDMA Read WR.
> 
> Yes, but both of these are in the rdma rw layer
>  
>> Further, in drivers/infiniband/hw/mlx5/qp.c :: calc_sq_size
>>
>>         wq_size = roundup_pow_of_two(attr->cap.max_send_wr * wqe_size);
>>         qp->sq.wqe_cnt = wq_size / MLX5_SEND_WQE_BB;
>>         if (qp->sq.wqe_cnt > (1 << MLX5_CAP_GEN(dev->mdev,
>> log_max_qp_sz))) {
> 
> And this log_max_qp_sz should be used to derive attr.max_qp_wr
> 
>> In this patch I'm trying to include the reg/inv multiplier in the
>> calculation, but that doesn't seem to be enough to make "accept"
>> reliable, IMO due to this extra calculation in calc_sq_size().
> 
> Did ib_create_qp get called with more than max_qp_wr ?

The request was for, like, 9300 SQEs. max_qp_wr is 32K on my systems.

> Or is max_qp_wr not working?

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-05-06 14:19               ` Chuck Lever
@ 2025-05-06 14:22                 ` Jason Gunthorpe
  2025-05-08  8:41                   ` Edward Srouji
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-05-06 14:22 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Anna Schumaker, linux-nfs, linux-rdma,
	Chuck Lever, Leon Romanovsky

On Tue, May 06, 2025 at 10:19:06AM -0400, Chuck Lever wrote:
> >> In this patch I'm trying to include the reg/inv multiplier in the
> >> calculation, but that doesn't seem to be enough to make "accept"
> >> reliable, IMO due to this extra calculation in calc_sq_size().
> > 
> > Did ib_create_qp get called with more than max_qp_wr ?
> 
> The request was for, like, 9300 SQEs. max_qp_wr is 32K on my systems.

Sounds like it is broken then..

	props->max_qp_wr	   = 1 << MLX5_CAP_GEN(mdev, log_max_qp_sz);

So it is ignoring the wqe_size adustment.. It should adjust by the worst
case result of calc_send_wqe() for the device..

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/14] svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
  2025-05-06 13:31   ` Christoph Hellwig
@ 2025-05-06 15:20     ` Chuck Lever
  2025-05-07  7:40       ` Christoph Hellwig
  0 siblings, 1 reply; 52+ messages in thread
From: Chuck Lever @ 2025-05-06 15:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On 5/6/25 9:31 AM, Christoph Hellwig wrote:
> On Mon, Apr 28, 2025 at 03:36:56PM -0400, cel@kernel.org wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Allow allocation of more entries in the rc_pages[] array when the
>> maximum size of an RPC message is increased.
> 
> Can we maybe also look into a way to not allocate the pages in the
> rqst first just to free them when they get replaced with those from the
> RDMA receive context?  Currently a lot of memory is wasted and pointless
> burden is put on the page allocator when using the RDMA transport on
> the server side.

You're talking about specifically:

1. svcrdma issues RDMA Read WRs from an svc_rqst thread context. It
   pulls the Read sink pages out of the svc_rqst's rq_pages[] array, and
   then svc_alloc_arg() refills the rq_pages[] array before the thread
   returns to the thread pool

2. When the RDMA Read completes, it is picked up by an svc_rqst thread.
   svcrdma frees the pages in the thread's rq_pages[] array, and
   replaces them with the Read's sink pages.

I've looked at this several times over the years. It's a tough problem
to balance against things like preventing a denial of service. For
example, an attempt was made to handle the RDMA Read synchronously in
the same thread that receives the RDMA Receive. Had to be reverted
because if the client is slow to furnish the Read payload, that ties
up the svc_rqst thread. That is a DoS vector.

One idea would be for NFSD to maintain a pool of these pages. But I'm
not convinced that we could invent anything that's less latent than the
generic bulk page allocator: release_pages() and alloc_bulk_pages().


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/14] sunrpc: Replace the rq_vec array with dynamically-allocated memory
  2025-05-06 13:29   ` Christoph Hellwig
@ 2025-05-06 16:31     ` Chuck Lever
  2025-05-07  7:34       ` Christoph Hellwig
  0 siblings, 1 reply; 52+ messages in thread
From: Chuck Lever @ 2025-05-06 16:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On 5/6/25 9:29 AM, Christoph Hellwig wrote:
> On Mon, Apr 28, 2025 at 03:36:53PM -0400, cel@kernel.org wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> As a step towards making NFSD's maximum rsize and wsize variable at
>> run-time, replace the fixed-size rq_vec[] array in struct svc_rqst
>> with a chunk of dynamically-allocated memory.
>>
>> The rq_vec array is sized assuming request processing will need at
>> most one kvec per page in a maximum-sized RPC message.
>>
>> On a system with 8-byte pointers and 4KB pages, pahole reports that
>> the rq_vec[] array is 4144 bytes. This patch replaces that array
>> with a single 8-byte pointer field.
> 
> The right thing to do here is to kill this array.  There is no
> reason to use kvecs in the VFS read/write APIs these days, we can
> use bio_vecs just fine, for which we have another allocation.

Fair enough. That's a little more churn than I wanted to do in this
patch series, but maybe it's easier than I expect.


> And given that both are only used by the server and never the client
> maybe they should both only be conditionally allocated?

Not sure I follow you here. The client certainly does make extensive use
of xdr_buf::bvec.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 05/14] sunrpc: Replace the rq_vec array with dynamically-allocated memory
  2025-05-06 16:31     ` Chuck Lever
@ 2025-05-07  7:34       ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-07  7:34 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Anna Schumaker, linux-nfs, linux-rdma,
	Chuck Lever

On Tue, May 06, 2025 at 12:31:37PM -0400, Chuck Lever wrote:
> > And given that both are only used by the server and never the client
> > maybe they should both only be conditionally allocated?
> 
> Not sure I follow you here. The client certainly does make extensive use
> of xdr_buf::bvec.

Yeah.  It doesn't use rq_bvec, but that's because that is in a
server-only structure..


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 08/14] svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages
  2025-05-06 15:20     ` Chuck Lever
@ 2025-05-07  7:40       ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-07  7:40 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Anna Schumaker, linux-nfs, linux-rdma,
	Chuck Lever

On Tue, May 06, 2025 at 11:20:44AM -0400, Chuck Lever wrote:
> 1. svcrdma issues RDMA Read WRs from an svc_rqst thread context. It
>    pulls the Read sink pages out of the svc_rqst's rq_pages[] array, and
>    then svc_alloc_arg() refills the rq_pages[] array before the thread
>    returns to the thread pool
> 
> 2. When the RDMA Read completes, it is picked up by an svc_rqst thread.
>    svcrdma frees the pages in the thread's rq_pages[] array, and
>    replaces them with the Read's sink pages.

...

> One idea would be for NFSD to maintain a pool of these pages. But I'm
> not convinced that we could invent anything that's less latent than the
> generic bulk page allocator: release_pages() and alloc_bulk_pages().

I think the main issue is the completion thread, which first allocates
and then frees the pages.  If you instead just have a container for
the pages (or any other future context state) you avoid that.  I.e.
lifting the concept of a receive context from the rdma transport
to common code.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server
  2025-05-06 13:52     ` Chuck Lever
  2025-05-06 13:54       ` Jeff Layton
@ 2025-05-07  7:42       ` Christoph Hellwig
  2025-05-07 14:25         ` Chuck Lever
  1 sibling, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2025-05-07  7:42 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Anna Schumaker, linux-nfs, linux-rdma,
	Chuck Lever

On Tue, May 06, 2025 at 09:52:06AM -0400, Chuck Lever wrote:
> > Are you going to wire this up to a config file in nfs-utils that
> > gets set before the daemon starts?
> 
> That's up to SteveD -- it might be added to /etc/nfs.conf.

Well, you should be talking to him or even include a patch.

> > Because otherwise this is a pretty horrible user interface.
> 
> This is an API that has existed forever.

Huh?  It is a brand new file added by this patch.

> I don't even like that this maximum can be tuned. After a period of
> experimentation, I was going to set the default to a higher value and
> be done with it, because I can't think of a reason why it needs to be
> shifted up or down after that.

Why not?  A tiny desk NAS box has very different resources available
compared to say a multi-socket enterprise AI data server.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server
  2025-05-07  7:42       ` Christoph Hellwig
@ 2025-05-07 14:25         ` Chuck Lever
  0 siblings, 0 replies; 52+ messages in thread
From: Chuck Lever @ 2025-05-07 14:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Anna Schumaker, linux-nfs, linux-rdma, Chuck Lever

On 5/7/25 3:42 AM, Christoph Hellwig wrote:
> On Tue, May 06, 2025 at 09:52:06AM -0400, Chuck Lever wrote:
>>> Are you going to wire this up to a config file in nfs-utils that
>>> gets set before the daemon starts?
>>
>> That's up to SteveD -- it might be added to /etc/nfs.conf.
> 
> Well, you should be talking to him or even include a patch.

On this list, we post nfs-utils patches separately, once the kernel API
is nailed. Steve doesn't pull such changes until the kernel changes
have been merged.

But see below: I'm still not convinced this is a tunable that is worth
going to that level of trouble for.


>>> Because otherwise this is a pretty horrible user interface.
>>
>> This is an API that has existed forever.
> 
> Huh?  It is a brand new file added by this patch.

/proc/fs/nfsd/max_block_size was added by commit 596bbe53eb3a ("[PATCH]
knfsd: Allow max size of NFSd payload to be configured") in 2006.

Or are you referring to something else?


>> I don't even like that this maximum can be tuned. After a period of
>> experimentation, I was going to set the default to a higher value and
>> be done with it, because I can't think of a reason why it needs to be
>> shifted up or down after that.
> 
> Why not?  A tiny desk NAS box has very different resources available
> compared to say a multi-socket enterprise AI data server.

I don't believe system memory size is a concern.

a. max_block_size is automatically reduced on small memory systems. See
   nfsd_get_default_max_blksize().

b. The extra memory allocation is per thread, so a smaller server can
   reduce the standing memory requirements by lowering the number of
   nfsd threads.

c. we're now removing rq_vec, so there's already less memory to allocate
   than before.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-05-06 14:22                 ` Jason Gunthorpe
@ 2025-05-08  8:41                   ` Edward Srouji
  2025-05-08 12:43                     ` Jason Gunthorpe
  0 siblings, 1 reply; 52+ messages in thread
From: Edward Srouji @ 2025-05-08  8:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Chuck Lever
  Cc: Christoph Hellwig, NeilBrown, Jeff Layton, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Anna Schumaker, linux-nfs, linux-rdma,
	Chuck Lever, Leon Romanovsky


On 5/6/2025 5:22 PM, Jason Gunthorpe wrote:
> On Tue, May 06, 2025 at 10:19:06AM -0400, Chuck Lever wrote:
>>>> In this patch I'm trying to include the reg/inv multiplier in the
>>>> calculation, but that doesn't seem to be enough to make "accept"
>>>> reliable, IMO due to this extra calculation in calc_sq_size().
>>> Did ib_create_qp get called with more than max_qp_wr ?
>> The request was for, like, 9300 SQEs. max_qp_wr is 32K on my systems.
> Sounds like it is broken then..
>
> 	props->max_qp_wr	   = 1 << MLX5_CAP_GEN(mdev, log_max_qp_sz);
>
> So it is ignoring the wqe_size adustment.. It should adjust by the worst
> case result of calc_send_wqe() for the device..
How do you suggest adjusting to the worst case?
How inline messages could be addressed and taken into account?
Even if we ignore the inline size, worst case potentially could be less 
than 1/8th of the max HCA CAP, not sure we want to deliver this as a 
limitation to users.
> Jason
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-05-08  8:41                   ` Edward Srouji
@ 2025-05-08 12:43                     ` Jason Gunthorpe
  2025-05-10 23:12                       ` Edward Srouji
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2025-05-08 12:43 UTC (permalink / raw)
  To: Edward Srouji
  Cc: Chuck Lever, Christoph Hellwig, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Anna Schumaker, linux-nfs,
	linux-rdma, Chuck Lever, Leon Romanovsky

On Thu, May 08, 2025 at 11:41:18AM +0300, Edward Srouji wrote:
> 
> On 5/6/2025 5:22 PM, Jason Gunthorpe wrote:
> > On Tue, May 06, 2025 at 10:19:06AM -0400, Chuck Lever wrote:
> > > > > In this patch I'm trying to include the reg/inv multiplier in the
> > > > > calculation, but that doesn't seem to be enough to make "accept"
> > > > > reliable, IMO due to this extra calculation in calc_sq_size().
> > > > Did ib_create_qp get called with more than max_qp_wr ?
> > > The request was for, like, 9300 SQEs. max_qp_wr is 32K on my systems.
> > Sounds like it is broken then..
> > 
> > 	props->max_qp_wr	   = 1 << MLX5_CAP_GEN(mdev, log_max_qp_sz);
> > 
> > So it is ignoring the wqe_size adustment.. It should adjust by the worst
> > case result of calc_send_wqe() for the device..
> How do you suggest adjusting to the worst case?
> How inline messages could be addressed and taken into account?

I think assume 0 size inline for computing max sizes

> Even if we ignore the inline size, worst case potentially could be less than
> 1/8th of the max HCA CAP, not sure we want to deliver this as a limitation
> to users.

The math is simply wrong - log_max_qp_sz is not the number of work
queue entries in the queue, it is the number of MLX5_SEND_WQE_BB's
units which is some internal value.

For a verbs API the result should be the max number of work queue
entries that can be requested for any of XRC/RC/UC/UD QP types using a
0 inline size, 1 SGL and no other special features.

Even for a simple RC QP sq_overhead() will return 132 which already
makes props->max_qp_wr uselessly wrong. 132 goes into here:

		return ALIGN(max_t(int, inl_size, size), MLX5_SEND_WQE_BB);

Comes out as 192 - so props->max_qp_wr is off by 3x even for a simple
no-feature RC QP.

Chuck is getting:

calc_sq_size:618:(pid 1514): send queue size (9326 * 256 / 64 -> 65536) exceeds limits(32768)

So I suppose that extra 64 bytes is coming from cap.max_send_sge >= 3?

Without a new API we can't make it fully discoverable, but the way it
is now is clearly wrong.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP
  2025-05-08 12:43                     ` Jason Gunthorpe
@ 2025-05-10 23:12                       ` Edward Srouji
  0 siblings, 0 replies; 52+ messages in thread
From: Edward Srouji @ 2025-05-10 23:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chuck Lever, Christoph Hellwig, NeilBrown, Jeff Layton,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Anna Schumaker,
	linux-nfs@vger.kernel.org, linux-rdma@vger.kernel.org,
	Chuck Lever, Leon Romanovsky


On 5/8/2025 3:43 PM, Jason Gunthorpe wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, May 08, 2025 at 11:41:18AM +0300, Edward Srouji wrote:
>> On 5/6/2025 5:22 PM, Jason Gunthorpe wrote:
>>> On Tue, May 06, 2025 at 10:19:06AM -0400, Chuck Lever wrote:
>>>>>> In this patch I'm trying to include the reg/inv multiplier in the
>>>>>> calculation, but that doesn't seem to be enough to make "accept"
>>>>>> reliable, IMO due to this extra calculation in calc_sq_size().
>>>>> Did ib_create_qp get called with more than max_qp_wr ?
>>>> The request was for, like, 9300 SQEs. max_qp_wr is 32K on my systems.
>>> Sounds like it is broken then..
>>>
>>>      props->max_qp_wr           = 1 << MLX5_CAP_GEN(mdev, log_max_qp_sz);
>>>
>>> So it is ignoring the wqe_size adustment.. It should adjust by the worst
>>> case result of calc_send_wqe() for the device..
>> How do you suggest adjusting to the worst case?
>> How inline messages could be addressed and taken into account?
> I think assume 0 size inline for computing max sizes
>
>> Even if we ignore the inline size, worst case potentially could be less than
>> 1/8th of the max HCA CAP, not sure we want to deliver this as a limitation
>> to users.
> The math is simply wrong - log_max_qp_sz is not the number of work
> queue entries in the queue, it is the number of MLX5_SEND_WQE_BB's
> units which is some internal value.

I agree, no doubt that the returned max_qp_wr is wrong and misleading...

>
> For a verbs API the result should be the max number of work queue
> entries that can be requested for any of XRC/RC/UC/UD QP types using a
> 0 inline size, 1 SGL and no other special features.
>
> Even for a simple RC QP sq_overhead() will return 132 which already
> makes props->max_qp_wr uselessly wrong. 132 goes into here:
>
>                  return ALIGN(max_t(int, inl_size, size), MLX5_SEND_WQE_BB);
>
> Comes out as 192 - so props->max_qp_wr is off by 3x even for a simple
> no-feature RC QP.
>
> Chuck is getting:
>
> calc_sq_size:618:(pid 1514): send queue size (9326 * 256 / 64 -> 65536) exceeds limits(32768)
>
> So I suppose that extra 64 bytes is coming from cap.max_send_sge >= 3?
>
> Without a new API we can't make it fully discoverable, but the way it
> is now is clearly wrong.

This is what I was considering, a new API.
The above suggested calculation will return a reasonable value but 
obviously won't satisfy all use-cases (probably someone else will face 
similar issue later on).
The question is whether it is worth a new dedicated API for an accurate 
"per user-case" calculation.

>
> Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2025-05-10 23:12 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-28 19:36 [PATCH v4 00/14] Allocate payload arrays dynamically cel
2025-04-28 19:36 ` [PATCH v4 01/14] svcrdma: Reduce the number of rdma_rw contexts per-QP cel
2025-05-06 13:08   ` Christoph Hellwig
2025-05-06 13:17     ` Jason Gunthorpe
2025-05-06 13:40       ` Christoph Hellwig
2025-05-06 13:55         ` Jason Gunthorpe
2025-05-06 14:13           ` Chuck Lever
2025-05-06 14:17             ` Jason Gunthorpe
2025-05-06 14:19               ` Chuck Lever
2025-05-06 14:22                 ` Jason Gunthorpe
2025-05-08  8:41                   ` Edward Srouji
2025-05-08 12:43                     ` Jason Gunthorpe
2025-05-10 23:12                       ` Edward Srouji
2025-04-28 19:36 ` [PATCH v4 02/14] sunrpc: Add a helper to derive maxpages from sv_max_mesg cel
2025-05-06 13:10   ` Christoph Hellwig
2025-04-28 19:36 ` [PATCH v4 03/14] sunrpc: Remove backchannel check in svc_init_buffer() cel
2025-05-06 13:11   ` Christoph Hellwig
2025-04-28 19:36 ` [PATCH v4 04/14] sunrpc: Replace the rq_pages array with dynamically-allocated memory cel
2025-04-30  4:53   ` NeilBrown
2025-04-28 19:36 ` [PATCH v4 05/14] sunrpc: Replace the rq_vec " cel
2025-05-06 13:29   ` Christoph Hellwig
2025-05-06 16:31     ` Chuck Lever
2025-05-07  7:34       ` Christoph Hellwig
2025-04-28 19:36 ` [PATCH v4 06/14] sunrpc: Replace the rq_bvec " cel
2025-04-28 19:36 ` [PATCH v4 07/14] sunrpc: Adjust size of socket's receive page array dynamically cel
2025-04-28 19:36 ` [PATCH v4 08/14] svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages cel
2025-05-06 13:31   ` Christoph Hellwig
2025-05-06 15:20     ` Chuck Lever
2025-05-07  7:40       ` Christoph Hellwig
2025-04-28 19:36 ` [PATCH v4 09/14] svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages cel
2025-04-28 19:36 ` [PATCH v4 10/14] sunrpc: Remove the RPCSVC_MAXPAGES macro cel
2025-04-28 19:36 ` [PATCH v4 11/14] NFSD: Remove NFSD_BUFSIZE cel
2025-04-28 21:03   ` Jeff Layton
2025-05-06 13:32   ` Christoph Hellwig
2025-04-28 19:37 ` [PATCH v4 12/14] NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro cel
2025-05-06 13:33   ` Christoph Hellwig
2025-04-28 19:37 ` [PATCH v4 13/14] NFSD: Add a "default" block size cel
2025-04-28 21:07   ` Jeff Layton
2025-04-28 19:37 ` [PATCH v4 14/14] SUNRPC: Bump the maximum payload size for the server cel
2025-04-28 21:08   ` Jeff Layton
2025-04-29 15:44     ` Chuck Lever
2025-05-06 13:34   ` Christoph Hellwig
2025-05-06 13:52     ` Chuck Lever
2025-05-06 13:54       ` Jeff Layton
2025-05-06 13:59         ` Chuck Lever
2025-05-07  7:42       ` Christoph Hellwig
2025-05-07 14:25         ` Chuck Lever
2025-04-29 13:06 ` [PATCH v4 00/14] Allocate payload arrays dynamically Zhu Yanjun
2025-04-29 13:41   ` Chuck Lever
2025-04-29 13:52     ` Zhu Yanjun
2025-04-30  5:11 ` NeilBrown
2025-04-30 12:45   ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).