* [PATCH v2 0/2] sunrpc: fix handling of rq_bvec array in svc_rqst
@ 2025-10-08 20:23 Jeff Layton
2025-10-08 20:23 ` [PATCH v2 1/2] sunrpc: account for TCP record marker in rq_bvec array when sending Jeff Layton
2025-10-08 20:23 ` [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker Jeff Layton
0 siblings, 2 replies; 11+ messages in thread
From: Jeff Layton @ 2025-10-08 20:23 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, David Howells
Cc: Brandon Adams, linux-nfs, netdev, linux-kernel, Jeff Layton
I've seen this message pop intermittently on some knfsd servers:
rpc-srv/tcp: nfsd: sent 1045870 when sending 1045868 bytes - shutting down socket
Unfortunately, I've not been able to reproduce this on my own, but I've
noticed a bug that could cause this.
The first patch in this series fixes a bug in rq_bvec handling I noticed
by inspection. The second patch adds a slot to rq_bvec to account for
the slot used for the TCP record marker.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Changes in v2:
- Better changelog message for patch #2
- Link to v1: https://lore.kernel.org/r/20251008-rq_bvec-v1-0-7f23d32d75e5@kernel.org
---
Jeff Layton (2):
sunrpc: account for TCP record marker in rq_bvec array when sending
sunrpc: add a slot to rqstp->rq_bvec for TCP record marker
fs/nfsd/vfs.c | 6 +++---
net/sunrpc/svc.c | 3 ++-
net/sunrpc/svcsock.c | 2 +-
3 files changed, 6 insertions(+), 5 deletions(-)
---
base-commit: 177818f176ef904fb18d237d1dbba00c2643aaf2
change-id: 20251008-rq_bvec-b66afd0fdbbb
Best regards,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v2 1/2] sunrpc: account for TCP record marker in rq_bvec array when sending
2025-10-08 20:23 [PATCH v2 0/2] sunrpc: fix handling of rq_bvec array in svc_rqst Jeff Layton
@ 2025-10-08 20:23 ` Jeff Layton
2025-10-08 21:32 ` NeilBrown
2025-10-08 20:23 ` [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker Jeff Layton
1 sibling, 1 reply; 11+ messages in thread
From: Jeff Layton @ 2025-10-08 20:23 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, David Howells
Cc: Brandon Adams, linux-nfs, netdev, linux-kernel, Jeff Layton
The call to xdr_buf_to_bvec() in svc_tcp_sendto() passes in the second
slot to the bvec array as the starting slot, but doesn't decrease the
length of the array by one.
Fixes: 59cf7346542b ("sunrpc: Replace the rq_bvec array with dynamically-allocated memory")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
net/sunrpc/svcsock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 7b90abc5cf0ee1520796b2f38fcb977417009830..377fcaaaa061463fc5c85fc09c7a8eab5e06af77 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1244,7 +1244,7 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
memcpy(buf, &marker, sizeof(marker));
bvec_set_virt(rqstp->rq_bvec, buf, sizeof(marker));
- count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages,
+ count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages - 1,
&rqstp->rq_res);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker
2025-10-08 20:23 [PATCH v2 0/2] sunrpc: fix handling of rq_bvec array in svc_rqst Jeff Layton
2025-10-08 20:23 ` [PATCH v2 1/2] sunrpc: account for TCP record marker in rq_bvec array when sending Jeff Layton
@ 2025-10-08 20:23 ` Jeff Layton
2025-10-08 21:51 ` NeilBrown
1 sibling, 1 reply; 11+ messages in thread
From: Jeff Layton @ 2025-10-08 20:23 UTC (permalink / raw)
To: Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, David Howells
Cc: Brandon Adams, linux-nfs, netdev, linux-kernel, Jeff Layton
We've seen some occurrences of messages like this in dmesg on some knfsd
servers:
xdr_buf_to_bvec: bio_vec array overflow
Usually followed by messages like this that indicate a short send (note
that this message is from an older kernel and the amount that it reports
attempting to send is short by 4 bytes):
rpc-srv/tcp: nfsd: sent 1048155 when sending 1048152 bytes - shutting down socket
svc_tcp_sendmsg() steals a slot in the rq_bvec array for the TCP record
marker. If the send is an unaligned READ call though, then there may not
be enough slots in the rq_bvec array in some cases.
Add a slot to the rq_bvec array, and fix up the array lengths in the
callers that care.
Fixes: e18e157bb5c8 ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call")
Tested-by: Brandon Adams <brandona@meta.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/vfs.c | 6 +++---
net/sunrpc/svc.c | 3 ++-
net/sunrpc/svcsock.c | 4 ++--
3 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 77f6879c2e063fa79865100bbc2d1e64eb332f42..c4e9300d657cf7fdba23f2f4e4bdaad9cd99d1a3 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1111,7 +1111,7 @@ nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
v = 0;
total = dio_end - dio_start;
- while (total && v < rqstp->rq_maxpages &&
+ while (total && v < rqstp->rq_maxpages + 1 &&
rqstp->rq_next_page < rqstp->rq_page_end) {
len = min_t(size_t, total, PAGE_SIZE);
bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
@@ -1200,7 +1200,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
v = 0;
total = *count;
- while (total && v < rqstp->rq_maxpages &&
+ while (total && v < rqstp->rq_maxpages + 1 &&
rqstp->rq_next_page < rqstp->rq_page_end) {
len = min_t(size_t, total, PAGE_SIZE - base);
bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
@@ -1318,7 +1318,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (stable && !fhp->fh_use_wgather)
kiocb.ki_flags |= IOCB_DSYNC;
- nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
+ nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages + 1, payload);
iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
since = READ_ONCE(file->f_wb_err);
if (verf)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 4704dce7284eccc9e2bc64cf22947666facfa86a..919263a0c04e3f1afa607414bc1893ba02206e38 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -706,7 +706,8 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
if (!svc_init_buffer(rqstp, serv, node))
goto out_enomem;
- rqstp->rq_bvec = kcalloc_node(rqstp->rq_maxpages,
+ /* +1 for the TCP record marker */
+ rqstp->rq_bvec = kcalloc_node(rqstp->rq_maxpages + 1,
sizeof(struct bio_vec),
GFP_KERNEL, node);
if (!rqstp->rq_bvec)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 377fcaaaa061463fc5c85fc09c7a8eab5e06af77..5f8bb11b686bcd7302b94476490ba9b1b9ddc06a 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -740,7 +740,7 @@ static int svc_udp_sendto(struct svc_rqst *rqstp)
if (svc_xprt_is_dead(xprt))
goto out_notconn;
- count = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, xdr);
+ count = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages + 1, xdr);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
count, rqstp->rq_res.len);
@@ -1244,7 +1244,7 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
memcpy(buf, &marker, sizeof(marker));
bvec_set_virt(rqstp->rq_bvec, buf, sizeof(marker));
- count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages - 1,
+ count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages,
&rqstp->rq_res);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v2 1/2] sunrpc: account for TCP record marker in rq_bvec array when sending
2025-10-08 20:23 ` [PATCH v2 1/2] sunrpc: account for TCP record marker in rq_bvec array when sending Jeff Layton
@ 2025-10-08 21:32 ` NeilBrown
0 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2025-10-08 21:32 UTC (permalink / raw)
To: Jeff Layton
Cc: Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, David Howells,
Brandon Adams, linux-nfs, netdev, linux-kernel, Jeff Layton
On Thu, 09 Oct 2025, Jeff Layton wrote:
> The call to xdr_buf_to_bvec() in svc_tcp_sendto() passes in the second
> slot to the bvec array as the starting slot, but doesn't decrease the
> length of the array by one.
>
> Fixes: 59cf7346542b ("sunrpc: Replace the rq_bvec array with dynamically-allocated memory")
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> net/sunrpc/svcsock.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 7b90abc5cf0ee1520796b2f38fcb977417009830..377fcaaaa061463fc5c85fc09c7a8eab5e06af77 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1244,7 +1244,7 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
> memcpy(buf, &marker, sizeof(marker));
> bvec_set_virt(rqstp->rq_bvec, buf, sizeof(marker));
>
> - count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages,
> + count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages - 1,
> &rqstp->rq_res);
oh - that's embarrassing :-(
Reviewed-by: NeilBrown <neil@brown.name>
Thanks,
NeilBrown
>
> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
>
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker
2025-10-08 20:23 ` [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker Jeff Layton
@ 2025-10-08 21:51 ` NeilBrown
2025-10-09 0:08 ` Mike Snitzer
2025-10-09 12:22 ` Jeff Layton
0 siblings, 2 replies; 11+ messages in thread
From: NeilBrown @ 2025-10-08 21:51 UTC (permalink / raw)
To: Jeff Layton
Cc: Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, David Howells,
Brandon Adams, linux-nfs, netdev, linux-kernel, Jeff Layton
On Thu, 09 Oct 2025, Jeff Layton wrote:
> We've seen some occurrences of messages like this in dmesg on some knfsd
> servers:
>
> xdr_buf_to_bvec: bio_vec array overflow
>
> Usually followed by messages like this that indicate a short send (note
> that this message is from an older kernel and the amount that it reports
> attempting to send is short by 4 bytes):
>
> rpc-srv/tcp: nfsd: sent 1048155 when sending 1048152 bytes - shutting down socket
>
> svc_tcp_sendmsg() steals a slot in the rq_bvec array for the TCP record
> marker. If the send is an unaligned READ call though, then there may not
> be enough slots in the rq_bvec array in some cases.
>
> Add a slot to the rq_bvec array, and fix up the array lengths in the
> callers that care.
>
> Fixes: e18e157bb5c8 ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call")
> Tested-by: Brandon Adams <brandona@meta.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
> fs/nfsd/vfs.c | 6 +++---
> net/sunrpc/svc.c | 3 ++-
> net/sunrpc/svcsock.c | 4 ++--
> 3 files changed, 7 insertions(+), 6 deletions(-)
I can't say that I'm liking this patch.
There are 11 place where (in nfsd-testing recently) where
rq_maxpages is used (as opposed to declared or assigned).
3 in nfsd/vfs.c
4 in sunrpc/svc.c
1 in sunrpc/svc_xprt.c
2 in sunrpc/svcsock.c
1 in xprtrdma/svc_rdma_rc.c
Your patch changes six of those to add 1. I guess the others aren't
"callers that care". It would help to have it clearly stated why, or
why not, a caller might care.
But also, what does "rq_maxpages" even mean now?
The comment in svc.h still says "num of entries in rq_pages"
which is certainly no longer the case.
But if it was the case, we should have called it "rq_numpages"
or similar.
But maybe it wasn't meant to be the number of pages in the array,
maybe it was meant to be the maximum number of pages is a request
or a reply.....
No - that is sv_max_mesg, to which we add 2 and 1.
So I could ask "why not just add another 1 in svc_serv_maxpages()?"
Would the callers that might not care be harmed if rq_maxpages were
one larger than it is?
It seems to me that rq_maxpages is rather confused and the bug you have
found which requires this patch is some evidence to that confusion. We
should fix the confusion, not just the bug.
So simple question to cut through my waffle:
Would this:
- return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
+ return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
fix the problem. If not, why not? If so, can we just do this?
then look at renaming rq_maxpages to rq_numpages and audit all the uses
(and maybe you have already audited...).
Thanks,
NeilBrown
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 77f6879c2e063fa79865100bbc2d1e64eb332f42..c4e9300d657cf7fdba23f2f4e4bdaad9cd99d1a3 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1111,7 +1111,7 @@ nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> v = 0;
> total = dio_end - dio_start;
> - while (total && v < rqstp->rq_maxpages &&
> + while (total && v < rqstp->rq_maxpages + 1 &&
> rqstp->rq_next_page < rqstp->rq_page_end) {
> len = min_t(size_t, total, PAGE_SIZE);
> bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> @@ -1200,7 +1200,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> v = 0;
> total = *count;
> - while (total && v < rqstp->rq_maxpages &&
> + while (total && v < rqstp->rq_maxpages + 1 &&
> rqstp->rq_next_page < rqstp->rq_page_end) {
> len = min_t(size_t, total, PAGE_SIZE - base);
> bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> @@ -1318,7 +1318,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> if (stable && !fhp->fh_use_wgather)
> kiocb.ki_flags |= IOCB_DSYNC;
>
> - nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> + nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages + 1, payload);
> iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> since = READ_ONCE(file->f_wb_err);
> if (verf)
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 4704dce7284eccc9e2bc64cf22947666facfa86a..919263a0c04e3f1afa607414bc1893ba02206e38 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -706,7 +706,8 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> if (!svc_init_buffer(rqstp, serv, node))
> goto out_enomem;
>
> - rqstp->rq_bvec = kcalloc_node(rqstp->rq_maxpages,
> + /* +1 for the TCP record marker */
> + rqstp->rq_bvec = kcalloc_node(rqstp->rq_maxpages + 1,
> sizeof(struct bio_vec),
> GFP_KERNEL, node);
> if (!rqstp->rq_bvec)
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 377fcaaaa061463fc5c85fc09c7a8eab5e06af77..5f8bb11b686bcd7302b94476490ba9b1b9ddc06a 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -740,7 +740,7 @@ static int svc_udp_sendto(struct svc_rqst *rqstp)
> if (svc_xprt_is_dead(xprt))
> goto out_notconn;
>
> - count = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, xdr);
> + count = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages + 1, xdr);
>
> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
> count, rqstp->rq_res.len);
> @@ -1244,7 +1244,7 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
> memcpy(buf, &marker, sizeof(marker));
> bvec_set_virt(rqstp->rq_bvec, buf, sizeof(marker));
>
> - count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages - 1,
> + count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages,
> &rqstp->rq_res);
>
> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
>
> --
> 2.51.0
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker
2025-10-08 21:51 ` NeilBrown
@ 2025-10-09 0:08 ` Mike Snitzer
2025-10-09 12:22 ` Jeff Layton
1 sibling, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2025-10-09 0:08 UTC (permalink / raw)
To: NeilBrown
Cc: Jeff Layton, Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, David Howells,
Brandon Adams, linux-nfs, netdev, linux-kernel
On Thu, Oct 09, 2025 at 08:51:25AM +1100, NeilBrown wrote:
> On Thu, 09 Oct 2025, Jeff Layton wrote:
> > We've seen some occurrences of messages like this in dmesg on some knfsd
> > servers:
> >
> > xdr_buf_to_bvec: bio_vec array overflow
> >
> > Usually followed by messages like this that indicate a short send (note
> > that this message is from an older kernel and the amount that it reports
> > attempting to send is short by 4 bytes):
> >
> > rpc-srv/tcp: nfsd: sent 1048155 when sending 1048152 bytes - shutting down socket
> >
> > svc_tcp_sendmsg() steals a slot in the rq_bvec array for the TCP record
> > marker. If the send is an unaligned READ call though, then there may not
> > be enough slots in the rq_bvec array in some cases.
> >
> > Add a slot to the rq_bvec array, and fix up the array lengths in the
> > callers that care.
> >
> > Fixes: e18e157bb5c8 ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call")
> > Tested-by: Brandon Adams <brandona@meta.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> > fs/nfsd/vfs.c | 6 +++---
> > net/sunrpc/svc.c | 3 ++-
> > net/sunrpc/svcsock.c | 4 ++--
> > 3 files changed, 7 insertions(+), 6 deletions(-)
>
> I can't say that I'm liking this patch.
>
> There are 11 place where (in nfsd-testing recently) where
> rq_maxpages is used (as opposed to declared or assigned).
>
> 3 in nfsd/vfs.c
> 4 in sunrpc/svc.c
> 1 in sunrpc/svc_xprt.c
> 2 in sunrpc/svcsock.c
> 1 in xprtrdma/svc_rdma_rc.c
>
> Your patch changes six of those to add 1. I guess the others aren't
> "callers that care". It would help to have it clearly stated why, or
> why not, a caller might care.
>
> But also, what does "rq_maxpages" even mean now?
> The comment in svc.h still says "num of entries in rq_pages"
> which is certainly no longer the case.
> But if it was the case, we should have called it "rq_numpages"
> or similar.
> But maybe it wasn't meant to be the number of pages in the array,
> maybe it was meant to be the maximum number of pages is a request
> or a reply.....
> No - that is sv_max_mesg, to which we add 2 and 1.
> So I could ask "why not just add another 1 in svc_serv_maxpages()?"
> Would the callers that might not care be harmed if rq_maxpages were
> one larger than it is?
>
> It seems to me that rq_maxpages is rather confused and the bug you have
> found which requires this patch is some evidence to that confusion. We
> should fix the confusion, not just the bug.
>
> So simple question to cut through my waffle:
> Would this:
> - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
>
> fix the problem. If not, why not? If so, can we just do this?
> then look at renaming rq_maxpages to rq_numpages and audit all the uses
> (and maybe you have already audited...).
Right, I recently wanted to do the same:
https://lore.kernel.org/linux-nfs/20250909233315.80318-2-snitzer@kernel.org/
Certainly cleaner and preferable to me.
Otherwise the +1 sprinkled selectively is really prone to be a problem
for any new users of rq_maxpages.
Mike
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker
2025-10-08 21:51 ` NeilBrown
2025-10-09 0:08 ` Mike Snitzer
@ 2025-10-09 12:22 ` Jeff Layton
2025-10-10 0:10 ` NeilBrown
1 sibling, 1 reply; 11+ messages in thread
From: Jeff Layton @ 2025-10-09 12:22 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, David Howells,
Brandon Adams, linux-nfs, netdev, linux-kernel
On Thu, 2025-10-09 at 08:51 +1100, NeilBrown wrote:
> On Thu, 09 Oct 2025, Jeff Layton wrote:
> > We've seen some occurrences of messages like this in dmesg on some knfsd
> > servers:
> >
> > xdr_buf_to_bvec: bio_vec array overflow
> >
> > Usually followed by messages like this that indicate a short send (note
> > that this message is from an older kernel and the amount that it reports
> > attempting to send is short by 4 bytes):
> >
> > rpc-srv/tcp: nfsd: sent 1048155 when sending 1048152 bytes - shutting down socket
> >
> > svc_tcp_sendmsg() steals a slot in the rq_bvec array for the TCP record
> > marker. If the send is an unaligned READ call though, then there may not
> > be enough slots in the rq_bvec array in some cases.
> >
> > Add a slot to the rq_bvec array, and fix up the array lengths in the
> > callers that care.
> >
> > Fixes: e18e157bb5c8 ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call")
> > Tested-by: Brandon Adams <brandona@meta.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> > fs/nfsd/vfs.c | 6 +++---
> > net/sunrpc/svc.c | 3 ++-
> > net/sunrpc/svcsock.c | 4 ++--
> > 3 files changed, 7 insertions(+), 6 deletions(-)
>
> I can't say that I'm liking this patch.
>
> There are 11 place where (in nfsd-testing recently) where
> rq_maxpages is used (as opposed to declared or assigned).
>
> 3 in nfsd/vfs.c
> 4 in sunrpc/svc.c
> 1 in sunrpc/svc_xprt.c
> 2 in sunrpc/svcsock.c
> 1 in xprtrdma/svc_rdma_rc.c
>
> Your patch changes six of those to add 1. I guess the others aren't
> "callers that care". It would help to have it clearly stated why, or
> why not, a caller might care.
>
> But also, what does "rq_maxpages" even mean now?
> The comment in svc.h still says "num of entries in rq_pages"
> which is certainly no longer the case.
> But if it was the case, we should have called it "rq_numpages"
> or similar.
> But maybe it wasn't meant to be the number of pages in the array,
> maybe it was meant to be the maximum number of pages is a request
> or a reply.....
> No - that is sv_max_mesg, to which we add 2 and 1.
> So I could ask "why not just add another 1 in svc_serv_maxpages()?"
> Would the callers that might not care be harmed if rq_maxpages were
> one larger than it is?
>
> It seems to me that rq_maxpages is rather confused and the bug you have
> found which requires this patch is some evidence to that confusion. We
> should fix the confusion, not just the bug.
>
> So simple question to cut through my waffle:
> Would this:
> - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
>
> fix the problem. If not, why not? If so, can we just do this?
> then look at renaming rq_maxpages to rq_numpages and audit all the uses
> (and maybe you have already audited...).
>
I get the objection. I'm not crazy about all of the adjustments either.
rq_maxpages is used to size two fields in the rqstp: rq_pages and
rq_bvec. It turns out that they both want rq_maxpages + 1 slots. The
rq_pages array needs the extra slot for a NULL terminator, and rq_bvec
needs it for the TCP record marker.
The RPC code mostly ignores the last slot in rq_pages array after it's
allocated, but we need rq_bvec to treat it like any other slot, hence
the adjustment here.
I looked at just doing what you suggest first. It would fix it, but at
the expense of keeping an extra page per nfsd thread. We could couple
your suggested fix with just not allocating that last rq_pages slot,
but we end up having to adjust more places than this change does. Also,
at that point, rq_maxpages is not _really_ the max number of pages.
Maybe what we need to do is move to a separate length field for
rq_bvec? We have some existing holes in svc_rqst that could hold one
and that would make the code more clear. I'll respin this and see how
that looks.
Thanks for the review!
>
>
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 77f6879c2e063fa79865100bbc2d1e64eb332f42..c4e9300d657cf7fdba23f2f4e4bdaad9cd99d1a3 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -1111,7 +1111,7 @@ nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >
> > v = 0;
> > total = dio_end - dio_start;
> > - while (total && v < rqstp->rq_maxpages &&
> > + while (total && v < rqstp->rq_maxpages + 1 &&
> > rqstp->rq_next_page < rqstp->rq_page_end) {
> > len = min_t(size_t, total, PAGE_SIZE);
> > bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> > @@ -1200,7 +1200,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >
> > v = 0;
> > total = *count;
> > - while (total && v < rqstp->rq_maxpages &&
> > + while (total && v < rqstp->rq_maxpages + 1 &&
> > rqstp->rq_next_page < rqstp->rq_page_end) {
> > len = min_t(size_t, total, PAGE_SIZE - base);
> > bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> > @@ -1318,7 +1318,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > if (stable && !fhp->fh_use_wgather)
> > kiocb.ki_flags |= IOCB_DSYNC;
> >
> > - nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> > + nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages + 1, payload);
> > iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> > since = READ_ONCE(file->f_wb_err);
> > if (verf)
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index 4704dce7284eccc9e2bc64cf22947666facfa86a..919263a0c04e3f1afa607414bc1893ba02206e38 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -706,7 +706,8 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > if (!svc_init_buffer(rqstp, serv, node))
> > goto out_enomem;
> >
> > - rqstp->rq_bvec = kcalloc_node(rqstp->rq_maxpages,
> > + /* +1 for the TCP record marker */
> > + rqstp->rq_bvec = kcalloc_node(rqstp->rq_maxpages + 1,
> > sizeof(struct bio_vec),
> > GFP_KERNEL, node);
> > if (!rqstp->rq_bvec)
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index 377fcaaaa061463fc5c85fc09c7a8eab5e06af77..5f8bb11b686bcd7302b94476490ba9b1b9ddc06a 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -740,7 +740,7 @@ static int svc_udp_sendto(struct svc_rqst *rqstp)
> > if (svc_xprt_is_dead(xprt))
> > goto out_notconn;
> >
> > - count = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, xdr);
> > + count = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages + 1, xdr);
> >
> > iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
> > count, rqstp->rq_res.len);
> > @@ -1244,7 +1244,7 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
> > memcpy(buf, &marker, sizeof(marker));
> > bvec_set_virt(rqstp->rq_bvec, buf, sizeof(marker));
> >
> > - count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages - 1,
> > + count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages,
> > &rqstp->rq_res);
> >
> > iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
> >
> > --
> > 2.51.0
> >
> >
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker
2025-10-09 12:22 ` Jeff Layton
@ 2025-10-10 0:10 ` NeilBrown
2025-10-10 10:25 ` Jeff Layton
2025-10-10 12:47 ` Chuck Lever
0 siblings, 2 replies; 11+ messages in thread
From: NeilBrown @ 2025-10-10 0:10 UTC (permalink / raw)
To: Jeff Layton
Cc: Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, David Howells,
Brandon Adams, linux-nfs, netdev, linux-kernel
On Thu, 09 Oct 2025, Jeff Layton wrote:
> On Thu, 2025-10-09 at 08:51 +1100, NeilBrown wrote:
> > On Thu, 09 Oct 2025, Jeff Layton wrote:
> > > We've seen some occurrences of messages like this in dmesg on some knfsd
> > > servers:
> > >
> > > xdr_buf_to_bvec: bio_vec array overflow
> > >
> > > Usually followed by messages like this that indicate a short send (note
> > > that this message is from an older kernel and the amount that it reports
> > > attempting to send is short by 4 bytes):
> > >
> > > rpc-srv/tcp: nfsd: sent 1048155 when sending 1048152 bytes - shutting down socket
> > >
> > > svc_tcp_sendmsg() steals a slot in the rq_bvec array for the TCP record
> > > marker. If the send is an unaligned READ call though, then there may not
> > > be enough slots in the rq_bvec array in some cases.
> > >
> > > Add a slot to the rq_bvec array, and fix up the array lengths in the
> > > callers that care.
> > >
> > > Fixes: e18e157bb5c8 ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call")
> > > Tested-by: Brandon Adams <brandona@meta.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > > fs/nfsd/vfs.c | 6 +++---
> > > net/sunrpc/svc.c | 3 ++-
> > > net/sunrpc/svcsock.c | 4 ++--
> > > 3 files changed, 7 insertions(+), 6 deletions(-)
> >
> > I can't say that I'm liking this patch.
> >
> > There are 11 place where (in nfsd-testing recently) where
> > rq_maxpages is used (as opposed to declared or assigned).
> >
> > 3 in nfsd/vfs.c
> > 4 in sunrpc/svc.c
> > 1 in sunrpc/svc_xprt.c
> > 2 in sunrpc/svcsock.c
> > 1 in xprtrdma/svc_rdma_rc.c
> >
> > Your patch changes six of those to add 1. I guess the others aren't
> > "callers that care". It would help to have it clearly stated why, or
> > why not, a caller might care.
> >
> > But also, what does "rq_maxpages" even mean now?
> > The comment in svc.h still says "num of entries in rq_pages"
> > which is certainly no longer the case.
> > But if it was the case, we should have called it "rq_numpages"
> > or similar.
> > But maybe it wasn't meant to be the number of pages in the array,
> > maybe it was meant to be the maximum number of pages is a request
> > or a reply.....
> > No - that is sv_max_mesg, to which we add 2 and 1.
> > So I could ask "why not just add another 1 in svc_serv_maxpages()?"
> > Would the callers that might not care be harmed if rq_maxpages were
> > one larger than it is?
> >
> > It seems to me that rq_maxpages is rather confused and the bug you have
> > found which requires this patch is some evidence to that confusion. We
> > should fix the confusion, not just the bug.
> >
> > So simple question to cut through my waffle:
> > Would this:
> > - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> > + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
> >
> > fix the problem. If not, why not? If so, can we just do this?
> > then look at renaming rq_maxpages to rq_numpages and audit all the uses
> > (and maybe you have already audited...).
> >
>
> I get the objection. I'm not crazy about all of the adjustments either.
>
> rq_maxpages is used to size two fields in the rqstp: rq_pages and
> rq_bvec. It turns out that they both want rq_maxpages + 1 slots. The
> rq_pages array needs the extra slot for a NULL terminator, and rq_bvec
> needs it for the TCP record marker.
Somehow the above para helped a lot for me to understand what the issue
is here - thanks.
rq_bvec is used for two quite separate purposes.
nfsd/vfs.c uses it to assemble read/write requests to send to the
filesystem.
sunrpc/svcsock.c uses to assemble send/recv requests to send to the
network.
It might help me if this were documented clearly in svc.h as I seem to
have had to discover several times now :-(
Should these even use the same rq_bvec? I guess it makes sense to share
but we should be cautious about letting the needs of one side infect the
code of the other side.
So if we increase the size of rq_bvec to meet the needs of svcsock.c, do
we need to make *any* code changes to vfs.c? I doubt it.
It bothers me a little bit that svc_tcp_sendmsg() needs to allocate a
frag. But given that it does, could it also allocate a larger bvec if
rq_bvec isn't big enough?
Or should svc_tcp_recvfrom() allocate the frag and make sure the bvec is
big enough ......
Or svc_alloc_arg() could check with each active transport for any
preallocation requirements...
Or svc_create_socket() could update some "bvec_size" field in svc_serv
which svc_alloc_arg() could check an possibly realloc rq_bvec.
I'm rambling a bit here. I agree with Chuck (and you) that it would be
nice if this need for a larger bvec were kept local to svcsock code if
possible.
But I'm fairly confident that the current problem doesn't justify any
changes to vfs.c. svc.c probably needs to somehow be involved in
rq_bvec being bigger and svcsock.c certainly needs to be able to make
use of the extra space, but that seems to be all that is required.
Thanks,
NeilBrown
>
> The RPC code mostly ignores the last slot in rq_pages array after it's
> allocated, but we need rq_bvec to treat it like any other slot, hence
> the adjustment here.
>
> I looked at just doing what you suggest first. It would fix it, but at
> the expense of keeping an extra page per nfsd thread. We could couple
> your suggested fix with just not allocating that last rq_pages slot,
> but we end up having to adjust more places than this change does. Also,
> at that point, rq_maxpages is not _really_ the max number of pages.
>
> Maybe what we need to do is move to a separate length field for
> rq_bvec? We have some existing holes in svc_rqst that could hold one
> and that would make the code more clear. I'll respin this and see how
> that looks.
>
> Thanks for the review!
>
> >
> >
> > >
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index 77f6879c2e063fa79865100bbc2d1e64eb332f42..c4e9300d657cf7fdba23f2f4e4bdaad9cd99d1a3 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -1111,7 +1111,7 @@ nfsd_direct_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >
> > > v = 0;
> > > total = dio_end - dio_start;
> > > - while (total && v < rqstp->rq_maxpages &&
> > > + while (total && v < rqstp->rq_maxpages + 1 &&
> > > rqstp->rq_next_page < rqstp->rq_page_end) {
> > > len = min_t(size_t, total, PAGE_SIZE);
> > > bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> > > @@ -1200,7 +1200,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >
> > > v = 0;
> > > total = *count;
> > > - while (total && v < rqstp->rq_maxpages &&
> > > + while (total && v < rqstp->rq_maxpages + 1 &&
> > > rqstp->rq_next_page < rqstp->rq_page_end) {
> > > len = min_t(size_t, total, PAGE_SIZE - base);
> > > bvec_set_page(&rqstp->rq_bvec[v], *rqstp->rq_next_page,
> > > @@ -1318,7 +1318,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > > if (stable && !fhp->fh_use_wgather)
> > > kiocb.ki_flags |= IOCB_DSYNC;
> > >
> > > - nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, payload);
> > > + nvecs = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages + 1, payload);
> > > iov_iter_bvec(&iter, ITER_SOURCE, rqstp->rq_bvec, nvecs, *cnt);
> > > since = READ_ONCE(file->f_wb_err);
> > > if (verf)
> > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > > index 4704dce7284eccc9e2bc64cf22947666facfa86a..919263a0c04e3f1afa607414bc1893ba02206e38 100644
> > > --- a/net/sunrpc/svc.c
> > > +++ b/net/sunrpc/svc.c
> > > @@ -706,7 +706,8 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > if (!svc_init_buffer(rqstp, serv, node))
> > > goto out_enomem;
> > >
> > > - rqstp->rq_bvec = kcalloc_node(rqstp->rq_maxpages,
> > > + /* +1 for the TCP record marker */
> > > + rqstp->rq_bvec = kcalloc_node(rqstp->rq_maxpages + 1,
> > > sizeof(struct bio_vec),
> > > GFP_KERNEL, node);
> > > if (!rqstp->rq_bvec)
> > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > index 377fcaaaa061463fc5c85fc09c7a8eab5e06af77..5f8bb11b686bcd7302b94476490ba9b1b9ddc06a 100644
> > > --- a/net/sunrpc/svcsock.c
> > > +++ b/net/sunrpc/svcsock.c
> > > @@ -740,7 +740,7 @@ static int svc_udp_sendto(struct svc_rqst *rqstp)
> > > if (svc_xprt_is_dead(xprt))
> > > goto out_notconn;
> > >
> > > - count = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages, xdr);
> > > + count = xdr_buf_to_bvec(rqstp->rq_bvec, rqstp->rq_maxpages + 1, xdr);
> > >
> > > iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
> > > count, rqstp->rq_res.len);
> > > @@ -1244,7 +1244,7 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
> > > memcpy(buf, &marker, sizeof(marker));
> > > bvec_set_virt(rqstp->rq_bvec, buf, sizeof(marker));
> > >
> > > - count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages - 1,
> > > + count = xdr_buf_to_bvec(rqstp->rq_bvec + 1, rqstp->rq_maxpages,
> > > &rqstp->rq_res);
> > >
> > > iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, rqstp->rq_bvec,
> > >
> > > --
> > > 2.51.0
> > >
> > >
>
> --
> Jeff Layton <jlayton@kernel.org>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker
2025-10-10 0:10 ` NeilBrown
@ 2025-10-10 10:25 ` Jeff Layton
2025-10-10 10:47 ` NeilBrown
2025-10-10 12:47 ` Chuck Lever
1 sibling, 1 reply; 11+ messages in thread
From: Jeff Layton @ 2025-10-10 10:25 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, David Howells,
Brandon Adams, linux-nfs, netdev, linux-kernel
On Fri, 2025-10-10 at 11:10 +1100, NeilBrown wrote:
> On Thu, 09 Oct 2025, Jeff Layton wrote:
> > On Thu, 2025-10-09 at 08:51 +1100, NeilBrown wrote:
> > > On Thu, 09 Oct 2025, Jeff Layton wrote:
> > > > We've seen some occurrences of messages like this in dmesg on some knfsd
> > > > servers:
> > > >
> > > > xdr_buf_to_bvec: bio_vec array overflow
> > > >
> > > > Usually followed by messages like this that indicate a short send (note
> > > > that this message is from an older kernel and the amount that it reports
> > > > attempting to send is short by 4 bytes):
> > > >
> > > > rpc-srv/tcp: nfsd: sent 1048155 when sending 1048152 bytes - shutting down socket
> > > >
> > > > svc_tcp_sendmsg() steals a slot in the rq_bvec array for the TCP record
> > > > marker. If the send is an unaligned READ call though, then there may not
> > > > be enough slots in the rq_bvec array in some cases.
> > > >
> > > > Add a slot to the rq_bvec array, and fix up the array lengths in the
> > > > callers that care.
> > > >
> > > > Fixes: e18e157bb5c8 ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call")
> > > > Tested-by: Brandon Adams <brandona@meta.com>
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > > fs/nfsd/vfs.c | 6 +++---
> > > > net/sunrpc/svc.c | 3 ++-
> > > > net/sunrpc/svcsock.c | 4 ++--
> > > > 3 files changed, 7 insertions(+), 6 deletions(-)
> > >
> > > I can't say that I'm liking this patch.
> > >
> > > There are 11 place where (in nfsd-testing recently) where
> > > rq_maxpages is used (as opposed to declared or assigned).
> > >
> > > 3 in nfsd/vfs.c
> > > 4 in sunrpc/svc.c
> > > 1 in sunrpc/svc_xprt.c
> > > 2 in sunrpc/svcsock.c
> > > 1 in xprtrdma/svc_rdma_rc.c
> > >
> > > Your patch changes six of those to add 1. I guess the others aren't
> > > "callers that care". It would help to have it clearly stated why, or
> > > why not, a caller might care.
> > >
> > > But also, what does "rq_maxpages" even mean now?
> > > The comment in svc.h still says "num of entries in rq_pages"
> > > which is certainly no longer the case.
> > > But if it was the case, we should have called it "rq_numpages"
> > > or similar.
> > > But maybe it wasn't meant to be the number of pages in the array,
> > > maybe it was meant to be the maximum number of pages is a request
> > > or a reply.....
> > > No - that is sv_max_mesg, to which we add 2 and 1.
> > > So I could ask "why not just add another 1 in svc_serv_maxpages()?"
> > > Would the callers that might not care be harmed if rq_maxpages were
> > > one larger than it is?
> > >
> > > It seems to me that rq_maxpages is rather confused and the bug you have
> > > found which requires this patch is some evidence to that confusion. We
> > > should fix the confusion, not just the bug.
> > >
> > > So simple question to cut through my waffle:
> > > Would this:
> > > - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> > > + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
> > >
> > > fix the problem. If not, why not? If so, can we just do this?
> > > then look at renaming rq_maxpages to rq_numpages and audit all the uses
> > > (and maybe you have already audited...).
> > >
> >
> > I get the objection. I'm not crazy about all of the adjustments either.
> >
> > rq_maxpages is used to size two fields in the rqstp: rq_pages and
> > rq_bvec. It turns out that they both want rq_maxpages + 1 slots. The
> > rq_pages array needs the extra slot for a NULL terminator, and rq_bvec
> > needs it for the TCP record marker.
>
> Somehow the above para helped a lot for me to understand what the issue
> is here - thanks.
>
> rq_bvec is used for two quite separate purposes.
>
> nfsd/vfs.c uses it to assemble read/write requests to send to the
> filesystem.
> sunrpc/svcsock.c uses to assemble send/recv requests to send to the
> network.
>
> It might help me if this were documented clearly in svc.h as I seem to
> have had to discover several times now :-(
>
> Should these even use the same rq_bvec? I guess it makes sense to share
> but we should be cautious about letting the needs of one side infect the
> code of the other side.
>
> So if we increase the size of rq_bvec to meet the needs of svcsock.c, do
> we need to make *any* code changes to vfs.c? I doubt it.
>
> It bothers me a little bit that svc_tcp_sendmsg() needs to allocate a
> frag. But given that it does, could it also allocate a larger bvec if
> rq_bvec isn't big enough?
>
> Or should svc_tcp_recvfrom() allocate the frag and make sure the bvec is
> big enough ......
> Or svc_alloc_arg() could check with each active transport for any
> preallocation requirements...
> Or svc_create_socket() could update some "bvec_size" field in svc_serv
> which svc_alloc_arg() could check an possibly realloc rq_bvec.
>
> I'm rambling a bit here. I agree with Chuck (and you) that it would be
> nice if this need for a larger bvec were kept local to svcsock code if
> possible.
>
> But I'm fairly confident that the current problem doesn't justify any
> changes to vfs.c. svc.c probably needs to somehow be involved in
> rq_bvec being bigger and svcsock.c certainly needs to be able to make
> use of the extra space, but that seems to be all that is required.
>
I sent a v3 patch which adds a separate rq_bvec_len field and uses that
in the places where the code is iterating over the rq_bvec. That does
change places in vfs.c, but I think it makes the code clearer. Are you
OK with that version?
Thanks,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker
2025-10-10 10:25 ` Jeff Layton
@ 2025-10-10 10:47 ` NeilBrown
0 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2025-10-10 10:47 UTC (permalink / raw)
To: Jeff Layton
Cc: Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Trond Myklebust, Anna Schumaker, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, David Howells,
Brandon Adams, linux-nfs, netdev, linux-kernel
On Fri, 10 Oct 2025, Jeff Layton wrote:
> On Fri, 2025-10-10 at 11:10 +1100, NeilBrown wrote:
> > On Thu, 09 Oct 2025, Jeff Layton wrote:
> > > On Thu, 2025-10-09 at 08:51 +1100, NeilBrown wrote:
> > > > On Thu, 09 Oct 2025, Jeff Layton wrote:
> > > > > We've seen some occurrences of messages like this in dmesg on some knfsd
> > > > > servers:
> > > > >
> > > > > xdr_buf_to_bvec: bio_vec array overflow
> > > > >
> > > > > Usually followed by messages like this that indicate a short send (note
> > > > > that this message is from an older kernel and the amount that it reports
> > > > > attempting to send is short by 4 bytes):
> > > > >
> > > > > rpc-srv/tcp: nfsd: sent 1048155 when sending 1048152 bytes - shutting down socket
> > > > >
> > > > > svc_tcp_sendmsg() steals a slot in the rq_bvec array for the TCP record
> > > > > marker. If the send is an unaligned READ call though, then there may not
> > > > > be enough slots in the rq_bvec array in some cases.
> > > > >
> > > > > Add a slot to the rq_bvec array, and fix up the array lengths in the
> > > > > callers that care.
> > > > >
> > > > > Fixes: e18e157bb5c8 ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call")
> > > > > Tested-by: Brandon Adams <brandona@meta.com>
> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > ---
> > > > > fs/nfsd/vfs.c | 6 +++---
> > > > > net/sunrpc/svc.c | 3 ++-
> > > > > net/sunrpc/svcsock.c | 4 ++--
> > > > > 3 files changed, 7 insertions(+), 6 deletions(-)
> > > >
> > > > I can't say that I'm liking this patch.
> > > >
> > > > There are 11 place where (in nfsd-testing recently) where
> > > > rq_maxpages is used (as opposed to declared or assigned).
> > > >
> > > > 3 in nfsd/vfs.c
> > > > 4 in sunrpc/svc.c
> > > > 1 in sunrpc/svc_xprt.c
> > > > 2 in sunrpc/svcsock.c
> > > > 1 in xprtrdma/svc_rdma_rc.c
> > > >
> > > > Your patch changes six of those to add 1. I guess the others aren't
> > > > "callers that care". It would help to have it clearly stated why, or
> > > > why not, a caller might care.
> > > >
> > > > But also, what does "rq_maxpages" even mean now?
> > > > The comment in svc.h still says "num of entries in rq_pages"
> > > > which is certainly no longer the case.
> > > > But if it was the case, we should have called it "rq_numpages"
> > > > or similar.
> > > > But maybe it wasn't meant to be the number of pages in the array,
> > > > maybe it was meant to be the maximum number of pages is a request
> > > > or a reply.....
> > > > No - that is sv_max_mesg, to which we add 2 and 1.
> > > > So I could ask "why not just add another 1 in svc_serv_maxpages()?"
> > > > Would the callers that might not care be harmed if rq_maxpages were
> > > > one larger than it is?
> > > >
> > > > It seems to me that rq_maxpages is rather confused and the bug you have
> > > > found which requires this patch is some evidence to that confusion. We
> > > > should fix the confusion, not just the bug.
> > > >
> > > > So simple question to cut through my waffle:
> > > > Would this:
> > > > - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
> > > > + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
> > > >
> > > > fix the problem. If not, why not? If so, can we just do this?
> > > > then look at renaming rq_maxpages to rq_numpages and audit all the uses
> > > > (and maybe you have already audited...).
> > > >
> > >
> > > I get the objection. I'm not crazy about all of the adjustments either.
> > >
> > > rq_maxpages is used to size two fields in the rqstp: rq_pages and
> > > rq_bvec. It turns out that they both want rq_maxpages + 1 slots. The
> > > rq_pages array needs the extra slot for a NULL terminator, and rq_bvec
> > > needs it for the TCP record marker.
> >
> > Somehow the above para helped a lot for me to understand what the issue
> > is here - thanks.
> >
> > rq_bvec is used for two quite separate purposes.
> >
> > nfsd/vfs.c uses it to assemble read/write requests to send to the
> > filesystem.
> > sunrpc/svcsock.c uses to assemble send/recv requests to send to the
> > network.
> >
> > It might help me if this were documented clearly in svc.h as I seem to
> > have had to discover several times now :-(
> >
> > Should these even use the same rq_bvec? I guess it makes sense to share
> > but we should be cautious about letting the needs of one side infect the
> > code of the other side.
> >
> > So if we increase the size of rq_bvec to meet the needs of svcsock.c, do
> > we need to make *any* code changes to vfs.c? I doubt it.
> >
> > It bothers me a little bit that svc_tcp_sendmsg() needs to allocate a
> > frag. But given that it does, could it also allocate a larger bvec if
> > rq_bvec isn't big enough?
> >
> > Or should svc_tcp_recvfrom() allocate the frag and make sure the bvec is
> > big enough ......
> > Or svc_alloc_arg() could check with each active transport for any
> > preallocation requirements...
> > Or svc_create_socket() could update some "bvec_size" field in svc_serv
> > which svc_alloc_arg() could check an possibly realloc rq_bvec.
> >
> > I'm rambling a bit here. I agree with Chuck (and you) that it would be
> > nice if this need for a larger bvec were kept local to svcsock code if
> > possible.
> >
> > But I'm fairly confident that the current problem doesn't justify any
> > changes to vfs.c. svc.c probably needs to somehow be involved in
> > rq_bvec being bigger and svcsock.c certainly needs to be able to make
> > use of the extra space, but that seems to be all that is required.
> >
>
> I sent a v3 patch which adds a separate rq_bvec_len field and uses that
> in the places where the code is iterating over the rq_bvec. That does
> change places in vfs.c, but I think it makes the code clearer. Are you
> OK with that version?
Hmmm.... yes, I think so. The bug itself doesn't really need vfs.c to
be changed. But the deeper bug is that rq_maxpages is being overloaded
to mean multiple things and you are separating rq_bvec_len out from that
meaning, and consistent using rq_bvec_len whenever the length of the
rq_bvec is needed. So in that sense vfs.c does need changed.
The commit comment could be improved though. I'll reply separately.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker
2025-10-10 0:10 ` NeilBrown
2025-10-10 10:25 ` Jeff Layton
@ 2025-10-10 12:47 ` Chuck Lever
1 sibling, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2025-10-10 12:47 UTC (permalink / raw)
To: NeilBrown, Jeff Layton
Cc: Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, David Howells, Brandon Adams,
linux-nfs, netdev, linux-kernel
On 10/9/25 8:10 PM, NeilBrown wrote:
> On Thu, 09 Oct 2025, Jeff Layton wrote:
>> On Thu, 2025-10-09 at 08:51 +1100, NeilBrown wrote:
>>> On Thu, 09 Oct 2025, Jeff Layton wrote:
>>>> We've seen some occurrences of messages like this in dmesg on some knfsd
>>>> servers:
>>>>
>>>> xdr_buf_to_bvec: bio_vec array overflow
>>>>
>>>> Usually followed by messages like this that indicate a short send (note
>>>> that this message is from an older kernel and the amount that it reports
>>>> attempting to send is short by 4 bytes):
>>>>
>>>> rpc-srv/tcp: nfsd: sent 1048155 when sending 1048152 bytes - shutting down socket
>>>>
>>>> svc_tcp_sendmsg() steals a slot in the rq_bvec array for the TCP record
>>>> marker. If the send is an unaligned READ call though, then there may not
>>>> be enough slots in the rq_bvec array in some cases.
>>>>
>>>> Add a slot to the rq_bvec array, and fix up the array lengths in the
>>>> callers that care.
>>>>
>>>> Fixes: e18e157bb5c8 ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call")
>>>> Tested-by: Brandon Adams <brandona@meta.com>
>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>>> ---
>>>> fs/nfsd/vfs.c | 6 +++---
>>>> net/sunrpc/svc.c | 3 ++-
>>>> net/sunrpc/svcsock.c | 4 ++--
>>>> 3 files changed, 7 insertions(+), 6 deletions(-)
>>>
>>> I can't say that I'm liking this patch.
>>>
>>> There are 11 place where (in nfsd-testing recently) where
>>> rq_maxpages is used (as opposed to declared or assigned).
>>>
>>> 3 in nfsd/vfs.c
>>> 4 in sunrpc/svc.c
>>> 1 in sunrpc/svc_xprt.c
>>> 2 in sunrpc/svcsock.c
>>> 1 in xprtrdma/svc_rdma_rc.c
>>>
>>> Your patch changes six of those to add 1. I guess the others aren't
>>> "callers that care". It would help to have it clearly stated why, or
>>> why not, a caller might care.
>>>
>>> But also, what does "rq_maxpages" even mean now?
>>> The comment in svc.h still says "num of entries in rq_pages"
>>> which is certainly no longer the case.
>>> But if it was the case, we should have called it "rq_numpages"
>>> or similar.
>>> But maybe it wasn't meant to be the number of pages in the array,
>>> maybe it was meant to be the maximum number of pages is a request
>>> or a reply.....
>>> No - that is sv_max_mesg, to which we add 2 and 1.
>>> So I could ask "why not just add another 1 in svc_serv_maxpages()?"
>>> Would the callers that might not care be harmed if rq_maxpages were
>>> one larger than it is?
>>>
>>> It seems to me that rq_maxpages is rather confused and the bug you have
>>> found which requires this patch is some evidence to that confusion. We
>>> should fix the confusion, not just the bug.
>>>
>>> So simple question to cut through my waffle:
>>> Would this:
>>> - return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1;
>>> + return DIV_ROUND_UP(serv->sv_max_mesg, PAGE_SIZE) + 2 + 1 + 1;
>>>
>>> fix the problem. If not, why not? If so, can we just do this?
>>> then look at renaming rq_maxpages to rq_numpages and audit all the uses
>>> (and maybe you have already audited...).
>>>
>>
>> I get the objection. I'm not crazy about all of the adjustments either.
>>
>> rq_maxpages is used to size two fields in the rqstp: rq_pages and
>> rq_bvec. It turns out that they both want rq_maxpages + 1 slots. The
>> rq_pages array needs the extra slot for a NULL terminator, and rq_bvec
>> needs it for the TCP record marker.
>
> Somehow the above para helped a lot for me to understand what the issue
> is here - thanks.
>
> rq_bvec is used for two quite separate purposes.
>
> nfsd/vfs.c uses it to assemble read/write requests to send to the
> filesystem.
> sunrpc/svcsock.c uses to assemble send/recv requests to send to the
> network.
>
> It might help me if this were documented clearly in svc.h as I seem to
> have had to discover several times now :-(
>
> Should these even use the same rq_bvec?
Perhaps not, now that you point out that these are two quite independent
use cases.
It might make sense for svcsock.c to allocate a bvec array for each
svc_sock that is the size needed for the given socket type. (UDP doesn't
need a record market and is limited to 64KB per send, for instance).
> I guess it makes sense to share
> but we should be cautious about letting the needs of one side infect the
> code of the other side.
--
Chuck Lever
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-10-10 12:48 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-08 20:23 [PATCH v2 0/2] sunrpc: fix handling of rq_bvec array in svc_rqst Jeff Layton
2025-10-08 20:23 ` [PATCH v2 1/2] sunrpc: account for TCP record marker in rq_bvec array when sending Jeff Layton
2025-10-08 21:32 ` NeilBrown
2025-10-08 20:23 ` [PATCH v2 2/2] sunrpc: add a slot to rqstp->rq_bvec for TCP record marker Jeff Layton
2025-10-08 21:51 ` NeilBrown
2025-10-09 0:08 ` Mike Snitzer
2025-10-09 12:22 ` Jeff Layton
2025-10-10 0:10 ` NeilBrown
2025-10-10 10:25 ` Jeff Layton
2025-10-10 10:47 ` NeilBrown
2025-10-10 12:47 ` Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).