All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] NFSD: Relocate the xdr_reserve_space_vec() call site
@ 2025-09-24 19:51 Chuck Lever
  2025-09-26  1:13 ` NeilBrown
  0 siblings, 1 reply; 3+ messages in thread
From: Chuck Lever @ 2025-09-24 19:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

In order to detect when a direct READ is possible, we need the send
buffer's .page_len to be zero when there is nothing in the buffer's
.pages array yet.

However, when xdr_reserve_space_vec() extends the size of the
xdr_stream to accommodate a READ payload, it adds to the send
buffer's .page_len.

It should be safe to reserve the stream space /after/ the VFS read
operation completes. This is, for example, how an NFSv3 READ works:
the VFS read goes into the rq_bvec, and is then added to the send
xdr_stream later by svcxdr_encode_opaque_pages().

Nit: we could probably get rid of the xdr_truncate_encode(), now
that maxcount reflects the actual count of bytes returned by the
file system.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs4xdr.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

Here's a brain-dead idea.


diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 41f0d54d6e1b..e3efc7d24aa5 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4475,19 +4475,32 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
 	__be32 zero = xdr_zero;
 	__be32 nfserr;
 
-	if (xdr_reserve_space_vec(xdr, maxcount) < 0)
-		return nfserr_resource;
-
 	nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
 				read->rd_offset, &maxcount, base,
 				&read->rd_eof);
 	read->rd_length = maxcount;
 	if (nfserr)
 		return nfserr;
+
+	/*
+	 * svcxdr_encode_opaque_pages() is not used here because
+	 * we don't want to encode subsequent results in this
+	 * COMPOUND into the xdr->buf's tail, but rather those
+	 * results should follow the NFS READ payload in the
+	 * buf's pages.
+	 */
+	if (xdr_reserve_space_vec(xdr, maxcount) < 0)
+		return nfserr_resource;
+
+	/*
+	 * Mark the buffer location of the NFS READ payload so that
+	 * direct placement-capable transports send only the
+	 * payload bytes out-of-band.
+	 */
 	if (svc_encode_result_payload(resp->rqstp, starting_len, maxcount))
 		return nfserr_io;
-	xdr_truncate_encode(xdr, starting_len + xdr_align_size(maxcount));
 
+	xdr_truncate_encode(xdr, starting_len + xdr_align_size(maxcount));
 	write_bytes_to_xdr_buf(xdr->buf, starting_len + maxcount, &zero,
 			       xdr_pad_size(maxcount));
 	return nfs_ok;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] NFSD: Relocate the xdr_reserve_space_vec() call site
  2025-09-24 19:51 [RFC PATCH] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
@ 2025-09-26  1:13 ` NeilBrown
  2025-09-26 14:31   ` Chuck Lever
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2025-09-26  1:13 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever

On Thu, 25 Sep 2025, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> In order to detect when a direct READ is possible, we need the send
> buffer's .page_len to be zero when there is nothing in the buffer's
> .pages array yet.
> 
> However, when xdr_reserve_space_vec() extends the size of the
> xdr_stream to accommodate a READ payload, it adds to the send
> buffer's .page_len.
> 
> It should be safe to reserve the stream space /after/ the VFS read
> operation completes. This is, for example, how an NFSv3 READ works:
> the VFS read goes into the rq_bvec, and is then added to the send
> xdr_stream later by svcxdr_encode_opaque_pages().

"This is .. how an NFSv3 READ works" is the part of this that stands out
for me.  Increasing the consistence between v3 and v4 must be good.

v2 and v3 readlink, readdir, read; and v4 splice_read
all use svcxdr_encode_opaque_pages().

These are precisely the operations where there is precisely one
page-like component of the reply.  For other v4 operations there are a
mix of page-like and non-pagelike elements.  They use the more
traditional xdr_reserve_space() and xdr_truncate_encode.

direct_read is more like splice_read than it is like iter_read.
This is because there can be only one page-like element.
So it isn't clear to me that it should be integrated with
nfsd_iter_read().  It is quite different from splice_read too.

However I think for the v4 case, direct read fits better in
nsfd4_encode_splice_read() than it does in nfsd4_encode_readv().
In both cases the READ is the only OP that can used the page vec - all
other replies have to fit in the header page.

NeilBrown


> 
> Nit: we could probably get rid of the xdr_truncate_encode(), now
> that maxcount reflects the actual count of bytes returned by the
> file system.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/nfs4xdr.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> Here's a brain-dead idea.
> 
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 41f0d54d6e1b..e3efc7d24aa5 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -4475,19 +4475,32 @@ static __be32 nfsd4_encode_readv(struct nfsd4_compoundres *resp,
>  	__be32 zero = xdr_zero;
>  	__be32 nfserr;
>  
> -	if (xdr_reserve_space_vec(xdr, maxcount) < 0)
> -		return nfserr_resource;
> -
>  	nfserr = nfsd_iter_read(resp->rqstp, read->rd_fhp, read->rd_nf,
>  				read->rd_offset, &maxcount, base,
>  				&read->rd_eof);
>  	read->rd_length = maxcount;
>  	if (nfserr)
>  		return nfserr;
> +
> +	/*
> +	 * svcxdr_encode_opaque_pages() is not used here because
> +	 * we don't want to encode subsequent results in this
> +	 * COMPOUND into the xdr->buf's tail, but rather those
> +	 * results should follow the NFS READ payload in the
> +	 * buf's pages.
> +	 */
> +	if (xdr_reserve_space_vec(xdr, maxcount) < 0)
> +		return nfserr_resource;
> +
> +	/*
> +	 * Mark the buffer location of the NFS READ payload so that
> +	 * direct placement-capable transports send only the
> +	 * payload bytes out-of-band.
> +	 */
>  	if (svc_encode_result_payload(resp->rqstp, starting_len, maxcount))
>  		return nfserr_io;
> -	xdr_truncate_encode(xdr, starting_len + xdr_align_size(maxcount));
>  
> +	xdr_truncate_encode(xdr, starting_len + xdr_align_size(maxcount));
>  	write_bytes_to_xdr_buf(xdr->buf, starting_len + maxcount, &zero,
>  			       xdr_pad_size(maxcount));
>  	return nfs_ok;
> -- 
> 2.51.0
> 
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] NFSD: Relocate the xdr_reserve_space_vec() call site
  2025-09-26  1:13 ` NeilBrown
@ 2025-09-26 14:31   ` Chuck Lever
  0 siblings, 0 replies; 3+ messages in thread
From: Chuck Lever @ 2025-09-26 14:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs, Chuck Lever

On 9/25/25 6:13 PM, NeilBrown wrote:
> On Thu, 25 Sep 2025, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> In order to detect when a direct READ is possible, we need the send
>> buffer's .page_len to be zero when there is nothing in the buffer's
>> .pages array yet.
>>
>> However, when xdr_reserve_space_vec() extends the size of the
>> xdr_stream to accommodate a READ payload, it adds to the send
>> buffer's .page_len.
>>
>> It should be safe to reserve the stream space /after/ the VFS read
>> operation completes. This is, for example, how an NFSv3 READ works:
>> the VFS read goes into the rq_bvec, and is then added to the send
>> xdr_stream later by svcxdr_encode_opaque_pages().
> "This is .. how an NFSv3 READ works" is the part of this that stands out
> for me.  Increasing the consistence between v3 and v4 must be good.
> 
> v2 and v3 readlink, readdir, read; and v4 splice_read
> all use svcxdr_encode_opaque_pages().
> 
> These are precisely the operations where there is precisely one
> page-like component of the reply.  For other v4 operations there are a
> mix of page-like and non-pagelike elements.  They use the more
> traditional xdr_reserve_space() and xdr_truncate_encode.
> 
> direct_read is more like splice_read than it is like iter_read.

But only in the NFSv4 case, and only in the sense of /when/ it may be
used, and the similarities are not exact (see below).

The logic of a direct read is identical to vectored and dontcache
reads in all cases -- build a bvec and call vfs_iocb_iter_read(). The
only reason it is split out today is because I want to get the direct
I/O implementation nailed down before de-duplicating the code.

And, we have the nice fallback behavior, currently, where NFSD tries
direct I/O; if that isn't possible, dontcache; and if that isn't
possible, vectored read, which is always possible. I believe there is
still situations where per-I/O fallback is reasonable.


> This is because there can be only one page-like element.
> So it isn't clear to me that it should be integrated with
> nfsd_iter_read().  It is quite different from splice_read too.
> 
> However I think for the v4 case, direct read fits better in
> nsfd4_encode_splice_read() than it does in nfsd4_encode_readv().
> In both cases the READ is the only OP that can used the page vec - all
> other replies have to fit in the header page.

There is the sticky part of this where GSS forbids the use of splice
read. In that case, all NFSv4 READ operations use nfsd4_encode_readv().
The first READ operation in a COMPOUND may still use direct I/O even
though NFSv4 READs do not use nfsd4_encode_splice_read() when krb5i or
krb5p is in use.

IMHO the direct read path still fits best (though I agree, not yet
perfectly) in nfsd_iter_read().

I'll post another revision of the series to continue the discussion.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-09-26 14:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-24 19:51 [RFC PATCH] NFSD: Relocate the xdr_reserve_space_vec() call site Chuck Lever
2025-09-26  1:13 ` NeilBrown
2025-09-26 14:31   ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.