[PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT
@ 2025-07-03 19:53 Jeff Layton
  2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jeff Layton @ 2025-07-03 19:53 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer
  Cc: linux-nfs, linux-kernel, Jeff Layton

Chuck and I were discussing RWF_DONTCACHE and he suggested that this
might be an alternate approach. My main gripe with DONTCACHE was that it
kicks off writeback after every WRITE operation. With NFS, we generally
get a COMMIT operation at some point. Allowing us to batch up writes
until that point has traditionally been considered better for
performance.

Instead of RWF_DONTCACHE, this patch has nfsd issue generic_fadvise(...,
POSIX_FADV_DONTNEED) on the appropriate range after any READ, stable
WRITE or COMMIT operation. This means that it doesn't change how and
when dirty data gets flushed to the disk, but still keeps resident
pagecache to a minimum.

For reference, here are some numbers from a fio run doing sequential
reads and writes, with the server in "normal" buffered I/O mode, with
Mike's RWF_DONTCACHE patch enabled, and with fadvise(...DONTNEED).

Jobfile:

[global]
name=fio-seq-RW
filename=fio-seq-RW
rw=rw
rwmixread=60
rwmixwrite=40
bs=1M
direct=0
numjobs=16
time_based
runtime=300

[file1]
size=100G
ioengine=io_uring
iodepth=16

::::::::::::::::::::::::::::::::::::

3 runs each.

Baseline (nothing enabled):
Run status group 0 (all jobs):
   READ: bw=2999MiB/s (3144MB/s), 185MiB/s-189MiB/s (194MB/s-198MB/s), io=879GiB (944GB), run=300014-300087msec
  WRITE: bw=1998MiB/s (2095MB/s), 124MiB/s-126MiB/s (130MB/s-132MB/s), io=585GiB (629GB), run=300014-300087msec

   READ: bw=2866MiB/s (3005MB/s), 177MiB/s-181MiB/s (185MB/s-190MB/s), io=844GiB (906GB), run=301294-301463msec
  WRITE: bw=1909MiB/s (2002MB/s), 117MiB/s-121MiB/s (123MB/s-127MB/s), io=562GiB (604GB), run=301294-301463msec

   READ: bw=2885MiB/s (3026MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=846GiB (908GB), run=300017-300117msec
  WRITE: bw=1923MiB/s (2016MB/s), 118MiB/s-122MiB/s (124MB/s-128MB/s), io=563GiB (605GB), run=300017-300117msec

RWF_DONTCACHE:
Run status group 0 (all jobs):
   READ: bw=3088MiB/s (3238MB/s), 189MiB/s-195MiB/s (198MB/s-205MB/s), io=906GiB (972GB), run=300015-300276msec
  WRITE: bw=2058MiB/s (2158MB/s), 126MiB/s-129MiB/s (132MB/s-136MB/s), io=604GiB (648GB), run=300015-300276msec

   READ: bw=3116MiB/s (3267MB/s), 191MiB/s-197MiB/s (201MB/s-206MB/s), io=913GiB (980GB), run=300022-300074msec
  WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-137MB/s), io=609GiB (654GB), run=300022-300074msec

   READ: bw=3011MiB/s (3158MB/s), 185MiB/s-191MiB/s (194MB/s-200MB/s), io=886GiB (951GB), run=301049-301133msec
  WRITE: bw=2007MiB/s (2104MB/s), 123MiB/s-127MiB/s (129MB/s-133MB/s), io=590GiB (634GB), run=301049-301133msec

fadvise(..., POSIX_FADV_DONTNEED):
   READ: bw=2918MiB/s (3060MB/s), 180MiB/s-184MiB/s (188MB/s-193MB/s), io=855GiB (918GB), run=300014-300111msec
  WRITE: bw=1944MiB/s (2038MB/s), 120MiB/s-123MiB/s (125MB/s-129MB/s), io=570GiB (612GB), run=300014-300111msec

   READ: bw=2951MiB/s (3095MB/s), 182MiB/s-188MiB/s (191MB/s-197MB/s), io=867GiB (931GB), run=300529-300695msec
  WRITE: bw=1966MiB/s (2061MB/s), 121MiB/s-124MiB/s (127MB/s-130MB/s), io=577GiB (620GB), run=300529-300695msec

   READ: bw=2971MiB/s (3115MB/s), 181MiB/s-188MiB/s (190MB/s-197MB/s), io=871GiB (935GB), run=300015-300077msec
  WRITE: bw=1979MiB/s (2076MB/s), 122MiB/s-125MiB/s (128MB/s-131MB/s), io=580GiB (623GB), run=300015-300077msec

::::::::::::::::::::::::::::::

The numbers are pretty close, but it looks like RWF_DONTCACHE edges out
the other modes. Also, with the RWF_DONTCACHE and fadvise() modes the
pagecache utilization stays very low on the server (which is of course,
the point).

I think next I'll test a hybrid mode. Use RWF_DONTCACHE for READ and
stable WRITE operations, and do the fadvise() only after COMMITs.

Plumbing this in for v4 will be "interesting" if we decide this approach
is sound, but it shouldn't be too bad if we only do it after a COMMIT.

Thoughts?

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Jeff Layton (2):
      sunrpc: delay pc_release callback until after sending a reply
      nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT

 fs/nfsd/debugfs.c  |  2 ++
 fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
 fs/nfsd/nfsd.h     |  1 +
 fs/nfsd/nfsproc.c  |  4 ++--
 fs/nfsd/vfs.c      | 21 ++++++++++++++-----
 fs/nfsd/vfs.h      |  5 +++--
 fs/nfsd/xdr3.h     |  3 +++
 net/sunrpc/svc.c   | 19 ++++++++++++++----
 8 files changed, 92 insertions(+), 22 deletions(-)
---
base-commit: 38ddcbef7f4e9c5aa075c8ccf9f6d5293e027951
change-id: 20250701-nfsd-testing-12e7c8da5f1c

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply
  2025-07-03 19:53 [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT Jeff Layton
@ 2025-07-03 19:53 ` Jeff Layton
  2025-07-03 23:33   ` NeilBrown
  2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton
  2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown
  2 siblings, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2025-07-03 19:53 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer
  Cc: linux-nfs, linux-kernel, Jeff Layton

The server-side sunrpc code currently calls pc_release before sending
the reply. A later nfsd patch will change some pc_release callbacks to
do extra work to clean the pagecache. There is no need to delay sending
the reply for this, however.

Change svc_process and svc_process_bc to call pc_release after sending
the reply instead of before.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 net/sunrpc/svc.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index b1fab3a6954437cf751e4725fa52cfc83eddf2ab..103bb6ba8e140fdccd6cab124e715caeb41bb445 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1426,8 +1426,6 @@ svc_process_common(struct svc_rqst *rqstp)
 
 	/* Call the function that processes the request. */
 	rc = process.dispatch(rqstp);
-	if (procp->pc_release)
-		procp->pc_release(rqstp);
 	xdr_finish_decode(xdr);
 
 	if (!rc)
@@ -1526,6 +1524,14 @@ static void svc_drop(struct svc_rqst *rqstp)
 	trace_svc_drop(rqstp);
 }
 
+static void svc_release_rqst(struct svc_rqst *rqstp)
+{
+	const struct svc_procedure *procp = rqstp->rq_procinfo;
+
+	if (procp && procp->pc_release)
+		procp->pc_release(rqstp);
+}
+
 /**
  * svc_process - Execute one RPC transaction
  * @rqstp: RPC transaction context
@@ -1533,7 +1539,7 @@ static void svc_drop(struct svc_rqst *rqstp)
  */
 void svc_process(struct svc_rqst *rqstp)
 {
-	struct kvec		*resv = &rqstp->rq_res.head[0];
+	struct kvec			*resv = &rqstp->rq_res.head[0];
 	__be32 *p;
 
 #if IS_ENABLED(CONFIG_FAIL_SUNRPC)
@@ -1565,9 +1571,12 @@ void svc_process(struct svc_rqst *rqstp)
 	if (unlikely(*p != rpc_call))
 		goto out_baddir;
 
-	if (!svc_process_common(rqstp))
+	if (!svc_process_common(rqstp)) {
+		svc_release_rqst(rqstp);
 		goto out_drop;
+	}
 	svc_send(rqstp);
+	svc_release_rqst(rqstp);
 	return;
 
 out_baddir:
@@ -1635,6 +1644,7 @@ void svc_process_bc(struct rpc_rqst *req, struct svc_rqst *rqstp)
 	if (!proc_error) {
 		/* Processing error: drop the request */
 		xprt_free_bc_request(req);
+		svc_release_rqst(rqstp);
 		return;
 	}
 	/* Finally, send the reply synchronously */
@@ -1648,6 +1658,7 @@ void svc_process_bc(struct rpc_rqst *req, struct svc_rqst *rqstp)
 	timeout.to_maxval = timeout.to_initval;
 	memcpy(&req->rq_snd_buf, &rqstp->rq_res, sizeof(req->rq_snd_buf));
 	task = rpc_run_bc_task(req, &timeout);
+	svc_release_rqst(rqstp);
 
 	if (IS_ERR(task))
 		return;

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply
  2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton
@ 2025-07-03 23:33   ` NeilBrown
  2025-07-04  0:05     ` Jeff Layton
  0 siblings, 1 reply; 18+ messages in thread
From: NeilBrown @ 2025-07-03 23:33 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel,
	Jeff Layton

On Fri, 04 Jul 2025, Jeff Layton wrote:
> The server-side sunrpc code currently calls pc_release before sending
> the reply. A later nfsd patch will change some pc_release callbacks to
> do extra work to clean the pagecache. There is no need to delay sending
> the reply for this, however.
> 
> Change svc_process and svc_process_bc to call pc_release after sending
> the reply instead of before.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  net/sunrpc/svc.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index b1fab3a6954437cf751e4725fa52cfc83eddf2ab..103bb6ba8e140fdccd6cab124e715caeb41bb445 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -1426,8 +1426,6 @@ svc_process_common(struct svc_rqst *rqstp)
>  
>  	/* Call the function that processes the request. */
>  	rc = process.dispatch(rqstp);
> -	if (procp->pc_release)
> -		procp->pc_release(rqstp);
>  	xdr_finish_decode(xdr);
>  
>  	if (!rc)
> @@ -1526,6 +1524,14 @@ static void svc_drop(struct svc_rqst *rqstp)
>  	trace_svc_drop(rqstp);
>  }
>  
> +static void svc_release_rqst(struct svc_rqst *rqstp)
> +{
> +	const struct svc_procedure *procp = rqstp->rq_procinfo;
> +
> +	if (procp && procp->pc_release)
> +		procp->pc_release(rqstp);
> +}
> +
>  /**
>   * svc_process - Execute one RPC transaction
>   * @rqstp: RPC transaction context
> @@ -1533,7 +1539,7 @@ static void svc_drop(struct svc_rqst *rqstp)
>   */
>  void svc_process(struct svc_rqst *rqstp)
>  {
> -	struct kvec		*resv = &rqstp->rq_res.head[0];
> +	struct kvec			*resv = &rqstp->rq_res.head[0];

Commas and Tabs - you can never really have enough of them, can you?

>  	__be32 *p;
>  
>  #if IS_ENABLED(CONFIG_FAIL_SUNRPC)
> @@ -1565,9 +1571,12 @@ void svc_process(struct svc_rqst *rqstp)
>  	if (unlikely(*p != rpc_call))
>  		goto out_baddir;
>  
> -	if (!svc_process_common(rqstp))
> +	if (!svc_process_common(rqstp)) {
> +		svc_release_rqst(rqstp);
>  		goto out_drop;
> +	}
>  	svc_send(rqstp);
> +	svc_release_rqst(rqstp);
>  	return;

Should we, as a general rule, avoid calling any cleanup function more
than once?  When tempted, we DEFINE_FREE() a cleanup function and
declare the variable appropriately.

Though in this case it might be easier to:

  if (svc_process_common(rqstp))
       svc_send(rqstp);
  else
       svc_drop(rqstp);
  svc_rlease_rqst(rqstp);
  return;

svc_process_bc() is a little more awkward.

But in general, delaying the release function until after the send seems
sound, and this patches appears to do it corretly.

Reviewed-by: NeilBrown <neil@brown.name>

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply
  2025-07-03 23:33   ` NeilBrown
@ 2025-07-04  0:05     ` Jeff Layton
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Layton @ 2025-07-04  0:05 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel

On Fri, 2025-07-04 at 09:33 +1000, NeilBrown wrote:
> On Fri, 04 Jul 2025, Jeff Layton wrote:
> > The server-side sunrpc code currently calls pc_release before sending
> > the reply. A later nfsd patch will change some pc_release callbacks to
> > do extra work to clean the pagecache. There is no need to delay sending
> > the reply for this, however.
> > 
> > Change svc_process and svc_process_bc to call pc_release after sending
> > the reply instead of before.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  net/sunrpc/svc.c | 19 +++++++++++++++----
> >  1 file changed, 15 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index b1fab3a6954437cf751e4725fa52cfc83eddf2ab..103bb6ba8e140fdccd6cab124e715caeb41bb445 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -1426,8 +1426,6 @@ svc_process_common(struct svc_rqst *rqstp)
> >  
> >  	/* Call the function that processes the request. */
> >  	rc = process.dispatch(rqstp);
> > -	if (procp->pc_release)
> > -		procp->pc_release(rqstp);
> >  	xdr_finish_decode(xdr);
> >  
> >  	if (!rc)
> > @@ -1526,6 +1524,14 @@ static void svc_drop(struct svc_rqst *rqstp)
> >  	trace_svc_drop(rqstp);
> >  }
> >  
> > +static void svc_release_rqst(struct svc_rqst *rqstp)
> > +{
> > +	const struct svc_procedure *procp = rqstp->rq_procinfo;
> > +
> > +	if (procp && procp->pc_release)
> > +		procp->pc_release(rqstp);
> > +}
> > +
> >  /**
> >   * svc_process - Execute one RPC transaction
> >   * @rqstp: RPC transaction context
> > @@ -1533,7 +1539,7 @@ static void svc_drop(struct svc_rqst *rqstp)
> >   */
> >  void svc_process(struct svc_rqst *rqstp)
> >  {
> > -	struct kvec		*resv = &rqstp->rq_res.head[0];
> > +	struct kvec			*resv = &rqstp->rq_res.head[0];
> 
> Commas and Tabs - you can never really have enough of them, can you?
> 

Not sure what happened there. I'll drop that hunk.

> >  	__be32 *p;
> >  
> >  #if IS_ENABLED(CONFIG_FAIL_SUNRPC)
> > @@ -1565,9 +1571,12 @@ void svc_process(struct svc_rqst *rqstp)
> >  	if (unlikely(*p != rpc_call))
> >  		goto out_baddir;
> >  
> > -	if (!svc_process_common(rqstp))
> > +	if (!svc_process_common(rqstp)) {
> > +		svc_release_rqst(rqstp);
> >  		goto out_drop;
> > +	}
> >  	svc_send(rqstp);
> > +	svc_release_rqst(rqstp);
> >  	return;
> 
> Should we, as a general rule, avoid calling any cleanup function more
> than once?  When tempted, we DEFINE_FREE() a cleanup function and
> declare the variable appropriately.

I'm not opposed to that. I think that change probably deserves a
separate patch.

> Though in this case it might be easier to:
> 
>   if (svc_process_common(rqstp))
>        svc_send(rqstp);
>   else
>        svc_drop(rqstp);
>   svc_rlease_rqst(rqstp);
>   return;
> 

There is another place that does a "goto out_drop in that function. I'm
not sure changing that would improve things, but I'll see how it looks.

> svc_process_bc() is a little more awkward.
> 

Definitely.

> But in general, delaying the release function until after the send seems
> sound, and this patches appears to do it corretly.
> 
> Reviewed-by: NeilBrown <neil@brown.name>
> 
> NeilBrown

Thanks for the review!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
  2025-07-03 19:53 [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT Jeff Layton
  2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton
@ 2025-07-03 19:53 ` Jeff Layton
  2025-07-03 20:07   ` Chuck Lever
  2025-07-03 23:44   ` NeilBrown
  2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown
  2 siblings, 2 replies; 18+ messages in thread
From: Jeff Layton @ 2025-07-03 19:53 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer
  Cc: linux-nfs, linux-kernel, Jeff Layton

Recent testing has shown that keeping pagecache pages around for too
long can be detrimental to performance with nfsd. Clients only rarely
revisit the same data, so the pages tend to just hang around.

This patch changes the pc_release callbacks for NFSv3 READ, WRITE and
COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed
range.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/debugfs.c  |  2 ++
 fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
 fs/nfsd/nfsd.h     |  1 +
 fs/nfsd/nfsproc.c  |  4 ++--
 fs/nfsd/vfs.c      | 21 ++++++++++++++-----
 fs/nfsd/vfs.h      |  5 +++--
 fs/nfsd/xdr3.h     |  3 +++
 7 files changed, 77 insertions(+), 18 deletions(-)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -44,4 +44,6 @@ void nfsd_debugfs_init(void)
 
 	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
 			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
+	debugfs_create_bool("enable-fadvise-dontneed", 0644,
+			    nfsd_top_dir, &nfsd_enable_fadvise_dontneed);
 }
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index b6d03e1ef5f7a5e8dd111b0d56c061f1e91abff7..11261cf67ea817ec566626f08b733e09c9e121de 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -9,6 +9,7 @@
 #include <linux/ext2_fs.h>
 #include <linux/magic.h>
 #include <linux/namei.h>
+#include <linux/fadvise.h>
 
 #include "cache.h"
 #include "xdr3.h"
@@ -206,11 +207,25 @@ nfsd3_proc_read(struct svc_rqst *rqstp)
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->status = nfsd_read(rqstp, &resp->fh, argp->offset,
-				 &resp->count, &resp->eof);
+				 &resp->count, &resp->eof, &resp->nf);
 	resp->status = nfsd3_map_status(resp->status);
 	return rpc_success;
 }
 
+static void
+nfsd3_release_read(struct svc_rqst *rqstp)
+{
+	struct nfsd3_readargs *argp = rqstp->rq_argp;
+	struct nfsd3_readres *resp = rqstp->rq_resp;
+
+	if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok)
+		generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count,
+				POSIX_FADV_DONTNEED);
+	if (resp->nf)
+		nfsd_file_put(resp->nf);
+	fh_put(&resp->fh);
+}
+
 /*
  * Write data to a file
  */
@@ -236,12 +251,26 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
 	resp->committed = argp->stable;
 	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
 				  &argp->payload, &cnt,
-				  resp->committed, resp->verf);
+				  resp->committed, resp->verf, &resp->nf);
 	resp->count = cnt;
 	resp->status = nfsd3_map_status(resp->status);
 	return rpc_success;
 }
 
+static void
+nfsd3_release_write(struct svc_rqst *rqstp)
+{
+	struct nfsd3_writeargs *argp = rqstp->rq_argp;
+	struct nfsd3_writeres *resp = rqstp->rq_resp;
+
+	if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok && resp->committed)
+		generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count,
+				POSIX_FADV_DONTNEED);
+	if (resp->nf)
+		nfsd_file_put(resp->nf);
+	fh_put(&resp->fh);
+}
+
 /*
  * Implement NFSv3's unchecked, guarded, and exclusive CREATE
  * semantics for regular files. Except for the created file,
@@ -748,7 +777,6 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
 {
 	struct nfsd3_commitargs *argp = rqstp->rq_argp;
 	struct nfsd3_commitres *resp = rqstp->rq_resp;
-	struct nfsd_file *nf;
 
 	dprintk("nfsd: COMMIT(3)   %s %u@%Lu\n",
 				SVCFH_fmt(&argp->fh),
@@ -757,17 +785,30 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE |
-					    NFSD_MAY_NOT_BREAK_LEASE, &nf);
+					    NFSD_MAY_NOT_BREAK_LEASE, &resp->nf);
 	if (resp->status)
 		goto out;
-	resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
+	resp->status = nfsd_commit(rqstp, &resp->fh, resp->nf, argp->offset,
 				   argp->count, resp->verf);
-	nfsd_file_put(nf);
 out:
 	resp->status = nfsd3_map_status(resp->status);
 	return rpc_success;
 }
 
+static void
+nfsd3_release_commit(struct svc_rqst *rqstp)
+{
+	struct nfsd3_commitargs *argp = rqstp->rq_argp;
+	struct nfsd3_commitres *resp = rqstp->rq_resp;
+
+	if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok)
+		generic_fadvise(nfsd_file_file(resp->nf), argp->offset, argp->count,
+				POSIX_FADV_DONTNEED);
+	if (resp->nf)
+		nfsd_file_put(resp->nf);
+	fh_put(&resp->fh);
+}
+
 
 /*
  * NFSv3 Server procedures.
@@ -864,7 +905,7 @@ static const struct svc_procedure nfsd_procedures3[22] = {
 		.pc_func = nfsd3_proc_read,
 		.pc_decode = nfs3svc_decode_readargs,
 		.pc_encode = nfs3svc_encode_readres,
-		.pc_release = nfs3svc_release_fhandle,
+		.pc_release = nfsd3_release_read,
 		.pc_argsize = sizeof(struct nfsd3_readargs),
 		.pc_argzero = sizeof(struct nfsd3_readargs),
 		.pc_ressize = sizeof(struct nfsd3_readres),
@@ -876,7 +917,7 @@ static const struct svc_procedure nfsd_procedures3[22] = {
 		.pc_func = nfsd3_proc_write,
 		.pc_decode = nfs3svc_decode_writeargs,
 		.pc_encode = nfs3svc_encode_writeres,
-		.pc_release = nfs3svc_release_fhandle,
+		.pc_release = nfsd3_release_write,
 		.pc_argsize = sizeof(struct nfsd3_writeargs),
 		.pc_argzero = sizeof(struct nfsd3_writeargs),
 		.pc_ressize = sizeof(struct nfsd3_writeres),
@@ -1039,7 +1080,7 @@ static const struct svc_procedure nfsd_procedures3[22] = {
 		.pc_func = nfsd3_proc_commit,
 		.pc_decode = nfs3svc_decode_commitargs,
 		.pc_encode = nfs3svc_encode_commitres,
-		.pc_release = nfs3svc_release_fhandle,
+		.pc_release = nfsd3_release_commit,
 		.pc_argsize = sizeof(struct nfsd3_commitargs),
 		.pc_argzero = sizeof(struct nfsd3_commitargs),
 		.pc_ressize = sizeof(struct nfsd3_commitres),
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 1cd0bed57bc2f27248fd66a8efef692a5e9a390c..288904d88b9245c03eae0aa347e867037b7c85c5 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -152,6 +152,7 @@ static inline void nfsd_debugfs_exit(void) {}
 #endif
 
 extern bool nfsd_disable_splice_read __read_mostly;
+extern bool nfsd_enable_fadvise_dontneed __read_mostly;
 
 extern int nfsd_max_blksize;
 
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 8f71f5748c75b69f15bae5e63799842e0e8b3139..bb8f98adb3e31b10adc4694987f8f5036726bf7f 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -225,7 +225,7 @@ nfsd_proc_read(struct svc_rqst *rqstp)
 	resp->count = argp->count;
 	fh_copy(&resp->fh, &argp->fh);
 	resp->status = nfsd_read(rqstp, &resp->fh, argp->offset,
-				 &resp->count, &eof);
+				 &resp->count, &eof, NULL);
 	if (resp->status == nfs_ok)
 		resp->status = fh_getattr(&resp->fh, &resp->stat);
 	else if (resp->status == nfserr_jukebox)
@@ -258,7 +258,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
-				  &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
+				  &argp->payload, &cnt, NFS_DATA_SYNC, NULL, NULL);
 	if (resp->status == nfs_ok)
 		resp->status = fh_getattr(&resp->fh, &resp->stat);
 	else if (resp->status == nfserr_jukebox)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ee78b6fb17098b788b07f5cd90598e678244b57e..f23eb3a07bb99dc231be9ea6db4e58b9e328a689 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -49,6 +49,7 @@
 #define NFSDDBG_FACILITY		NFSDDBG_FILEOP
 
 bool nfsd_disable_splice_read __read_mostly;
+bool nfsd_enable_fadvise_dontneed __read_mostly;
 
 /**
  * nfserrno - Map Linux errnos to NFS errnos
@@ -1280,6 +1281,7 @@ bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
  * @offset: starting byte offset
  * @count: IN: requested number of bytes; OUT: number of bytes read
  * @eof: OUT: set non-zero if operation reached the end of the file
+ * @pnf: optional return pointer to pass back nfsd_file reference
  *
  * The caller must verify that there is enough space in @rqstp.rq_res
  * to perform this operation.
@@ -1290,7 +1292,8 @@ bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
  * returned.
  */
 __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
-		 loff_t offset, unsigned long *count, u32 *eof)
+		 loff_t offset, unsigned long *count, u32 *eof,
+		 struct nfsd_file **pnf)
 {
 	struct nfsd_file	*nf;
 	struct file *file;
@@ -1307,7 +1310,10 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	else
 		err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
 
-	nfsd_file_put(nf);
+	if (pnf && err == nfs_ok)
+		*pnf = nf;
+	else
+		nfsd_file_put(nf);
 	trace_nfsd_read_done(rqstp, fhp, offset, *count);
 	return err;
 }
@@ -1321,8 +1327,10 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
  * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
  * @stable: An NFS stable_how value
  * @verf: NFS WRITE verifier
+ * @pnf: optional return pointer to pass back nfsd_file reference
  *
- * Upon return, caller must invoke fh_put on @fhp.
+ * Upon return, caller must invoke fh_put() on @fhp. If it sets @pnf,
+ * then it must also call nfsd_file_put() on the returned reference.
  *
  * Return values:
  *   An nfsstat value in network byte order.
@@ -1330,7 +1338,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 __be32
 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
 	   const struct xdr_buf *payload, unsigned long *cnt, int stable,
-	   __be32 *verf)
+	   __be32 *verf, struct nfsd_file **pnf)
 {
 	struct nfsd_file *nf;
 	__be32 err;
@@ -1343,7 +1351,10 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
 
 	err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
 			     stable, verf);
-	nfsd_file_put(nf);
+	if (pnf && err == nfs_ok)
+		*pnf = nf;
+	else
+		nfsd_file_put(nf);
 out:
 	trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
 	return err;
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index eff04959606fe55c141ab4a2eed97c7e0716a5f5..2d063ee7786f499f34c39cd3ba7e776bb7562d57 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -127,10 +127,11 @@ __be32		nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 bool		nfsd_read_splice_ok(struct svc_rqst *rqstp);
 __be32		nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				loff_t offset, unsigned long *count,
-				u32 *eof);
+				u32 *eof, struct nfsd_file **pnf);
 __be32		nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				loff_t offset, const struct xdr_buf *payload,
-				unsigned long *cnt, int stable, __be32 *verf);
+				unsigned long *cnt, int stable, __be32 *verf,
+				struct nfsd_file **pnf);
 __be32		nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 				struct nfsd_file *nf, loff_t offset,
 				const struct xdr_buf *payload,
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 522067b7fd755930a1c2e42b826d9132ac2993be..3f4c51983003188be51a0f8c2db2e0acc9a1363b 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -146,6 +146,7 @@ struct nfsd3_readres {
 	unsigned long		count;
 	__u32			eof;
 	struct page		**pages;
+	struct nfsd_file	*nf;
 };
 
 struct nfsd3_writeres {
@@ -154,6 +155,7 @@ struct nfsd3_writeres {
 	unsigned long		count;
 	int			committed;
 	__be32			verf[2];
+	struct nfsd_file	*nf;
 };
 
 struct nfsd3_renameres {
@@ -217,6 +219,7 @@ struct nfsd3_commitres {
 	__be32			status;
 	struct svc_fh		fh;
 	__be32			verf[2];
+	struct nfsd_file	*nf;
 };
 
 struct nfsd3_getaclres {

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
  2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton
@ 2025-07-03 20:07   ` Chuck Lever
  2025-07-08 14:34     ` Jeff Layton
  2025-07-08 21:07     ` Mike Snitzer
  2025-07-03 23:44   ` NeilBrown
  1 sibling, 2 replies; 18+ messages in thread
From: Chuck Lever @ 2025-07-03 20:07 UTC (permalink / raw)
  To: Jeff Layton, Trond Myklebust, Anna Schumaker, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer
  Cc: linux-nfs, linux-kernel

On 7/3/25 3:53 PM, Jeff Layton wrote:
> Recent testing has shown that keeping pagecache pages around for too
> long can be detrimental to performance with nfsd. Clients only rarely
> revisit the same data, so the pages tend to just hang around.
> 
> This patch changes the pc_release callbacks for NFSv3 READ, WRITE and
> COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed
> range.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/nfsd/debugfs.c  |  2 ++
>  fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
>  fs/nfsd/nfsd.h     |  1 +
>  fs/nfsd/nfsproc.c  |  4 ++--
>  fs/nfsd/vfs.c      | 21 ++++++++++++++-----
>  fs/nfsd/vfs.h      |  5 +++--
>  fs/nfsd/xdr3.h     |  3 +++
>  7 files changed, 77 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void)
>  
>  	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
>  			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
> +	debugfs_create_bool("enable-fadvise-dontneed", 0644,
> +			    nfsd_top_dir, &nfsd_enable_fadvise_dontneed);

I prefer that this setting is folded into the new io_cache_read /
io_cache_write tune-ables that Mike's patch adds, rather than adding
a new boolean.

That might make a hybrid "DONTCACHE for READ and fadvise for WRITE"
pretty easy.


>  }
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index b6d03e1ef5f7a5e8dd111b0d56c061f1e91abff7..11261cf67ea817ec566626f08b733e09c9e121de 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -9,6 +9,7 @@
>  #include <linux/ext2_fs.h>
>  #include <linux/magic.h>
>  #include <linux/namei.h>
> +#include <linux/fadvise.h>
>  
>  #include "cache.h"
>  #include "xdr3.h"
> @@ -206,11 +207,25 @@ nfsd3_proc_read(struct svc_rqst *rqstp)
>  
>  	fh_copy(&resp->fh, &argp->fh);
>  	resp->status = nfsd_read(rqstp, &resp->fh, argp->offset,
> -				 &resp->count, &resp->eof);
> +				 &resp->count, &resp->eof, &resp->nf);
>  	resp->status = nfsd3_map_status(resp->status);
>  	return rpc_success;
>  }
>  
> +static void
> +nfsd3_release_read(struct svc_rqst *rqstp)
> +{
> +	struct nfsd3_readargs *argp = rqstp->rq_argp;
> +	struct nfsd3_readres *resp = rqstp->rq_resp;
> +
> +	if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok)
> +		generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count,
> +				POSIX_FADV_DONTNEED);
> +	if (resp->nf)
> +		nfsd_file_put(resp->nf);
> +	fh_put(&resp->fh);
> +}
> +
>  /*
>   * Write data to a file
>   */
> @@ -236,12 +251,26 @@ nfsd3_proc_write(struct svc_rqst *rqstp)
>  	resp->committed = argp->stable;
>  	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
>  				  &argp->payload, &cnt,
> -				  resp->committed, resp->verf);
> +				  resp->committed, resp->verf, &resp->nf);
>  	resp->count = cnt;
>  	resp->status = nfsd3_map_status(resp->status);
>  	return rpc_success;
>  }
>  
> +static void
> +nfsd3_release_write(struct svc_rqst *rqstp)
> +{
> +	struct nfsd3_writeargs *argp = rqstp->rq_argp;
> +	struct nfsd3_writeres *resp = rqstp->rq_resp;
> +
> +	if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok && resp->committed)
> +		generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count,
> +				POSIX_FADV_DONTNEED);
> +	if (resp->nf)
> +		nfsd_file_put(resp->nf);
> +	fh_put(&resp->fh);
> +}
> +
>  /*
>   * Implement NFSv3's unchecked, guarded, and exclusive CREATE
>   * semantics for regular files. Except for the created file,
> @@ -748,7 +777,6 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
>  {
>  	struct nfsd3_commitargs *argp = rqstp->rq_argp;
>  	struct nfsd3_commitres *resp = rqstp->rq_resp;
> -	struct nfsd_file *nf;
>  
>  	dprintk("nfsd: COMMIT(3)   %s %u@%Lu\n",
>  				SVCFH_fmt(&argp->fh),
> @@ -757,17 +785,30 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
>  
>  	fh_copy(&resp->fh, &argp->fh);
>  	resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE |
> -					    NFSD_MAY_NOT_BREAK_LEASE, &nf);
> +					    NFSD_MAY_NOT_BREAK_LEASE, &resp->nf);
>  	if (resp->status)
>  		goto out;
> -	resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
> +	resp->status = nfsd_commit(rqstp, &resp->fh, resp->nf, argp->offset,
>  				   argp->count, resp->verf);
> -	nfsd_file_put(nf);
>  out:
>  	resp->status = nfsd3_map_status(resp->status);
>  	return rpc_success;
>  }
>  
> +static void
> +nfsd3_release_commit(struct svc_rqst *rqstp)
> +{
> +	struct nfsd3_commitargs *argp = rqstp->rq_argp;
> +	struct nfsd3_commitres *resp = rqstp->rq_resp;
> +
> +	if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok)
> +		generic_fadvise(nfsd_file_file(resp->nf), argp->offset, argp->count,
> +				POSIX_FADV_DONTNEED);
> +	if (resp->nf)
> +		nfsd_file_put(resp->nf);
> +	fh_put(&resp->fh);
> +}
> +
>  
>  /*
>   * NFSv3 Server procedures.
> @@ -864,7 +905,7 @@ static const struct svc_procedure nfsd_procedures3[22] = {
>  		.pc_func = nfsd3_proc_read,
>  		.pc_decode = nfs3svc_decode_readargs,
>  		.pc_encode = nfs3svc_encode_readres,
> -		.pc_release = nfs3svc_release_fhandle,
> +		.pc_release = nfsd3_release_read,
>  		.pc_argsize = sizeof(struct nfsd3_readargs),
>  		.pc_argzero = sizeof(struct nfsd3_readargs),
>  		.pc_ressize = sizeof(struct nfsd3_readres),
> @@ -876,7 +917,7 @@ static const struct svc_procedure nfsd_procedures3[22] = {
>  		.pc_func = nfsd3_proc_write,
>  		.pc_decode = nfs3svc_decode_writeargs,
>  		.pc_encode = nfs3svc_encode_writeres,
> -		.pc_release = nfs3svc_release_fhandle,
> +		.pc_release = nfsd3_release_write,
>  		.pc_argsize = sizeof(struct nfsd3_writeargs),
>  		.pc_argzero = sizeof(struct nfsd3_writeargs),
>  		.pc_ressize = sizeof(struct nfsd3_writeres),
> @@ -1039,7 +1080,7 @@ static const struct svc_procedure nfsd_procedures3[22] = {
>  		.pc_func = nfsd3_proc_commit,
>  		.pc_decode = nfs3svc_decode_commitargs,
>  		.pc_encode = nfs3svc_encode_commitres,
> -		.pc_release = nfs3svc_release_fhandle,
> +		.pc_release = nfsd3_release_commit,
>  		.pc_argsize = sizeof(struct nfsd3_commitargs),
>  		.pc_argzero = sizeof(struct nfsd3_commitargs),
>  		.pc_ressize = sizeof(struct nfsd3_commitres),
> diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
> index 1cd0bed57bc2f27248fd66a8efef692a5e9a390c..288904d88b9245c03eae0aa347e867037b7c85c5 100644
> --- a/fs/nfsd/nfsd.h
> +++ b/fs/nfsd/nfsd.h
> @@ -152,6 +152,7 @@ static inline void nfsd_debugfs_exit(void) {}
>  #endif
>  
>  extern bool nfsd_disable_splice_read __read_mostly;
> +extern bool nfsd_enable_fadvise_dontneed __read_mostly;
>  
>  extern int nfsd_max_blksize;
>  
> diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
> index 8f71f5748c75b69f15bae5e63799842e0e8b3139..bb8f98adb3e31b10adc4694987f8f5036726bf7f 100644
> --- a/fs/nfsd/nfsproc.c
> +++ b/fs/nfsd/nfsproc.c
> @@ -225,7 +225,7 @@ nfsd_proc_read(struct svc_rqst *rqstp)
>  	resp->count = argp->count;
>  	fh_copy(&resp->fh, &argp->fh);
>  	resp->status = nfsd_read(rqstp, &resp->fh, argp->offset,
> -				 &resp->count, &eof);
> +				 &resp->count, &eof, NULL);
>  	if (resp->status == nfs_ok)
>  		resp->status = fh_getattr(&resp->fh, &resp->stat);
>  	else if (resp->status == nfserr_jukebox)
> @@ -258,7 +258,7 @@ nfsd_proc_write(struct svc_rqst *rqstp)
>  
>  	fh_copy(&resp->fh, &argp->fh);
>  	resp->status = nfsd_write(rqstp, &resp->fh, argp->offset,
> -				  &argp->payload, &cnt, NFS_DATA_SYNC, NULL);
> +				  &argp->payload, &cnt, NFS_DATA_SYNC, NULL, NULL);
>  	if (resp->status == nfs_ok)
>  		resp->status = fh_getattr(&resp->fh, &resp->stat);
>  	else if (resp->status == nfserr_jukebox)
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index ee78b6fb17098b788b07f5cd90598e678244b57e..f23eb3a07bb99dc231be9ea6db4e58b9e328a689 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -49,6 +49,7 @@
>  #define NFSDDBG_FACILITY		NFSDDBG_FILEOP
>  
>  bool nfsd_disable_splice_read __read_mostly;
> +bool nfsd_enable_fadvise_dontneed __read_mostly;
>  
>  /**
>   * nfserrno - Map Linux errnos to NFS errnos
> @@ -1280,6 +1281,7 @@ bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
>   * @offset: starting byte offset
>   * @count: IN: requested number of bytes; OUT: number of bytes read
>   * @eof: OUT: set non-zero if operation reached the end of the file
> + * @pnf: optional return pointer to pass back nfsd_file reference
>   *
>   * The caller must verify that there is enough space in @rqstp.rq_res
>   * to perform this operation.
> @@ -1290,7 +1292,8 @@ bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
>   * returned.
>   */
>  __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> -		 loff_t offset, unsigned long *count, u32 *eof)
> +		 loff_t offset, unsigned long *count, u32 *eof,
> +		 struct nfsd_file **pnf)
>  {
>  	struct nfsd_file	*nf;
>  	struct file *file;
> @@ -1307,7 +1310,10 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	else
>  		err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof);
>  
> -	nfsd_file_put(nf);
> +	if (pnf && err == nfs_ok)
> +		*pnf = nf;
> +	else
> +		nfsd_file_put(nf);
>  	trace_nfsd_read_done(rqstp, fhp, offset, *count);
>  	return err;
>  }
> @@ -1321,8 +1327,10 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>   * @cnt: IN: number of bytes to write, OUT: number of bytes actually written
>   * @stable: An NFS stable_how value
>   * @verf: NFS WRITE verifier
> + * @pnf: optional return pointer to pass back nfsd_file reference
>   *
> - * Upon return, caller must invoke fh_put on @fhp.
> + * Upon return, caller must invoke fh_put() on @fhp. If it sets @pnf,
> + * then it must also call nfsd_file_put() on the returned reference.
>   *
>   * Return values:
>   *   An nfsstat value in network byte order.
> @@ -1330,7 +1338,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  __be32
>  nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
>  	   const struct xdr_buf *payload, unsigned long *cnt, int stable,
> -	   __be32 *verf)
> +	   __be32 *verf, struct nfsd_file **pnf)
>  {
>  	struct nfsd_file *nf;
>  	__be32 err;
> @@ -1343,7 +1351,10 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
>  
>  	err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt,
>  			     stable, verf);
> -	nfsd_file_put(nf);
> +	if (pnf && err == nfs_ok)
> +		*pnf = nf;
> +	else
> +		nfsd_file_put(nf);
>  out:
>  	trace_nfsd_write_done(rqstp, fhp, offset, *cnt);
>  	return err;
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index eff04959606fe55c141ab4a2eed97c7e0716a5f5..2d063ee7786f499f34c39cd3ba7e776bb7562d57 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -127,10 +127,11 @@ __be32		nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  bool		nfsd_read_splice_ok(struct svc_rqst *rqstp);
>  __be32		nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  				loff_t offset, unsigned long *count,
> -				u32 *eof);
> +				u32 *eof, struct nfsd_file **pnf);
>  __be32		nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  				loff_t offset, const struct xdr_buf *payload,
> -				unsigned long *cnt, int stable, __be32 *verf);
> +				unsigned long *cnt, int stable, __be32 *verf,
> +				struct nfsd_file **pnf);
>  __be32		nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  				struct nfsd_file *nf, loff_t offset,
>  				const struct xdr_buf *payload,
> diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
> index 522067b7fd755930a1c2e42b826d9132ac2993be..3f4c51983003188be51a0f8c2db2e0acc9a1363b 100644
> --- a/fs/nfsd/xdr3.h
> +++ b/fs/nfsd/xdr3.h
> @@ -146,6 +146,7 @@ struct nfsd3_readres {
>  	unsigned long		count;
>  	__u32			eof;
>  	struct page		**pages;
> +	struct nfsd_file	*nf;
>  };
>  
>  struct nfsd3_writeres {
> @@ -154,6 +155,7 @@ struct nfsd3_writeres {
>  	unsigned long		count;
>  	int			committed;
>  	__be32			verf[2];
> +	struct nfsd_file	*nf;
>  };
>  
>  struct nfsd3_renameres {
> @@ -217,6 +219,7 @@ struct nfsd3_commitres {
>  	__be32			status;
>  	struct svc_fh		fh;
>  	__be32			verf[2];
> +	struct nfsd_file	*nf;
>  };
>  
>  struct nfsd3_getaclres {
> 


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
  2025-07-03 20:07   ` Chuck Lever
@ 2025-07-08 14:34     ` Jeff Layton
  2025-07-08 21:12       ` Mike Snitzer
  2025-07-08 21:07     ` Mike Snitzer
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2025-07-08 14:34 UTC (permalink / raw)
  To: Chuck Lever, Trond Myklebust, Anna Schumaker, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer
  Cc: linux-nfs, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5447 bytes --]

On Thu, 2025-07-03 at 16:07 -0400, Chuck Lever wrote:
> On 7/3/25 3:53 PM, Jeff Layton wrote:
> > Recent testing has shown that keeping pagecache pages around for too
> > long can be detrimental to performance with nfsd. Clients only rarely
> > revisit the same data, so the pages tend to just hang around.
> > 
> > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and
> > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed
> > range.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/nfsd/debugfs.c  |  2 ++
> >  fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
> >  fs/nfsd/nfsd.h     |  1 +
> >  fs/nfsd/nfsproc.c  |  4 ++--
> >  fs/nfsd/vfs.c      | 21 ++++++++++++++-----
> >  fs/nfsd/vfs.h      |  5 +++--
> >  fs/nfsd/xdr3.h     |  3 +++
> >  7 files changed, 77 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void)
> >  
> >  	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
> >  			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
> > +	debugfs_create_bool("enable-fadvise-dontneed", 0644,
> > +			    nfsd_top_dir, &nfsd_enable_fadvise_dontneed);
> 
> I prefer that this setting is folded into the new io_cache_read /
> io_cache_write tune-ables that Mike's patch adds, rather than adding
> a new boolean.
> 
> That might make a hybrid "DONTCACHE for READ and fadvise for WRITE"
> pretty easy.
> 

I ended up rebasing Mike's dontcache branch on top of v6.16-rc5 with
all of Chuck's trees in. I then added the attached patch and did some
testing with a couple of machines I checked out internally at Meta.
This is the throughput results with the fio-seq-RW test with the file
size set to 100G and the duration at 5 mins.

Note that:

read and writes buffered:
   READ: bw=3024MiB/s (3171MB/s), 186MiB/s-191MiB/s (195MB/s-201MB/s), io=889GiB (954GB), run=300012-300966msec
  WRITE: bw=2015MiB/s (2113MB/s), 124MiB/s-128MiB/s (131MB/s-134MB/s), io=592GiB (636GB), run=300012-300966msec

   READ: bw=2902MiB/s (3043MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=851GiB (913GB), run=300027-300118msec
  WRITE: bw=1934MiB/s (2027MB/s), 119MiB/s-122MiB/s (124MB/s-128MB/s), io=567GiB (608GB), run=300027-300118msec

   READ: bw=2897MiB/s (3037MB/s), 178MiB/s-183MiB/s (186MB/s-192MB/s), io=849GiB (911GB), run=300006-300078msec
  WRITE: bw=1930MiB/s (2023MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=565GiB (607GB), run=300006-300078msec

reads and writes RWF_DONTCACHE:
   READ: bw=3090MiB/s (3240MB/s), 190MiB/s-195MiB/s (199MB/s-205MB/s), io=906GiB (972GB), run=300015-300113msec
  WRITE: bw=2060MiB/s (2160MB/s), 126MiB/s-130MiB/s (132MB/s-137MB/s), io=604GiB (648GB), run=300015-300113msec

   READ: bw=3057MiB/s (3205MB/s), 188MiB/s-193MiB/s (198MB/s-203MB/s), io=897GiB (963GB), run=300329-300450msec
  WRITE: bw=2037MiB/s (2136MB/s), 126MiB/s-129MiB/s (132MB/s-135MB/s), io=598GiB (642GB), run=300329-300450msec

   READ: bw=3166MiB/s (3320MB/s), 196MiB/s-200MiB/s (205MB/s-210MB/s), io=928GiB (996GB), run=300021-300090msec
  WRITE: bw=2111MiB/s (2213MB/s), 131MiB/s-133MiB/s (137MB/s-140MB/s), io=619GiB (664GB), run=300021-300090msec

reads and writes witg O_DIRECT:
   READ: bw=3115MiB/s (3267MB/s), 192MiB/s-198MiB/s (201MB/s-208MB/s), io=913GiB (980GB), run=300025-300078msec
  WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-138MB/s), io=609GiB (653GB), run=300025-300078msec

   READ: bw=3189MiB/s (3343MB/s), 197MiB/s-202MiB/s (207MB/s-211MB/s), io=934GiB (1003GB), run=300023-300096msec
  WRITE: bw=2125MiB/s (2228MB/s), 132MiB/s-134MiB/s (138MB/s-140MB/s), io=623GiB (669GB), run=300023-300096msec

   READ: bw=3113MiB/s (3264MB/s), 191MiB/s-197MiB/s (200MB/s-207MB/s), io=912GiB (979GB), run=300020-300098msec
  WRITE: bw=2075MiB/s (2175MB/s), 127MiB/s-131MiB/s (134MB/s-138MB/s), io=608GiB (653GB), run=300020-300098msec

RWF_DONTCACHE on reads and stable writes + fadvise DONTNEED after COMMIT:
   READ: bw=2888MiB/s (3029MB/s), 178MiB/s-182MiB/s (187MB/s-191MB/s), io=846GiB (909GB), run=300012-300109msec
  WRITE: bw=1924MiB/s (2017MB/s), 118MiB/s-121MiB/s (124MB/s-127MB/s), io=564GiB (605GB), run=300012-300109msec

   READ: bw=2899MiB/s (3040MB/s), 180MiB/s-183MiB/s (188MB/s-192MB/s), io=852GiB (915GB), run=300022-300940msec
  WRITE: bw=1931MiB/s (2025MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=567GiB (609GB), run=300022-300940msec

   READ: bw=2902MiB/s (3043MB/s), 179MiB/s-184MiB/s (188MB/s-193MB/s), io=853GiB (916GB), run=300913-301146msec
  WRITE: bw=1933MiB/s (2027MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=568GiB (610GB), run=300913-301146msec


The fadvise case is clearly slower than the others. Interestingly it
also slowed down read performance, which leads me to believe that maybe
the fadvise calls were interfering with concurrent reads. Given the
disappointing numbers, I'll probably drop the last patch.

There is probably a case to be made for patch #1, on the general
principle of expediting sending the reply as much as possible. Chuck,
let me know if you want me to submit that individually.
-- 
Jeff Layton <jlayton@kernel.org>

[-- Attachment #2: 0001-nfsd-add-a-NFSD_IO_FADVISE-setting-to-io_cache_write.patch --]
[-- Type: text/x-patch, Size: 4348 bytes --]

From 14958516bf45f92a8609cb6ad504e92550b416d7 Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@kernel.org>
Date: Mon, 7 Jul 2025 11:00:34 -0400
Subject: [PATCH] nfsd: add a NFSD_IO_FADVISE setting to io_cache_write

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/debugfs.c  |  2 ++
 fs/nfsd/nfs3proc.c | 24 +++++++++++++++++++-----
 fs/nfsd/nfsd.h     |  3 ++-
 fs/nfsd/vfs.c      |  4 ++++
 fs/nfsd/xdr3.h     |  1 +
 5 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index fadf1d88f640..d0f7dfe6d621 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -89,6 +89,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
  *   %0: NFS WRITE will use buffered IO (default)
  *   %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
  *   %2: NFS WRITE will use direct IO
+ *   %3: NFS stable WRITE will use dontcache + fadvise DONTNEED after COMMIT
  *
  * The default value of this setting is zero (buffered IO is
  * used). This setting takes immediate effect for all NFS
@@ -107,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
 	case NFSD_IO_BUFFERED:
 	case NFSD_IO_DONTCACHE:
 	case NFSD_IO_DIRECT:
+	case NFSD_IO_FADVISE:
 		nfsd_io_cache_write = val;
 		break;
 	default:
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index b6d03e1ef5f7..6737d22c001d 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -9,6 +9,7 @@
 #include <linux/ext2_fs.h>
 #include <linux/magic.h>
 #include <linux/namei.h>
+#include <linux/fadvise.h>
 
 #include "cache.h"
 #include "xdr3.h"
@@ -748,7 +749,6 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
 {
 	struct nfsd3_commitargs *argp = rqstp->rq_argp;
 	struct nfsd3_commitres *resp = rqstp->rq_resp;
-	struct nfsd_file *nf;
 
 	dprintk("nfsd: COMMIT(3)   %s %u@%Lu\n",
 				SVCFH_fmt(&argp->fh),
@@ -757,17 +757,31 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
 
 	fh_copy(&resp->fh, &argp->fh);
 	resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE |
-					    NFSD_MAY_NOT_BREAK_LEASE, &nf);
+					    NFSD_MAY_NOT_BREAK_LEASE, &resp->nf);
 	if (resp->status)
 		goto out;
-	resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
+	resp->status = nfsd_commit(rqstp, &resp->fh, resp->nf, argp->offset,
 				   argp->count, resp->verf);
-	nfsd_file_put(nf);
 out:
 	resp->status = nfsd3_map_status(resp->status);
 	return rpc_success;
 }
 
+static void
+nfsd3_release_commit(struct svc_rqst *rqstp)
+{
+	struct nfsd3_commitargs *argp = rqstp->rq_argp;
+	struct nfsd3_commitres *resp = rqstp->rq_resp;
+
+	fh_put(&resp->fh);
+	if (resp->nf) {
+		if (nfsd_io_cache_write == NFSD_IO_FADVISE)
+			vfs_fadvise(nfsd_file_file(resp->nf), argp->offset,
+				    argp->count, POSIX_FADV_DONTNEED);
+		nfsd_file_put(resp->nf);
+	}
+}
+
 
 /*
  * NFSv3 Server procedures.
@@ -1039,7 +1053,7 @@ static const struct svc_procedure nfsd_procedures3[22] = {
 		.pc_func = nfsd3_proc_commit,
 		.pc_decode = nfs3svc_decode_commitargs,
 		.pc_encode = nfs3svc_encode_commitres,
-		.pc_release = nfs3svc_release_fhandle,
+		.pc_release = nfsd3_release_commit,
 		.pc_argsize = sizeof(struct nfsd3_commitargs),
 		.pc_argzero = sizeof(struct nfsd3_commitargs),
 		.pc_ressize = sizeof(struct nfsd3_commitres),
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 1ae38c5557c4..b21902e02982 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -156,7 +156,8 @@ extern bool nfsd_disable_splice_read __read_mostly;
 enum {
 	NFSD_IO_BUFFERED = 0,
 	NFSD_IO_DONTCACHE,
-	NFSD_IO_DIRECT
+	NFSD_IO_DIRECT,
+	NFSD_IO_FADVISE
 };
 
 extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 08350070e083..5d4588a106b2 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1271,6 +1271,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	case NFSD_IO_DONTCACHE:
 		kiocb.ki_flags = IOCB_DONTCACHE;
 		break;
+	case NFSD_IO_FADVISE:
+		if (stable)
+			kiocb.ki_flags = IOCB_DONTCACHE;
+		break;
 	case NFSD_IO_BUFFERED:
 		break;
 	}
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 522067b7fd75..ec91a24e651a 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -217,6 +217,7 @@ struct nfsd3_commitres {
 	__be32			status;
 	struct svc_fh		fh;
 	__be32			verf[2];
+	struct nfsd_file	*nf;
 };
 
 struct nfsd3_getaclres {
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
  2025-07-08 14:34     ` Jeff Layton
@ 2025-07-08 21:12       ` Mike Snitzer
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2025-07-08 21:12 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Chuck Lever, Trond Myklebust, Anna Schumaker, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, linux-kernel

On Tue, Jul 08, 2025 at 10:34:15AM -0400, Jeff Layton wrote:
> On Thu, 2025-07-03 at 16:07 -0400, Chuck Lever wrote:
> > On 7/3/25 3:53 PM, Jeff Layton wrote:
> > > Recent testing has shown that keeping pagecache pages around for too
> > > long can be detrimental to performance with nfsd. Clients only rarely
> > > revisit the same data, so the pages tend to just hang around.
> > > 
> > > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and
> > > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed
> > > range.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/nfsd/debugfs.c  |  2 ++
> > >  fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
> > >  fs/nfsd/nfsd.h     |  1 +
> > >  fs/nfsd/nfsproc.c  |  4 ++--
> > >  fs/nfsd/vfs.c      | 21 ++++++++++++++-----
> > >  fs/nfsd/vfs.h      |  5 +++--
> > >  fs/nfsd/xdr3.h     |  3 +++
> > >  7 files changed, 77 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644
> > > --- a/fs/nfsd/debugfs.c
> > > +++ b/fs/nfsd/debugfs.c
> > > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void)
> > >  
> > >  	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
> > >  			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
> > > +	debugfs_create_bool("enable-fadvise-dontneed", 0644,
> > > +			    nfsd_top_dir, &nfsd_enable_fadvise_dontneed);
> > 
> > I prefer that this setting is folded into the new io_cache_read /
> > io_cache_write tune-ables that Mike's patch adds, rather than adding
> > a new boolean.
> > 
> > That might make a hybrid "DONTCACHE for READ and fadvise for WRITE"
> > pretty easy.
> > 
> 
> I ended up rebasing Mike's dontcache branch on top of v6.16-rc5 with
> all of Chuck's trees in. I then added the attached patch and did some
> testing with a couple of machines I checked out internally at Meta.
> This is the throughput results with the fio-seq-RW test with the file
> size set to 100G and the duration at 5 mins.
> 
> Note that:
> 
> read and writes buffered:
>    READ: bw=3024MiB/s (3171MB/s), 186MiB/s-191MiB/s (195MB/s-201MB/s), io=889GiB (954GB), run=300012-300966msec
>   WRITE: bw=2015MiB/s (2113MB/s), 124MiB/s-128MiB/s (131MB/s-134MB/s), io=592GiB (636GB), run=300012-300966msec
> 
>    READ: bw=2902MiB/s (3043MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=851GiB (913GB), run=300027-300118msec
>   WRITE: bw=1934MiB/s (2027MB/s), 119MiB/s-122MiB/s (124MB/s-128MB/s), io=567GiB (608GB), run=300027-300118msec
> 
>    READ: bw=2897MiB/s (3037MB/s), 178MiB/s-183MiB/s (186MB/s-192MB/s), io=849GiB (911GB), run=300006-300078msec
>   WRITE: bw=1930MiB/s (2023MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=565GiB (607GB), run=300006-300078msec
> 
> reads and writes RWF_DONTCACHE:
>    READ: bw=3090MiB/s (3240MB/s), 190MiB/s-195MiB/s (199MB/s-205MB/s), io=906GiB (972GB), run=300015-300113msec
>   WRITE: bw=2060MiB/s (2160MB/s), 126MiB/s-130MiB/s (132MB/s-137MB/s), io=604GiB (648GB), run=300015-300113msec
> 
>    READ: bw=3057MiB/s (3205MB/s), 188MiB/s-193MiB/s (198MB/s-203MB/s), io=897GiB (963GB), run=300329-300450msec
>   WRITE: bw=2037MiB/s (2136MB/s), 126MiB/s-129MiB/s (132MB/s-135MB/s), io=598GiB (642GB), run=300329-300450msec
> 
>    READ: bw=3166MiB/s (3320MB/s), 196MiB/s-200MiB/s (205MB/s-210MB/s), io=928GiB (996GB), run=300021-300090msec
>   WRITE: bw=2111MiB/s (2213MB/s), 131MiB/s-133MiB/s (137MB/s-140MB/s), io=619GiB (664GB), run=300021-300090msec
> 
> reads and writes witg O_DIRECT:
>    READ: bw=3115MiB/s (3267MB/s), 192MiB/s-198MiB/s (201MB/s-208MB/s), io=913GiB (980GB), run=300025-300078msec
>   WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-138MB/s), io=609GiB (653GB), run=300025-300078msec
> 
>    READ: bw=3189MiB/s (3343MB/s), 197MiB/s-202MiB/s (207MB/s-211MB/s), io=934GiB (1003GB), run=300023-300096msec
>   WRITE: bw=2125MiB/s (2228MB/s), 132MiB/s-134MiB/s (138MB/s-140MB/s), io=623GiB (669GB), run=300023-300096msec
> 
>    READ: bw=3113MiB/s (3264MB/s), 191MiB/s-197MiB/s (200MB/s-207MB/s), io=912GiB (979GB), run=300020-300098msec
>   WRITE: bw=2075MiB/s (2175MB/s), 127MiB/s-131MiB/s (134MB/s-138MB/s), io=608GiB (653GB), run=300020-300098msec
> 
> RWF_DONTCACHE on reads and stable writes + fadvise DONTNEED after COMMIT:
>    READ: bw=2888MiB/s (3029MB/s), 178MiB/s-182MiB/s (187MB/s-191MB/s), io=846GiB (909GB), run=300012-300109msec
>   WRITE: bw=1924MiB/s (2017MB/s), 118MiB/s-121MiB/s (124MB/s-127MB/s), io=564GiB (605GB), run=300012-300109msec
> 
>    READ: bw=2899MiB/s (3040MB/s), 180MiB/s-183MiB/s (188MB/s-192MB/s), io=852GiB (915GB), run=300022-300940msec
>   WRITE: bw=1931MiB/s (2025MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=567GiB (609GB), run=300022-300940msec
> 
>    READ: bw=2902MiB/s (3043MB/s), 179MiB/s-184MiB/s (188MB/s-193MB/s), io=853GiB (916GB), run=300913-301146msec
>   WRITE: bw=1933MiB/s (2027MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=568GiB (610GB), run=300913-301146msec
> 
> 
> The fadvise case is clearly slower than the others. Interestingly it
> also slowed down read performance, which leads me to believe that maybe
> the fadvise calls were interfering with concurrent reads. Given the
> disappointing numbers, I'll probably drop the last patch.
> 
> There is probably a case to be made for patch #1, on the general
> principle of expediting sending the reply as much as possible. Chuck,
> let me know if you want me to submit that individually.
> -- 
> Jeff Layton <jlayton@kernel.org>

> From 14958516bf45f92a8609cb6ad504e92550b416d7 Mon Sep 17 00:00:00 2001
> From: Jeff Layton <jlayton@kernel.org>
> Date: Mon, 7 Jul 2025 11:00:34 -0400
> Subject: [PATCH] nfsd: add a NFSD_IO_FADVISE setting to io_cache_write
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Ah, you raced ahead and did what I suggested in my previous reply just
now.  Too bad the performance is lacking... but I applaud you're
trying out a new/interesting idea!

BTW, measuring CPU and memory use is also important to capture to get
full picture.

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
  2025-07-03 20:07   ` Chuck Lever
  2025-07-08 14:34     ` Jeff Layton
@ 2025-07-08 21:07     ` Mike Snitzer
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2025-07-08 21:07 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, linux-kernel

On Thu, Jul 03, 2025 at 04:07:51PM -0400, Chuck Lever wrote:
> On 7/3/25 3:53 PM, Jeff Layton wrote:
> > Recent testing has shown that keeping pagecache pages around for too
> > long can be detrimental to performance with nfsd. Clients only rarely
> > revisit the same data, so the pages tend to just hang around.
> > 
> > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and
> > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed
> > range.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/nfsd/debugfs.c  |  2 ++
> >  fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
> >  fs/nfsd/nfsd.h     |  1 +
> >  fs/nfsd/nfsproc.c  |  4 ++--
> >  fs/nfsd/vfs.c      | 21 ++++++++++++++-----
> >  fs/nfsd/vfs.h      |  5 +++--
> >  fs/nfsd/xdr3.h     |  3 +++
> >  7 files changed, 77 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void)
> >  
> >  	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
> >  			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
> > +	debugfs_create_bool("enable-fadvise-dontneed", 0644,
> > +			    nfsd_top_dir, &nfsd_enable_fadvise_dontneed);
> 
> I prefer that this setting is folded into the new io_cache_read /
> io_cache_write tune-ables that Mike's patch adds, rather than adding
> a new boolean.
> 
> That might make a hybrid "DONTCACHE for READ and fadvise for WRITE"
> pretty easy.

That'd be really easy.  Jeff, maybe have a look to rebase your changes
on this patchset:
https://lore.kernel.org/linux-nfs/20250708160619.64800-1-snitzer@kernel.org/

Ontop of this patch in particular:
https://lore.kernel.org/linux-nfs/20250708160619.64800-8-snitzer@kernel.org/

My git branch with this patchset at the tip is available here:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=kernel-6.12.24/nfsd-testing-snitm

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
  2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton
  2025-07-03 20:07   ` Chuck Lever
@ 2025-07-03 23:44   ` NeilBrown
  2025-07-03 23:49     ` Jeff Layton
  2025-07-04  7:26     ` NeilBrown
  1 sibling, 2 replies; 18+ messages in thread
From: NeilBrown @ 2025-07-03 23:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel,
	Jeff Layton

On Fri, 04 Jul 2025, Jeff Layton wrote:
> Recent testing has shown that keeping pagecache pages around for too
> long can be detrimental to performance with nfsd. Clients only rarely
> revisit the same data, so the pages tend to just hang around.
> 
> This patch changes the pc_release callbacks for NFSv3 READ, WRITE and
> COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed
> range.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/nfsd/debugfs.c  |  2 ++
>  fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
>  fs/nfsd/nfsd.h     |  1 +
>  fs/nfsd/nfsproc.c  |  4 ++--
>  fs/nfsd/vfs.c      | 21 ++++++++++++++-----
>  fs/nfsd/vfs.h      |  5 +++--
>  fs/nfsd/xdr3.h     |  3 +++
>  7 files changed, 77 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644
> --- a/fs/nfsd/debugfs.c
> +++ b/fs/nfsd/debugfs.c
> @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void)
>  
>  	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
>  			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
> +	debugfs_create_bool("enable-fadvise-dontneed", 0644,
> +			    nfsd_top_dir, &nfsd_enable_fadvise_dontneed);
>  }
> diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> index b6d03e1ef5f7a5e8dd111b0d56c061f1e91abff7..11261cf67ea817ec566626f08b733e09c9e121de 100644
> --- a/fs/nfsd/nfs3proc.c
> +++ b/fs/nfsd/nfs3proc.c
> @@ -9,6 +9,7 @@
>  #include <linux/ext2_fs.h>
>  #include <linux/magic.h>
>  #include <linux/namei.h>
> +#include <linux/fadvise.h>
>  
>  #include "cache.h"
>  #include "xdr3.h"
> @@ -206,11 +207,25 @@ nfsd3_proc_read(struct svc_rqst *rqstp)
>  
>  	fh_copy(&resp->fh, &argp->fh);
>  	resp->status = nfsd_read(rqstp, &resp->fh, argp->offset,
> -				 &resp->count, &resp->eof);
> +				 &resp->count, &resp->eof, &resp->nf);
>  	resp->status = nfsd3_map_status(resp->status);
>  	return rpc_success;
>  }
>  
> +static void
> +nfsd3_release_read(struct svc_rqst *rqstp)
> +{
> +	struct nfsd3_readargs *argp = rqstp->rq_argp;
> +	struct nfsd3_readres *resp = rqstp->rq_resp;
> +
> +	if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok)
> +		generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count,
> +				POSIX_FADV_DONTNEED);
> +	if (resp->nf)
> +		nfsd_file_put(resp->nf);
> +	fh_put(&resp->fh);

This looks wrong - testing resp->nf after assuming it was non-NULL.
I don't think it *is* wrong because ->state == nfs_ok ensures
->nf is valid. But still....

How about:

    fh_put(resp->fh);
    if (!resp->nf)
         return;
    if (nfsd_enable_fadvise_dontneed)
	generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count,
		POSIX_FADV_DONTNEED);
    nfsd_file_put(resp->nf);

??
Note that we don't test ->status because that is identical to testing ->nf.
Ditto for other release functions.

Otherwise it makes sense for exploring how to optimise IO.

Reviewed-by: NeilBrown <neil@brown.name>

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
  2025-07-03 23:44   ` NeilBrown
@ 2025-07-03 23:49     ` Jeff Layton
  2025-07-04  7:26     ` NeilBrown
  1 sibling, 0 replies; 18+ messages in thread
From: Jeff Layton @ 2025-07-03 23:49 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel

On Fri, 2025-07-04 at 09:44 +1000, NeilBrown wrote:
> On Fri, 04 Jul 2025, Jeff Layton wrote:
> > Recent testing has shown that keeping pagecache pages around for too
> > long can be detrimental to performance with nfsd. Clients only rarely
> > revisit the same data, so the pages tend to just hang around.
> > 
> > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and
> > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed
> > range.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/nfsd/debugfs.c  |  2 ++
> >  fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
> >  fs/nfsd/nfsd.h     |  1 +
> >  fs/nfsd/nfsproc.c  |  4 ++--
> >  fs/nfsd/vfs.c      | 21 ++++++++++++++-----
> >  fs/nfsd/vfs.h      |  5 +++--
> >  fs/nfsd/xdr3.h     |  3 +++
> >  7 files changed, 77 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void)
> >  
> >  	debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
> >  			    nfsd_top_dir, NULL, &nfsd_dsr_fops);
> > +	debugfs_create_bool("enable-fadvise-dontneed", 0644,
> > +			    nfsd_top_dir, &nfsd_enable_fadvise_dontneed);
> >  }
> > diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
> > index b6d03e1ef5f7a5e8dd111b0d56c061f1e91abff7..11261cf67ea817ec566626f08b733e09c9e121de 100644
> > --- a/fs/nfsd/nfs3proc.c
> > +++ b/fs/nfsd/nfs3proc.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/ext2_fs.h>
> >  #include <linux/magic.h>
> >  #include <linux/namei.h>
> > +#include <linux/fadvise.h>
> >  
> >  #include "cache.h"
> >  #include "xdr3.h"
> > @@ -206,11 +207,25 @@ nfsd3_proc_read(struct svc_rqst *rqstp)
> >  
> >  	fh_copy(&resp->fh, &argp->fh);
> >  	resp->status = nfsd_read(rqstp, &resp->fh, argp->offset,
> > -				 &resp->count, &resp->eof);
> > +				 &resp->count, &resp->eof, &resp->nf);
> >  	resp->status = nfsd3_map_status(resp->status);
> >  	return rpc_success;
> >  }
> >  
> > +static void
> > +nfsd3_release_read(struct svc_rqst *rqstp)
> > +{
> > +	struct nfsd3_readargs *argp = rqstp->rq_argp;
> > +	struct nfsd3_readres *resp = rqstp->rq_resp;
> > +
> > +	if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok)
> > +		generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count,
> > +				POSIX_FADV_DONTNEED);
> > +	if (resp->nf)
> > +		nfsd_file_put(resp->nf);
> > +	fh_put(&resp->fh);
> 
> This looks wrong - testing resp->nf after assuming it was non-NULL.
> I don't think it *is* wrong because ->state == nfs_ok ensures
> ->nf is valid. But still....
> 

That was my thinking, but I agree that it's a bit fragile.

> How about:
> 
>     fh_put(resp->fh);
>     if (!resp->nf)
>          return;
>     if (nfsd_enable_fadvise_dontneed)
> 	generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count,
> 		POSIX_FADV_DONTNEED);
>     nfsd_file_put(resp->nf);
> 
> ??
> Note that we don't test ->status because that is identical to testing ->nf.
> Ditto for other release functions.
> 

That looks good. I'll plan to do that in the next respin.

> Otherwise it makes sense for exploring how to optimise IO.
> 
> Reviewed-by: NeilBrown <neil@brown.name>
> 
> NeilBrown


Thanks for the reviews!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
  2025-07-03 23:44   ` NeilBrown
  2025-07-03 23:49     ` Jeff Layton
@ 2025-07-04  7:26     ` NeilBrown
  2025-07-05 11:21       ` Jeff Layton
  1 sibling, 1 reply; 18+ messages in thread
From: NeilBrown @ 2025-07-04  7:26 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel,
	Jeff Layton

On Fri, 04 Jul 2025, NeilBrown wrote:
> 
> Otherwise it makes sense for exploring how to optimise IO.
> 
> Reviewed-by: NeilBrown <neil@brown.name>

Actually - I take that back.  generic_fadvise() is the wrong interface.
It is for filesystems to use if the don't have any special requirements,
and for vfs_fadvise() to use if the file system hasn't give a function
to use.

nfsd should be calling vfs_fadvise().

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
  2025-07-04  7:26     ` NeilBrown
@ 2025-07-05 11:21       ` Jeff Layton
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Layton @ 2025-07-05 11:21 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel

On Fri, 2025-07-04 at 17:26 +1000, NeilBrown wrote:
> On Fri, 04 Jul 2025, NeilBrown wrote:
> > 
> > Otherwise it makes sense for exploring how to optimise IO.
> > 
> > Reviewed-by: NeilBrown <neil@brown.name>
> 
> Actually - I take that back.  generic_fadvise() is the wrong interface.
> It is for filesystems to use if the don't have any special requirements,
> and for vfs_fadvise() to use if the file system hasn't give a function
> to use.
> 
> nfsd should be calling vfs_fadvise().

Good catch. I'll fix that up in the next version.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT
  2025-07-03 19:53 [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT Jeff Layton
  2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton
  2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton
@ 2025-07-03 23:16 ` NeilBrown
  2025-07-03 23:28   ` Chuck Lever
  2025-07-05 11:32   ` Jeff Layton
  2 siblings, 2 replies; 18+ messages in thread
From: NeilBrown @ 2025-07-03 23:16 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel,
	Jeff Layton

On Fri, 04 Jul 2025, Jeff Layton wrote:
> Chuck and I were discussing RWF_DONTCACHE and he suggested that this
> might be an alternate approach. My main gripe with DONTCACHE was that it
> kicks off writeback after every WRITE operation. With NFS, we generally
> get a COMMIT operation at some point. Allowing us to batch up writes
> until that point has traditionally been considered better for
> performance.

I wonder if that traditional consideration is justified, give your
subsequent results.  The addition of COMMIT in v3 allowed us to both:
 - delay kicking off writes
 - not wait for writes to complete

I think the second was always primary.  Maybe we didn't consider the
value of the first enough.
Obviously the client caches writes and delays the start of writeback.
Adding another delay on the serve side does not seem to have a clear
justification.  Maybe we *should* kick-off writeback immediately.  There
would still be opportunity for subsequent WRITE requests to be merged
into the writeback queue.

Ideally DONTCACHE should only affect cache usage and the latency of
subsequence READs.  It shouldn't affect WRITE behaviour.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT
  2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown
@ 2025-07-03 23:28   ` Chuck Lever
  2025-07-04  7:34     ` NeilBrown
  2025-07-05 11:32   ` Jeff Layton
  1 sibling, 1 reply; 18+ messages in thread
From: Chuck Lever @ 2025-07-03 23:28 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton
  Cc: Trond Myklebust, Anna Schumaker, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel

On 7/3/25 7:16 PM, NeilBrown wrote:
> On Fri, 04 Jul 2025, Jeff Layton wrote:
>> Chuck and I were discussing RWF_DONTCACHE and he suggested that this
>> might be an alternate approach. My main gripe with DONTCACHE was that it
>> kicks off writeback after every WRITE operation. With NFS, we generally
>> get a COMMIT operation at some point. Allowing us to batch up writes
>> until that point has traditionally been considered better for
>> performance.
> 
> I wonder if that traditional consideration is justified, give your
> subsequent results.  The addition of COMMIT in v3 allowed us to both:
>  - delay kicking off writes
>  - not wait for writes to complete
> 
> I think the second was always primary.  Maybe we didn't consider the
> value of the first enough.
> Obviously the client caches writes and delays the start of writeback.
> Adding another delay on the serve side does not seem to have a clear
> justification.  Maybe we *should* kick-off writeback immediately.  There
> would still be opportunity for subsequent WRITE requests to be merged
> into the writeback queue.

Dave Chinner had the same thought a while back. So I've experimented
with starting writes as part of nfsd_write(). Kicking off writes,
even without waiting, is actually pretty costly, and it resulted in
worse performance.

Now that .pc_release is called /after/ the WRITE response has been sent,
though, that might be a place where kicking off writeback could be done
without charging that latency to the client.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT
  2025-07-03 23:28   ` Chuck Lever
@ 2025-07-04  7:34     ` NeilBrown
  0 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2025-07-04  7:34 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel

On Fri, 04 Jul 2025, Chuck Lever wrote:
> On 7/3/25 7:16 PM, NeilBrown wrote:
> > On Fri, 04 Jul 2025, Jeff Layton wrote:
> >> Chuck and I were discussing RWF_DONTCACHE and he suggested that this
> >> might be an alternate approach. My main gripe with DONTCACHE was that it
> >> kicks off writeback after every WRITE operation. With NFS, we generally
> >> get a COMMIT operation at some point. Allowing us to batch up writes
> >> until that point has traditionally been considered better for
> >> performance.
> > 
> > I wonder if that traditional consideration is justified, give your
> > subsequent results.  The addition of COMMIT in v3 allowed us to both:
> >  - delay kicking off writes
> >  - not wait for writes to complete
> > 
> > I think the second was always primary.  Maybe we didn't consider the
> > value of the first enough.
> > Obviously the client caches writes and delays the start of writeback.
> > Adding another delay on the serve side does not seem to have a clear
> > justification.  Maybe we *should* kick-off writeback immediately.  There
> > would still be opportunity for subsequent WRITE requests to be merged
> > into the writeback queue.
> 
> Dave Chinner had the same thought a while back. So I've experimented
> with starting writes as part of nfsd_write(). Kicking off writes,
> even without waiting, is actually pretty costly, and it resulted in
> worse performance.

Was this with filemap_fdatawrite_range_kick() or something else?

> 
> Now that .pc_release is called /after/ the WRITE response has been sent,
> though, that might be a place where kicking off writeback could be done
> without charging that latency to the client.

Certainly after is better.  I was imagining kicking the writeback
threads, but they seems to only do whole filesystems.  Maybe that is OK?

I doubt we want more than one thread kicking off write for any given
inode at a time.  Maybe it would be best to add a "kicking_write" flag
in the filecache and if it is already set, then just add the range to
some range of pending-writes.  Maybe.
I guess I should explore how to test :-)

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT
  2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown
  2025-07-03 23:28   ` Chuck Lever
@ 2025-07-05 11:32   ` Jeff Layton
  2025-07-10  8:00     ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2025-07-05 11:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel

On Fri, 2025-07-04 at 09:16 +1000, NeilBrown wrote:
> On Fri, 04 Jul 2025, Jeff Layton wrote:
> > Chuck and I were discussing RWF_DONTCACHE and he suggested that this
> > might be an alternate approach. My main gripe with DONTCACHE was that it
> > kicks off writeback after every WRITE operation. With NFS, we generally
> > get a COMMIT operation at some point. Allowing us to batch up writes
> > until that point has traditionally been considered better for
> > performance.
> 
> I wonder if that traditional consideration is justified, give your
> subsequent results.  The addition of COMMIT in v3 allowed us to both:
>  - delay kicking off writes
>  - not wait for writes to complete
> 
> I think the second was always primary.  Maybe we didn't consider the
> value of the first enough.
> Obviously the client caches writes and delays the start of writeback.
> Adding another delay on the serve side does not seem to have a clear
> justification.  Maybe we *should* kick-off writeback immediately.  There
> would still be opportunity for subsequent WRITE requests to be merged
> into the writeback queue.
> 

That is the fundamental question: should we delay writeback or not? It
seems like delaying it is probably best, even in the modern era with
SSDs, but we do need more numbers here (ideally across a range of
workloads).

> Ideally DONTCACHE should only affect cache usage and the latency of
> subsequence READs.  It shouldn't affect WRITE behaviour.
> 

It definitely does affect it today. The ideal thing IMO would be to
just add the dropbehind flag to the folios on writes but not call
filemap_fdatawrite_range_kick() on every write operation.

After a COMMIT the pages should be clean and the vfs_fadvise call
should just drop them from the cache, so this approach shouldn't
materially change how writeback behaves.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT
  2025-07-05 11:32   ` Jeff Layton
@ 2025-07-10  8:00     ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2025-07-10  8:00 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs,
	linux-kernel, axboe, kundanthebest, linux-fsdevel

On Sat, Jul 05, 2025 at 07:32:58AM -0400, Jeff Layton wrote:
> That is the fundamental question: should we delay writeback or not? It
> seems like delaying it is probably best, even in the modern era with
> SSDs, but we do need more numbers here (ideally across a range of
> workloads).

If you have asynchronous writeback there's probably no good reason to
delay it per-se.  But it does make sense to wait long enough to have
a large I/O size, especially with some form of parity raid you'll want
to fill up the chunk, but also storage devices themselves will perform
much better with a larger size.  e.g. for HDD you'll want to write 1MB
batches, and similar write sizes also help with for SSDs.  While the
write performance itself might not be much worse with smaller I/O
especially for high quality ones, large I/O helps to reduce the
internal fragmentation and thus later reduces garbage collection
overhead and thus increases life time.

> > Ideally DONTCACHE should only affect cache usage and the latency of
> > subsequence READs.  It shouldn't affect WRITE behaviour.
> > 
> 
> It definitely does affect it today. The ideal thing IMO would be to
> just add the dropbehind flag to the folios on writes but not call
> filemap_fdatawrite_range_kick() on every write operation.

Yes, a mode that sets drop behind but leaves writeback to the
writeback threads can be interesting.  Right now it will still be
bottlenecked by the single writeback thread, but work on this is
underway.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-07-10  8:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 19:53 [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT Jeff Layton
2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton
2025-07-03 23:33   ` NeilBrown
2025-07-04  0:05     ` Jeff Layton
2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton
2025-07-03 20:07   ` Chuck Lever
2025-07-08 14:34     ` Jeff Layton
2025-07-08 21:12       ` Mike Snitzer
2025-07-08 21:07     ` Mike Snitzer
2025-07-03 23:44   ` NeilBrown
2025-07-03 23:49     ` Jeff Layton
2025-07-04  7:26     ` NeilBrown
2025-07-05 11:21       ` Jeff Layton
2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown
2025-07-03 23:28   ` Chuck Lever
2025-07-04  7:34     ` NeilBrown
2025-07-05 11:32   ` Jeff Layton
2025-07-10  8:00     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).