* [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT @ 2025-07-03 19:53 Jeff Layton 2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Jeff Layton @ 2025-07-03 19:53 UTC (permalink / raw) To: Trond Myklebust, Anna Schumaker, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer Cc: linux-nfs, linux-kernel, Jeff Layton Chuck and I were discussing RWF_DONTCACHE and he suggested that this might be an alternate approach. My main gripe with DONTCACHE was that it kicks off writeback after every WRITE operation. With NFS, we generally get a COMMIT operation at some point. Allowing us to batch up writes until that point has traditionally been considered better for performance. Instead of RWF_DONTCACHE, this patch has nfsd issue generic_fadvise(..., POSIX_FADV_DONTNEED) on the appropriate range after any READ, stable WRITE or COMMIT operation. This means that it doesn't change how and when dirty data gets flushed to the disk, but still keeps resident pagecache to a minimum. For reference, here are some numbers from a fio run doing sequential reads and writes, with the server in "normal" buffered I/O mode, with Mike's RWF_DONTCACHE patch enabled, and with fadvise(...DONTNEED). Jobfile: [global] name=fio-seq-RW filename=fio-seq-RW rw=rw rwmixread=60 rwmixwrite=40 bs=1M direct=0 numjobs=16 time_based runtime=300 [file1] size=100G ioengine=io_uring iodepth=16 :::::::::::::::::::::::::::::::::::: 3 runs each. Baseline (nothing enabled): Run status group 0 (all jobs): READ: bw=2999MiB/s (3144MB/s), 185MiB/s-189MiB/s (194MB/s-198MB/s), io=879GiB (944GB), run=300014-300087msec WRITE: bw=1998MiB/s (2095MB/s), 124MiB/s-126MiB/s (130MB/s-132MB/s), io=585GiB (629GB), run=300014-300087msec READ: bw=2866MiB/s (3005MB/s), 177MiB/s-181MiB/s (185MB/s-190MB/s), io=844GiB (906GB), run=301294-301463msec WRITE: bw=1909MiB/s (2002MB/s), 117MiB/s-121MiB/s (123MB/s-127MB/s), io=562GiB (604GB), run=301294-301463msec READ: bw=2885MiB/s (3026MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=846GiB (908GB), run=300017-300117msec WRITE: bw=1923MiB/s (2016MB/s), 118MiB/s-122MiB/s (124MB/s-128MB/s), io=563GiB (605GB), run=300017-300117msec RWF_DONTCACHE: Run status group 0 (all jobs): READ: bw=3088MiB/s (3238MB/s), 189MiB/s-195MiB/s (198MB/s-205MB/s), io=906GiB (972GB), run=300015-300276msec WRITE: bw=2058MiB/s (2158MB/s), 126MiB/s-129MiB/s (132MB/s-136MB/s), io=604GiB (648GB), run=300015-300276msec READ: bw=3116MiB/s (3267MB/s), 191MiB/s-197MiB/s (201MB/s-206MB/s), io=913GiB (980GB), run=300022-300074msec WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-137MB/s), io=609GiB (654GB), run=300022-300074msec READ: bw=3011MiB/s (3158MB/s), 185MiB/s-191MiB/s (194MB/s-200MB/s), io=886GiB (951GB), run=301049-301133msec WRITE: bw=2007MiB/s (2104MB/s), 123MiB/s-127MiB/s (129MB/s-133MB/s), io=590GiB (634GB), run=301049-301133msec fadvise(..., POSIX_FADV_DONTNEED): READ: bw=2918MiB/s (3060MB/s), 180MiB/s-184MiB/s (188MB/s-193MB/s), io=855GiB (918GB), run=300014-300111msec WRITE: bw=1944MiB/s (2038MB/s), 120MiB/s-123MiB/s (125MB/s-129MB/s), io=570GiB (612GB), run=300014-300111msec READ: bw=2951MiB/s (3095MB/s), 182MiB/s-188MiB/s (191MB/s-197MB/s), io=867GiB (931GB), run=300529-300695msec WRITE: bw=1966MiB/s (2061MB/s), 121MiB/s-124MiB/s (127MB/s-130MB/s), io=577GiB (620GB), run=300529-300695msec READ: bw=2971MiB/s (3115MB/s), 181MiB/s-188MiB/s (190MB/s-197MB/s), io=871GiB (935GB), run=300015-300077msec WRITE: bw=1979MiB/s (2076MB/s), 122MiB/s-125MiB/s (128MB/s-131MB/s), io=580GiB (623GB), run=300015-300077msec :::::::::::::::::::::::::::::: The numbers are pretty close, but it looks like RWF_DONTCACHE edges out the other modes. Also, with the RWF_DONTCACHE and fadvise() modes the pagecache utilization stays very low on the server (which is of course, the point). I think next I'll test a hybrid mode. Use RWF_DONTCACHE for READ and stable WRITE operations, and do the fadvise() only after COMMITs. Plumbing this in for v4 will be "interesting" if we decide this approach is sound, but it shouldn't be too bad if we only do it after a COMMIT. Thoughts? Signed-off-by: Jeff Layton <jlayton@kernel.org> --- Jeff Layton (2): sunrpc: delay pc_release callback until after sending a reply nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT fs/nfsd/debugfs.c | 2 ++ fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- fs/nfsd/nfsd.h | 1 + fs/nfsd/nfsproc.c | 4 ++-- fs/nfsd/vfs.c | 21 ++++++++++++++----- fs/nfsd/vfs.h | 5 +++-- fs/nfsd/xdr3.h | 3 +++ net/sunrpc/svc.c | 19 ++++++++++++++---- 8 files changed, 92 insertions(+), 22 deletions(-) --- base-commit: 38ddcbef7f4e9c5aa075c8ccf9f6d5293e027951 change-id: 20250701-nfsd-testing-12e7c8da5f1c Best regards, -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply 2025-07-03 19:53 [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT Jeff Layton @ 2025-07-03 19:53 ` Jeff Layton 2025-07-03 23:33 ` NeilBrown 2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton 2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown 2 siblings, 1 reply; 18+ messages in thread From: Jeff Layton @ 2025-07-03 19:53 UTC (permalink / raw) To: Trond Myklebust, Anna Schumaker, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer Cc: linux-nfs, linux-kernel, Jeff Layton The server-side sunrpc code currently calls pc_release before sending the reply. A later nfsd patch will change some pc_release callbacks to do extra work to clean the pagecache. There is no need to delay sending the reply for this, however. Change svc_process and svc_process_bc to call pc_release after sending the reply instead of before. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- net/sunrpc/svc.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c index b1fab3a6954437cf751e4725fa52cfc83eddf2ab..103bb6ba8e140fdccd6cab124e715caeb41bb445 100644 --- a/net/sunrpc/svc.c +++ b/net/sunrpc/svc.c @@ -1426,8 +1426,6 @@ svc_process_common(struct svc_rqst *rqstp) /* Call the function that processes the request. */ rc = process.dispatch(rqstp); - if (procp->pc_release) - procp->pc_release(rqstp); xdr_finish_decode(xdr); if (!rc) @@ -1526,6 +1524,14 @@ static void svc_drop(struct svc_rqst *rqstp) trace_svc_drop(rqstp); } +static void svc_release_rqst(struct svc_rqst *rqstp) +{ + const struct svc_procedure *procp = rqstp->rq_procinfo; + + if (procp && procp->pc_release) + procp->pc_release(rqstp); +} + /** * svc_process - Execute one RPC transaction * @rqstp: RPC transaction context @@ -1533,7 +1539,7 @@ static void svc_drop(struct svc_rqst *rqstp) */ void svc_process(struct svc_rqst *rqstp) { - struct kvec *resv = &rqstp->rq_res.head[0]; + struct kvec *resv = &rqstp->rq_res.head[0]; __be32 *p; #if IS_ENABLED(CONFIG_FAIL_SUNRPC) @@ -1565,9 +1571,12 @@ void svc_process(struct svc_rqst *rqstp) if (unlikely(*p != rpc_call)) goto out_baddir; - if (!svc_process_common(rqstp)) + if (!svc_process_common(rqstp)) { + svc_release_rqst(rqstp); goto out_drop; + } svc_send(rqstp); + svc_release_rqst(rqstp); return; out_baddir: @@ -1635,6 +1644,7 @@ void svc_process_bc(struct rpc_rqst *req, struct svc_rqst *rqstp) if (!proc_error) { /* Processing error: drop the request */ xprt_free_bc_request(req); + svc_release_rqst(rqstp); return; } /* Finally, send the reply synchronously */ @@ -1648,6 +1658,7 @@ void svc_process_bc(struct rpc_rqst *req, struct svc_rqst *rqstp) timeout.to_maxval = timeout.to_initval; memcpy(&req->rq_snd_buf, &rqstp->rq_res, sizeof(req->rq_snd_buf)); task = rpc_run_bc_task(req, &timeout); + svc_release_rqst(rqstp); if (IS_ERR(task)) return; -- 2.50.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply 2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton @ 2025-07-03 23:33 ` NeilBrown 2025-07-04 0:05 ` Jeff Layton 0 siblings, 1 reply; 18+ messages in thread From: NeilBrown @ 2025-07-03 23:33 UTC (permalink / raw) To: Jeff Layton Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel, Jeff Layton On Fri, 04 Jul 2025, Jeff Layton wrote: > The server-side sunrpc code currently calls pc_release before sending > the reply. A later nfsd patch will change some pc_release callbacks to > do extra work to clean the pagecache. There is no need to delay sending > the reply for this, however. > > Change svc_process and svc_process_bc to call pc_release after sending > the reply instead of before. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > net/sunrpc/svc.c | 19 +++++++++++++++---- > 1 file changed, 15 insertions(+), 4 deletions(-) > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c > index b1fab3a6954437cf751e4725fa52cfc83eddf2ab..103bb6ba8e140fdccd6cab124e715caeb41bb445 100644 > --- a/net/sunrpc/svc.c > +++ b/net/sunrpc/svc.c > @@ -1426,8 +1426,6 @@ svc_process_common(struct svc_rqst *rqstp) > > /* Call the function that processes the request. */ > rc = process.dispatch(rqstp); > - if (procp->pc_release) > - procp->pc_release(rqstp); > xdr_finish_decode(xdr); > > if (!rc) > @@ -1526,6 +1524,14 @@ static void svc_drop(struct svc_rqst *rqstp) > trace_svc_drop(rqstp); > } > > +static void svc_release_rqst(struct svc_rqst *rqstp) > +{ > + const struct svc_procedure *procp = rqstp->rq_procinfo; > + > + if (procp && procp->pc_release) > + procp->pc_release(rqstp); > +} > + > /** > * svc_process - Execute one RPC transaction > * @rqstp: RPC transaction context > @@ -1533,7 +1539,7 @@ static void svc_drop(struct svc_rqst *rqstp) > */ > void svc_process(struct svc_rqst *rqstp) > { > - struct kvec *resv = &rqstp->rq_res.head[0]; > + struct kvec *resv = &rqstp->rq_res.head[0]; Commas and Tabs - you can never really have enough of them, can you? > __be32 *p; > > #if IS_ENABLED(CONFIG_FAIL_SUNRPC) > @@ -1565,9 +1571,12 @@ void svc_process(struct svc_rqst *rqstp) > if (unlikely(*p != rpc_call)) > goto out_baddir; > > - if (!svc_process_common(rqstp)) > + if (!svc_process_common(rqstp)) { > + svc_release_rqst(rqstp); > goto out_drop; > + } > svc_send(rqstp); > + svc_release_rqst(rqstp); > return; Should we, as a general rule, avoid calling any cleanup function more than once? When tempted, we DEFINE_FREE() a cleanup function and declare the variable appropriately. Though in this case it might be easier to: if (svc_process_common(rqstp)) svc_send(rqstp); else svc_drop(rqstp); svc_rlease_rqst(rqstp); return; svc_process_bc() is a little more awkward. But in general, delaying the release function until after the send seems sound, and this patches appears to do it corretly. Reviewed-by: NeilBrown <neil@brown.name> NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply 2025-07-03 23:33 ` NeilBrown @ 2025-07-04 0:05 ` Jeff Layton 0 siblings, 0 replies; 18+ messages in thread From: Jeff Layton @ 2025-07-04 0:05 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel On Fri, 2025-07-04 at 09:33 +1000, NeilBrown wrote: > On Fri, 04 Jul 2025, Jeff Layton wrote: > > The server-side sunrpc code currently calls pc_release before sending > > the reply. A later nfsd patch will change some pc_release callbacks to > > do extra work to clean the pagecache. There is no need to delay sending > > the reply for this, however. > > > > Change svc_process and svc_process_bc to call pc_release after sending > > the reply instead of before. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > net/sunrpc/svc.c | 19 +++++++++++++++---- > > 1 file changed, 15 insertions(+), 4 deletions(-) > > > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c > > index b1fab3a6954437cf751e4725fa52cfc83eddf2ab..103bb6ba8e140fdccd6cab124e715caeb41bb445 100644 > > --- a/net/sunrpc/svc.c > > +++ b/net/sunrpc/svc.c > > @@ -1426,8 +1426,6 @@ svc_process_common(struct svc_rqst *rqstp) > > > > /* Call the function that processes the request. */ > > rc = process.dispatch(rqstp); > > - if (procp->pc_release) > > - procp->pc_release(rqstp); > > xdr_finish_decode(xdr); > > > > if (!rc) > > @@ -1526,6 +1524,14 @@ static void svc_drop(struct svc_rqst *rqstp) > > trace_svc_drop(rqstp); > > } > > > > +static void svc_release_rqst(struct svc_rqst *rqstp) > > +{ > > + const struct svc_procedure *procp = rqstp->rq_procinfo; > > + > > + if (procp && procp->pc_release) > > + procp->pc_release(rqstp); > > +} > > + > > /** > > * svc_process - Execute one RPC transaction > > * @rqstp: RPC transaction context > > @@ -1533,7 +1539,7 @@ static void svc_drop(struct svc_rqst *rqstp) > > */ > > void svc_process(struct svc_rqst *rqstp) > > { > > - struct kvec *resv = &rqstp->rq_res.head[0]; > > + struct kvec *resv = &rqstp->rq_res.head[0]; > > Commas and Tabs - you can never really have enough of them, can you? > Not sure what happened there. I'll drop that hunk. > > __be32 *p; > > > > #if IS_ENABLED(CONFIG_FAIL_SUNRPC) > > @@ -1565,9 +1571,12 @@ void svc_process(struct svc_rqst *rqstp) > > if (unlikely(*p != rpc_call)) > > goto out_baddir; > > > > - if (!svc_process_common(rqstp)) > > + if (!svc_process_common(rqstp)) { > > + svc_release_rqst(rqstp); > > goto out_drop; > > + } > > svc_send(rqstp); > > + svc_release_rqst(rqstp); > > return; > > Should we, as a general rule, avoid calling any cleanup function more > than once? When tempted, we DEFINE_FREE() a cleanup function and > declare the variable appropriately. I'm not opposed to that. I think that change probably deserves a separate patch. > Though in this case it might be easier to: > > if (svc_process_common(rqstp)) > svc_send(rqstp); > else > svc_drop(rqstp); > svc_rlease_rqst(rqstp); > return; > There is another place that does a "goto out_drop in that function. I'm not sure changing that would improve things, but I'll see how it looks. > svc_process_bc() is a little more awkward. > Definitely. > But in general, delaying the release function until after the send seems > sound, and this patches appears to do it corretly. > > Reviewed-by: NeilBrown <neil@brown.name> > > NeilBrown Thanks for the review! -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT 2025-07-03 19:53 [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT Jeff Layton 2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton @ 2025-07-03 19:53 ` Jeff Layton 2025-07-03 20:07 ` Chuck Lever 2025-07-03 23:44 ` NeilBrown 2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown 2 siblings, 2 replies; 18+ messages in thread From: Jeff Layton @ 2025-07-03 19:53 UTC (permalink / raw) To: Trond Myklebust, Anna Schumaker, Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer Cc: linux-nfs, linux-kernel, Jeff Layton Recent testing has shown that keeping pagecache pages around for too long can be detrimental to performance with nfsd. Clients only rarely revisit the same data, so the pages tend to just hang around. This patch changes the pc_release callbacks for NFSv3 READ, WRITE and COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed range. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/nfsd/debugfs.c | 2 ++ fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- fs/nfsd/nfsd.h | 1 + fs/nfsd/nfsproc.c | 4 ++-- fs/nfsd/vfs.c | 21 ++++++++++++++----- fs/nfsd/vfs.h | 5 +++-- fs/nfsd/xdr3.h | 3 +++ 7 files changed, 77 insertions(+), 18 deletions(-) diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644 --- a/fs/nfsd/debugfs.c +++ b/fs/nfsd/debugfs.c @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void) debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO, nfsd_top_dir, NULL, &nfsd_dsr_fops); + debugfs_create_bool("enable-fadvise-dontneed", 0644, + nfsd_top_dir, &nfsd_enable_fadvise_dontneed); } diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c index b6d03e1ef5f7a5e8dd111b0d56c061f1e91abff7..11261cf67ea817ec566626f08b733e09c9e121de 100644 --- a/fs/nfsd/nfs3proc.c +++ b/fs/nfsd/nfs3proc.c @@ -9,6 +9,7 @@ #include <linux/ext2_fs.h> #include <linux/magic.h> #include <linux/namei.h> +#include <linux/fadvise.h> #include "cache.h" #include "xdr3.h" @@ -206,11 +207,25 @@ nfsd3_proc_read(struct svc_rqst *rqstp) fh_copy(&resp->fh, &argp->fh); resp->status = nfsd_read(rqstp, &resp->fh, argp->offset, - &resp->count, &resp->eof); + &resp->count, &resp->eof, &resp->nf); resp->status = nfsd3_map_status(resp->status); return rpc_success; } +static void +nfsd3_release_read(struct svc_rqst *rqstp) +{ + struct nfsd3_readargs *argp = rqstp->rq_argp; + struct nfsd3_readres *resp = rqstp->rq_resp; + + if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok) + generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count, + POSIX_FADV_DONTNEED); + if (resp->nf) + nfsd_file_put(resp->nf); + fh_put(&resp->fh); +} + /* * Write data to a file */ @@ -236,12 +251,26 @@ nfsd3_proc_write(struct svc_rqst *rqstp) resp->committed = argp->stable; resp->status = nfsd_write(rqstp, &resp->fh, argp->offset, &argp->payload, &cnt, - resp->committed, resp->verf); + resp->committed, resp->verf, &resp->nf); resp->count = cnt; resp->status = nfsd3_map_status(resp->status); return rpc_success; } +static void +nfsd3_release_write(struct svc_rqst *rqstp) +{ + struct nfsd3_writeargs *argp = rqstp->rq_argp; + struct nfsd3_writeres *resp = rqstp->rq_resp; + + if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok && resp->committed) + generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count, + POSIX_FADV_DONTNEED); + if (resp->nf) + nfsd_file_put(resp->nf); + fh_put(&resp->fh); +} + /* * Implement NFSv3's unchecked, guarded, and exclusive CREATE * semantics for regular files. Except for the created file, @@ -748,7 +777,6 @@ nfsd3_proc_commit(struct svc_rqst *rqstp) { struct nfsd3_commitargs *argp = rqstp->rq_argp; struct nfsd3_commitres *resp = rqstp->rq_resp; - struct nfsd_file *nf; dprintk("nfsd: COMMIT(3) %s %u@%Lu\n", SVCFH_fmt(&argp->fh), @@ -757,17 +785,30 @@ nfsd3_proc_commit(struct svc_rqst *rqstp) fh_copy(&resp->fh, &argp->fh); resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE | - NFSD_MAY_NOT_BREAK_LEASE, &nf); + NFSD_MAY_NOT_BREAK_LEASE, &resp->nf); if (resp->status) goto out; - resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset, + resp->status = nfsd_commit(rqstp, &resp->fh, resp->nf, argp->offset, argp->count, resp->verf); - nfsd_file_put(nf); out: resp->status = nfsd3_map_status(resp->status); return rpc_success; } +static void +nfsd3_release_commit(struct svc_rqst *rqstp) +{ + struct nfsd3_commitargs *argp = rqstp->rq_argp; + struct nfsd3_commitres *resp = rqstp->rq_resp; + + if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok) + generic_fadvise(nfsd_file_file(resp->nf), argp->offset, argp->count, + POSIX_FADV_DONTNEED); + if (resp->nf) + nfsd_file_put(resp->nf); + fh_put(&resp->fh); +} + /* * NFSv3 Server procedures. @@ -864,7 +905,7 @@ static const struct svc_procedure nfsd_procedures3[22] = { .pc_func = nfsd3_proc_read, .pc_decode = nfs3svc_decode_readargs, .pc_encode = nfs3svc_encode_readres, - .pc_release = nfs3svc_release_fhandle, + .pc_release = nfsd3_release_read, .pc_argsize = sizeof(struct nfsd3_readargs), .pc_argzero = sizeof(struct nfsd3_readargs), .pc_ressize = sizeof(struct nfsd3_readres), @@ -876,7 +917,7 @@ static const struct svc_procedure nfsd_procedures3[22] = { .pc_func = nfsd3_proc_write, .pc_decode = nfs3svc_decode_writeargs, .pc_encode = nfs3svc_encode_writeres, - .pc_release = nfs3svc_release_fhandle, + .pc_release = nfsd3_release_write, .pc_argsize = sizeof(struct nfsd3_writeargs), .pc_argzero = sizeof(struct nfsd3_writeargs), .pc_ressize = sizeof(struct nfsd3_writeres), @@ -1039,7 +1080,7 @@ static const struct svc_procedure nfsd_procedures3[22] = { .pc_func = nfsd3_proc_commit, .pc_decode = nfs3svc_decode_commitargs, .pc_encode = nfs3svc_encode_commitres, - .pc_release = nfs3svc_release_fhandle, + .pc_release = nfsd3_release_commit, .pc_argsize = sizeof(struct nfsd3_commitargs), .pc_argzero = sizeof(struct nfsd3_commitargs), .pc_ressize = sizeof(struct nfsd3_commitres), diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h index 1cd0bed57bc2f27248fd66a8efef692a5e9a390c..288904d88b9245c03eae0aa347e867037b7c85c5 100644 --- a/fs/nfsd/nfsd.h +++ b/fs/nfsd/nfsd.h @@ -152,6 +152,7 @@ static inline void nfsd_debugfs_exit(void) {} #endif extern bool nfsd_disable_splice_read __read_mostly; +extern bool nfsd_enable_fadvise_dontneed __read_mostly; extern int nfsd_max_blksize; diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c index 8f71f5748c75b69f15bae5e63799842e0e8b3139..bb8f98adb3e31b10adc4694987f8f5036726bf7f 100644 --- a/fs/nfsd/nfsproc.c +++ b/fs/nfsd/nfsproc.c @@ -225,7 +225,7 @@ nfsd_proc_read(struct svc_rqst *rqstp) resp->count = argp->count; fh_copy(&resp->fh, &argp->fh); resp->status = nfsd_read(rqstp, &resp->fh, argp->offset, - &resp->count, &eof); + &resp->count, &eof, NULL); if (resp->status == nfs_ok) resp->status = fh_getattr(&resp->fh, &resp->stat); else if (resp->status == nfserr_jukebox) @@ -258,7 +258,7 @@ nfsd_proc_write(struct svc_rqst *rqstp) fh_copy(&resp->fh, &argp->fh); resp->status = nfsd_write(rqstp, &resp->fh, argp->offset, - &argp->payload, &cnt, NFS_DATA_SYNC, NULL); + &argp->payload, &cnt, NFS_DATA_SYNC, NULL, NULL); if (resp->status == nfs_ok) resp->status = fh_getattr(&resp->fh, &resp->stat); else if (resp->status == nfserr_jukebox) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index ee78b6fb17098b788b07f5cd90598e678244b57e..f23eb3a07bb99dc231be9ea6db4e58b9e328a689 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -49,6 +49,7 @@ #define NFSDDBG_FACILITY NFSDDBG_FILEOP bool nfsd_disable_splice_read __read_mostly; +bool nfsd_enable_fadvise_dontneed __read_mostly; /** * nfserrno - Map Linux errnos to NFS errnos @@ -1280,6 +1281,7 @@ bool nfsd_read_splice_ok(struct svc_rqst *rqstp) * @offset: starting byte offset * @count: IN: requested number of bytes; OUT: number of bytes read * @eof: OUT: set non-zero if operation reached the end of the file + * @pnf: optional return pointer to pass back nfsd_file reference * * The caller must verify that there is enough space in @rqstp.rq_res * to perform this operation. @@ -1290,7 +1292,8 @@ bool nfsd_read_splice_ok(struct svc_rqst *rqstp) * returned. */ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, - loff_t offset, unsigned long *count, u32 *eof) + loff_t offset, unsigned long *count, u32 *eof, + struct nfsd_file **pnf) { struct nfsd_file *nf; struct file *file; @@ -1307,7 +1310,10 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, else err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof); - nfsd_file_put(nf); + if (pnf && err == nfs_ok) + *pnf = nf; + else + nfsd_file_put(nf); trace_nfsd_read_done(rqstp, fhp, offset, *count); return err; } @@ -1321,8 +1327,10 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, * @cnt: IN: number of bytes to write, OUT: number of bytes actually written * @stable: An NFS stable_how value * @verf: NFS WRITE verifier + * @pnf: optional return pointer to pass back nfsd_file reference * - * Upon return, caller must invoke fh_put on @fhp. + * Upon return, caller must invoke fh_put() on @fhp. If it sets @pnf, + * then it must also call nfsd_file_put() on the returned reference. * * Return values: * An nfsstat value in network byte order. @@ -1330,7 +1338,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, __be32 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset, const struct xdr_buf *payload, unsigned long *cnt, int stable, - __be32 *verf) + __be32 *verf, struct nfsd_file **pnf) { struct nfsd_file *nf; __be32 err; @@ -1343,7 +1351,10 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset, err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt, stable, verf); - nfsd_file_put(nf); + if (pnf && err == nfs_ok) + *pnf = nf; + else + nfsd_file_put(nf); out: trace_nfsd_write_done(rqstp, fhp, offset, *cnt); return err; diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h index eff04959606fe55c141ab4a2eed97c7e0716a5f5..2d063ee7786f499f34c39cd3ba7e776bb7562d57 100644 --- a/fs/nfsd/vfs.h +++ b/fs/nfsd/vfs.h @@ -127,10 +127,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, bool nfsd_read_splice_ok(struct svc_rqst *rqstp); __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset, unsigned long *count, - u32 *eof); + u32 *eof, struct nfsd_file **pnf); __be32 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset, const struct xdr_buf *payload, - unsigned long *cnt, int stable, __be32 *verf); + unsigned long *cnt, int stable, __be32 *verf, + struct nfsd_file **pnf); __be32 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, loff_t offset, const struct xdr_buf *payload, diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h index 522067b7fd755930a1c2e42b826d9132ac2993be..3f4c51983003188be51a0f8c2db2e0acc9a1363b 100644 --- a/fs/nfsd/xdr3.h +++ b/fs/nfsd/xdr3.h @@ -146,6 +146,7 @@ struct nfsd3_readres { unsigned long count; __u32 eof; struct page **pages; + struct nfsd_file *nf; }; struct nfsd3_writeres { @@ -154,6 +155,7 @@ struct nfsd3_writeres { unsigned long count; int committed; __be32 verf[2]; + struct nfsd_file *nf; }; struct nfsd3_renameres { @@ -217,6 +219,7 @@ struct nfsd3_commitres { __be32 status; struct svc_fh fh; __be32 verf[2]; + struct nfsd_file *nf; }; struct nfsd3_getaclres { -- 2.50.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT 2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton @ 2025-07-03 20:07 ` Chuck Lever 2025-07-08 14:34 ` Jeff Layton 2025-07-08 21:07 ` Mike Snitzer 2025-07-03 23:44 ` NeilBrown 1 sibling, 2 replies; 18+ messages in thread From: Chuck Lever @ 2025-07-03 20:07 UTC (permalink / raw) To: Jeff Layton, Trond Myklebust, Anna Schumaker, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer Cc: linux-nfs, linux-kernel On 7/3/25 3:53 PM, Jeff Layton wrote: > Recent testing has shown that keeping pagecache pages around for too > long can be detrimental to performance with nfsd. Clients only rarely > revisit the same data, so the pages tend to just hang around. > > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed > range. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/nfsd/debugfs.c | 2 ++ > fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- > fs/nfsd/nfsd.h | 1 + > fs/nfsd/nfsproc.c | 4 ++-- > fs/nfsd/vfs.c | 21 ++++++++++++++----- > fs/nfsd/vfs.h | 5 +++-- > fs/nfsd/xdr3.h | 3 +++ > 7 files changed, 77 insertions(+), 18 deletions(-) > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644 > --- a/fs/nfsd/debugfs.c > +++ b/fs/nfsd/debugfs.c > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void) > > debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO, > nfsd_top_dir, NULL, &nfsd_dsr_fops); > + debugfs_create_bool("enable-fadvise-dontneed", 0644, > + nfsd_top_dir, &nfsd_enable_fadvise_dontneed); I prefer that this setting is folded into the new io_cache_read / io_cache_write tune-ables that Mike's patch adds, rather than adding a new boolean. That might make a hybrid "DONTCACHE for READ and fadvise for WRITE" pretty easy. > } > diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c > index b6d03e1ef5f7a5e8dd111b0d56c061f1e91abff7..11261cf67ea817ec566626f08b733e09c9e121de 100644 > --- a/fs/nfsd/nfs3proc.c > +++ b/fs/nfsd/nfs3proc.c > @@ -9,6 +9,7 @@ > #include <linux/ext2_fs.h> > #include <linux/magic.h> > #include <linux/namei.h> > +#include <linux/fadvise.h> > > #include "cache.h" > #include "xdr3.h" > @@ -206,11 +207,25 @@ nfsd3_proc_read(struct svc_rqst *rqstp) > > fh_copy(&resp->fh, &argp->fh); > resp->status = nfsd_read(rqstp, &resp->fh, argp->offset, > - &resp->count, &resp->eof); > + &resp->count, &resp->eof, &resp->nf); > resp->status = nfsd3_map_status(resp->status); > return rpc_success; > } > > +static void > +nfsd3_release_read(struct svc_rqst *rqstp) > +{ > + struct nfsd3_readargs *argp = rqstp->rq_argp; > + struct nfsd3_readres *resp = rqstp->rq_resp; > + > + if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok) > + generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count, > + POSIX_FADV_DONTNEED); > + if (resp->nf) > + nfsd_file_put(resp->nf); > + fh_put(&resp->fh); > +} > + > /* > * Write data to a file > */ > @@ -236,12 +251,26 @@ nfsd3_proc_write(struct svc_rqst *rqstp) > resp->committed = argp->stable; > resp->status = nfsd_write(rqstp, &resp->fh, argp->offset, > &argp->payload, &cnt, > - resp->committed, resp->verf); > + resp->committed, resp->verf, &resp->nf); > resp->count = cnt; > resp->status = nfsd3_map_status(resp->status); > return rpc_success; > } > > +static void > +nfsd3_release_write(struct svc_rqst *rqstp) > +{ > + struct nfsd3_writeargs *argp = rqstp->rq_argp; > + struct nfsd3_writeres *resp = rqstp->rq_resp; > + > + if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok && resp->committed) > + generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count, > + POSIX_FADV_DONTNEED); > + if (resp->nf) > + nfsd_file_put(resp->nf); > + fh_put(&resp->fh); > +} > + > /* > * Implement NFSv3's unchecked, guarded, and exclusive CREATE > * semantics for regular files. Except for the created file, > @@ -748,7 +777,6 @@ nfsd3_proc_commit(struct svc_rqst *rqstp) > { > struct nfsd3_commitargs *argp = rqstp->rq_argp; > struct nfsd3_commitres *resp = rqstp->rq_resp; > - struct nfsd_file *nf; > > dprintk("nfsd: COMMIT(3) %s %u@%Lu\n", > SVCFH_fmt(&argp->fh), > @@ -757,17 +785,30 @@ nfsd3_proc_commit(struct svc_rqst *rqstp) > > fh_copy(&resp->fh, &argp->fh); > resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE | > - NFSD_MAY_NOT_BREAK_LEASE, &nf); > + NFSD_MAY_NOT_BREAK_LEASE, &resp->nf); > if (resp->status) > goto out; > - resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset, > + resp->status = nfsd_commit(rqstp, &resp->fh, resp->nf, argp->offset, > argp->count, resp->verf); > - nfsd_file_put(nf); > out: > resp->status = nfsd3_map_status(resp->status); > return rpc_success; > } > > +static void > +nfsd3_release_commit(struct svc_rqst *rqstp) > +{ > + struct nfsd3_commitargs *argp = rqstp->rq_argp; > + struct nfsd3_commitres *resp = rqstp->rq_resp; > + > + if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok) > + generic_fadvise(nfsd_file_file(resp->nf), argp->offset, argp->count, > + POSIX_FADV_DONTNEED); > + if (resp->nf) > + nfsd_file_put(resp->nf); > + fh_put(&resp->fh); > +} > + > > /* > * NFSv3 Server procedures. > @@ -864,7 +905,7 @@ static const struct svc_procedure nfsd_procedures3[22] = { > .pc_func = nfsd3_proc_read, > .pc_decode = nfs3svc_decode_readargs, > .pc_encode = nfs3svc_encode_readres, > - .pc_release = nfs3svc_release_fhandle, > + .pc_release = nfsd3_release_read, > .pc_argsize = sizeof(struct nfsd3_readargs), > .pc_argzero = sizeof(struct nfsd3_readargs), > .pc_ressize = sizeof(struct nfsd3_readres), > @@ -876,7 +917,7 @@ static const struct svc_procedure nfsd_procedures3[22] = { > .pc_func = nfsd3_proc_write, > .pc_decode = nfs3svc_decode_writeargs, > .pc_encode = nfs3svc_encode_writeres, > - .pc_release = nfs3svc_release_fhandle, > + .pc_release = nfsd3_release_write, > .pc_argsize = sizeof(struct nfsd3_writeargs), > .pc_argzero = sizeof(struct nfsd3_writeargs), > .pc_ressize = sizeof(struct nfsd3_writeres), > @@ -1039,7 +1080,7 @@ static const struct svc_procedure nfsd_procedures3[22] = { > .pc_func = nfsd3_proc_commit, > .pc_decode = nfs3svc_decode_commitargs, > .pc_encode = nfs3svc_encode_commitres, > - .pc_release = nfs3svc_release_fhandle, > + .pc_release = nfsd3_release_commit, > .pc_argsize = sizeof(struct nfsd3_commitargs), > .pc_argzero = sizeof(struct nfsd3_commitargs), > .pc_ressize = sizeof(struct nfsd3_commitres), > diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h > index 1cd0bed57bc2f27248fd66a8efef692a5e9a390c..288904d88b9245c03eae0aa347e867037b7c85c5 100644 > --- a/fs/nfsd/nfsd.h > +++ b/fs/nfsd/nfsd.h > @@ -152,6 +152,7 @@ static inline void nfsd_debugfs_exit(void) {} > #endif > > extern bool nfsd_disable_splice_read __read_mostly; > +extern bool nfsd_enable_fadvise_dontneed __read_mostly; > > extern int nfsd_max_blksize; > > diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c > index 8f71f5748c75b69f15bae5e63799842e0e8b3139..bb8f98adb3e31b10adc4694987f8f5036726bf7f 100644 > --- a/fs/nfsd/nfsproc.c > +++ b/fs/nfsd/nfsproc.c > @@ -225,7 +225,7 @@ nfsd_proc_read(struct svc_rqst *rqstp) > resp->count = argp->count; > fh_copy(&resp->fh, &argp->fh); > resp->status = nfsd_read(rqstp, &resp->fh, argp->offset, > - &resp->count, &eof); > + &resp->count, &eof, NULL); > if (resp->status == nfs_ok) > resp->status = fh_getattr(&resp->fh, &resp->stat); > else if (resp->status == nfserr_jukebox) > @@ -258,7 +258,7 @@ nfsd_proc_write(struct svc_rqst *rqstp) > > fh_copy(&resp->fh, &argp->fh); > resp->status = nfsd_write(rqstp, &resp->fh, argp->offset, > - &argp->payload, &cnt, NFS_DATA_SYNC, NULL); > + &argp->payload, &cnt, NFS_DATA_SYNC, NULL, NULL); > if (resp->status == nfs_ok) > resp->status = fh_getattr(&resp->fh, &resp->stat); > else if (resp->status == nfserr_jukebox) > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index ee78b6fb17098b788b07f5cd90598e678244b57e..f23eb3a07bb99dc231be9ea6db4e58b9e328a689 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -49,6 +49,7 @@ > #define NFSDDBG_FACILITY NFSDDBG_FILEOP > > bool nfsd_disable_splice_read __read_mostly; > +bool nfsd_enable_fadvise_dontneed __read_mostly; > > /** > * nfserrno - Map Linux errnos to NFS errnos > @@ -1280,6 +1281,7 @@ bool nfsd_read_splice_ok(struct svc_rqst *rqstp) > * @offset: starting byte offset > * @count: IN: requested number of bytes; OUT: number of bytes read > * @eof: OUT: set non-zero if operation reached the end of the file > + * @pnf: optional return pointer to pass back nfsd_file reference > * > * The caller must verify that there is enough space in @rqstp.rq_res > * to perform this operation. > @@ -1290,7 +1292,8 @@ bool nfsd_read_splice_ok(struct svc_rqst *rqstp) > * returned. > */ > __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > - loff_t offset, unsigned long *count, u32 *eof) > + loff_t offset, unsigned long *count, u32 *eof, > + struct nfsd_file **pnf) > { > struct nfsd_file *nf; > struct file *file; > @@ -1307,7 +1310,10 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > else > err = nfsd_iter_read(rqstp, fhp, file, offset, count, 0, eof); > > - nfsd_file_put(nf); > + if (pnf && err == nfs_ok) > + *pnf = nf; > + else > + nfsd_file_put(nf); > trace_nfsd_read_done(rqstp, fhp, offset, *count); > return err; > } > @@ -1321,8 +1327,10 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > * @cnt: IN: number of bytes to write, OUT: number of bytes actually written > * @stable: An NFS stable_how value > * @verf: NFS WRITE verifier > + * @pnf: optional return pointer to pass back nfsd_file reference > * > - * Upon return, caller must invoke fh_put on @fhp. > + * Upon return, caller must invoke fh_put() on @fhp. If it sets @pnf, > + * then it must also call nfsd_file_put() on the returned reference. > * > * Return values: > * An nfsstat value in network byte order. > @@ -1330,7 +1338,7 @@ __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > __be32 > nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset, > const struct xdr_buf *payload, unsigned long *cnt, int stable, > - __be32 *verf) > + __be32 *verf, struct nfsd_file **pnf) > { > struct nfsd_file *nf; > __be32 err; > @@ -1343,7 +1351,10 @@ nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset, > > err = nfsd_vfs_write(rqstp, fhp, nf, offset, payload, cnt, > stable, verf); > - nfsd_file_put(nf); > + if (pnf && err == nfs_ok) > + *pnf = nf; > + else > + nfsd_file_put(nf); > out: > trace_nfsd_write_done(rqstp, fhp, offset, *cnt); > return err; > diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h > index eff04959606fe55c141ab4a2eed97c7e0716a5f5..2d063ee7786f499f34c39cd3ba7e776bb7562d57 100644 > --- a/fs/nfsd/vfs.h > +++ b/fs/nfsd/vfs.h > @@ -127,10 +127,11 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > bool nfsd_read_splice_ok(struct svc_rqst *rqstp); > __be32 nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > loff_t offset, unsigned long *count, > - u32 *eof); > + u32 *eof, struct nfsd_file **pnf); > __be32 nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, > loff_t offset, const struct xdr_buf *payload, > - unsigned long *cnt, int stable, __be32 *verf); > + unsigned long *cnt, int stable, __be32 *verf, > + struct nfsd_file **pnf); > __be32 nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, > struct nfsd_file *nf, loff_t offset, > const struct xdr_buf *payload, > diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h > index 522067b7fd755930a1c2e42b826d9132ac2993be..3f4c51983003188be51a0f8c2db2e0acc9a1363b 100644 > --- a/fs/nfsd/xdr3.h > +++ b/fs/nfsd/xdr3.h > @@ -146,6 +146,7 @@ struct nfsd3_readres { > unsigned long count; > __u32 eof; > struct page **pages; > + struct nfsd_file *nf; > }; > > struct nfsd3_writeres { > @@ -154,6 +155,7 @@ struct nfsd3_writeres { > unsigned long count; > int committed; > __be32 verf[2]; > + struct nfsd_file *nf; > }; > > struct nfsd3_renameres { > @@ -217,6 +219,7 @@ struct nfsd3_commitres { > __be32 status; > struct svc_fh fh; > __be32 verf[2]; > + struct nfsd_file *nf; > }; > > struct nfsd3_getaclres { > -- Chuck Lever ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT 2025-07-03 20:07 ` Chuck Lever @ 2025-07-08 14:34 ` Jeff Layton 2025-07-08 21:12 ` Mike Snitzer 2025-07-08 21:07 ` Mike Snitzer 1 sibling, 1 reply; 18+ messages in thread From: Jeff Layton @ 2025-07-08 14:34 UTC (permalink / raw) To: Chuck Lever, Trond Myklebust, Anna Schumaker, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer Cc: linux-nfs, linux-kernel [-- Attachment #1: Type: text/plain, Size: 5447 bytes --] On Thu, 2025-07-03 at 16:07 -0400, Chuck Lever wrote: > On 7/3/25 3:53 PM, Jeff Layton wrote: > > Recent testing has shown that keeping pagecache pages around for too > > long can be detrimental to performance with nfsd. Clients only rarely > > revisit the same data, so the pages tend to just hang around. > > > > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and > > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed > > range. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/nfsd/debugfs.c | 2 ++ > > fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- > > fs/nfsd/nfsd.h | 1 + > > fs/nfsd/nfsproc.c | 4 ++-- > > fs/nfsd/vfs.c | 21 ++++++++++++++----- > > fs/nfsd/vfs.h | 5 +++-- > > fs/nfsd/xdr3.h | 3 +++ > > 7 files changed, 77 insertions(+), 18 deletions(-) > > > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c > > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644 > > --- a/fs/nfsd/debugfs.c > > +++ b/fs/nfsd/debugfs.c > > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void) > > > > debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO, > > nfsd_top_dir, NULL, &nfsd_dsr_fops); > > + debugfs_create_bool("enable-fadvise-dontneed", 0644, > > + nfsd_top_dir, &nfsd_enable_fadvise_dontneed); > > I prefer that this setting is folded into the new io_cache_read / > io_cache_write tune-ables that Mike's patch adds, rather than adding > a new boolean. > > That might make a hybrid "DONTCACHE for READ and fadvise for WRITE" > pretty easy. > I ended up rebasing Mike's dontcache branch on top of v6.16-rc5 with all of Chuck's trees in. I then added the attached patch and did some testing with a couple of machines I checked out internally at Meta. This is the throughput results with the fio-seq-RW test with the file size set to 100G and the duration at 5 mins. Note that: read and writes buffered: READ: bw=3024MiB/s (3171MB/s), 186MiB/s-191MiB/s (195MB/s-201MB/s), io=889GiB (954GB), run=300012-300966msec WRITE: bw=2015MiB/s (2113MB/s), 124MiB/s-128MiB/s (131MB/s-134MB/s), io=592GiB (636GB), run=300012-300966msec READ: bw=2902MiB/s (3043MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=851GiB (913GB), run=300027-300118msec WRITE: bw=1934MiB/s (2027MB/s), 119MiB/s-122MiB/s (124MB/s-128MB/s), io=567GiB (608GB), run=300027-300118msec READ: bw=2897MiB/s (3037MB/s), 178MiB/s-183MiB/s (186MB/s-192MB/s), io=849GiB (911GB), run=300006-300078msec WRITE: bw=1930MiB/s (2023MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=565GiB (607GB), run=300006-300078msec reads and writes RWF_DONTCACHE: READ: bw=3090MiB/s (3240MB/s), 190MiB/s-195MiB/s (199MB/s-205MB/s), io=906GiB (972GB), run=300015-300113msec WRITE: bw=2060MiB/s (2160MB/s), 126MiB/s-130MiB/s (132MB/s-137MB/s), io=604GiB (648GB), run=300015-300113msec READ: bw=3057MiB/s (3205MB/s), 188MiB/s-193MiB/s (198MB/s-203MB/s), io=897GiB (963GB), run=300329-300450msec WRITE: bw=2037MiB/s (2136MB/s), 126MiB/s-129MiB/s (132MB/s-135MB/s), io=598GiB (642GB), run=300329-300450msec READ: bw=3166MiB/s (3320MB/s), 196MiB/s-200MiB/s (205MB/s-210MB/s), io=928GiB (996GB), run=300021-300090msec WRITE: bw=2111MiB/s (2213MB/s), 131MiB/s-133MiB/s (137MB/s-140MB/s), io=619GiB (664GB), run=300021-300090msec reads and writes witg O_DIRECT: READ: bw=3115MiB/s (3267MB/s), 192MiB/s-198MiB/s (201MB/s-208MB/s), io=913GiB (980GB), run=300025-300078msec WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-138MB/s), io=609GiB (653GB), run=300025-300078msec READ: bw=3189MiB/s (3343MB/s), 197MiB/s-202MiB/s (207MB/s-211MB/s), io=934GiB (1003GB), run=300023-300096msec WRITE: bw=2125MiB/s (2228MB/s), 132MiB/s-134MiB/s (138MB/s-140MB/s), io=623GiB (669GB), run=300023-300096msec READ: bw=3113MiB/s (3264MB/s), 191MiB/s-197MiB/s (200MB/s-207MB/s), io=912GiB (979GB), run=300020-300098msec WRITE: bw=2075MiB/s (2175MB/s), 127MiB/s-131MiB/s (134MB/s-138MB/s), io=608GiB (653GB), run=300020-300098msec RWF_DONTCACHE on reads and stable writes + fadvise DONTNEED after COMMIT: READ: bw=2888MiB/s (3029MB/s), 178MiB/s-182MiB/s (187MB/s-191MB/s), io=846GiB (909GB), run=300012-300109msec WRITE: bw=1924MiB/s (2017MB/s), 118MiB/s-121MiB/s (124MB/s-127MB/s), io=564GiB (605GB), run=300012-300109msec READ: bw=2899MiB/s (3040MB/s), 180MiB/s-183MiB/s (188MB/s-192MB/s), io=852GiB (915GB), run=300022-300940msec WRITE: bw=1931MiB/s (2025MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=567GiB (609GB), run=300022-300940msec READ: bw=2902MiB/s (3043MB/s), 179MiB/s-184MiB/s (188MB/s-193MB/s), io=853GiB (916GB), run=300913-301146msec WRITE: bw=1933MiB/s (2027MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=568GiB (610GB), run=300913-301146msec The fadvise case is clearly slower than the others. Interestingly it also slowed down read performance, which leads me to believe that maybe the fadvise calls were interfering with concurrent reads. Given the disappointing numbers, I'll probably drop the last patch. There is probably a case to be made for patch #1, on the general principle of expediting sending the reply as much as possible. Chuck, let me know if you want me to submit that individually. -- Jeff Layton <jlayton@kernel.org> [-- Attachment #2: 0001-nfsd-add-a-NFSD_IO_FADVISE-setting-to-io_cache_write.patch --] [-- Type: text/x-patch, Size: 4348 bytes --] From 14958516bf45f92a8609cb6ad504e92550b416d7 Mon Sep 17 00:00:00 2001 From: Jeff Layton <jlayton@kernel.org> Date: Mon, 7 Jul 2025 11:00:34 -0400 Subject: [PATCH] nfsd: add a NFSD_IO_FADVISE setting to io_cache_write Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/nfsd/debugfs.c | 2 ++ fs/nfsd/nfs3proc.c | 24 +++++++++++++++++++----- fs/nfsd/nfsd.h | 3 ++- fs/nfsd/vfs.c | 4 ++++ fs/nfsd/xdr3.h | 1 + 5 files changed, 28 insertions(+), 6 deletions(-) diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c index fadf1d88f640..d0f7dfe6d621 100644 --- a/fs/nfsd/debugfs.c +++ b/fs/nfsd/debugfs.c @@ -89,6 +89,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get, * %0: NFS WRITE will use buffered IO (default) * %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind) * %2: NFS WRITE will use direct IO + * %3: NFS stable WRITE will use dontcache + fadvise DONTNEED after COMMIT * * The default value of this setting is zero (buffered IO is * used). This setting takes immediate effect for all NFS @@ -107,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val) case NFSD_IO_BUFFERED: case NFSD_IO_DONTCACHE: case NFSD_IO_DIRECT: + case NFSD_IO_FADVISE: nfsd_io_cache_write = val; break; default: diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c index b6d03e1ef5f7..6737d22c001d 100644 --- a/fs/nfsd/nfs3proc.c +++ b/fs/nfsd/nfs3proc.c @@ -9,6 +9,7 @@ #include <linux/ext2_fs.h> #include <linux/magic.h> #include <linux/namei.h> +#include <linux/fadvise.h> #include "cache.h" #include "xdr3.h" @@ -748,7 +749,6 @@ nfsd3_proc_commit(struct svc_rqst *rqstp) { struct nfsd3_commitargs *argp = rqstp->rq_argp; struct nfsd3_commitres *resp = rqstp->rq_resp; - struct nfsd_file *nf; dprintk("nfsd: COMMIT(3) %s %u@%Lu\n", SVCFH_fmt(&argp->fh), @@ -757,17 +757,31 @@ nfsd3_proc_commit(struct svc_rqst *rqstp) fh_copy(&resp->fh, &argp->fh); resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE | - NFSD_MAY_NOT_BREAK_LEASE, &nf); + NFSD_MAY_NOT_BREAK_LEASE, &resp->nf); if (resp->status) goto out; - resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset, + resp->status = nfsd_commit(rqstp, &resp->fh, resp->nf, argp->offset, argp->count, resp->verf); - nfsd_file_put(nf); out: resp->status = nfsd3_map_status(resp->status); return rpc_success; } +static void +nfsd3_release_commit(struct svc_rqst *rqstp) +{ + struct nfsd3_commitargs *argp = rqstp->rq_argp; + struct nfsd3_commitres *resp = rqstp->rq_resp; + + fh_put(&resp->fh); + if (resp->nf) { + if (nfsd_io_cache_write == NFSD_IO_FADVISE) + vfs_fadvise(nfsd_file_file(resp->nf), argp->offset, + argp->count, POSIX_FADV_DONTNEED); + nfsd_file_put(resp->nf); + } +} + /* * NFSv3 Server procedures. @@ -1039,7 +1053,7 @@ static const struct svc_procedure nfsd_procedures3[22] = { .pc_func = nfsd3_proc_commit, .pc_decode = nfs3svc_decode_commitargs, .pc_encode = nfs3svc_encode_commitres, - .pc_release = nfs3svc_release_fhandle, + .pc_release = nfsd3_release_commit, .pc_argsize = sizeof(struct nfsd3_commitargs), .pc_argzero = sizeof(struct nfsd3_commitargs), .pc_ressize = sizeof(struct nfsd3_commitres), diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h index 1ae38c5557c4..b21902e02982 100644 --- a/fs/nfsd/nfsd.h +++ b/fs/nfsd/nfsd.h @@ -156,7 +156,8 @@ extern bool nfsd_disable_splice_read __read_mostly; enum { NFSD_IO_BUFFERED = 0, NFSD_IO_DONTCACHE, - NFSD_IO_DIRECT + NFSD_IO_DIRECT, + NFSD_IO_FADVISE }; extern u64 nfsd_io_cache_read __read_mostly; diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 08350070e083..5d4588a106b2 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1271,6 +1271,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, case NFSD_IO_DONTCACHE: kiocb.ki_flags = IOCB_DONTCACHE; break; + case NFSD_IO_FADVISE: + if (stable) + kiocb.ki_flags = IOCB_DONTCACHE; + break; case NFSD_IO_BUFFERED: break; } diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h index 522067b7fd75..ec91a24e651a 100644 --- a/fs/nfsd/xdr3.h +++ b/fs/nfsd/xdr3.h @@ -217,6 +217,7 @@ struct nfsd3_commitres { __be32 status; struct svc_fh fh; __be32 verf[2]; + struct nfsd_file *nf; }; struct nfsd3_getaclres { -- 2.50.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT 2025-07-08 14:34 ` Jeff Layton @ 2025-07-08 21:12 ` Mike Snitzer 0 siblings, 0 replies; 18+ messages in thread From: Mike Snitzer @ 2025-07-08 21:12 UTC (permalink / raw) To: Jeff Layton Cc: Chuck Lever, Trond Myklebust, Anna Schumaker, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, linux-kernel On Tue, Jul 08, 2025 at 10:34:15AM -0400, Jeff Layton wrote: > On Thu, 2025-07-03 at 16:07 -0400, Chuck Lever wrote: > > On 7/3/25 3:53 PM, Jeff Layton wrote: > > > Recent testing has shown that keeping pagecache pages around for too > > > long can be detrimental to performance with nfsd. Clients only rarely > > > revisit the same data, so the pages tend to just hang around. > > > > > > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and > > > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed > > > range. > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > --- > > > fs/nfsd/debugfs.c | 2 ++ > > > fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- > > > fs/nfsd/nfsd.h | 1 + > > > fs/nfsd/nfsproc.c | 4 ++-- > > > fs/nfsd/vfs.c | 21 ++++++++++++++----- > > > fs/nfsd/vfs.h | 5 +++-- > > > fs/nfsd/xdr3.h | 3 +++ > > > 7 files changed, 77 insertions(+), 18 deletions(-) > > > > > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c > > > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644 > > > --- a/fs/nfsd/debugfs.c > > > +++ b/fs/nfsd/debugfs.c > > > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void) > > > > > > debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO, > > > nfsd_top_dir, NULL, &nfsd_dsr_fops); > > > + debugfs_create_bool("enable-fadvise-dontneed", 0644, > > > + nfsd_top_dir, &nfsd_enable_fadvise_dontneed); > > > > I prefer that this setting is folded into the new io_cache_read / > > io_cache_write tune-ables that Mike's patch adds, rather than adding > > a new boolean. > > > > That might make a hybrid "DONTCACHE for READ and fadvise for WRITE" > > pretty easy. > > > > I ended up rebasing Mike's dontcache branch on top of v6.16-rc5 with > all of Chuck's trees in. I then added the attached patch and did some > testing with a couple of machines I checked out internally at Meta. > This is the throughput results with the fio-seq-RW test with the file > size set to 100G and the duration at 5 mins. > > Note that: > > read and writes buffered: > READ: bw=3024MiB/s (3171MB/s), 186MiB/s-191MiB/s (195MB/s-201MB/s), io=889GiB (954GB), run=300012-300966msec > WRITE: bw=2015MiB/s (2113MB/s), 124MiB/s-128MiB/s (131MB/s-134MB/s), io=592GiB (636GB), run=300012-300966msec > > READ: bw=2902MiB/s (3043MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=851GiB (913GB), run=300027-300118msec > WRITE: bw=1934MiB/s (2027MB/s), 119MiB/s-122MiB/s (124MB/s-128MB/s), io=567GiB (608GB), run=300027-300118msec > > READ: bw=2897MiB/s (3037MB/s), 178MiB/s-183MiB/s (186MB/s-192MB/s), io=849GiB (911GB), run=300006-300078msec > WRITE: bw=1930MiB/s (2023MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=565GiB (607GB), run=300006-300078msec > > reads and writes RWF_DONTCACHE: > READ: bw=3090MiB/s (3240MB/s), 190MiB/s-195MiB/s (199MB/s-205MB/s), io=906GiB (972GB), run=300015-300113msec > WRITE: bw=2060MiB/s (2160MB/s), 126MiB/s-130MiB/s (132MB/s-137MB/s), io=604GiB (648GB), run=300015-300113msec > > READ: bw=3057MiB/s (3205MB/s), 188MiB/s-193MiB/s (198MB/s-203MB/s), io=897GiB (963GB), run=300329-300450msec > WRITE: bw=2037MiB/s (2136MB/s), 126MiB/s-129MiB/s (132MB/s-135MB/s), io=598GiB (642GB), run=300329-300450msec > > READ: bw=3166MiB/s (3320MB/s), 196MiB/s-200MiB/s (205MB/s-210MB/s), io=928GiB (996GB), run=300021-300090msec > WRITE: bw=2111MiB/s (2213MB/s), 131MiB/s-133MiB/s (137MB/s-140MB/s), io=619GiB (664GB), run=300021-300090msec > > reads and writes witg O_DIRECT: > READ: bw=3115MiB/s (3267MB/s), 192MiB/s-198MiB/s (201MB/s-208MB/s), io=913GiB (980GB), run=300025-300078msec > WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-138MB/s), io=609GiB (653GB), run=300025-300078msec > > READ: bw=3189MiB/s (3343MB/s), 197MiB/s-202MiB/s (207MB/s-211MB/s), io=934GiB (1003GB), run=300023-300096msec > WRITE: bw=2125MiB/s (2228MB/s), 132MiB/s-134MiB/s (138MB/s-140MB/s), io=623GiB (669GB), run=300023-300096msec > > READ: bw=3113MiB/s (3264MB/s), 191MiB/s-197MiB/s (200MB/s-207MB/s), io=912GiB (979GB), run=300020-300098msec > WRITE: bw=2075MiB/s (2175MB/s), 127MiB/s-131MiB/s (134MB/s-138MB/s), io=608GiB (653GB), run=300020-300098msec > > RWF_DONTCACHE on reads and stable writes + fadvise DONTNEED after COMMIT: > READ: bw=2888MiB/s (3029MB/s), 178MiB/s-182MiB/s (187MB/s-191MB/s), io=846GiB (909GB), run=300012-300109msec > WRITE: bw=1924MiB/s (2017MB/s), 118MiB/s-121MiB/s (124MB/s-127MB/s), io=564GiB (605GB), run=300012-300109msec > > READ: bw=2899MiB/s (3040MB/s), 180MiB/s-183MiB/s (188MB/s-192MB/s), io=852GiB (915GB), run=300022-300940msec > WRITE: bw=1931MiB/s (2025MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=567GiB (609GB), run=300022-300940msec > > READ: bw=2902MiB/s (3043MB/s), 179MiB/s-184MiB/s (188MB/s-193MB/s), io=853GiB (916GB), run=300913-301146msec > WRITE: bw=1933MiB/s (2027MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=568GiB (610GB), run=300913-301146msec > > > The fadvise case is clearly slower than the others. Interestingly it > also slowed down read performance, which leads me to believe that maybe > the fadvise calls were interfering with concurrent reads. Given the > disappointing numbers, I'll probably drop the last patch. > > There is probably a case to be made for patch #1, on the general > principle of expediting sending the reply as much as possible. Chuck, > let me know if you want me to submit that individually. > -- > Jeff Layton <jlayton@kernel.org> > From 14958516bf45f92a8609cb6ad504e92550b416d7 Mon Sep 17 00:00:00 2001 > From: Jeff Layton <jlayton@kernel.org> > Date: Mon, 7 Jul 2025 11:00:34 -0400 > Subject: [PATCH] nfsd: add a NFSD_IO_FADVISE setting to io_cache_write > > Signed-off-by: Jeff Layton <jlayton@kernel.org> Ah, you raced ahead and did what I suggested in my previous reply just now. Too bad the performance is lacking... but I applaud you're trying out a new/interesting idea! BTW, measuring CPU and memory use is also important to capture to get full picture. Mike ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT 2025-07-03 20:07 ` Chuck Lever 2025-07-08 14:34 ` Jeff Layton @ 2025-07-08 21:07 ` Mike Snitzer 1 sibling, 0 replies; 18+ messages in thread From: Mike Snitzer @ 2025-07-08 21:07 UTC (permalink / raw) To: Chuck Lever Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, linux-kernel On Thu, Jul 03, 2025 at 04:07:51PM -0400, Chuck Lever wrote: > On 7/3/25 3:53 PM, Jeff Layton wrote: > > Recent testing has shown that keeping pagecache pages around for too > > long can be detrimental to performance with nfsd. Clients only rarely > > revisit the same data, so the pages tend to just hang around. > > > > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and > > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed > > range. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/nfsd/debugfs.c | 2 ++ > > fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- > > fs/nfsd/nfsd.h | 1 + > > fs/nfsd/nfsproc.c | 4 ++-- > > fs/nfsd/vfs.c | 21 ++++++++++++++----- > > fs/nfsd/vfs.h | 5 +++-- > > fs/nfsd/xdr3.h | 3 +++ > > 7 files changed, 77 insertions(+), 18 deletions(-) > > > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c > > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644 > > --- a/fs/nfsd/debugfs.c > > +++ b/fs/nfsd/debugfs.c > > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void) > > > > debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO, > > nfsd_top_dir, NULL, &nfsd_dsr_fops); > > + debugfs_create_bool("enable-fadvise-dontneed", 0644, > > + nfsd_top_dir, &nfsd_enable_fadvise_dontneed); > > I prefer that this setting is folded into the new io_cache_read / > io_cache_write tune-ables that Mike's patch adds, rather than adding > a new boolean. > > That might make a hybrid "DONTCACHE for READ and fadvise for WRITE" > pretty easy. That'd be really easy. Jeff, maybe have a look to rebase your changes on this patchset: https://lore.kernel.org/linux-nfs/20250708160619.64800-1-snitzer@kernel.org/ Ontop of this patch in particular: https://lore.kernel.org/linux-nfs/20250708160619.64800-8-snitzer@kernel.org/ My git branch with this patchset at the tip is available here: https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=kernel-6.12.24/nfsd-testing-snitm Mike ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT 2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton 2025-07-03 20:07 ` Chuck Lever @ 2025-07-03 23:44 ` NeilBrown 2025-07-03 23:49 ` Jeff Layton 2025-07-04 7:26 ` NeilBrown 1 sibling, 2 replies; 18+ messages in thread From: NeilBrown @ 2025-07-03 23:44 UTC (permalink / raw) To: Jeff Layton Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel, Jeff Layton On Fri, 04 Jul 2025, Jeff Layton wrote: > Recent testing has shown that keeping pagecache pages around for too > long can be detrimental to performance with nfsd. Clients only rarely > revisit the same data, so the pages tend to just hang around. > > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed > range. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/nfsd/debugfs.c | 2 ++ > fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- > fs/nfsd/nfsd.h | 1 + > fs/nfsd/nfsproc.c | 4 ++-- > fs/nfsd/vfs.c | 21 ++++++++++++++----- > fs/nfsd/vfs.h | 5 +++-- > fs/nfsd/xdr3.h | 3 +++ > 7 files changed, 77 insertions(+), 18 deletions(-) > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644 > --- a/fs/nfsd/debugfs.c > +++ b/fs/nfsd/debugfs.c > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void) > > debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO, > nfsd_top_dir, NULL, &nfsd_dsr_fops); > + debugfs_create_bool("enable-fadvise-dontneed", 0644, > + nfsd_top_dir, &nfsd_enable_fadvise_dontneed); > } > diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c > index b6d03e1ef5f7a5e8dd111b0d56c061f1e91abff7..11261cf67ea817ec566626f08b733e09c9e121de 100644 > --- a/fs/nfsd/nfs3proc.c > +++ b/fs/nfsd/nfs3proc.c > @@ -9,6 +9,7 @@ > #include <linux/ext2_fs.h> > #include <linux/magic.h> > #include <linux/namei.h> > +#include <linux/fadvise.h> > > #include "cache.h" > #include "xdr3.h" > @@ -206,11 +207,25 @@ nfsd3_proc_read(struct svc_rqst *rqstp) > > fh_copy(&resp->fh, &argp->fh); > resp->status = nfsd_read(rqstp, &resp->fh, argp->offset, > - &resp->count, &resp->eof); > + &resp->count, &resp->eof, &resp->nf); > resp->status = nfsd3_map_status(resp->status); > return rpc_success; > } > > +static void > +nfsd3_release_read(struct svc_rqst *rqstp) > +{ > + struct nfsd3_readargs *argp = rqstp->rq_argp; > + struct nfsd3_readres *resp = rqstp->rq_resp; > + > + if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok) > + generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count, > + POSIX_FADV_DONTNEED); > + if (resp->nf) > + nfsd_file_put(resp->nf); > + fh_put(&resp->fh); This looks wrong - testing resp->nf after assuming it was non-NULL. I don't think it *is* wrong because ->state == nfs_ok ensures ->nf is valid. But still.... How about: fh_put(resp->fh); if (!resp->nf) return; if (nfsd_enable_fadvise_dontneed) generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count, POSIX_FADV_DONTNEED); nfsd_file_put(resp->nf); ?? Note that we don't test ->status because that is identical to testing ->nf. Ditto for other release functions. Otherwise it makes sense for exploring how to optimise IO. Reviewed-by: NeilBrown <neil@brown.name> NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT 2025-07-03 23:44 ` NeilBrown @ 2025-07-03 23:49 ` Jeff Layton 2025-07-04 7:26 ` NeilBrown 1 sibling, 0 replies; 18+ messages in thread From: Jeff Layton @ 2025-07-03 23:49 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel On Fri, 2025-07-04 at 09:44 +1000, NeilBrown wrote: > On Fri, 04 Jul 2025, Jeff Layton wrote: > > Recent testing has shown that keeping pagecache pages around for too > > long can be detrimental to performance with nfsd. Clients only rarely > > revisit the same data, so the pages tend to just hang around. > > > > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and > > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed > > range. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/nfsd/debugfs.c | 2 ++ > > fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++--------- > > fs/nfsd/nfsd.h | 1 + > > fs/nfsd/nfsproc.c | 4 ++-- > > fs/nfsd/vfs.c | 21 ++++++++++++++----- > > fs/nfsd/vfs.h | 5 +++-- > > fs/nfsd/xdr3.h | 3 +++ > > 7 files changed, 77 insertions(+), 18 deletions(-) > > > > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c > > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644 > > --- a/fs/nfsd/debugfs.c > > +++ b/fs/nfsd/debugfs.c > > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void) > > > > debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO, > > nfsd_top_dir, NULL, &nfsd_dsr_fops); > > + debugfs_create_bool("enable-fadvise-dontneed", 0644, > > + nfsd_top_dir, &nfsd_enable_fadvise_dontneed); > > } > > diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c > > index b6d03e1ef5f7a5e8dd111b0d56c061f1e91abff7..11261cf67ea817ec566626f08b733e09c9e121de 100644 > > --- a/fs/nfsd/nfs3proc.c > > +++ b/fs/nfsd/nfs3proc.c > > @@ -9,6 +9,7 @@ > > #include <linux/ext2_fs.h> > > #include <linux/magic.h> > > #include <linux/namei.h> > > +#include <linux/fadvise.h> > > > > #include "cache.h" > > #include "xdr3.h" > > @@ -206,11 +207,25 @@ nfsd3_proc_read(struct svc_rqst *rqstp) > > > > fh_copy(&resp->fh, &argp->fh); > > resp->status = nfsd_read(rqstp, &resp->fh, argp->offset, > > - &resp->count, &resp->eof); > > + &resp->count, &resp->eof, &resp->nf); > > resp->status = nfsd3_map_status(resp->status); > > return rpc_success; > > } > > > > +static void > > +nfsd3_release_read(struct svc_rqst *rqstp) > > +{ > > + struct nfsd3_readargs *argp = rqstp->rq_argp; > > + struct nfsd3_readres *resp = rqstp->rq_resp; > > + > > + if (nfsd_enable_fadvise_dontneed && resp->status == nfs_ok) > > + generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count, > > + POSIX_FADV_DONTNEED); > > + if (resp->nf) > > + nfsd_file_put(resp->nf); > > + fh_put(&resp->fh); > > This looks wrong - testing resp->nf after assuming it was non-NULL. > I don't think it *is* wrong because ->state == nfs_ok ensures > ->nf is valid. But still.... > That was my thinking, but I agree that it's a bit fragile. > How about: > > fh_put(resp->fh); > if (!resp->nf) > return; > if (nfsd_enable_fadvise_dontneed) > generic_fadvise(nfsd_file_file(resp->nf), argp->offset, resp->count, > POSIX_FADV_DONTNEED); > nfsd_file_put(resp->nf); > > ?? > Note that we don't test ->status because that is identical to testing ->nf. > Ditto for other release functions. > That looks good. I'll plan to do that in the next respin. > Otherwise it makes sense for exploring how to optimise IO. > > Reviewed-by: NeilBrown <neil@brown.name> > > NeilBrown Thanks for the reviews! -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT 2025-07-03 23:44 ` NeilBrown 2025-07-03 23:49 ` Jeff Layton @ 2025-07-04 7:26 ` NeilBrown 2025-07-05 11:21 ` Jeff Layton 1 sibling, 1 reply; 18+ messages in thread From: NeilBrown @ 2025-07-04 7:26 UTC (permalink / raw) To: Jeff Layton Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel, Jeff Layton On Fri, 04 Jul 2025, NeilBrown wrote: > > Otherwise it makes sense for exploring how to optimise IO. > > Reviewed-by: NeilBrown <neil@brown.name> Actually - I take that back. generic_fadvise() is the wrong interface. It is for filesystems to use if the don't have any special requirements, and for vfs_fadvise() to use if the file system hasn't give a function to use. nfsd should be calling vfs_fadvise(). NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT 2025-07-04 7:26 ` NeilBrown @ 2025-07-05 11:21 ` Jeff Layton 0 siblings, 0 replies; 18+ messages in thread From: Jeff Layton @ 2025-07-05 11:21 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel On Fri, 2025-07-04 at 17:26 +1000, NeilBrown wrote: > On Fri, 04 Jul 2025, NeilBrown wrote: > > > > Otherwise it makes sense for exploring how to optimise IO. > > > > Reviewed-by: NeilBrown <neil@brown.name> > > Actually - I take that back. generic_fadvise() is the wrong interface. > It is for filesystems to use if the don't have any special requirements, > and for vfs_fadvise() to use if the file system hasn't give a function > to use. > > nfsd should be calling vfs_fadvise(). Good catch. I'll fix that up in the next version. Thanks, -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT 2025-07-03 19:53 [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT Jeff Layton 2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton 2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton @ 2025-07-03 23:16 ` NeilBrown 2025-07-03 23:28 ` Chuck Lever 2025-07-05 11:32 ` Jeff Layton 2 siblings, 2 replies; 18+ messages in thread From: NeilBrown @ 2025-07-03 23:16 UTC (permalink / raw) To: Jeff Layton Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel, Jeff Layton On Fri, 04 Jul 2025, Jeff Layton wrote: > Chuck and I were discussing RWF_DONTCACHE and he suggested that this > might be an alternate approach. My main gripe with DONTCACHE was that it > kicks off writeback after every WRITE operation. With NFS, we generally > get a COMMIT operation at some point. Allowing us to batch up writes > until that point has traditionally been considered better for > performance. I wonder if that traditional consideration is justified, give your subsequent results. The addition of COMMIT in v3 allowed us to both: - delay kicking off writes - not wait for writes to complete I think the second was always primary. Maybe we didn't consider the value of the first enough. Obviously the client caches writes and delays the start of writeback. Adding another delay on the serve side does not seem to have a clear justification. Maybe we *should* kick-off writeback immediately. There would still be opportunity for subsequent WRITE requests to be merged into the writeback queue. Ideally DONTCACHE should only affect cache usage and the latency of subsequence READs. It shouldn't affect WRITE behaviour. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT 2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown @ 2025-07-03 23:28 ` Chuck Lever 2025-07-04 7:34 ` NeilBrown 2025-07-05 11:32 ` Jeff Layton 1 sibling, 1 reply; 18+ messages in thread From: Chuck Lever @ 2025-07-03 23:28 UTC (permalink / raw) To: NeilBrown, Jeff Layton Cc: Trond Myklebust, Anna Schumaker, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel On 7/3/25 7:16 PM, NeilBrown wrote: > On Fri, 04 Jul 2025, Jeff Layton wrote: >> Chuck and I were discussing RWF_DONTCACHE and he suggested that this >> might be an alternate approach. My main gripe with DONTCACHE was that it >> kicks off writeback after every WRITE operation. With NFS, we generally >> get a COMMIT operation at some point. Allowing us to batch up writes >> until that point has traditionally been considered better for >> performance. > > I wonder if that traditional consideration is justified, give your > subsequent results. The addition of COMMIT in v3 allowed us to both: > - delay kicking off writes > - not wait for writes to complete > > I think the second was always primary. Maybe we didn't consider the > value of the first enough. > Obviously the client caches writes and delays the start of writeback. > Adding another delay on the serve side does not seem to have a clear > justification. Maybe we *should* kick-off writeback immediately. There > would still be opportunity for subsequent WRITE requests to be merged > into the writeback queue. Dave Chinner had the same thought a while back. So I've experimented with starting writes as part of nfsd_write(). Kicking off writes, even without waiting, is actually pretty costly, and it resulted in worse performance. Now that .pc_release is called /after/ the WRITE response has been sent, though, that might be a place where kicking off writeback could be done without charging that latency to the client. -- Chuck Lever ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT 2025-07-03 23:28 ` Chuck Lever @ 2025-07-04 7:34 ` NeilBrown 0 siblings, 0 replies; 18+ messages in thread From: NeilBrown @ 2025-07-04 7:34 UTC (permalink / raw) To: Chuck Lever Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel On Fri, 04 Jul 2025, Chuck Lever wrote: > On 7/3/25 7:16 PM, NeilBrown wrote: > > On Fri, 04 Jul 2025, Jeff Layton wrote: > >> Chuck and I were discussing RWF_DONTCACHE and he suggested that this > >> might be an alternate approach. My main gripe with DONTCACHE was that it > >> kicks off writeback after every WRITE operation. With NFS, we generally > >> get a COMMIT operation at some point. Allowing us to batch up writes > >> until that point has traditionally been considered better for > >> performance. > > > > I wonder if that traditional consideration is justified, give your > > subsequent results. The addition of COMMIT in v3 allowed us to both: > > - delay kicking off writes > > - not wait for writes to complete > > > > I think the second was always primary. Maybe we didn't consider the > > value of the first enough. > > Obviously the client caches writes and delays the start of writeback. > > Adding another delay on the serve side does not seem to have a clear > > justification. Maybe we *should* kick-off writeback immediately. There > > would still be opportunity for subsequent WRITE requests to be merged > > into the writeback queue. > > Dave Chinner had the same thought a while back. So I've experimented > with starting writes as part of nfsd_write(). Kicking off writes, > even without waiting, is actually pretty costly, and it resulted in > worse performance. Was this with filemap_fdatawrite_range_kick() or something else? > > Now that .pc_release is called /after/ the WRITE response has been sent, > though, that might be a place where kicking off writeback could be done > without charging that latency to the client. Certainly after is better. I was imagining kicking the writeback threads, but they seems to only do whole filesystems. Maybe that is OK? I doubt we want more than one thread kicking off write for any given inode at a time. Maybe it would be best to add a "kicking_write" flag in the filecache and if it is already set, then just add the range to some range of pending-writes. Maybe. I guess I should explore how to test :-) NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT 2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown 2025-07-03 23:28 ` Chuck Lever @ 2025-07-05 11:32 ` Jeff Layton 2025-07-10 8:00 ` Christoph Hellwig 1 sibling, 1 reply; 18+ messages in thread From: Jeff Layton @ 2025-07-05 11:32 UTC (permalink / raw) To: NeilBrown Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel On Fri, 2025-07-04 at 09:16 +1000, NeilBrown wrote: > On Fri, 04 Jul 2025, Jeff Layton wrote: > > Chuck and I were discussing RWF_DONTCACHE and he suggested that this > > might be an alternate approach. My main gripe with DONTCACHE was that it > > kicks off writeback after every WRITE operation. With NFS, we generally > > get a COMMIT operation at some point. Allowing us to batch up writes > > until that point has traditionally been considered better for > > performance. > > I wonder if that traditional consideration is justified, give your > subsequent results. The addition of COMMIT in v3 allowed us to both: > - delay kicking off writes > - not wait for writes to complete > > I think the second was always primary. Maybe we didn't consider the > value of the first enough. > Obviously the client caches writes and delays the start of writeback. > Adding another delay on the serve side does not seem to have a clear > justification. Maybe we *should* kick-off writeback immediately. There > would still be opportunity for subsequent WRITE requests to be merged > into the writeback queue. > That is the fundamental question: should we delay writeback or not? It seems like delaying it is probably best, even in the modern era with SSDs, but we do need more numbers here (ideally across a range of workloads). > Ideally DONTCACHE should only affect cache usage and the latency of > subsequence READs. It shouldn't affect WRITE behaviour. > It definitely does affect it today. The ideal thing IMO would be to just add the dropbehind flag to the folios on writes but not call filemap_fdatawrite_range_kick() on every write operation. After a COMMIT the pages should be clean and the vfs_fadvise call should just drop them from the cache, so this approach shouldn't materially change how writeback behaves. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT 2025-07-05 11:32 ` Jeff Layton @ 2025-07-10 8:00 ` Christoph Hellwig 0 siblings, 0 replies; 18+ messages in thread From: Christoph Hellwig @ 2025-07-10 8:00 UTC (permalink / raw) To: Jeff Layton Cc: NeilBrown, Trond Myklebust, Anna Schumaker, Chuck Lever, Olga Kornievskaia, Dai Ngo, Tom Talpey, Mike Snitzer, linux-nfs, linux-kernel, axboe, kundanthebest, linux-fsdevel On Sat, Jul 05, 2025 at 07:32:58AM -0400, Jeff Layton wrote: > That is the fundamental question: should we delay writeback or not? It > seems like delaying it is probably best, even in the modern era with > SSDs, but we do need more numbers here (ideally across a range of > workloads). If you have asynchronous writeback there's probably no good reason to delay it per-se. But it does make sense to wait long enough to have a large I/O size, especially with some form of parity raid you'll want to fill up the chunk, but also storage devices themselves will perform much better with a larger size. e.g. for HDD you'll want to write 1MB batches, and similar write sizes also help with for SSDs. While the write performance itself might not be much worse with smaller I/O especially for high quality ones, large I/O helps to reduce the internal fragmentation and thus later reduces garbage collection overhead and thus increases life time. > > Ideally DONTCACHE should only affect cache usage and the latency of > > subsequence READs. It shouldn't affect WRITE behaviour. > > > > It definitely does affect it today. The ideal thing IMO would be to > just add the dropbehind flag to the folios on writes but not call > filemap_fdatawrite_range_kick() on every write operation. Yes, a mode that sets drop behind but leaves writeback to the writeback threads can be interesting. Right now it will still be bottlenecked by the single writeback thread, but work on this is underway. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-07-10 8:00 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-03 19:53 [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT Jeff Layton 2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton 2025-07-03 23:33 ` NeilBrown 2025-07-04 0:05 ` Jeff Layton 2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton 2025-07-03 20:07 ` Chuck Lever 2025-07-08 14:34 ` Jeff Layton 2025-07-08 21:12 ` Mike Snitzer 2025-07-08 21:07 ` Mike Snitzer 2025-07-03 23:44 ` NeilBrown 2025-07-03 23:49 ` Jeff Layton 2025-07-04 7:26 ` NeilBrown 2025-07-05 11:21 ` Jeff Layton 2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown 2025-07-03 23:28 ` Chuck Lever 2025-07-04 7:34 ` NeilBrown 2025-07-05 11:32 ` Jeff Layton 2025-07-10 8:00 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).