From: Jeff Layton <jlayton@kernel.org>
To: Chuck Lever <chuck.lever@oracle.com>,
Trond Myklebust <trondmy@kernel.org>,
Anna Schumaker <anna@kernel.org>, NeilBrown <neil@brown.name>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
Mike Snitzer <snitzer@kernel.org>
Cc: linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT
Date: Tue, 08 Jul 2025 10:34:15 -0400 [thread overview]
Message-ID: <cda4542e4ae8b30a6f5628386388f813d3209558.camel@kernel.org> (raw)
In-Reply-To: <520bd301-4526-4364-bbfa-5f591ab8f60a@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 5447 bytes --]
On Thu, 2025-07-03 at 16:07 -0400, Chuck Lever wrote:
> On 7/3/25 3:53 PM, Jeff Layton wrote:
> > Recent testing has shown that keeping pagecache pages around for too
> > long can be detrimental to performance with nfsd. Clients only rarely
> > revisit the same data, so the pages tend to just hang around.
> >
> > This patch changes the pc_release callbacks for NFSv3 READ, WRITE and
> > COMMIT to call generic_fadvise(..., POSIX_FADV_DONTNEED) on the accessed
> > range.
> >
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> > fs/nfsd/debugfs.c | 2 ++
> > fs/nfsd/nfs3proc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------
> > fs/nfsd/nfsd.h | 1 +
> > fs/nfsd/nfsproc.c | 4 ++--
> > fs/nfsd/vfs.c | 21 ++++++++++++++-----
> > fs/nfsd/vfs.h | 5 +++--
> > fs/nfsd/xdr3.h | 3 +++
> > 7 files changed, 77 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
> > index 84b0c8b559dc90bd5c2d9d5e15c8e0682c0d610c..b007718dd959bc081166ec84e06f577a8fc2b46b 100644
> > --- a/fs/nfsd/debugfs.c
> > +++ b/fs/nfsd/debugfs.c
> > @@ -44,4 +44,6 @@ void nfsd_debugfs_init(void)
> >
> > debugfs_create_file("disable-splice-read", S_IWUSR | S_IRUGO,
> > nfsd_top_dir, NULL, &nfsd_dsr_fops);
> > + debugfs_create_bool("enable-fadvise-dontneed", 0644,
> > + nfsd_top_dir, &nfsd_enable_fadvise_dontneed);
>
> I prefer that this setting is folded into the new io_cache_read /
> io_cache_write tune-ables that Mike's patch adds, rather than adding
> a new boolean.
>
> That might make a hybrid "DONTCACHE for READ and fadvise for WRITE"
> pretty easy.
>
I ended up rebasing Mike's dontcache branch on top of v6.16-rc5 with
all of Chuck's trees in. I then added the attached patch and did some
testing with a couple of machines I checked out internally at Meta.
This is the throughput results with the fio-seq-RW test with the file
size set to 100G and the duration at 5 mins.
Note that:
read and writes buffered:
READ: bw=3024MiB/s (3171MB/s), 186MiB/s-191MiB/s (195MB/s-201MB/s), io=889GiB (954GB), run=300012-300966msec
WRITE: bw=2015MiB/s (2113MB/s), 124MiB/s-128MiB/s (131MB/s-134MB/s), io=592GiB (636GB), run=300012-300966msec
READ: bw=2902MiB/s (3043MB/s), 177MiB/s-183MiB/s (186MB/s-192MB/s), io=851GiB (913GB), run=300027-300118msec
WRITE: bw=1934MiB/s (2027MB/s), 119MiB/s-122MiB/s (124MB/s-128MB/s), io=567GiB (608GB), run=300027-300118msec
READ: bw=2897MiB/s (3037MB/s), 178MiB/s-183MiB/s (186MB/s-192MB/s), io=849GiB (911GB), run=300006-300078msec
WRITE: bw=1930MiB/s (2023MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=565GiB (607GB), run=300006-300078msec
reads and writes RWF_DONTCACHE:
READ: bw=3090MiB/s (3240MB/s), 190MiB/s-195MiB/s (199MB/s-205MB/s), io=906GiB (972GB), run=300015-300113msec
WRITE: bw=2060MiB/s (2160MB/s), 126MiB/s-130MiB/s (132MB/s-137MB/s), io=604GiB (648GB), run=300015-300113msec
READ: bw=3057MiB/s (3205MB/s), 188MiB/s-193MiB/s (198MB/s-203MB/s), io=897GiB (963GB), run=300329-300450msec
WRITE: bw=2037MiB/s (2136MB/s), 126MiB/s-129MiB/s (132MB/s-135MB/s), io=598GiB (642GB), run=300329-300450msec
READ: bw=3166MiB/s (3320MB/s), 196MiB/s-200MiB/s (205MB/s-210MB/s), io=928GiB (996GB), run=300021-300090msec
WRITE: bw=2111MiB/s (2213MB/s), 131MiB/s-133MiB/s (137MB/s-140MB/s), io=619GiB (664GB), run=300021-300090msec
reads and writes witg O_DIRECT:
READ: bw=3115MiB/s (3267MB/s), 192MiB/s-198MiB/s (201MB/s-208MB/s), io=913GiB (980GB), run=300025-300078msec
WRITE: bw=2077MiB/s (2178MB/s), 128MiB/s-131MiB/s (134MB/s-138MB/s), io=609GiB (653GB), run=300025-300078msec
READ: bw=3189MiB/s (3343MB/s), 197MiB/s-202MiB/s (207MB/s-211MB/s), io=934GiB (1003GB), run=300023-300096msec
WRITE: bw=2125MiB/s (2228MB/s), 132MiB/s-134MiB/s (138MB/s-140MB/s), io=623GiB (669GB), run=300023-300096msec
READ: bw=3113MiB/s (3264MB/s), 191MiB/s-197MiB/s (200MB/s-207MB/s), io=912GiB (979GB), run=300020-300098msec
WRITE: bw=2075MiB/s (2175MB/s), 127MiB/s-131MiB/s (134MB/s-138MB/s), io=608GiB (653GB), run=300020-300098msec
RWF_DONTCACHE on reads and stable writes + fadvise DONTNEED after COMMIT:
READ: bw=2888MiB/s (3029MB/s), 178MiB/s-182MiB/s (187MB/s-191MB/s), io=846GiB (909GB), run=300012-300109msec
WRITE: bw=1924MiB/s (2017MB/s), 118MiB/s-121MiB/s (124MB/s-127MB/s), io=564GiB (605GB), run=300012-300109msec
READ: bw=2899MiB/s (3040MB/s), 180MiB/s-183MiB/s (188MB/s-192MB/s), io=852GiB (915GB), run=300022-300940msec
WRITE: bw=1931MiB/s (2025MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=567GiB (609GB), run=300022-300940msec
READ: bw=2902MiB/s (3043MB/s), 179MiB/s-184MiB/s (188MB/s-193MB/s), io=853GiB (916GB), run=300913-301146msec
WRITE: bw=1933MiB/s (2027MB/s), 119MiB/s-122MiB/s (125MB/s-128MB/s), io=568GiB (610GB), run=300913-301146msec
The fadvise case is clearly slower than the others. Interestingly it
also slowed down read performance, which leads me to believe that maybe
the fadvise calls were interfering with concurrent reads. Given the
disappointing numbers, I'll probably drop the last patch.
There is probably a case to be made for patch #1, on the general
principle of expediting sending the reply as much as possible. Chuck,
let me know if you want me to submit that individually.
--
Jeff Layton <jlayton@kernel.org>
[-- Attachment #2: 0001-nfsd-add-a-NFSD_IO_FADVISE-setting-to-io_cache_write.patch --]
[-- Type: text/x-patch, Size: 4348 bytes --]
From 14958516bf45f92a8609cb6ad504e92550b416d7 Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@kernel.org>
Date: Mon, 7 Jul 2025 11:00:34 -0400
Subject: [PATCH] nfsd: add a NFSD_IO_FADVISE setting to io_cache_write
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/debugfs.c | 2 ++
fs/nfsd/nfs3proc.c | 24 +++++++++++++++++++-----
fs/nfsd/nfsd.h | 3 ++-
fs/nfsd/vfs.c | 4 ++++
fs/nfsd/xdr3.h | 1 +
5 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/fs/nfsd/debugfs.c b/fs/nfsd/debugfs.c
index fadf1d88f640..d0f7dfe6d621 100644
--- a/fs/nfsd/debugfs.c
+++ b/fs/nfsd/debugfs.c
@@ -89,6 +89,7 @@ DEFINE_DEBUGFS_ATTRIBUTE(nfsd_io_cache_read_fops, nfsd_io_cache_read_get,
* %0: NFS WRITE will use buffered IO (default)
* %1: NFS WRITE will use dontcache (buffered IO w/ dropbehind)
* %2: NFS WRITE will use direct IO
+ * %3: NFS stable WRITE will use dontcache + fadvise DONTNEED after COMMIT
*
* The default value of this setting is zero (buffered IO is
* used). This setting takes immediate effect for all NFS
@@ -107,6 +108,7 @@ static int nfsd_io_cache_write_set(void *data, u64 val)
case NFSD_IO_BUFFERED:
case NFSD_IO_DONTCACHE:
case NFSD_IO_DIRECT:
+ case NFSD_IO_FADVISE:
nfsd_io_cache_write = val;
break;
default:
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index b6d03e1ef5f7..6737d22c001d 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -9,6 +9,7 @@
#include <linux/ext2_fs.h>
#include <linux/magic.h>
#include <linux/namei.h>
+#include <linux/fadvise.h>
#include "cache.h"
#include "xdr3.h"
@@ -748,7 +749,6 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
{
struct nfsd3_commitargs *argp = rqstp->rq_argp;
struct nfsd3_commitres *resp = rqstp->rq_resp;
- struct nfsd_file *nf;
dprintk("nfsd: COMMIT(3) %s %u@%Lu\n",
SVCFH_fmt(&argp->fh),
@@ -757,17 +757,31 @@ nfsd3_proc_commit(struct svc_rqst *rqstp)
fh_copy(&resp->fh, &argp->fh);
resp->status = nfsd_file_acquire_gc(rqstp, &resp->fh, NFSD_MAY_WRITE |
- NFSD_MAY_NOT_BREAK_LEASE, &nf);
+ NFSD_MAY_NOT_BREAK_LEASE, &resp->nf);
if (resp->status)
goto out;
- resp->status = nfsd_commit(rqstp, &resp->fh, nf, argp->offset,
+ resp->status = nfsd_commit(rqstp, &resp->fh, resp->nf, argp->offset,
argp->count, resp->verf);
- nfsd_file_put(nf);
out:
resp->status = nfsd3_map_status(resp->status);
return rpc_success;
}
+static void
+nfsd3_release_commit(struct svc_rqst *rqstp)
+{
+ struct nfsd3_commitargs *argp = rqstp->rq_argp;
+ struct nfsd3_commitres *resp = rqstp->rq_resp;
+
+ fh_put(&resp->fh);
+ if (resp->nf) {
+ if (nfsd_io_cache_write == NFSD_IO_FADVISE)
+ vfs_fadvise(nfsd_file_file(resp->nf), argp->offset,
+ argp->count, POSIX_FADV_DONTNEED);
+ nfsd_file_put(resp->nf);
+ }
+}
+
/*
* NFSv3 Server procedures.
@@ -1039,7 +1053,7 @@ static const struct svc_procedure nfsd_procedures3[22] = {
.pc_func = nfsd3_proc_commit,
.pc_decode = nfs3svc_decode_commitargs,
.pc_encode = nfs3svc_encode_commitres,
- .pc_release = nfs3svc_release_fhandle,
+ .pc_release = nfsd3_release_commit,
.pc_argsize = sizeof(struct nfsd3_commitargs),
.pc_argzero = sizeof(struct nfsd3_commitargs),
.pc_ressize = sizeof(struct nfsd3_commitres),
diff --git a/fs/nfsd/nfsd.h b/fs/nfsd/nfsd.h
index 1ae38c5557c4..b21902e02982 100644
--- a/fs/nfsd/nfsd.h
+++ b/fs/nfsd/nfsd.h
@@ -156,7 +156,8 @@ extern bool nfsd_disable_splice_read __read_mostly;
enum {
NFSD_IO_BUFFERED = 0,
NFSD_IO_DONTCACHE,
- NFSD_IO_DIRECT
+ NFSD_IO_DIRECT,
+ NFSD_IO_FADVISE
};
extern u64 nfsd_io_cache_read __read_mostly;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 08350070e083..5d4588a106b2 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1271,6 +1271,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp,
case NFSD_IO_DONTCACHE:
kiocb.ki_flags = IOCB_DONTCACHE;
break;
+ case NFSD_IO_FADVISE:
+ if (stable)
+ kiocb.ki_flags = IOCB_DONTCACHE;
+ break;
case NFSD_IO_BUFFERED:
break;
}
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 522067b7fd75..ec91a24e651a 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -217,6 +217,7 @@ struct nfsd3_commitres {
__be32 status;
struct svc_fh fh;
__be32 verf[2];
+ struct nfsd_file *nf;
};
struct nfsd3_getaclres {
--
2.50.0
next prev parent reply other threads:[~2025-07-08 14:34 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-03 19:53 [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT Jeff Layton
2025-07-03 19:53 ` [PATCH RFC 1/2] sunrpc: delay pc_release callback until after sending a reply Jeff Layton
2025-07-03 23:33 ` NeilBrown
2025-07-04 0:05 ` Jeff Layton
2025-07-03 19:53 ` [PATCH RFC 2/2] nfsd: call generic_fadvise after v3 READ, stable WRITE or COMMIT Jeff Layton
2025-07-03 20:07 ` Chuck Lever
2025-07-08 14:34 ` Jeff Layton [this message]
2025-07-08 21:12 ` Mike Snitzer
2025-07-08 21:07 ` Mike Snitzer
2025-07-03 23:44 ` NeilBrown
2025-07-03 23:49 ` Jeff Layton
2025-07-04 7:26 ` NeilBrown
2025-07-05 11:21 ` Jeff Layton
2025-07-03 23:16 ` [PATCH RFC 0/2] nfsd: issue POSIX_FADV_DONTNEED after READ/WRITE/COMMIT NeilBrown
2025-07-03 23:28 ` Chuck Lever
2025-07-04 7:34 ` NeilBrown
2025-07-05 11:32 ` Jeff Layton
2025-07-10 8:00 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cda4542e4ae8b30a6f5628386388f813d3209558.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=Dai.Ngo@oracle.com \
--cc=anna@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neil@brown.name \
--cc=okorniev@redhat.com \
--cc=snitzer@kernel.org \
--cc=tom@talpey.com \
--cc=trondmy@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).