Re: nfsd: add the ability to enable use of RWF_DONTCACHE for all nfsd IO

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org, Neil Brown <neilb@suse.de>,
	Jeff Layton <jlayton@kernel.org>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	axboe@kernel.dk
Subject: Re: nfsd: add the ability to enable use of RWF_DONTCACHE for all nfsd IO
Date: Fri, 21 Feb 2025 10:02:12 -0500	[thread overview]
Message-ID: <Z7iVdHcnGveg-gbg@kernel.org> (raw)
In-Reply-To: <ce92e5f6-d7cb-47ef-ad96-d334212a51f1@oracle.com>

On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote:
> [ Adding NFSD reviewers ... ]
> 
> On 2/20/25 12:12 PM, Mike Snitzer wrote:
> > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written
> > by nfsd will be removed from the page cache upon completion."
> > 
> > nfsd_dontcache is disabled by default.  It may be enabled with:
> >   echo Y > /sys/module/nfsd/parameters/nfsd_dontcache
> 
> A per-export setting like an export option would be nicer. Also, does
> it make sense to make it a separate control for READ and one for WRITE?
> My trick knee suggests caching read results is still going to add
> significant value, but write, not so much.

My intent was to make 6.14's DONTCACHE feature able to be tested in
the context of nfsd in a no-frills way.  I realize adding the
nfsd_dontcache knob skews toward too raw, lacks polish.  But I'm
inclined to expose such course-grained opt-in knobs to encourage
others' discovery (and answers to some of the questions you pose
below).  I also hope to enlist all NFSD reviewers' help in
categorizing/documenting where DONTCACHE helps/hurts. ;)

And I agree that ultimately per-export control is needed.  I'll take
the time to implement that, hopeful to have something more suitable in
time for LSF.

> However, to add any such administrative control, I'd like to see some
> performance numbers. I think we need to enumerate the cases (I/O types)
> that are most interesting to examine: small memory NFS servers; lots of
> small unaligned I/O; server-side CPU per byte; storage interrupt rates;
> any others?
> 
> And let's see some user/admin documentation (eg when should this setting
> be enabled? when would it be contra-indicated?)
> 
> The same arguments that applied to Cedric's request to make maximum RPC
> size a tunable setting apply here. Do we want to carry a manual setting
> for this mechanism for a long time, or do we expect that the setting can
> become automatic/uninteresting after a period of experimentation?
> 
> * It might be argued that putting these experimental tunables under /sys
>   eliminates the support longevity question, since there aren't strict
>   rules about removing files under /sys.

Right, I do think a sysfs knob (that defaults to disabled, requires
user opt-in) is a pretty useful and benign means to expose
experimental functionality.

And I agree with all you said needed above, I haven't had the time to
focus on DONTCACHE since ~Decemeber, I just picked up my old patches
from that time and decided to send the NFSD one since DONTCACHE has
been merged for 6.14.
 
> > FOP_DONTCACHE must be advertised as supported by the underlying
> > filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled
> > all IO will fail with -EOPNOTSUPP.
> 
> It would be better all around if NFSD simply ignored the setting in the
> cases where the underlying file system doesn't implement DONTCACHE.

I'll work on making it so.

> 
> 
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> >  fs/nfsd/vfs.c | 17 ++++++++++++++++-
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 29cb7b812d71..d7e49004e93d 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp)
> >  	return __nfsd_open(fhp, S_IFREG, may_flags, filp);
> >  }
> >  
> > +static bool nfsd_dontcache __read_mostly = false;
> > +module_param(nfsd_dontcache, bool, 0644);
> > +MODULE_PARM_DESC(nfsd_dontcache,
> > +		 "Any data read or written by nfsd will be removed from the page cache upon completion.");
> > +
> >  /*
> >   * Grab and keep cached pages associated with a file in the svc_rqst
> >   * so that they can be passed to the network sendmsg routines
> > @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  	loff_t ppos = offset;
> >  	struct page *page;
> >  	ssize_t host_err;
> > +	rwf_t flags = 0;
> >  
> >  	v = 0;
> >  	total = *count;
> > @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  	}
> >  	WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec));
> >  
> > +	if (nfsd_dontcache)
> > +		flags |= RWF_DONTCACHE;
> > +
> >  	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> >  	iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count);
> > -	host_err = vfs_iter_read(file, &iter, &ppos, 0);
> > +	host_err = vfs_iter_read(file, &iter, &ppos, flags);
> >  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> >  }
> >  
> > @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> >  	if (stable && !fhp->fh_use_wgather)
> >  		flags |= RWF_SYNC;
> >  
> > +	if (nfsd_dontcache)
> > +		flags |= RWF_DONTCACHE;
> > +
> >  	iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
> >  	since = READ_ONCE(file->f_wb_err);
> >  	if (verf)
> > @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> >   */
> >  bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
> >  {
> > +	if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */
> > +		return false;
> > +
> 
> Urgh.

Heh, yeah, bypassing splice was needed given dontcache hooks off vfs_iter_read.

> So I've been mulling over simply removing the splice read path.
> 
>  - Less code, less complexity, smaller test matrix
> 
>  - How much of a performance loss would result?
> 
>  - Would such a change make it easier to pass whole folios from
>    the file system directly to the network layer?

Good to know.

Thanks,
Mike

next prev parent reply	other threads:[~2025-02-21 15:02 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-20 17:12 [PATCH] nfsd: add the ability to enable use of RWF_DONTCACHE for all nfsd IO Mike Snitzer
2025-02-20 18:17 ` Chuck Lever
2025-02-21 15:02   ` Mike Snitzer [this message]
2025-02-21 15:25     ` Jeff Layton
2025-02-21 15:36       ` Mike Snitzer
2025-02-21 15:42         ` Jeff Layton
2025-02-21 15:46         ` Chuck Lever
2025-02-21 16:13           ` Trond Myklebust
2025-02-21 18:42             ` Jeff Layton
2025-02-21 19:18               ` Trond Myklebust
2025-02-21 15:39     ` Chuck Lever
2025-02-21 15:46       ` Jeff Layton
2025-02-21 15:50         ` Chuck Lever
2025-02-20 19:00 ` [PATCH] " Jeff Layton
2025-02-20 19:15   ` Chuck Lever
2025-02-21 15:25     ` Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z7iVdHcnGveg-gbg@kernel.org \
    --to=snitzer@kernel.org \
    --cc=Dai.Ngo@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=chuck.lever@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.