All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: NeilBrown <neilb@suse.de>
Cc: linux-nfs@vger.kernel.org, Anna Schumaker <anna@kernel.org>,
	Trond Myklebust <trondmy@hammerspace.com>,
	Chuck Lever <chuck.lever@oracle.com>,
	Jeff Layton <jlayton@kernel.org>
Subject: Re: nfsd/filecache: add nfsd_file_acquire_gc_cached
Date: Fri, 25 Oct 2024 10:21:36 -0400	[thread overview]
Message-ID: <ZxupcAD4Keohs0TL@kernel.org> (raw)
In-Reply-To: <172982525650.81717.5861053414648479623@noble.neil.brown.name>

On Fri, Oct 25, 2024 at 02:00:56PM +1100, NeilBrown wrote:
> On Fri, 25 Oct 2024, Mike Snitzer wrote:
> > Rather than make nfsd_file_do_acquire() more complex (by training
> > it to conditionally skip both fh_verify() and nfsd_file allocation
> > and contruction) introduce nfsd_file_acquire_gc_cached() -- which
> > duplicates the minimalist subset of nfsd_file_do_acquire() needed to
> > achieve nfsd_file lookup using an opaque @inode_key.
> > 
> > nfsd_file_acquire_gc_cached() only returns a cached and GC'd nfsd_file
> > obtained using the opaque @inode_key, established from a previous call
> > to nfsd_file_do_acquire_local() that originally added the GC'd
> > nfsd_file to the filecache.
> > 
> > Update nfsd_open_local_fh to store @inode_key in @nfs_fh so later
> > calls can check if it maps to an open GC'd nfsd_file in the filecache
> > using nfsd_file_acquire_gc_cached().  Its nfsd_file_lookup_locked()
> > call will only find a match if @cred matches the nfsd_file's nf_cred.
> > 
> > And care is taken to clear the inode_key in nfsd_file_free() if the
> > nfsd_file has a non-NULL nf_inodep (which is a pointer to the address
> > of the opaque inode_key stored in the nfs_fh).  This avoids any risk
> > of re-using the old inode_key for a different nfsd_file.
> > 
> > This commit's cached nfsd_file lookup dramatically speeds up LOCALIO
> > performance, especially for small 4K O_DIRECT IO, e.g.:
> > 
> > before: read: IOPS=376k,  BW=1469MiB/s (1541MB/s)(28.7GiB/20001msec)
> > after:  read: IOPS=1037k, BW=4050MiB/s (4247MB/s)(79.1GiB/20002msec)
> > 
> > Note that LOCALIO calls nfsd_open_local_fh() for every IO it issues to
> > the underlying filesystem using the returned nfsd_file.  This is why
> > caching the opaque 'inode_key' in 'struct nfs_fh' is so helpful for
> > LOCALIO to quickly return the cached nfsd_file.  Whereas regular NFS
> > avoids fh_verify()'s costly duplicate lookups of the underlying
> > filesystem's dentry by caching it in 'fh_dentry' of 'struct svc_fh'.
> > LOCALIO cannot take the same approach, of storing the dentry, without
> > creating object lifetime issues associated with dentry references
> > holding server mounts open and preventing unmounts.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> 
> I think this is a good idea.  If we are to avoid a complete lookup for
> every IO we need some back-pointer from the nfsd filecache to something
> in nfs so that a cached lookup can be invalidated.  Various other
> schemes have been suggested before.  This one seems particularly simple.
> 
> I'm wondering about the request for a garbage-collected nfsd_file
> though.  For NFSv3 that makes sense.  For NFSv4 we would expect the file
> to already be open as a non-garbage-collected nfsd_file and opening it
> again seems wasteful.  That doesn't need to be fixed for this patch and
> maybe doesn't need to be fixed at all, but it seemed worth highlighting.

Maybe you're referring to the nfsd_file_acquire_gc_cached() kernel doc
text?  The code doesn't impose the nfsd_file must be a GC'd nfsd_file
(nor do I ever create an nfsd_file, if it isn't in the filecache then
it returns NULL).

GC'd just buys us a more likely chance of finding the nfsd_file in the
cache so it pairs nicely with GC having been requested/used when the
nfsd_file originally created and added to the cache.

Anyway, what follows in my reply is all moot given this thread has
teased out the desire to utilize a callback mechanism to allow the NFS
client to hold a longterm reference on the open nfsd_file.

BUT I will be folding in all the things covered below so I can at
least put nfsd_file_acquire_cached() out to pasture in more fully
formed state (should there ever be a need to revisit it)...

> More below
> 
> > ---
> >  fs/nfs/inode.c             |  3 ++
> >  fs/nfs_common/nfslocalio.c |  2 +-
> >  fs/nfsd/filecache.c        | 78 ++++++++++++++++++++++++++++++++++++++
> >  fs/nfsd/filecache.h        |  7 ++++
> >  fs/nfsd/localio.c          | 46 +++++++++++++++++++---
> >  include/linux/nfs.h        |  4 ++
> >  include/linux/nfslocalio.h |  6 +--
> >  7 files changed, 136 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > index cc7a32b32676..3051d65e3a89 100644
> > --- a/fs/nfs/inode.c
> > +++ b/fs/nfs/inode.c
> > @@ -2413,6 +2413,9 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
> >  #endif /* CONFIG_NFS_V4 */
> >  #ifdef CONFIG_NFS_V4_2
> >  	nfsi->xattr_cache = NULL;
> > +#endif
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > +	nfsi->fh.inode_key = NULL;
> >  #endif
> >  	nfs_netfs_inode_init(nfsi);
> >  
> > diff --git a/fs/nfs_common/nfslocalio.c b/fs/nfs_common/nfslocalio.c
> > index 09404d142d1a..bacebaa1e15c 100644
> > --- a/fs/nfs_common/nfslocalio.c
> > +++ b/fs/nfs_common/nfslocalio.c
> > @@ -130,7 +130,7 @@ EXPORT_SYMBOL_GPL(nfs_uuid_invalidate_one_client);
> >  
> >  struct nfsd_file *nfs_open_local_fh(nfs_uuid_t *uuid,
> >  		   struct rpc_clnt *rpc_clnt, const struct cred *cred,
> > -		   const struct nfs_fh *nfs_fh, const fmode_t fmode)
> > +		   struct nfs_fh *nfs_fh, const fmode_t fmode)
> >  {
> >  	struct net *net;
> >  	struct nfsd_file *localio;
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 1408166222c5..5ab978ac3555 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -221,6 +221,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
> >  	INIT_LIST_HEAD(&nf->nf_gc);
> >  	nf->nf_birthtime = ktime_get();
> >  	nf->nf_file = NULL;
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > +	nf->nf_inodep = NULL;
> > +#endif
> 
> All these "#if IS_ENABLED" are ugly.  I wonder if we could get rid of
> them.
> Using __GFP_ZERO for the alloc would work here, but might be an unwanted
> cost.  Maybe nf_inodep could be ignored if nf_file is NULL.

I did originally nest the nf_inodep backpointer reset within
nfsd_file_free()'s nf->nf_file check, will go back to that.

Then nf_inodep is otherwise ignored everywhere.

> >  	nf->nf_cred = get_current_cred();
> >  	nf->nf_net = net;
> >  	nf->nf_flags = want_gc ?
> > @@ -285,6 +288,12 @@ nfsd_file_free(struct nfsd_file *nf)
> >  		nfsd_file_check_write_error(nf);
> >  		nfsd_filp_close(nf->nf_file);
> >  	}
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > +	if (nf->nf_inodep) {
> > +		*(nf->nf_inodep) = NULL;
> > +		nf->nf_inodep = NULL;
> > +	}
> > +#endif
> 
> This one is harder to hide.  We don't really need the final assignment
> though so maybe we could
>
> #define NF_INODEP(nf) (nf->nf_inodep)
> or
> #define NF_INODEP(nf) (NULL)
> 
> in a header (where #if are more acceptable), then make this code:
> 
>  if (NF_INODEP(nf))
> 	*NF_INODEP(nf) = NULL;
> 
> Is that better or worse I wonder.

Not opposed to this, will do.

> >  
> >  	/*
> >  	 * If this item is still linked via nf_lru, that's a bug.
> > @@ -1255,6 +1264,75 @@ nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
> >  	return beres;
> >  }
> >  
> > +/**
> > + * nfsd_file_acquire_cached - Get cached GC'd open file using inode
> > + * @net: The network namespace in which to perform a lookup
> > + * @cred: the user credential with which to validate access
> > + * @inode_key: inode to use as opaque lookup key
> > + * @may_flags: NFSD_MAY_ settings for the file
> > + * @pnf: OUT: found cached GC'd "struct nfsd_file" object
> > + *
> > + * Rather than make nfsd_file_do_acquire more complex (by training
> > + * it to conditionally skip fh_verify(), nfsd_file allocation and
> > + * contruction) duplicate the minimalist subset of it that is
> > + * needed to achieve nfsd_file lookup using the opaque @inode_key.
> > + *
> > + * The nfsd_file object returned by this API is reference-counted
> > + * and garbage-collected. The object is retained for a few
> > + * seconds after the final nfsd_file_put() in case the caller
> > + * wants to re-use it.
> > + *
> > + * Return values:
> > + *   %nfs_ok - @pnf points to an nfsd_file with its reference
> > + *   count boosted.
> > + *
> > + * On error, an nfsstat value in network byte order is returned.
> > + */
> > +__be32
> > +nfsd_file_acquire_cached(struct net *net, const struct cred *cred,
> > +			 void *inode_key, unsigned int may_flags,
> > +			 struct nfsd_file **pnf)
> > +{
> > +	unsigned char need = may_flags & NFSD_FILE_MAY_MASK;
> > +	struct nfsd_file *nf;
> > +	__be32 status;
> > +
> > +	rcu_read_lock();
> > +	nf = nfsd_file_lookup_locked(net, cred, inode_key, need, true);
> > +	rcu_read_unlock();
> > +
> > +	if (unlikely(!nf))
> > +		return nfserr_noent;
> > +
> > +	/*
> > +	 * If the nf is on the LRU then it holds an extra reference
> > +	 * that must be put if it's removed. It had better not be
> > +	 * the last one however, since we should hold another.
> > +	 */
> > +	if (nfsd_file_lru_remove(nf))
> > +		WARN_ON_ONCE(refcount_dec_and_test(&nf->nf_ref));
> 
> Just use refcount_dec(&nf->nf_ref).  It will provide a warning if the
> count reaches zero.  In general you should never need warnings when
> using refcount as it generates the needed warnings itself.

OK, I lifted the code from nfsd_file_do_acquire() as a starting point;
so it should probably be cleaned up there (independent of this patch,
not it).

> > +
> > +	if (WARN_ON_ONCE(test_bit(NFSD_FILE_PENDING, &nf->nf_flags) ||
> > +			 !test_bit(NFSD_FILE_HASHED, &nf->nf_flags))) {
> > +		status = nfserr_inval;
> > +		goto error;
> > +	}
> 
> Do we really want the above?  I guess you were following the pattern in
> nfsd_file_do_acquire() which waits for FILE_PENDING and then re-tests
> FILE_HASHED (nfsd_file_lookup_locked() already tested it).
> I guess it doesn't hurt, but I'm not sure it helps.

Right, was just trying to maintain status-quo.  I think it doesn't
hurt, and may actually help given spin_lock isn't held around
nfsd_file_lookup_locked?  But the WARN_ON_ONCE should be removed.

> > diff --git a/fs/nfsd/localio.c b/fs/nfsd/localio.c
> > index f441cb9f74d5..34a229409117 100644
> > --- a/fs/nfsd/localio.c
> > +++ b/fs/nfsd/localio.c
> > @@ -58,33 +58,67 @@ void nfsd_localio_ops_init(void)
> >  struct nfsd_file *
> >  nfsd_open_local_fh(struct net *net, struct auth_domain *dom,
> >  		   struct rpc_clnt *rpc_clnt, const struct cred *cred,
> > -		   const struct nfs_fh *nfs_fh, const fmode_t fmode)
> > +		   struct nfs_fh *nfs_fh, const fmode_t fmode)
> >  {
> >  	int mayflags = NFSD_MAY_LOCALIO;
> >  	struct svc_cred rq_cred;
> >  	struct svc_fh fh;
> >  	struct nfsd_file *localio;
> > +	void *nf_inode_key;
> >  	__be32 beres;
> >  
> >  	if (nfs_fh->size > NFS4_FHSIZE)
> >  		return ERR_PTR(-EINVAL);
> >  
> > -	/* nfs_fh -> svc_fh */
> > -	fh_init(&fh, NFS4_FHSIZE);
> > -	fh.fh_handle.fh_size = nfs_fh->size;
> > -	memcpy(fh.fh_handle.fh_raw, nfs_fh->data, nfs_fh->size);
> > -
> >  	if (fmode & FMODE_READ)
> >  		mayflags |= NFSD_MAY_READ;
> >  	if (fmode & FMODE_WRITE)
> >  		mayflags |= NFSD_MAY_WRITE;
> >  
> > +	/*
> > +	 * Check if 'inode_key' stored in @nfs_fh maps to an
> > +	 * open nfsd_file in the filecache (from a previous
> > +	 * nfsd_open_local_fh).
> > +	 *
> > +	 * The 'inode_key' may have become stale (due to nfsd_file
> > +	 * being free'd by filecache GC) so the lookup will fail
> > +	 * gracefully by falling back to using nfsd_file_acquire_local().
> > +	 */
> > +	nf_inode_key = READ_ONCE(nfs_fh->inode_key);
> 
> I think you want the above in an rcu-locked region with
> nfsd_file_acquire_cached().  Else the inode could be freed and
> reallocated after the READ_ONCE and before the lookup.
> Maybe pass the address of inode_key and have nfsd_file_acquire_cached()
> deref after getting rcu_read_lock().

Good point, will do.

> > +	if (nf_inode_key) {
> > +		beres = nfsd_file_acquire_cached(net, cred,
> > +						 nf_inode_key,
> > +						 mayflags, &localio);
> > +		if (beres == nfs_ok)
> > +			return localio;
> > +		/*
> > +		 * reset stale nfs_fh->inode_key and fallthru
> > +		 * to use normal nfsd_file_acquire_local().
> > +		 */
> > +		nfs_fh->inode_key = NULL;
> > +	}

> > diff --git a/include/linux/nfs.h b/include/linux/nfs.h
> > index 9ad727ddfedb..56c894575c70 100644
> > --- a/include/linux/nfs.h
> > +++ b/include/linux/nfs.h
> > @@ -29,6 +29,10 @@
> >  struct nfs_fh {
> >  	unsigned short		size;
> >  	unsigned char		data[NFS_MAXFHSIZE];
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > +	/* 'inode_key' is an opaque key used for nfsd_file hash lookups */
> > +	void *			inode_key;
> > +#endif
> 
> I wonder if this is really the right place to store the inode_key.
> 'struct nfs_fh' appears in lots of places and the only place where you
> want the inode_key is inside the nfs_inode.  Maybe store an augmented
> nfs_fh in there...

Yeah, I went with nfs_fh because it served my needs (and nfs_fh is
passed to nfs_open_local_fh).  But very much open to suggestions.
While I'm not yet sure about your nfs_inode suggestion, I'll certainly
take a closer look after addressing your other feedback.

Thanks for your review!

Mike

      parent reply	other threads:[~2024-10-25 14:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-24 18:55 [PATCH] nfsd/filecache: add nfsd_file_acquire_gc_cached Mike Snitzer
2024-10-25  3:00 ` NeilBrown
2024-10-25 12:52   ` Chuck Lever III
2024-10-25 13:31     ` Trond Myklebust
2024-10-25 13:51       ` Chuck Lever III
2024-10-25 14:00       ` Jeff Layton
2024-10-27 21:49         ` NeilBrown
2024-10-25 14:21   ` Mike Snitzer [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZxupcAD4Keohs0TL@kernel.org \
    --to=snitzer@kernel.org \
    --cc=anna@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.