From: Mike Snitzer <snitzer@kernel.org>
To: NeilBrown <neilb@suse.de>
Cc: linux-nfs@vger.kernel.org, Anna Schumaker <anna@kernel.org>,
Trond Myklebust <trondmy@hammerspace.com>,
Chuck Lever <chuck.lever@oracle.com>,
Jeff Layton <jlayton@kernel.org>
Subject: Re: nfsd/filecache: add nfsd_file_acquire_gc_cached
Date: Fri, 25 Oct 2024 10:21:36 -0400 [thread overview]
Message-ID: <ZxupcAD4Keohs0TL@kernel.org> (raw)
In-Reply-To: <172982525650.81717.5861053414648479623@noble.neil.brown.name>
On Fri, Oct 25, 2024 at 02:00:56PM +1100, NeilBrown wrote:
> On Fri, 25 Oct 2024, Mike Snitzer wrote:
> > Rather than make nfsd_file_do_acquire() more complex (by training
> > it to conditionally skip both fh_verify() and nfsd_file allocation
> > and contruction) introduce nfsd_file_acquire_gc_cached() -- which
> > duplicates the minimalist subset of nfsd_file_do_acquire() needed to
> > achieve nfsd_file lookup using an opaque @inode_key.
> >
> > nfsd_file_acquire_gc_cached() only returns a cached and GC'd nfsd_file
> > obtained using the opaque @inode_key, established from a previous call
> > to nfsd_file_do_acquire_local() that originally added the GC'd
> > nfsd_file to the filecache.
> >
> > Update nfsd_open_local_fh to store @inode_key in @nfs_fh so later
> > calls can check if it maps to an open GC'd nfsd_file in the filecache
> > using nfsd_file_acquire_gc_cached(). Its nfsd_file_lookup_locked()
> > call will only find a match if @cred matches the nfsd_file's nf_cred.
> >
> > And care is taken to clear the inode_key in nfsd_file_free() if the
> > nfsd_file has a non-NULL nf_inodep (which is a pointer to the address
> > of the opaque inode_key stored in the nfs_fh). This avoids any risk
> > of re-using the old inode_key for a different nfsd_file.
> >
> > This commit's cached nfsd_file lookup dramatically speeds up LOCALIO
> > performance, especially for small 4K O_DIRECT IO, e.g.:
> >
> > before: read: IOPS=376k, BW=1469MiB/s (1541MB/s)(28.7GiB/20001msec)
> > after: read: IOPS=1037k, BW=4050MiB/s (4247MB/s)(79.1GiB/20002msec)
> >
> > Note that LOCALIO calls nfsd_open_local_fh() for every IO it issues to
> > the underlying filesystem using the returned nfsd_file. This is why
> > caching the opaque 'inode_key' in 'struct nfs_fh' is so helpful for
> > LOCALIO to quickly return the cached nfsd_file. Whereas regular NFS
> > avoids fh_verify()'s costly duplicate lookups of the underlying
> > filesystem's dentry by caching it in 'fh_dentry' of 'struct svc_fh'.
> > LOCALIO cannot take the same approach, of storing the dentry, without
> > creating object lifetime issues associated with dentry references
> > holding server mounts open and preventing unmounts.
> >
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>
> I think this is a good idea. If we are to avoid a complete lookup for
> every IO we need some back-pointer from the nfsd filecache to something
> in nfs so that a cached lookup can be invalidated. Various other
> schemes have been suggested before. This one seems particularly simple.
>
> I'm wondering about the request for a garbage-collected nfsd_file
> though. For NFSv3 that makes sense. For NFSv4 we would expect the file
> to already be open as a non-garbage-collected nfsd_file and opening it
> again seems wasteful. That doesn't need to be fixed for this patch and
> maybe doesn't need to be fixed at all, but it seemed worth highlighting.
Maybe you're referring to the nfsd_file_acquire_gc_cached() kernel doc
text? The code doesn't impose the nfsd_file must be a GC'd nfsd_file
(nor do I ever create an nfsd_file, if it isn't in the filecache then
it returns NULL).
GC'd just buys us a more likely chance of finding the nfsd_file in the
cache so it pairs nicely with GC having been requested/used when the
nfsd_file originally created and added to the cache.
Anyway, what follows in my reply is all moot given this thread has
teased out the desire to utilize a callback mechanism to allow the NFS
client to hold a longterm reference on the open nfsd_file.
BUT I will be folding in all the things covered below so I can at
least put nfsd_file_acquire_cached() out to pasture in more fully
formed state (should there ever be a need to revisit it)...
> More below
>
> > ---
> > fs/nfs/inode.c | 3 ++
> > fs/nfs_common/nfslocalio.c | 2 +-
> > fs/nfsd/filecache.c | 78 ++++++++++++++++++++++++++++++++++++++
> > fs/nfsd/filecache.h | 7 ++++
> > fs/nfsd/localio.c | 46 +++++++++++++++++++---
> > include/linux/nfs.h | 4 ++
> > include/linux/nfslocalio.h | 6 +--
> > 7 files changed, 136 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > index cc7a32b32676..3051d65e3a89 100644
> > --- a/fs/nfs/inode.c
> > +++ b/fs/nfs/inode.c
> > @@ -2413,6 +2413,9 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
> > #endif /* CONFIG_NFS_V4 */
> > #ifdef CONFIG_NFS_V4_2
> > nfsi->xattr_cache = NULL;
> > +#endif
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > + nfsi->fh.inode_key = NULL;
> > #endif
> > nfs_netfs_inode_init(nfsi);
> >
> > diff --git a/fs/nfs_common/nfslocalio.c b/fs/nfs_common/nfslocalio.c
> > index 09404d142d1a..bacebaa1e15c 100644
> > --- a/fs/nfs_common/nfslocalio.c
> > +++ b/fs/nfs_common/nfslocalio.c
> > @@ -130,7 +130,7 @@ EXPORT_SYMBOL_GPL(nfs_uuid_invalidate_one_client);
> >
> > struct nfsd_file *nfs_open_local_fh(nfs_uuid_t *uuid,
> > struct rpc_clnt *rpc_clnt, const struct cred *cred,
> > - const struct nfs_fh *nfs_fh, const fmode_t fmode)
> > + struct nfs_fh *nfs_fh, const fmode_t fmode)
> > {
> > struct net *net;
> > struct nfsd_file *localio;
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 1408166222c5..5ab978ac3555 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -221,6 +221,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
> > INIT_LIST_HEAD(&nf->nf_gc);
> > nf->nf_birthtime = ktime_get();
> > nf->nf_file = NULL;
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > + nf->nf_inodep = NULL;
> > +#endif
>
> All these "#if IS_ENABLED" are ugly. I wonder if we could get rid of
> them.
> Using __GFP_ZERO for the alloc would work here, but might be an unwanted
> cost. Maybe nf_inodep could be ignored if nf_file is NULL.
I did originally nest the nf_inodep backpointer reset within
nfsd_file_free()'s nf->nf_file check, will go back to that.
Then nf_inodep is otherwise ignored everywhere.
> > nf->nf_cred = get_current_cred();
> > nf->nf_net = net;
> > nf->nf_flags = want_gc ?
> > @@ -285,6 +288,12 @@ nfsd_file_free(struct nfsd_file *nf)
> > nfsd_file_check_write_error(nf);
> > nfsd_filp_close(nf->nf_file);
> > }
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > + if (nf->nf_inodep) {
> > + *(nf->nf_inodep) = NULL;
> > + nf->nf_inodep = NULL;
> > + }
> > +#endif
>
> This one is harder to hide. We don't really need the final assignment
> though so maybe we could
>
> #define NF_INODEP(nf) (nf->nf_inodep)
> or
> #define NF_INODEP(nf) (NULL)
>
> in a header (where #if are more acceptable), then make this code:
>
> if (NF_INODEP(nf))
> *NF_INODEP(nf) = NULL;
>
> Is that better or worse I wonder.
Not opposed to this, will do.
> >
> > /*
> > * If this item is still linked via nf_lru, that's a bug.
> > @@ -1255,6 +1264,75 @@ nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
> > return beres;
> > }
> >
> > +/**
> > + * nfsd_file_acquire_cached - Get cached GC'd open file using inode
> > + * @net: The network namespace in which to perform a lookup
> > + * @cred: the user credential with which to validate access
> > + * @inode_key: inode to use as opaque lookup key
> > + * @may_flags: NFSD_MAY_ settings for the file
> > + * @pnf: OUT: found cached GC'd "struct nfsd_file" object
> > + *
> > + * Rather than make nfsd_file_do_acquire more complex (by training
> > + * it to conditionally skip fh_verify(), nfsd_file allocation and
> > + * contruction) duplicate the minimalist subset of it that is
> > + * needed to achieve nfsd_file lookup using the opaque @inode_key.
> > + *
> > + * The nfsd_file object returned by this API is reference-counted
> > + * and garbage-collected. The object is retained for a few
> > + * seconds after the final nfsd_file_put() in case the caller
> > + * wants to re-use it.
> > + *
> > + * Return values:
> > + * %nfs_ok - @pnf points to an nfsd_file with its reference
> > + * count boosted.
> > + *
> > + * On error, an nfsstat value in network byte order is returned.
> > + */
> > +__be32
> > +nfsd_file_acquire_cached(struct net *net, const struct cred *cred,
> > + void *inode_key, unsigned int may_flags,
> > + struct nfsd_file **pnf)
> > +{
> > + unsigned char need = may_flags & NFSD_FILE_MAY_MASK;
> > + struct nfsd_file *nf;
> > + __be32 status;
> > +
> > + rcu_read_lock();
> > + nf = nfsd_file_lookup_locked(net, cred, inode_key, need, true);
> > + rcu_read_unlock();
> > +
> > + if (unlikely(!nf))
> > + return nfserr_noent;
> > +
> > + /*
> > + * If the nf is on the LRU then it holds an extra reference
> > + * that must be put if it's removed. It had better not be
> > + * the last one however, since we should hold another.
> > + */
> > + if (nfsd_file_lru_remove(nf))
> > + WARN_ON_ONCE(refcount_dec_and_test(&nf->nf_ref));
>
> Just use refcount_dec(&nf->nf_ref). It will provide a warning if the
> count reaches zero. In general you should never need warnings when
> using refcount as it generates the needed warnings itself.
OK, I lifted the code from nfsd_file_do_acquire() as a starting point;
so it should probably be cleaned up there (independent of this patch,
not it).
> > +
> > + if (WARN_ON_ONCE(test_bit(NFSD_FILE_PENDING, &nf->nf_flags) ||
> > + !test_bit(NFSD_FILE_HASHED, &nf->nf_flags))) {
> > + status = nfserr_inval;
> > + goto error;
> > + }
>
> Do we really want the above? I guess you were following the pattern in
> nfsd_file_do_acquire() which waits for FILE_PENDING and then re-tests
> FILE_HASHED (nfsd_file_lookup_locked() already tested it).
> I guess it doesn't hurt, but I'm not sure it helps.
Right, was just trying to maintain status-quo. I think it doesn't
hurt, and may actually help given spin_lock isn't held around
nfsd_file_lookup_locked? But the WARN_ON_ONCE should be removed.
> > diff --git a/fs/nfsd/localio.c b/fs/nfsd/localio.c
> > index f441cb9f74d5..34a229409117 100644
> > --- a/fs/nfsd/localio.c
> > +++ b/fs/nfsd/localio.c
> > @@ -58,33 +58,67 @@ void nfsd_localio_ops_init(void)
> > struct nfsd_file *
> > nfsd_open_local_fh(struct net *net, struct auth_domain *dom,
> > struct rpc_clnt *rpc_clnt, const struct cred *cred,
> > - const struct nfs_fh *nfs_fh, const fmode_t fmode)
> > + struct nfs_fh *nfs_fh, const fmode_t fmode)
> > {
> > int mayflags = NFSD_MAY_LOCALIO;
> > struct svc_cred rq_cred;
> > struct svc_fh fh;
> > struct nfsd_file *localio;
> > + void *nf_inode_key;
> > __be32 beres;
> >
> > if (nfs_fh->size > NFS4_FHSIZE)
> > return ERR_PTR(-EINVAL);
> >
> > - /* nfs_fh -> svc_fh */
> > - fh_init(&fh, NFS4_FHSIZE);
> > - fh.fh_handle.fh_size = nfs_fh->size;
> > - memcpy(fh.fh_handle.fh_raw, nfs_fh->data, nfs_fh->size);
> > -
> > if (fmode & FMODE_READ)
> > mayflags |= NFSD_MAY_READ;
> > if (fmode & FMODE_WRITE)
> > mayflags |= NFSD_MAY_WRITE;
> >
> > + /*
> > + * Check if 'inode_key' stored in @nfs_fh maps to an
> > + * open nfsd_file in the filecache (from a previous
> > + * nfsd_open_local_fh).
> > + *
> > + * The 'inode_key' may have become stale (due to nfsd_file
> > + * being free'd by filecache GC) so the lookup will fail
> > + * gracefully by falling back to using nfsd_file_acquire_local().
> > + */
> > + nf_inode_key = READ_ONCE(nfs_fh->inode_key);
>
> I think you want the above in an rcu-locked region with
> nfsd_file_acquire_cached(). Else the inode could be freed and
> reallocated after the READ_ONCE and before the lookup.
> Maybe pass the address of inode_key and have nfsd_file_acquire_cached()
> deref after getting rcu_read_lock().
Good point, will do.
> > + if (nf_inode_key) {
> > + beres = nfsd_file_acquire_cached(net, cred,
> > + nf_inode_key,
> > + mayflags, &localio);
> > + if (beres == nfs_ok)
> > + return localio;
> > + /*
> > + * reset stale nfs_fh->inode_key and fallthru
> > + * to use normal nfsd_file_acquire_local().
> > + */
> > + nfs_fh->inode_key = NULL;
> > + }
> > diff --git a/include/linux/nfs.h b/include/linux/nfs.h
> > index 9ad727ddfedb..56c894575c70 100644
> > --- a/include/linux/nfs.h
> > +++ b/include/linux/nfs.h
> > @@ -29,6 +29,10 @@
> > struct nfs_fh {
> > unsigned short size;
> > unsigned char data[NFS_MAXFHSIZE];
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > + /* 'inode_key' is an opaque key used for nfsd_file hash lookups */
> > + void * inode_key;
> > +#endif
>
> I wonder if this is really the right place to store the inode_key.
> 'struct nfs_fh' appears in lots of places and the only place where you
> want the inode_key is inside the nfs_inode. Maybe store an augmented
> nfs_fh in there...
Yeah, I went with nfs_fh because it served my needs (and nfs_fh is
passed to nfs_open_local_fh). But very much open to suggestions.
While I'm not yet sure about your nfs_inode suggestion, I'll certainly
take a closer look after addressing your other feedback.
Thanks for your review!
Mike
prev parent reply other threads:[~2024-10-25 14:21 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-24 18:55 [PATCH] nfsd/filecache: add nfsd_file_acquire_gc_cached Mike Snitzer
2024-10-25 3:00 ` NeilBrown
2024-10-25 12:52 ` Chuck Lever III
2024-10-25 13:31 ` Trond Myklebust
2024-10-25 13:51 ` Chuck Lever III
2024-10-25 14:00 ` Jeff Layton
2024-10-27 21:49 ` NeilBrown
2024-10-25 14:21 ` Mike Snitzer [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZxupcAD4Keohs0TL@kernel.org \
--to=snitzer@kernel.org \
--cc=anna@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@suse.de \
--cc=trondmy@hammerspace.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox