From: "J. Bruce Fields" <bfields@fieldses.org>
To: NeilBrown <neilb@suse.de>
Cc: Chuck Lever <chuck.lever@oracle.com>,
linux-nfs@vger.kernel.org, Josef Bacik <josef@toxicpanda.com>
Subject: Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export
Date: Thu, 26 Aug 2021 16:19:16 -0400 [thread overview]
Message-ID: <20210826201916.GB10730@fieldses.org> (raw)
In-Reply-To: <162995778427.7591.11743795294299207756@noble.neil.brown.name>
On Thu, Aug 26, 2021 at 04:03:04PM +1000, NeilBrown wrote:
>
> [[ Hi Bruce and Chuck,
> I've rebased this patch on the earlier patch I sent which allows
> me to use the name "fh_flags". I've also added a missing #include.
> I've added the 'Acked-by' which Joesf provided earlier for the
> btrfs-part. I don't have an 'ack' for the stat.h part, but no-one
> has complained wither.
> I think it is as ready as it can be, and am keen to know what you
> think.
> I'm not *very* keen on testing s_magic in nfsd code (though we
> already have a couple of such tests in nfs3proc.c), but it does have
> the advantage of ensuring that no other filesystem can use this
> functionality without landing a patch in fs/nfsd/.
>
> Thanks for any review that you can provide,
> NeilBrown
> ]]
This seems hairy, but *somebody* has hated every single solution
proposed, so, argh, I don't know, maybe it's best.
There was a ton of "but why can't we just..." in previous threads, could
we include URLs for those and/or the lwn articles? E.g.:
https://lore.kernel.org/linux-nfs/162742539595.32498.13687924366155737575.stgit@noble.brown/#b
https://lwn.net/Articles/866709/
> BTRFS does not provide unique inode numbers across a filesystem.
> It only provides unique inode numbers within a subvolume and
> uses synthetic device numbers for different subvolumes to ensure
> uniqueness for device+inode.
>
> nfsd cannot use these varying synthetic device numbers. If nfsd were to
> synthesise different stable filesystem ids to give to the client, that
> would cause subvolumes to appear in the mount table on the client, even
> though they don't appear in the mount table on the server. Also, NFSv3
> doesn't support changing the filesystem id without a new explicit mount
> on the client (this is partially supported in practice, but violates the
> protocol specification and has problems in some edge cases).
>
> So currently, the roots of all subvolumes report the same inode number
> in the same filesystem to NFS clients and tools like 'find' notice that
> a directory has the same identity as an ancestor, and so refuse to
> enter that directory.
>
> This patch allows btrfs (or any filesystem) to provide a 64bit number
> that can be xored with the inode number to make the number more unique.
> Rather than the client being certain to see duplicates, with this patch
> it is possible but extremely rare.
>
> The number that btrfs provides is a swab64() version of the subvolume
> identifier. This has most entropy in the high bits (the low bits of the
> subvolume identifer), while the inode has most entropy in the low bits.
> The result will always be unique within a subvolume, and will almost
> always be unique across the filesystem.
>
> If an upgrade of the NFS server caused all inode numbers in an exportfs
> BTRFS filesystem to appear to the client to change, the client may not
> handle this well. The Linux client will cause any open files to become
> 'stale'. If the mount point changed inode number, the whole mount would
> become inaccessible.
>
> To avoid this, an unused byte in the filehandle (fh_auth) has been
> repurposed as "fh_flags". The new behaviour of uniquifying inode number
> is only activated when a new flag is set.
>
> NFSD will only set this flag in filehandles it reports if the filehandle
> of the parent (provided by the client) contains the bit, or if
> - the filehandle for the parent is not provided or is for a different
> export and
> - the filehandle refers to a BTRFS filesystem.
>
> Thus if you have a BTRFS filesystem originally mounted from a server
> without this patch, the flag will never be set and the current behaviour
> will continue. Only once you re-mount the filesystem (or the filesystem
> is re-auto-mounted) will the inode numbers change. When that happens,
> it is likely that the filesystem st_dev number seen on the client will
> change anyway.
>
> Acked-by: Josef Bacik <josef@toxicpanda.com> (for BTFS change)
s/BTFS/BTRFS/.
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/btrfs/inode.c | 4 ++++
> fs/nfsd/nfs3xdr.c | 15 ++++++++++++++-
> fs/nfsd/nfs4xdr.c | 7 ++++---
> fs/nfsd/nfsfh.c | 14 ++++++++++++--
> fs/nfsd/nfsfh.h | 34 +++++++++++++++++++++++++++++++---
> fs/nfsd/xdr3.h | 2 ++
> include/linux/stat.h | 18 ++++++++++++++++++
> 7 files changed, 85 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 06f9f167222b..d35e2a30f25f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9195,6 +9195,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
> generic_fillattr(&init_user_ns, inode, stat);
> stat->dev = BTRFS_I(inode)->root->anon_dev;
>
> + if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID)
> + stat->ino_uniquifier =
> + swab64(BTRFS_I(inode)->root->root_key.objectid);
> +
> spin_lock(&BTRFS_I(inode)->lock);
> delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
> inode_bytes = inode_get_bytes(inode);
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 3d37923afb06..5e2d5c352ecd 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -340,6 +340,7 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
> {
> struct user_namespace *userns = nfsd_user_namespace(rqstp);
> __be32 *p;
> + u64 ino;
> u64 fsid;
>
> p = xdr_reserve_space(xdr, XDR_UNIT * 21);
> @@ -377,7 +378,8 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
> p = xdr_encode_hyper(p, fsid);
>
> /* fileid */
> - p = xdr_encode_hyper(p, stat->ino);
> + ino = nfsd_uniquify_ino(fhp, stat);
> + p = xdr_encode_hyper(p, ino);
>
> p = encode_nfstime3(p, &stat->atime);
> p = encode_nfstime3(p, &stat->mtime);
> @@ -1151,6 +1153,17 @@ svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name,
> if (xdr_stream_encode_item_present(xdr) < 0)
> return false;
> /* fileid */
> + if (!resp->dir_have_uniquifier) {
> + struct kstat stat;
> + if (fh_getattr(&resp->fh, &stat) == nfs_ok)
> + resp->dir_ino_uniquifier =
> + nfsd_ino_uniquifier(&resp->fh, &stat);
> + else
> + resp->dir_ino_uniquifier = 0;
> + resp->dir_have_uniquifier = true;
This took me a minute. So we're assuming the uniquifier stays the same
across a directory and its children (because you can't hard link across
subvolumes), and this code is just caching the uniquifier for use across
the directory--is that right?
> + }
> + if (resp->dir_ino_uniquifier != ino)
> + ino ^= resp->dir_ino_uniquifier;
I guess this check (here and in nfsd_uniquify_ino) is just to prevent
returning inode number zero?
--b.
> if (xdr_stream_encode_u64(xdr, ino) < 0)
> return false;
> /* name */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index a54b2845473b..6e31f6286e0b 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -3114,10 +3114,11 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> fhp->fh_handle.fh_size);
> }
> if (bmval0 & FATTR4_WORD0_FILEID) {
> + u64 ino = nfsd_uniquify_ino(fhp, &stat);
> p = xdr_reserve_space(xdr, 8);
> if (!p)
> goto out_resource;
> - p = xdr_encode_hyper(p, stat.ino);
> + p = xdr_encode_hyper(p, ino);
> }
> if (bmval0 & FATTR4_WORD0_FILES_AVAIL) {
> p = xdr_reserve_space(xdr, 8);
> @@ -3274,7 +3275,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>
> p = xdr_reserve_space(xdr, 8);
> if (!p)
> - goto out_resource;
> + goto out_resource;
> /*
> * Get parent's attributes if not ignoring crossmount
> * and this is the root of a cross-mounted filesystem.
> @@ -3284,7 +3285,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
> err = get_parent_attributes(exp, &parent_stat);
> if (err)
> goto out_nfserr;
> - ino = parent_stat.ino;
> + ino = nfsd_uniquify_ino(fhp, &parent_stat);
> }
> p = xdr_encode_hyper(p, ino);
> }
> diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
> index 7695c0f1eefe..18bb139f8bfe 100644
> --- a/fs/nfsd/nfsfh.c
> +++ b/fs/nfsd/nfsfh.c
> @@ -9,6 +9,7 @@
> */
>
> #include <linux/exportfs.h>
> +#include <linux/magic.h>
>
> #include <linux/sunrpc/svcauth_gss.h>
> #include "nfsd.h"
> @@ -173,7 +174,7 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp)
>
> if (--data_left < 0)
> return error;
> - if (fh->fh_auth_type != 0)
> + if ((fh->fh_flags & ~NFSD_FH_FLAG_ALL) != 0)
> return error;
> len = key_len(fh->fh_fsid_type) / 4;
> if (len == 0)
> @@ -532,6 +533,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
>
> struct inode * inode = d_inode(dentry);
> dev_t ex_dev = exp_sb(exp)->s_dev;
> + u8 flags = 0;
>
> dprintk("nfsd: fh_compose(exp %02x:%02x/%ld %pd2, ino=%ld)\n",
> MAJOR(ex_dev), MINOR(ex_dev),
> @@ -548,6 +550,14 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
> /* If we have a ref_fh, then copy the fh_no_wcc setting from it. */
> fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false;
>
> + if (ref_fh && ref_fh->fh_export == exp) {
> + flags = ref_fh->fh_handle.fh_flags;
> + } else {
> + /* Set flags as needed */
> + if (exp->ex_path.mnt->mnt_sb->s_magic == BTRFS_SUPER_MAGIC)
> + flags |= NFSD_FH_FLAG_INO_UNIQUIFY;
> + }
> +
> if (ref_fh == fhp)
> fh_put(ref_fh);
>
> @@ -565,7 +575,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry,
>
> fhp->fh_handle.fh_size =
> key_len(fhp->fh_handle.fh_fsid_type) + 4;
> - fhp->fh_handle.fh_auth_type = 0;
> + fhp->fh_handle.fh_flags = flags;
>
> mk_fsid(fhp->fh_handle.fh_fsid_type,
> fhp->fh_handle.fh_fsid,
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f36234c474dc..bbc7ddd34143 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -18,11 +18,17 @@
> * The file handle starts with a sequence of four-byte words.
> * The first word contains a version number (1) and three descriptor bytes
> * that tell how the remaining 3 variable length fields should be handled.
> - * These three bytes are auth_type, fsid_type and fileid_type.
> + * These three bytes are flags, fsid_type and fileid_type.
> *
> * All four-byte values are in host-byte-order.
> *
> - * The auth_type field is deprecated and must be set to 0.
> + * The flags field (previously auth_type) can be used when nfsd behaviour
> + * needs to change in a non-compatible way, usually for some specific
> + * filesystem. Flags should only be set in filehandles for filesystems which
> + * need them.
> + * Current values:
> + * 1 - BTRFS only. Cause stat->ino_uniquifier to be used to improve inode
> + * number uniqueness.
> *
> * The fsid_type identifies how the filesystem (or export point) is
> * encoded.
> @@ -53,7 +59,7 @@ struct knfsd_fh {
> __u32 fh_raw[NFS4_FHSIZE/4];
> struct {
> __u8 fh_version; /* == 1 */
> - __u8 fh_auth_type; /* deprecated */
> + __u8 fh_flags;
> __u8 fh_fsid_type;
> __u8 fh_fileid_type;
> __u32 fh_fsid[]; /* flexible-array member */
> @@ -131,6 +137,28 @@ enum fsid_source {
> };
> extern enum fsid_source fsid_source(const struct svc_fh *fhp);
>
> +enum nfsd_fh_flags {
> + NFSD_FH_FLAG_INO_UNIQUIFY = 1, /* BTRFS only */
> +
> + NFSD_FH_FLAG_ALL = 1
> +};
> +
> +static inline u64 nfsd_ino_uniquifier(const struct svc_fh *fhp,
> + const struct kstat *stat)
> +{
> + if (fhp->fh_handle.fh_flags & NFSD_FH_FLAG_INO_UNIQUIFY)
> + return stat->ino_uniquifier;
> + return 0;
> +}
> +
> +static inline u64 nfsd_uniquify_ino(const struct svc_fh *fhp,
> + const struct kstat *stat)
> +{
> + u64 u = nfsd_ino_uniquifier(fhp, stat);
> + if (u != stat->ino)
> + return stat->ino ^ u;
> + return stat->ino;
> +}
>
> /*
> * This might look a little large to "inline" but in all calls except
> diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
> index 933008382bbe..d9b6c8314bbb 100644
> --- a/fs/nfsd/xdr3.h
> +++ b/fs/nfsd/xdr3.h
> @@ -179,6 +179,8 @@ struct nfsd3_readdirres {
> struct xdr_buf dirlist;
> struct svc_fh scratch;
> struct readdir_cd common;
> + u64 dir_ino_uniquifier;
> + bool dir_have_uniquifier;
> unsigned int cookie_offset;
> struct svc_rqst * rqstp;
>
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index fff27e603814..0f3f74d302f8 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -46,6 +46,24 @@ struct kstat {
> struct timespec64 btime; /* File creation time */
> u64 blocks;
> u64 mnt_id;
> + /*
> + * BTRFS does not provide unique inode numbers within a filesystem,
> + * depending on a synthetic 'dev' to provide uniqueness.
> + * NFSd cannot make use of this 'dev' number so clients often see
> + * duplicate inode numbers.
> + * For BTRFS, 'ino' is unlikely to use the high bits until the filesystem
> + * has created a great many inodes.
> + * It puts another number in ino_uniquifier which:
> + * - has most entropy in the high bits
> + * - is different precisely when 'dev' is different
> + * - is stable across unmount/remount
> + * NFSd can xor this with 'ino' to get a substantially more unique
> + * number for reporting to the client.
> + * The ino_uniquifier for a directory can reasonably be applied
> + * to inode numbers reported by the readdir filldir callback.
> + * It is NOT currently exported to user-space.
> + */
> + u64 ino_uniquifier;
> };
>
> #endif
> --
> 2.32.0
>
>
next prev parent reply other threads:[~2021-08-26 20:19 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-26 4:28 [PATCH] NFSD: drop support for ancient file-handles NeilBrown
2021-08-26 6:03 ` [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export NeilBrown
2021-08-26 20:19 ` J. Bruce Fields [this message]
2021-08-26 22:10 ` NeilBrown
2021-08-27 14:53 ` Frank Filz
2021-08-27 22:57 ` NeilBrown
2021-08-27 23:46 ` Frank Filz
2021-08-27 23:55 ` NeilBrown
2021-08-28 2:21 ` Frank Filz
2021-08-27 18:32 ` J. Bruce Fields
2021-08-27 23:01 ` NeilBrown
2021-08-27 16:20 ` Christoph Hellwig
2021-08-27 23:05 ` NeilBrown
2021-08-28 7:09 ` Christoph Hellwig
2021-08-31 4:59 ` NeilBrown
2021-09-01 7:20 ` Christoph Hellwig
2021-09-01 15:22 ` J. Bruce Fields
2021-09-02 4:14 ` NeilBrown
2021-09-05 16:07 ` J. Bruce Fields
2021-09-06 1:29 ` NeilBrown
2021-09-11 14:12 ` Amir Goldstein
2021-09-13 0:43 ` NeilBrown
2021-09-13 10:04 ` Amir Goldstein
2021-09-13 22:59 ` NeilBrown
2021-09-14 5:45 ` Amir Goldstein
2021-09-20 22:09 ` NeilBrown
2021-09-02 7:11 ` Christoph Hellwig
2021-09-02 4:06 ` NeilBrown
2021-09-02 7:16 ` Christoph Hellwig
2021-09-02 7:53 ` Miklos Szeredi
2021-09-02 14:16 ` Frank Filz
2021-09-02 23:02 ` NeilBrown
2021-08-26 14:10 ` [PATCH] NFSD: drop support for ancient file-handles Chuck Lever III
2021-08-26 21:38 ` NeilBrown
2021-08-26 14:51 ` J. Bruce Fields
2021-08-26 21:41 ` NeilBrown
2021-08-27 15:15 ` Christoph Hellwig
2021-08-27 23:24 ` NeilBrown
2021-08-31 4:41 ` [PATCH 1/2 v2] NFSD: drop support for ancient filehandles NeilBrown
2021-08-31 4:42 ` [PATCH 2/2] NFSD: simplify struct nfsfh NeilBrown
2021-09-01 7:44 ` Christoph Hellwig
2021-09-01 7:44 ` [PATCH 1/2 v2] NFSD: drop support for ancient filehandles Christoph Hellwig
2021-09-01 14:21 ` Chuck Lever III
2021-09-02 1:14 ` [PATCH 1/3 v3] NFSD: move filehandle format declarations out of "uapi" NeilBrown
2021-09-02 1:15 ` [PATCH 2/3 v3] NFSD: drop support for ancient filehandles NeilBrown
2021-09-02 1:16 ` [PATCH 3/3 v3] NFSD: simplify struct nfsfh NeilBrown
2021-09-02 7:22 ` [PATCH 2/3 v3] NFSD: drop support for ancient filehandles Christoph Hellwig
2021-09-02 7:21 ` [PATCH 1/3 v3] NFSD: move filehandle format declarations out of "uapi" Christoph Hellwig
2021-09-23 21:21 ` Bruce Fields
2021-09-25 4:21 ` NeilBrown
-- strict thread matches above, loose matches on Subject: below --
2021-07-27 22:37 [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly NeilBrown
2021-08-13 1:45 ` [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export NeilBrown
2021-08-15 7:39 ` Goffredo Baroncelli
2021-08-15 19:35 ` Roman Mamedov
2021-08-15 22:17 ` NeilBrown
2021-08-23 4:05 ` [PATCH v2] BTRFS/NFSD: " NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210826201916.GB10730@fieldses.org \
--to=bfields@fieldses.org \
--cc=chuck.lever@oracle.com \
--cc=josef@toxicpanda.com \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).