Re: any idea about auto export multiple btrfs snapshots?

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Wang Yugui <wangyugui@e16-tech.com>
To: "NeilBrown" <neilb@suse.de>
Cc: linux-nfs@vger.kernel.org
Subject: Re: any idea about auto export multiple btrfs snapshots?
Date: Wed, 23 Jun 2021 14:14:44 +0800	[thread overview]
Message-ID: <20210623141444.C4F2.409509F4@e16-tech.com> (raw)
In-Reply-To: <162440994038.28671.7338874000115610814@noble.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 15976 bytes --]

Hi,

This patch works very well. Thanks a lot.
-  crossmnt of btrfs subvol works as expected.
-  nfs/umount subvol works well.
-  pseudo mount point inode(255) is good.

I test it in 5.10.45 with a few minor rebase.
( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
just fs/nfsd/nfs3xdr.c rebase)

But when I tested it with another btrfs system without subvol but with
more data, 'find /nfs/test' caused a OOPS .  and this OOPS will not
happen just without this patch.

The data in this filesystem is created/left by xfstest(FSTYP=nfs,
TEST_DEV).

#nfs4 option: default mount.nfs4, nfs-utils-2.3.3

# access btrfs directly
$ find /mnt/test | wc -l
6612

# access btrfs through nfs
$ find /nfs/test | wc -l

[  466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
[  466.172123] #PF: supervisor read access in kernel mode
[  466.177857] #PF: error_code(0x0000) - not-present page
[  466.183601] PGD 0 P4D 0
[  466.186443] Oops: 0000 [#1] SMP NOPTI
[  466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
[  466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
[  466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
[  466.212067] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9
 f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83
[  466.233061] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246
[  466.238894] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00
[  466.246872] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[  466.254848] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001
[  466.262825] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000
[  466.270802] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018
[  466.278779] FS:  0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000
[  466.287823] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  466.294246] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0
[  466.302222] Call Trace:
[  466.304970]  nfsd4_encode_fattr+0x15ac/0x1940 [nfsd]
[  466.310557]  ? btrfs_verify_level_key+0xad/0xf0 [btrfs]
[  466.316413]  ? btrfs_search_slot+0x8e3/0x900 [btrfs]
[  466.321973]  nfsd4_encode_dirent+0x160/0x3b0 [nfsd]
[  466.327434]  nfsd_readdir+0x199/0x240 [nfsd]
[  466.332215]  ? nfsd4_encode_getattr+0x30/0x30 [nfsd]
[  466.337771]  ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
[  466.343714]  ? security_prepare_creds+0x6f/0xa0
[  466.348788]  nfsd4_encode_readdir+0xd9/0x1c0 [nfsd]
[  466.354250]  nfsd4_encode_operation+0x9b/0x1b0 [nfsd]
[  466.360430]  nfsd4_proc_compound+0x4e3/0x710 [nfsd]
[  466.366352]  nfsd_dispatch+0xd4/0x180 [nfsd]
[  466.371620]  svc_process_common+0x392/0x6c0 [sunrpc]
[  466.377650]  ? svc_recv+0x3c4/0x8a0 [sunrpc]
[  466.382883]  ? nfsd_svc+0x300/0x300 [nfsd]
[  466.387908]  ? nfsd_destroy+0x60/0x60 [nfsd]
[  466.393126]  svc_process+0xb7/0xf0 [sunrpc]
[  466.398234]  nfsd+0xe8/0x140 [nfsd]
[  466.402555]  kthread+0x116/0x130
[  466.406579]  ? kthread_park+0x80/0x80
[  466.411091]  ret_from_fork+0x1f/0x30
[  466.415499] Modules linked in: acpi_ipmi rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rfkill intel_rapl_m
sr intel_rapl_common iTCO_wdt intel_pmc_bxt iTCO_vendor_support dcdbas ipmi_ssif sb_edac x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_si rapl intel_cstate mei_me ipmi_devintf intel_uncore j
oydev mei ipmi_msghandler lpc_ich acpi_power_meter nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm rdmavt nfsd rdma
_rxe ib_uverbs ip6_udp_tunnel udp_tunnel ib_core auth_rpcgss nfs_acl lockd grace nfs_ssc ip_tables xfs mgag200
 drm_kms_helper crct10dif_pclmul crc32_pclmul btrfs cec crc32c_intel xor bnx2x raid6_pq drm igb mpt3sas ghash_
clmulni_intel pcspkr nvme megaraid_sas mdio nvme_core dca raid_class i2c_algo_bit scsi_transport_sas wmi dm_mu
ltipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc i2c_dev
[  466.499551] CR2: 0000000000000004
[  466.503759] ---[ end trace 91eb52bf0cb65801 ]---
[  466.511948] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
[  466.517714] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9
 f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83
[  466.539753] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246
[  466.546122] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00
[  466.554625] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[  466.563096] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001
[  466.571572] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000
[  466.580024] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018
[  466.588487] FS:  0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000
[  466.598032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  466.604973] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0
[  466.613467] Kernel panic - not syncing: Fatal exception
[  466.807651] Kernel Offset: 0x12000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xfffff
fffbfffffff)
[  466.823190] ---[ end Kernel panic - not syncing: Fatal exception ]---


Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/06/23

> On Tue, 22 Jun 2021, Wang Yugui wrote:
> > > 
> > > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().
> > 
> > btrfs subvol crossmnt begin to work, although buggy.
> > 
> > some subvol is crossmnt-ed, some subvol is yet not, and some dir is
> > wrongly crossmnt-ed
> > 
> > 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
> > to happen.
> > 
> > This is the current patch based on 5.10.44. 
> > At least nfsd_follow_up() is buggy.
> > 
> 
> I don't think the approach you are taking makes sense.  Let me explain
> why.
> 
> The problem is that applications on the NFS client can see different
> files or directories on the same (apparent) filesystem with the same
> inode number.  Most application won't care and NFS itself doesn't get
> confused by the duplicate inode numbers, but 'find' and similar programs
> (probably 'tar' for example) do get upset.
> 
> This happens because BTRFS reuses inode numbers in subvols which it
> presents to the kernel as all part of the one filesystem (or at least,
> all part of the one mount point).  NFSD only sees one filesystem, and so
> reports the same filesystem-id (fsid) for all objects.  The NFS client
> then sees that the fsid is the same and tells applications that the
> objects are all in the one filesystem.
> 
> To fix this, we need to make sure that nfsd reports a different fsid for
> objects in different subvols.  There are two obvious ways to do this.
> 
> One is to teach nfsd to recognize btrfs subvolumes exactly like separate
> filesystems (as nfsd already ensure each filesystem gets its own fsid).
> This is the approach of my first suggestion.  It requires changing
> nfsd_mountpoint() and follow_up() and any other code that is aware of
> different filesytems.  As I mentioned, it also requires changing mountd
> to be able to extract a list of subvols from btrfs because they don't
> appear in /proc/mounts.  
> 
> As you might know an NFS filehandle has 3 parts: a header, a filesystem
> identifier, and an inode identifier.  This approach would involve giving
> different subvols different filesystem identifiers in the filehandle.
> This, it turns out is a very big change - bigger than I at first
> imagined.
> 
> The second obvious approach is to leave the filehandles unchanged and to
> continue to treat an entire btrfs filesystem as a single filesystem
> EXCEPT when reporting the fsid to the NFS client.  All we *really* need
> to do is make sure the client sees a different fsid when it enters a
> part of the filesystem which re-uses inode numbers.  This is what my
> latest patch did.
> 
> Your patch seems to combine ideas from both approaches.  It includes my
> code to replace the fsid, but also intercepts follow_up etc.  This
> cannot be useful.
> 
> As I noted when I posted it, there is a problem with my patch.  I now
> understand that problem.
> 
> When NFS sees that fsid change it needs to create 2 inodes for that
> directory.  One inode will be in the parent filesystem and will be
> marked as an auto-mount point so that any lookup below that directory
> will trigger an internal mount.  The other inode is the root of the
> child filesystem.  It gets mounted on the first inode.
> 
> With normal filesystem mounts, there really is an inode in the parent
> filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID
> attribute.  This fileid will be different from all other inode numbers
> in the parent filesystem.
> 
> With BTRFS there is no inode in the parent volume (as far as I know) so
> there is nothing useful to return for MOUNTED_ON_FILEID.  This results
> in NFS using the same inode number for the inode in the parent
> filesystem as the inode in the child filesystem.  For btrfs, this will
> be 256.  As there is already an inode in the parent filesystem with inum
> 256, 'find' complains.
> 
> The following patch addresses this by adding code to nfsd when it
> determines MOUINTD_ON_FILEID to choose an number that should be unused
> in btrfs.  With this change, 'find' seems to work correctly with NFSv4
> mounts of btrfs.
> 
> This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID
> attribute - strictly speaking, the NFSv3 protocol doesn't support
> crossing mount points, though the Linux implementation does allow it.
> 
> So this patch works and, I think, is the best we can do in terms of
> functionality.  I don't like the details of the implementation though.
> It requires NFSD to know too much about BTRFS internals.
> 
> I think I would like btrfs to make it clear where a subvol started,
> maybe by setting DCACHE_MOUNTED on the dentry.  This flag is only a
> hint, not a promise of anything, so other code should get confused.
> This would have nfsd calling vfs_statfs quite so often ....  maybe that
> isn't such a big deal.
> 
> More importantly, there needs to be some way for NFSD to find an inode
> number to report for the MOUNTED_ON_FILEID.  This needs to be a number
> not used elsewhere in the filesystem.  It might be safe to use the
> same fileid for all subvols (as my patch currently does), but we would
> need to confirm that 'find' and 'tar' don't complain about that or
> mishandle it.  If it is safe to use the same fileid, then a new field in
> the superblock to store it might work.  If a different fileid is needed,
> the we might need a new field in 'struct kstatfs', so vfs_statfs can
> report it.
> 
> Anyway, here is my current patch.  It includes support for NFSv3 as well
> as NFSv4.
> 
> NeilBrown
> 
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index 9421dae22737..790a3357525d 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -15,6 +15,7 @@
>  #include <linux/slab.h>
>  #include <linux/namei.h>
>  #include <linux/module.h>
> +#include <linux/statfs.h>
>  #include <linux/exportfs.h>
>  #include <linux/sunrpc/svc_xprt.h>
>  
> @@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
>  	int err;
>  	struct auth_domain *dom = NULL;
>  	struct svc_export exp = {}, *expp;
> +	struct kstatfs statfs;
>  	int an_int;
>  
>  	if (mesg[mlen-1] != '\n')
> @@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
>  	err = kern_path(buf, 0, &exp.ex_path);
>  	if (err)
>  		goto out1;
> +	err = vfs_statfs(&exp.ex_path, &statfs);
> +	if (err)
> +		goto out3;
> +	exp.ex_fsid64 = statfs.f_fsid;
>  
>  	exp.ex_client = dom;
>  	exp.cd = cd;
> @@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
>  	new->ex_anon_uid = item->ex_anon_uid;
>  	new->ex_anon_gid = item->ex_anon_gid;
>  	new->ex_fsid = item->ex_fsid;
> +	new->ex_fsid64 = item->ex_fsid64;
>  	new->ex_devid_map = item->ex_devid_map;
>  	item->ex_devid_map = NULL;
>  	new->ex_uuid = item->ex_uuid;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index ee0e3aba4a6e..d3eb9a599918 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -68,6 +68,7 @@ struct svc_export {
>  	kuid_t			ex_anon_uid;
>  	kgid_t			ex_anon_gid;
>  	int			ex_fsid;
> +	__kernel_fsid_t		ex_fsid64;
>  	unsigned char *		ex_uuid; /* 16 byte fsid */
>  	struct nfsd4_fs_locations ex_fslocs;
>  	uint32_t		ex_nflavors;
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 0a5ebc52e6a9..f11ba3434fd6 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -367,10 +367,18 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
>  	case FSIDSOURCE_FSID:
>  		fsid = (u64)fhp->fh_export->ex_fsid;
>  		break;
> -	case FSIDSOURCE_UUID:
> +	case FSIDSOURCE_UUID: {
> +		struct kstatfs statfs;
> +
>  		fsid = ((u64 *)fhp->fh_export->ex_uuid)[0];
>  		fsid ^= ((u64 *)fhp->fh_export->ex_uuid)[1];
> +		if (fh_getstafs(fhp, &statfs) == 0 &&
> +		    (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] ||
> +		     statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1]))
> +			/* looks like a btrfs subvol */
> +			fsid = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1];
>  		break;
> +		}
>  	default:
>  		fsid = (u64)huge_encode_dev(fhp->fh_dentry->d_sb->s_dev);
>  	}
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7abeccb975b2..5f614d1b362e 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -42,6 +42,7 @@
>  #include <linux/sunrpc/svcauth_gss.h>
>  #include <linux/sunrpc/addr.h>
>  #include <linux/xattr.h>
> +#include <linux/btrfs_tree.h>
>  #include <uapi/linux/xattr.h>
>  
>  #include "idmap.h"
> @@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  	if (err)
>  		goto out_nfserr;
>  	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
> +		       FATTR4_WORD0_FSID |
>  			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
>  	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
> +		       FATTR4_WORD1_MOUNTED_ON_FILEID |
>  		       FATTR4_WORD1_SPACE_TOTAL))) {
>  		err = vfs_statfs(&path, &statfs);
>  		if (err)
> @@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  		case FSIDSOURCE_UUID:
>  			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
>  								EX_UUID_LEN);
> +			if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> +			    statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
> +				/* looks like a btrfs subvol */
> +				p[-2] ^= statfs.f_fsid.val[0];
> +				p[-1] ^= statfs.f_fsid.val[1];
> +			}
>  			break;
>  		}
>  	}
> @@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  				goto out_nfserr;
>  			ino = parent_stat.ino;
>  		}
> +		if (fsid_source(fhp) == FSIDSOURCE_UUID &&
> +		    (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> +		     statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]))
> +			    /* btrfs subvol pseudo mount point */
> +			    ino = BTRFS_FIRST_FREE_OBJECTID-1;
> +
>  		p = xdr_encode_hyper(p, ino);
>  	}
>  #ifdef CONFIG_NFSD_PNFS
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index b21b76e6b9a8..82b76b0b7bec 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
>  				    AT_STATX_SYNC_AS_STAT));
>  }
>  
> +static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs)
> +{
> +	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
> +			 .dentry = fh->fh_dentry};
> +	return nfserrno(vfs_statfs(&p, statfs));
> +}
> +
>  static inline int nfsd_create_is_exclusive(int createmode)
>  {
>  	return createmode == NFS3_CREATE_EXCLUSIVE


[-- Attachment #2: 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch --]
[-- Type: application/octet-stream, Size: 10440 bytes --]

From 7f674853edde79f37589586ef219b8650e409677 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Wed, 23 Jun 2021 10:59:00 +1000
Subject: [PATCH] any idea about auto export multiple btrfs snapshots?

On Tue, 22 Jun 2021, Wang Yugui wrote:
> >
> > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().
>
> btrfs subvol crossmnt begin to work, although buggy.
>
> some subvol is crossmnt-ed, some subvol is yet not, and some dir is
> wrongly crossmnt-ed
>
> 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
> to happen.
>
> This is the current patch based on 5.10.44.
> At least nfsd_follow_up() is buggy.
>

I don't think the approach you are taking makes sense.  Let me explain
why.

The problem is that applications on the NFS client can see different
files or directories on the same (apparent) filesystem with the same
inode number.  Most application won't care and NFS itself doesn't get
confused by the duplicate inode numbers, but 'find' and similar programs
(probably 'tar' for example) do get upset.

This happens because BTRFS reuses inode numbers in subvols which it
presents to the kernel as all part of the one filesystem (or at least,
all part of the one mount point).  NFSD only sees one filesystem, and so
reports the same filesystem-id (fsid) for all objects.  The NFS client
then sees that the fsid is the same and tells applications that the
objects are all in the one filesystem.

To fix this, we need to make sure that nfsd reports a different fsid for
objects in different subvols.  There are two obvious ways to do this.

One is to teach nfsd to recognize btrfs subvolumes exactly like separate
filesystems (as nfsd already ensure each filesystem gets its own fsid).
This is the approach of my first suggestion.  It requires changing
nfsd_mountpoint() and follow_up() and any other code that is aware of
different filesytems.  As I mentioned, it also requires changing mountd
to be able to extract a list of subvols from btrfs because they don't
appear in /proc/mounts.

As you might know an NFS filehandle has 3 parts: a header, a filesystem
identifier, and an inode identifier.  This approach would involve giving
different subvols different filesystem identifiers in the filehandle.
This, it turns out is a very big change - bigger than I at first
imagined.

The second obvious approach is to leave the filehandles unchanged and to
continue to treat an entire btrfs filesystem as a single filesystem
EXCEPT when reporting the fsid to the NFS client.  All we *really* need
to do is make sure the client sees a different fsid when it enters a
part of the filesystem which re-uses inode numbers.  This is what my
latest patch did.

Your patch seems to combine ideas from both approaches.  It includes my
code to replace the fsid, but also intercepts follow_up etc.  This
cannot be useful.

As I noted when I posted it, there is a problem with my patch.  I now
understand that problem.

When NFS sees that fsid change it needs to create 2 inodes for that
directory.  One inode will be in the parent filesystem and will be
marked as an auto-mount point so that any lookup below that directory
will trigger an internal mount.  The other inode is the root of the
child filesystem.  It gets mounted on the first inode.

With normal filesystem mounts, there really is an inode in the parent
filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID
attribute.  This fileid will be different from all other inode numbers
in the parent filesystem.

With BTRFS there is no inode in the parent volume (as far as I know) so
there is nothing useful to return for MOUNTED_ON_FILEID.  This results
in NFS using the same inode number for the inode in the parent
filesystem as the inode in the child filesystem.  For btrfs, this will
be 256.  As there is already an inode in the parent filesystem with inum
256, 'find' complains.

The following patch addresses this by adding code to nfsd when it
determines MOUINTD_ON_FILEID to choose an number that should be unused
in btrfs.  With this change, 'find' seems to work correctly with NFSv4
mounts of btrfs.

This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID
attribute - strictly speaking, the NFSv3 protocol doesn't support
crossing mount points, though the Linux implementation does allow it.

So this patch works and, I think, is the best we can do in terms of
functionality.  I don't like the details of the implementation though.
It requires NFSD to know too much about BTRFS internals.

I think I would like btrfs to make it clear where a subvol started,
maybe by setting DCACHE_MOUNTED on the dentry.  This flag is only a
hint, not a promise of anything, so other code should get confused.
This would have nfsd calling vfs_statfs quite so often ....  maybe that
isn't such a big deal.

More importantly, there needs to be some way for NFSD to find an inode
number to report for the MOUNTED_ON_FILEID.  This needs to be a number
not used elsewhere in the filesystem.  It might be safe to use the
same fileid for all subvols (as my patch currently does), but we would
need to confirm that 'find' and 'tar' don't complain about that or
mishandle it.  If it is safe to use the same fileid, then a new field in
the superblock to store it might work.  If a different fileid is needed,
the we might need a new field in 'struct kstatfs', so vfs_statfs can
report it.

Anyway, here is my current patch.  It includes support for NFSv3 as well
as NFSv4.

NeilBrown
---
 fs/nfsd/export.c  |  7 +++++++
 fs/nfsd/export.h  |  1 +
 fs/nfsd/nfs3xdr.c | 10 +++++++++-
 fs/nfsd/nfs4xdr.c | 15 +++++++++++++++
 fs/nfsd/vfs.h     |  7 +++++++
 5 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 9421dae22737..790a3357525d 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/namei.h>
 #include <linux/module.h>
+#include <linux/statfs.h>
 #include <linux/exportfs.h>
 #include <linux/sunrpc/svc_xprt.h>
 
@@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 	int err;
 	struct auth_domain *dom = NULL;
 	struct svc_export exp = {}, *expp;
+	struct kstatfs statfs;
 	int an_int;
 
 	if (mesg[mlen-1] != '\n')
@@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 	err = kern_path(buf, 0, &exp.ex_path);
 	if (err)
 		goto out1;
+	err = vfs_statfs(&exp.ex_path, &statfs);
+	if (err)
+		goto out3;
+	exp.ex_fsid64 = statfs.f_fsid;
 
 	exp.ex_client = dom;
 	exp.cd = cd;
@@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
 	new->ex_anon_uid = item->ex_anon_uid;
 	new->ex_anon_gid = item->ex_anon_gid;
 	new->ex_fsid = item->ex_fsid;
+	new->ex_fsid64 = item->ex_fsid64;
 	new->ex_devid_map = item->ex_devid_map;
 	item->ex_devid_map = NULL;
 	new->ex_uuid = item->ex_uuid;
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index ee0e3aba4a6e..d3eb9a599918 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -68,6 +68,7 @@ struct svc_export {
 	kuid_t			ex_anon_uid;
 	kgid_t			ex_anon_gid;
 	int			ex_fsid;
+	__kernel_fsid_t		ex_fsid64;
 	unsigned char *		ex_uuid; /* 16 byte fsid */
 	struct nfsd4_fs_locations ex_fslocs;
 	uint32_t		ex_nflavors;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 0a5ebc52e6a9..f11ba3434fd6 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -153,11 +153,19 @@ static __be32 *encode_fsid(__be32 *p, struct svc_fh *fhp)
 	case FSIDSOURCE_FSID:
 		p = xdr_encode_hyper(p, (u64) fhp->fh_export->ex_fsid);
 		break;
-	case FSIDSOURCE_UUID:
+	case FSIDSOURCE_UUID: {
+		struct kstatfs statfs;
+
 		f = ((u64*)fhp->fh_export->ex_uuid)[0];
 		f ^= ((u64*)fhp->fh_export->ex_uuid)[1];
+		if (fh_getstafs(fhp, &statfs) == 0 &&
+		    (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] ||
+		     statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1]))
+			/* looks like a btrfs subvol */
+			f = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1];
 		p = xdr_encode_hyper(p, f);
 		break;
+		}
 	}
 	return p;
 }
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..5f614d1b362e 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -42,6 +42,7 @@
 #include <linux/sunrpc/svcauth_gss.h>
 #include <linux/sunrpc/addr.h>
 #include <linux/xattr.h>
+#include <linux/btrfs_tree.h>
 #include <uapi/linux/xattr.h>
 
 #include "idmap.h"
@@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 	if (err)
 		goto out_nfserr;
 	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
+		       FATTR4_WORD0_FSID |
 			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
 	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
+		       FATTR4_WORD1_MOUNTED_ON_FILEID |
 		       FATTR4_WORD1_SPACE_TOTAL))) {
 		err = vfs_statfs(&path, &statfs);
 		if (err)
@@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 		case FSIDSOURCE_UUID:
 			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
 								EX_UUID_LEN);
+			if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+			    statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
+				/* looks like a btrfs subvol */
+				p[-2] ^= statfs.f_fsid.val[0];
+				p[-1] ^= statfs.f_fsid.val[1];
+			}
 			break;
 		}
 	}
@@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 				goto out_nfserr;
 			ino = parent_stat.ino;
 		}
+		if (fsid_source(fhp) == FSIDSOURCE_UUID &&
+		    (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
+		     statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]))
+			    /* btrfs subvol pseudo mount point */
+			    ino = BTRFS_FIRST_FREE_OBJECTID-1;
+
 		p = xdr_encode_hyper(p, ino);
 	}
 #ifdef CONFIG_NFSD_PNFS
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index b21b76e6b9a8..82b76b0b7bec 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
 				    AT_STATX_SYNC_AS_STAT));
 }
 
+static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs)
+{
+	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
+			 .dentry = fh->fh_dentry};
+	return nfserrno(vfs_statfs(&p, statfs));
+}
+
 static inline int nfsd_create_is_exclusive(int createmode)
 {
 	return createmode == NFS3_CREATE_EXCLUSIVE
-- 
2.30.2

next prev parent reply	other threads:[~2021-06-23  6:14 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-13  3:53 any idea about auto export multiple btrfs snapshots? Wang Yugui
2021-06-14 22:50 ` NeilBrown
2021-06-15 15:13   ` Wang Yugui
2021-06-15 15:41     ` Wang Yugui
2021-06-16  5:47     ` Wang Yugui
2021-06-17  3:02     ` NeilBrown
2021-06-17  4:28       ` Wang Yugui
2021-06-18  0:32         ` NeilBrown
2021-06-18  7:26           ` Wang Yugui
2021-06-18 13:34             ` Wang Yugui
2021-06-19  6:47               ` Wang Yugui
2021-06-20 12:27             ` Wang Yugui
2021-06-21  4:52             ` NeilBrown
2021-06-21  5:13               ` NeilBrown
2021-06-21  8:34                 ` Wang Yugui
2021-06-22  1:28                   ` NeilBrown
2021-06-22  3:22                     ` Wang Yugui
2021-06-22  7:14                       ` Wang Yugui
2021-06-23  0:59                         ` NeilBrown
2021-06-23  6:14                           ` Wang Yugui [this message]
2021-06-23  6:29                             ` NeilBrown
2021-06-23  9:34                               ` Wang Yugui
2021-06-23 23:38                                 ` NeilBrown
2021-06-23 15:35                           ` J. Bruce Fields
2021-06-23 22:04                             ` NeilBrown
2021-06-23 22:25                               ` J. Bruce Fields
2021-06-23 23:29                                 ` NeilBrown
2021-06-23 23:41                                   ` Frank Filz
2021-06-24  0:01                                   ` J. Bruce Fields
2021-06-24 21:58                               ` Patrick Goetz
2021-06-24 23:27                                 ` NeilBrown
2021-06-21 14:35               ` Frank Filz
2021-06-21 14:55                 ` Wang Yugui
2021-06-21 17:49                   ` Frank Filz
2021-06-21 22:41                     ` Wang Yugui
2021-06-22 17:34                       ` Frank Filz
2021-06-22 22:48                         ` Wang Yugui
2021-06-17  2:15   ` Wang Yugui
     [not found] ` <20210310074620.GA2158@tik.uni-stuttgart.de>
     [not found]   ` <162632387205.13764.6196748476850020429@noble.neil.brown.name>
2021-07-15 14:09     ` [PATCH/RFC] NFSD: handle BTRFS subvolumes better Josef Bacik
2021-07-15 16:45       ` Christoph Hellwig
2021-07-15 17:11         ` Josef Bacik
2021-07-15 17:24           ` Christoph Hellwig
2021-07-15 18:01             ` Josef Bacik
2021-07-15 22:37               ` NeilBrown
2021-07-19 15:40                 ` Josef Bacik
2021-07-19 20:00                   ` J. Bruce Fields
2021-07-19 20:44                     ` Josef Bacik
2021-07-19 23:53                       ` NeilBrown
2021-07-19 15:49                 ` J. Bruce Fields
2021-07-20  0:02                   ` NeilBrown
2021-07-19  9:16               ` Christoph Hellwig
2021-07-19 23:54                 ` NeilBrown
2021-07-20  6:23                   ` Christoph Hellwig
2021-07-20  7:17                     ` NeilBrown
2021-07-20  8:00                       ` Christoph Hellwig
2021-07-20 23:11                         ` NeilBrown
2021-07-20 22:10               ` J. Bruce Fields
2021-07-15 23:02       ` NeilBrown
2021-07-15 15:45     ` J. Bruce Fields
2021-07-15 23:08       ` NeilBrown

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:9421dae2273 dfblob:790a3357525 dfblob:ee0e3aba4a6
dfblob:d3eb9a59991 dfblob:0a5ebc52e6a dfblob:f11ba3434fd
dfblob:7abeccb975b dfblob:5f614d1b362 dfblob:b21b76e6b9a
dfblob:82b76b0b7be )
 OR (
bs:"any idea about auto export multiple btrfs snapshots?" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210623141444.C4F2.409509F4@e16-tech.com \
    --to=wangyugui@e16-tech.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox