From: Josef Bacik <josef@toxicpanda.com>
To: NeilBrown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>,
"J. Bruce Fields" <bfields@fieldses.org>,
Chuck Lever <chuck.lever@oracle.com>, Chris Mason <clm@fb.com>,
David Sterba <dsterba@suse.com>,
linux-nfs@vger.kernel.org, Wang Yugui <wangyugui@e16-tech.com>,
Ulli Horlacher <framstag@rus.uni-stuttgart.de>,
linux-btrfs@vger.kernel.org
Subject: Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.
Date: Mon, 19 Jul 2021 11:40:28 -0400 [thread overview]
Message-ID: <15d0f450-cae5-22bc-eef3-8a973e6dda27@toxicpanda.com> (raw)
In-Reply-To: <162638862766.13764.8566962032225976326@noble.neil.brown.name>
On 7/15/21 6:37 PM, NeilBrown wrote:
> On Fri, 16 Jul 2021, Josef Bacik wrote:
>> On 7/15/21 1:24 PM, Christoph Hellwig wrote:
>>> On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
>>>> Because there's no alternative. We need a way to tell userspace they've
>>>> wandered into a different inode namespace. There's no argument that what
>>>> we're doing is ugly, but there's never been a clear "do X instead". Just a
>>>> lot of whinging that btrfs is broken. This makes userspace happy and is
>>>> simple and straightforward. I'm open to alternatives, but there have been 0
>>>> workable alternatives proposed in the last decade of complaining about it.
>>>
>>> Make sure we cross a vfsmount when crossing the "st_dev" domain so
>>> that it is properly reported. Suggested many times and ignored all
>>> the time beause it requires a bit of work.
>>>
>>
>> You keep telling me this but forgetting that I did all this work when you
>> originally suggested it. The problem I ran into was the automount stuff
>> requires that we have a completely different superblock for every vfsmount.
>> This is fine for things like nfs or samba where the automount literally points
>> to a completely different mount, but doesn't work for btrfs where it's on the
>> same file system. If you have 1000 subvolumes and run sync() you're going to
>> write the superblock 1000 times for the same file system. You are going to
>> reclaim inodes on the same file system 1000 times. You are going to reclaim
>> dcache on the same filesytem 1000 times. You are also going to pin 1000
>> dentries/inodes into memory whenever you wander into these things because the
>> super is going to hold them open.
>>
>> This is not a workable solution. It's not a matter of simply tying into
>> existing infrastructure, we'd have to completely rework how the VFS deals with
>> this stuff in order to be reasonable. And when I brought this up to Al he told
>> me I was insane and we absolutely had to have a different SB for every vfsmount,
>> which means we can't use vfsmount for this, which means we don't have any other
>> options. Thanks,
>
> When I was first looking at this, I thought that separate vfsmnts
> and auto-mounting was the way to go "just like NFS". NFS still shares a
> lot between the multiple superblock - certainly it shares the same
> connection to the server.
>
> But I dropped the idea when Bruce pointed out that nfsd is not set up to
> export auto-mounted filesystems. It needs to be able to find a
> filesystem given a UUID (extracted from a filehandle), and it does this
> by walking through the mount table to find one that matches. So unless
> all btrfs subvols were mounted all the time (which I wouldn't propose),
> it would need major work to fix.
>
> NFSv4 describes the fsid as having a "major" and "minor" component.
> We've never treated these as having an important meaning - just extra
> bits to encode uniqueness in. Maybe we should have used "major" for the
> vfsmnt, and kept "minor" for the subvol.....
>
> The idea for a single vfsmnt exposing multiple inode-name-spaces does
> appeal to me. The "st_dev" is just part of the name, and already a
> fairly blurry part. Thanks to bind mounts, multiple mounts can have the
> same st_dev. I see no intrinsic reason that a single mount should not
> have multiple fsids, provided that a coherent picture is provided to
> userspace which doesn't contain too many surprises.
>
Ok so setting aside btrfs for the moment, how does NFS deal with exporting a
directory that has multiple other file systems under that tree? I assume the
same sort of problem doesn't occur, but why is that? Is it because it's a
different vfsmount/sb or is there some other magic making this work? Thanks,
Josef
next prev parent reply other threads:[~2021-07-19 16:12 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-13 3:53 any idea about auto export multiple btrfs snapshots? Wang Yugui
2021-06-14 22:50 ` NeilBrown
2021-06-15 15:13 ` Wang Yugui
2021-06-15 15:41 ` Wang Yugui
2021-06-16 5:47 ` Wang Yugui
2021-06-17 3:02 ` NeilBrown
2021-06-17 4:28 ` Wang Yugui
2021-06-18 0:32 ` NeilBrown
2021-06-18 7:26 ` Wang Yugui
2021-06-18 13:34 ` Wang Yugui
2021-06-19 6:47 ` Wang Yugui
2021-06-20 12:27 ` Wang Yugui
2021-06-21 4:52 ` NeilBrown
2021-06-21 5:13 ` NeilBrown
2021-06-21 8:34 ` Wang Yugui
2021-06-22 1:28 ` NeilBrown
2021-06-22 3:22 ` Wang Yugui
2021-06-22 7:14 ` Wang Yugui
2021-06-23 0:59 ` NeilBrown
2021-06-23 6:14 ` Wang Yugui
2021-06-23 6:29 ` NeilBrown
2021-06-23 9:34 ` Wang Yugui
2021-06-23 23:38 ` NeilBrown
2021-06-23 15:35 ` J. Bruce Fields
2021-06-23 22:04 ` NeilBrown
2021-06-23 22:25 ` J. Bruce Fields
2021-06-23 23:29 ` NeilBrown
2021-06-23 23:41 ` Frank Filz
2021-06-24 0:01 ` J. Bruce Fields
2021-06-24 21:58 ` Patrick Goetz
2021-06-24 23:27 ` NeilBrown
2021-06-21 14:35 ` Frank Filz
2021-06-21 14:55 ` Wang Yugui
2021-06-21 17:49 ` Frank Filz
2021-06-21 22:41 ` Wang Yugui
2021-06-22 17:34 ` Frank Filz
2021-06-22 22:48 ` Wang Yugui
2021-06-17 2:15 ` Wang Yugui
[not found] ` <20210310074620.GA2158@tik.uni-stuttgart.de>
[not found] ` <162632387205.13764.6196748476850020429@noble.neil.brown.name>
2021-07-15 14:09 ` [PATCH/RFC] NFSD: handle BTRFS subvolumes better Josef Bacik
2021-07-15 16:45 ` Christoph Hellwig
2021-07-15 17:11 ` Josef Bacik
2021-07-15 17:24 ` Christoph Hellwig
2021-07-15 18:01 ` Josef Bacik
2021-07-15 22:37 ` NeilBrown
2021-07-19 15:40 ` Josef Bacik [this message]
2021-07-19 20:00 ` J. Bruce Fields
2021-07-19 20:44 ` Josef Bacik
2021-07-19 23:53 ` NeilBrown
2021-07-19 15:49 ` J. Bruce Fields
2021-07-20 0:02 ` NeilBrown
2021-07-19 9:16 ` Christoph Hellwig
2021-07-19 23:54 ` NeilBrown
2021-07-20 6:23 ` Christoph Hellwig
2021-07-20 7:17 ` NeilBrown
2021-07-20 8:00 ` Christoph Hellwig
2021-07-20 23:11 ` NeilBrown
2021-07-20 22:10 ` J. Bruce Fields
2021-07-15 23:02 ` NeilBrown
2021-07-15 15:45 ` J. Bruce Fields
2021-07-15 23:08 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=15d0f450-cae5-22bc-eef3-8a973e6dda27@toxicpanda.com \
--to=josef@toxicpanda.com \
--cc=bfields@fieldses.org \
--cc=chuck.lever@oracle.com \
--cc=clm@fb.com \
--cc=dsterba@suse.com \
--cc=framstag@rus.uni-stuttgart.de \
--cc=hch@infradead.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@suse.de \
--cc=wangyugui@e16-tech.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox